We can establish our framework by defining the variables to be used.  We start with the outcome variable, $Y$.  Next, we consider out treatment variable, $A_k$, which takes on value 1 if there is treatment, and is 0 otherwise.  To begin, we will use treatment strategies that are exclusively treatment or exclusively no treatment, corresponding to $\overline{a} = (1,1,\dots,1) = \overline{1}$ and $\overline{a} = (0,0,\dots,0) = \overline{0}$ respectively.  The next measurable variable is $L$, which represents the covariate(s) to be included.  Note that both the covariates, $L$, and the outcome, $Y$ are affected by an unmeasured common cause, $U$.  The diagram below illustrates these relationships.  

![title](image3.png)

The purpose of this investigation is to measure the average causal effect of treatment, which can be estimated using 
$$ \mathbb{E}\big[Y^{a=1}\big] - \mathbb{E}\big[Y^{a=0}\big]$$ 

We want to build out the g-formula as follows  
$$ \mathbb{E} \big[Y_{i}^{\overline{a}}\big]  = \sum_{i} \mathbb{E} \big[Y_i \mid A_{i-1} = a_{i-1},  \; A_{i-2} = a_{i-2}, \; L_{i-1}, \; L_{i-2} \big]$$ 

We can do this by building out two models, one for $Y$ And one for $L$.  We will begin by using a continous $Y$ and a binary $L$ for simplicity.  

For $Y$, we will use a linear regression for a time delay of t=2.  The model will look something like this 

$$\mathbb{E} \big[Y_{t+1} \mid \overline{A}_t, \overline{L}_t, \overline{Y}_t \big] = \theta_{0,t} + \theta_1 Y_{t} + \theta_2 Y_{t-1} + \theta_3 A_{t}+ \theta_4 A_{t-1} + \theta_5 L_t + \theta_6 L_{t-1} $$ 

For each $L$, we will use a logistic regression, also calculated on a time delay of t=2.  This will give us something similar to 
$$ logistP\big[L_{t+1} \mid \overline{A}_t, \overline{L}_t, \overline{Y}_{t+1} \big] = \alpha_{0,t} + \alpha_1 Y_{t+1} + \alpha_2 Y_{t} + \alpha_3 Y_{t-1} +  \alpha_4 A_{t}+ \alpha_5 A_{t-1} + \alpha_6 L_t + \alpha_7 L_{t-1} $$ 


<!---what quadratic time term needed to be added here? --->

We first need to simulate the data. We will build out the covariates of the population.  For simplicity, we will use two binary covariates, $L_1$ and $L_2$.  Given a population of size $n$, we will assign probabilities as depicted in the image below.  

![title](image2.png)

Using these assigned covariates, we can also randomly assign treatment or not for $A$, for the entire population.  Then, using this information together, we can simulate the outcomes, $Y_{i,t}$ for the population.  

In [39]:
import pandas as pd
import numpy as np
import scipy as sp
import sklearn as sk
import math
import csv
import statsmodels.api as sm
import statsmodels.formula.api as smf
import random

The data is being simulated using the following equations 

$$logit(L_k) = \alpha_0 + \alpha_1 \cdot L_{k-1} + \alpha_2 \cdot L_{k-2} + \alpha_3 A_{k-1} + \alpha_4 A_{K-2}$$ 


$$ logit(A_k) = \beta_0 + \beta_1 L_{k} + \beta_2 L_{k-1} + \beta_3 A_{k-1} + \beta_4 A_{K-2} $$ 

In [15]:
## establishing constants 
indiv = 100   ## number of individuals in study 
# p1 = 0.75   ## probability of having covariate trait 1 
# p2 = 0.75  ## probability of having covariate trait 2
# pA = 0.5   ## probability of receiving treatment  


time = 10 ## number of time points being considered 
t_delay = 2 ## number of time delays included in model 

## building out simulated data 
columns = ["indiv", "time", "A", "Y", "U", "L1"]
df = pd.DataFrame(columns = columns)
coefficient_df = pd.DataFrame(columns = ["indiv", "alpha_0", "alpha_1", "alpha_2", "alpha_3", "alpha_4", "beta_0", "beta_1", "beta_2", "beta_3", "beta_4"])
df.head()

for ii in range(1,indiv+1):
    
    alpha_0 = np.random.uniform(low = -1, high = 1)
    alpha_1 = np.random.uniform(low = -1, high = 1)
    alpha_2 = np.random.uniform(low = -1, high = 1)
    alpha_3 = np.random.uniform(low = -1, high = 1)
    alpha_4 = np.random.uniform(low = -1, high = 1)

    beta_0 = np.random.uniform(low = -1, high = 1)
    beta_1 = np.random.uniform(low = -1, high = 1)
    beta_2 = np.random.uniform(low = -1, high = 1)
    beta_3 = np.random.uniform(low = -1, high = 1)
    beta_4 = np.random.uniform(low = -1, high = 1)
    
    coefficient_df.loc[len(coefficient_df)+1] = [ii, alpha_0, alpha_1, alpha_2, alpha_3, alpha_4, beta_0, beta_1, beta_2, beta_3, beta_4, ]
     
    for jj in range(0, time+1): 
        
        ## creating an unobserved variable that affects covariates 
        U = np.random.uniform(low = 0.1, high = 1)
            
        if jj == 0: 
            x_L = alpha_0 + U 
            L1 = np.random.binomial(n=1, p = np.exp(x_L)/(1+np.exp(x_L)))
            
            x_A = beta_0 + U + beta_1*L1 
            A = np.random.binomial(n=1, p = np.exp(x_A)/(1+np.exp(x_A)))
            
        elif jj == 1: 
            x_L = alpha_0 + alpha_1*float(df["L1"][(df.time == jj-1) & (df.indiv == ii)])+alpha_3*float(df["A"][(df.time == jj-1) & (df.indiv == ii)])
            L1 = np.random.binomial(n=1, p = np.exp(x_L)/(1+np.exp(x_L)))

            x_A = beta_0 + beta_1*L1 + beta_2*float(df["L1"][(df.time == jj-1) & (df.indiv == ii)])+ beta_3*float(df["A"][(df.time == jj-1) & (df.indiv == ii)])
            A = np.random.binomial(n=1, p = np.exp(x_A)/(1+np.exp(x_A)))

        else: 
            x_L = alpha_0 + alpha_1*float(df["L1"][(df.time == jj-1) & (df.indiv == ii)])+alpha_2*float(df["L1"][(df.time == jj-2) & (df.indiv == ii)])+alpha_3*float(df["A"][(df.time == jj-1) & (df.indiv == ii)])+alpha_4*float(df["A"][(df.time == jj-2) & (df.indiv == ii)])
            L1 = np.random.binomial(n=1, p = np.exp(x_L)/(1+np.exp(x_L)))
            
        
            x_A = beta_0 + beta_1*L1 + beta_2*float(df["L1"][(df.time == jj-1) & (df.indiv == ii)])+ beta_3*float(df["A"][(df.time == jj-1) & (df.indiv == ii)])+ beta_4*float(df["A"][(df.time == jj-2) & (df.indiv == ii)])
            A = np.random.binomial(n=1, p = np.exp(x_A)/(1+np.exp(x_A)))
         
        Y = np.random.normal(loc = U, scale = 1)
        
        df.loc[len(df)+1] = [ii, jj, A, Y, U, L1]

In [17]:
# creating shifted values 
df["A_1"] = df.A.shift(1)
df["A_2"] = df.A.shift(2)
df["Y_1"] = df.Y.shift(1)
df["Y_2"] = df.Y.shift(2)
df["L1_1"] = df.L1.shift(1)
df["L1_2"] = df.L1.shift(2)

df.to_csv("Sim_Data.csv")

Starting with the model for Y, we use a linear regression to get the following model, 
$$\mathbb{E} \big[Y_{t+1} \mid \overline{A}_t, \overline{L}_t, \overline{Y}_t \big] = \theta_{0,t} + \theta_1 Y_{t} + \theta_2 Y_{t-1} + \theta_3 A_{t}+ \theta_4 A_{t-1} + \theta_5 L_t + \theta_6 L_{t-1} $$ 



In [18]:
columns = ["time", "int", "A", "L1", "A_1", "A_2", "Y_1", "Y_2", "L1_1", "L1_2"]
Y_model_df = pd.DataFrame(columns = columns)
## creating our models 
for ii in range(t_delay, time+1): 
    temp_df = df[df.time == ii]
    Y_model = smf.ols('Y ~ A + L1 + A_1 + A_2 + Y.shift(1) + Y.shift(2) + L1_1 + L1_2', data=temp_df).fit()
    Y_model_df = Y_model_df.append(pd.DataFrame([ii] + [Y_model.params[i] for i in range(0,9)], index = columns).transpose(), ignore_index=True)

In [19]:
Y_model_df.to_csv("Y_model_data.csv")

In [90]:
## Creating model for L1 
## SHOULD THIS INCLUDE Y??? 

columns = ["time", "alpha_0", "alpha_1", "alpha_2", "alpha_3", "alpha_4"]
train_columns = ["L1_1", "L1_2", "A_1", "A_2"]
L1_model_df = pd.DataFrame(columns = columns)

for ii in range(1, time+1): 
    temp_df = df[df.time == ii]   
    if ii == 1: 
        L1_model = sm.Logit(np.asarray(temp_df["L1"]), np.asarray(sm.add_constant(temp_df[["L1_1", "A_1"]]))).fit()
        L1_model_df = L1_model_df.append(pd.DataFrame([ii] + [L1_model.params[i] for i in range(0,2)] + ["Nan"] + [L1_model.params[2]] + ["Nan"], index = columns).transpose(), ignore_index=True)
    else: 
        L1_model = sm.Logit(np.asarray(temp_df["L1"]), np.asarray(sm.add_constant(temp_df[train_columns]))).fit()
        L1_model_df = L1_model_df.append(pd.DataFrame([ii] + [L1_model.params[i] for i in range(0,5)], index = columns).transpose(), ignore_index=True)
    

Optimization terminated successfully.
         Current function value: 0.677269
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.677507
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.632987
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.663671
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.640181
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.632510
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.628300
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.627476
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.645710
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.646454
  

In [91]:
## Creating model for A 

columns = ["time", "beta_0", "beta_1", "beta_2", "beta_3", "beta_4"] 
train_columns = ["L1", "L1_1", "A_1", "A_2"]
A_model_df = pd.DataFrame(columns = columns)

for ii in range(1, time+1): 
    temp_df = df[df.time == ii]
    if ii == 1: 
        A_model = sm.Logit(np.asarray(temp_df["A"]), np.asarray(sm.add_constant(temp_df[["L1", "L1_1", "A_1"]]))).fit()
        A_model_df = A_model_df.append(pd.DataFrame([ii] + [A_model.params[i] for i in range(0,4)] + ["Nan"], index = columns).transpose(), ignore_index=True)
    else:
        A_model = sm.Logit(np.asarray(temp_df["A"]), np.asarray(sm.add_constant(temp_df[train_columns]))).fit()
        A_model_df = A_model_df.append(pd.DataFrame([ii] + [A_model.params[i] for i in range(0,5)], index = columns).transpose(), ignore_index=True)
   

Optimization terminated successfully.
         Current function value: 0.619917
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.683137
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.648152
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.642491
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.607466
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.672750
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.626664
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.643394
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.664249
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.662615
  

Now, using these three models, we can calculate the following expectations

$$ \mathbb{E} \big[Y_{i}^{\overline{a}}\big]  = \sum_{i} \mathbb{E} \big[Y_i \mid A_{i-1} = a_{i-1},  \; A_{i-2} = a_{i-2}, \; L_{1, i-1}, \; L_{1, i-2}, \; L_{2, i-1}, \; L_{2, i-2} \big]$$ 

for the respective $\bar{a} = \bar{0}$ and $\bar{a} = \bar{1}$

In [92]:
A_model_df.head()

Unnamed: 0,time,beta_0,beta_1,beta_2,beta_3,beta_4
0,1,-0.736521,-0.658712,-0.260118,1.26892,Nan
1,2,-0.406433,0.061678,-0.100935,-0.1296,0.561374
2,3,0.329748,-0.386072,-0.720246,-0.0970855,1.00538
3,4,-0.930438,-0.288135,0.0260276,1.01494,0.590457
4,5,0.0983186,-0.742778,-0.7334,1.03745,0.714804


In [29]:
n = 1000 
final_model_df = pd.DataFrame(columns = ["time", "exp_Y_A1", "exp_Y_A0"])

for ii in range(0,n):
    L0 = random.choce(list(df["L1"][df["time"] == 0]))
    L1 = 
    
    x_A = beta_0 + beta_1*L1 + beta_2*float(df["L1"][(df.time == jj-1) & (df.indiv == ii)])+ beta_3*float(df["A"][(df.time == jj-1) & (df.indiv == ii)])
    A1 = A = np.random.binomial(n=1, p = np.exp(x_A)/(1+np.exp(x_A)))
    
    L2 = 
    for jj in range(t_delay, time):



Calculating the final difference for each time, using the following
$$ \mathbb{E}\big[Y^{a=1}\big] - \mathbb{E}\big[Y^{a=0}\big]$$ 

In [62]:
random.choice(list(L_0s))

1.0

In [59]:
type(list(L_0s))

list