We can establish our framework by defining the variables to be used.  We start with the outcome variable, $Y$.  Next, we consider out treatment variable, $A_k$, which takes on value 1 if there is treatment, and is 0 otherwise.  To begin, we will use treatment strategies that are exclusively treatment or exclusively no treatment, corresponding to $\overline{a} = (1,1,\dots,1) = \overline{1}$ and $\overline{a} = (0,0,\dots,0) = \overline{0}$ respectively.  The next measurable variable is $L$, which represents the covariate(s) to be included.  Note that both the covariates, $L$, and the outcome, $Y$ are affected by an unmeasured common cause, $U$.  The diagram below illustrates these relationships.  

![title](image1.png)

The purpose of this investigation is to measure the average causal effect of treatment, which can be estimated using 
$$ \mathbb{E}\big[Y^{a=1}\big] - \mathbb{E}\big[Y^{a=0}\big]$$ 

We want to build out the g-formula as follows  
$$ \mathbb{E} \big[Y_{i}^{\overline{a}}\big]  = \sum_{i} \mathbb{E} \big[Y_i \mid A_{i-1} = a_{i-1},  \; A_{i-2} = a_{i-2}, \; L_{i-1}, \; L_{i-2} \big]$$ 

We can do this by building out two models, one for $Y$ And one for $L$.  We will begin by using a continous $Y$ and a binary $L$ for simplicity.  

For $Y$, we will use a linear regression for a time delay of t=2.  The model will look something like this 

$$\mathbb{E} \big[Y_{t+1} \mid \overline{A}_t, \overline{L}_t, \overline{Y}_t \big] = \theta_{0,t} + \theta_1 Y_{t} + \theta_2 Y_{t-1} + \theta_3 A_{t}+ \theta_4 A_{t-1} + \theta_5 L_t + \theta_6 L_{t-1} $$ 

For $L$, we will use a logistic regression, also calculated on a time delay of t=2.  This will give us something similar to 
$$ logistP\big[L_{t+1} \mid \overline{A}_t, \overline{L}_t, \overline{Y}_{t+1} \big] = \alpha_{0,t} + \alpha_1 Y_{t+1} + \alpha_2 Y_{t} + \alpha_3 Y_{t-1} +  \alpha_4 A_{t}+ \alpha_5 A_{t-1} + \alpha_6 L_t + \alpha_7 L_{t-1} $$ 


<!---what quadratic time term needed to be added here? --->

We first need to simulate the data. We will first build out the covariates of the population.  For simplicity, we will use two binary covariates, $L_1$ and $L_2$.  Given a population of size $n$, we will assign probabilities as depicted in the image below.  

![title](image2.png)

Using these assigned covariates, we can also randomly assign treatment or not for $A$, for the entire population.  Then, using this information together, we can simulate the outcomes, $Y_{i,t}$ for the population.  

In [1]:
import sklearn as sk
import pandas as pd
import numpy as np
import scipy as sp
import math
import csv
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [2]:
## establishing constants 
n = 100   ## number of individuals in study 
p1 = 0.75   ## probability of having covariate trait 1 
p2 = 0.75  ## probability of having covariate trait 2
pA = 0.5   ## probability of receiving treatment  
time = 10 ## number of time points being considered 
t_delay = 2 ## number of time delays included in model 

## building out simulated data 
columns = ["index", "time", "A", "Y", "U", "L1", "L2"]
df = pd.DataFrame(columns = columns)
df.head()

for ii in range(1,n+1):
    ## assuming a continuous treatment here 
    A = 3*np.random.binomial(n=1, p=pA)
    
    ## creating an unobserved variable that affects covariates 
    U = np.random.uniform(low = 0.1, high = 1)
    
    ## assuming first covariate is constant 
    L1 = np.random.binomial(n=1, p=p1*U)
    
    
    for jj in range(0, time+1): 
        if jj == 0: 
            L2 = 1 
            
        L2 = np.random.binomial(n = 1, p=p2*L2)
        
        Y = np.random.normal(loc = A+L1+L2, scale = 1)
        
        df.loc[len(df)+1] = [ii, jj, A, Y, U, L1, L2]
# df.head(200)

In [3]:
## creating shifted values for 
df["A_1"] = df.A.shift(1)
df["A_2"] = df.A.shift(2)
df["Y_1"] = df.Y.shift(1)
df["Y_2"] = df.Y.shift(2)
df["L1_1"] = df.L1.shift(1)
df["L1_2"] = df.L1.shift(2)
df["L2_1"] = df.L2.shift(1)
df["L2_2"] = df.L2.shift(2)


## making sure individuals' values dont get muddled 
df["A_1"][df.time == 0] = 0
df["A_2"][df.time == 0] = 0
df["Y_1"][df.time == 0] = "NaN"  ## WHAT SHOULD THIS BE EQUAL TO???
df["Y_2"][df.time == 0] = "NaN"
df["L1_1"][df.time == 0] = df["L1"][df.time == 0]
df["L1_2"][df.time == 0] = df["L1"][df.time == 0]
df["L2_1"][df.time == 0] = df["L2"][df.time == 0]
df["L2_2"][df.time == 0] = df["L2"][df.time == 0]

df["A_2"][df.time == 1] = 0
df["Y_2"][df.time == 1] = "NaN"
df["L1_2"][df.time == 1] = df["L1"][df.time == 0]
df["L2_2"][df.time == 1] = df["L2"][df.time == 0]

# df.to_csv("Sim_Data.csv")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a

In [None]:
# df = csv.reader(open("Sim_Data.csv"))
df.head()

Unnamed: 0,index,time,A,Y,U,L1,L2,A_1,A_2,Y_1,Y_2,L1_1,L1_2,L2_1,L2_2
1,1.0,0.0,0.0,0.490524,0.846235,1.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0
2,1.0,1.0,0.0,2.862,0.846235,1.0,0.0,0.0,0.0,0.490524,,1.0,,0.0,
3,1.0,2.0,0.0,0.509378,0.846235,1.0,0.0,0.0,0.0,2.862,0.490524,1.0,1.0,0.0,0.0
4,1.0,3.0,0.0,2.486942,0.846235,1.0,0.0,0.0,0.0,0.509378,2.862,1.0,1.0,0.0,0.0
5,1.0,4.0,0.0,0.282136,0.846235,1.0,0.0,0.0,0.0,2.48694,0.509378,1.0,1.0,0.0,0.0


In [20]:
Y_model_df = pd.DataFrame(columns = ["int", "A", "L1", "L2", "A_1", "A_2", "Y_1", "Y_2", "L1_1", "L1_2", "L2_1", "L2_2"])
## creating our models 
for ii in range(0, time+1): 
    temp_df = df[df.time == ii]
    Y_model = smf.ols('Y ~ A + L1 + L2 + A_1 + A_2 + Y.shift(1) + Y.shift(2) + L1_1 + L1_2 + L2_1 + L2_2', data=temp_df).fit()
    Y_model_df.loc[len(Y_model_df)+1] = [ii] + [Y_model.params[i] for i in range(0,12)]

ValueError: cannot set a row with mismatched columns

In [21]:
[ii] + [Y_model.params[i] for i in range(0,12)]

[0,
 0.068748049928044508,
 1.0584304065340755,
 0.37223791156898395,
 0.29547604508801839,
 -1.4130701670430043e-17,
 0.0,
 0.082142910207975123,
 -0.057527624403048858,
 0.37223791156898511,
 0.37223791156898511,
 0.295476045088018,
 0.295476045088018]

In [24]:
Y_model.params

Intercept     6.874805e-02
A             1.058430e+00
L1            3.722379e-01
L2            2.954760e-01
A_1          -1.413070e-17
A_2           0.000000e+00
Y.shift(1)    8.214291e-02
Y.shift(2)   -5.752762e-02
L1_1          3.722379e-01
L1_2          3.722379e-01
L2_1          2.954760e-01
L2_2          2.954760e-01
dtype: float64