We can establish our framework by defining the variables to be used.  We start with the outcome variable, $Y$.  Next, we consider out treatment variable, $A_k$, which takes on value 1 if there is treatment, and is 0 otherwise.  To begin, we will use treatment strategies that are exclusively treatment or exclusively no treatment, corresponding to $\overline{a} = (1,1,\dots,1) = \overline{1}$ and $\overline{a} = (0,0,\dots,0) = \overline{0}$ respectively.  The next measurable variable is $L$, which represents the covariate(s) to be included.  Note that both the covariates, $L$, and the outcome, $Y$ are affected by an unmeasured common cause, $U$.  The diagram below illustrates these relationships.  

![title](image3.png)

The purpose of this investigation is to measure the average causal effect of treatment, which can be estimated using 
$$ \mathbb{E}\big[Y^{a=1}\big] - \mathbb{E}\big[Y^{a=0}\big]$$ 

We want to build out the g-formula as follows  
$$ \mathbb{E} \big[Y_{i}^{\overline{a}}\big]  = \sum_{i} \mathbb{E} \big[Y_i \mid A_{i-1} = a_{i-1},  \; A_{i-2} = a_{i-2}, \; L_{i-1}, \; L_{i-2} \big]$$ 

We can do this by building out two models, one for $Y$ And one for $L$.  We will begin by using a continous $Y$ and a binary $L$ for simplicity.  

For $Y$, we will use a linear regression for a time delay of t=2.  The model will look something like this 

$$\mathbb{E} \big[Y_{t+1} \mid \overline{A}_t, \overline{L}_t, \overline{Y}_t \big] = \theta_{0,t} + \theta_1 Y_{t} + \theta_2 Y_{t-1} + \theta_3 A_{t}+ \theta_4 A_{t-1} + \theta_5 L_t + \theta_6 L_{t-1} $$ 

For each $L$, we will use a logistic regression, also calculated on a time delay of t=2.  This will give us something similar to 
$$ logistP\big[L_{t+1} \mid \overline{A}_t, \overline{L}_t, \overline{Y}_{t+1} \big] = \alpha_{0,t} + \alpha_1 Y_{t+1} + \alpha_2 Y_{t} + \alpha_3 Y_{t-1} +  \alpha_4 A_{t}+ \alpha_5 A_{t-1} + \alpha_6 L_t + \alpha_7 L_{t-1} $$ 


<!---what quadratic time term needed to be added here? --->

We first need to simulate the data. We will build out the covariates of the population.  For simplicity, we will use two binary covariates, $L_1$ and $L_2$.  Given a population of size $n$, we will assign probabilities as depicted in the image below.  

![title](image2.png)

Using these assigned covariates, we can also randomly assign treatment or not for $A$, for the entire population.  Then, using this information together, we can simulate the outcomes, $Y_{i,t}$ for the population.  

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
import sklearn as sk
import math
import csv
import statsmodels.api as sm
import statsmodels.formula.api as smf

The data is being simulated using the following equations 

$$logit(L_k) = \alpha_0 + \alpha_1 \cdot L_{k-1} + \alpha_2 \cdot L_{k-2} + \alpha_3 A_{k-1} + \alpha_4 A_{K-2}$$ 


$$ logit(A_k) = \beta_0 + \beta_1 L_{k} + \beta_2 L_{k-1} + \beta_3 A_{k-1} + \beta_4 A_{K-2} $$ 

In [15]:
## establishing constants 
indiv = 100   ## number of individuals in study 
# p1 = 0.75   ## probability of having covariate trait 1 
# p2 = 0.75  ## probability of having covariate trait 2
# pA = 0.5   ## probability of receiving treatment  


time = 10 ## number of time points being considered 
t_delay = 2 ## number of time delays included in model 

## building out simulated data 
columns = ["indiv", "time", "A", "Y", "U", "L1"]
## columns = ["indiv", "time", "A", "Y", "U", "L1", "L2"]
df = pd.DataFrame(columns = columns)
coefficient_df = pd.DataFrame(columns = ["indiv", "alpha_0", "alpha_1", "alpha_2", "alpha_3", "alpha_4", "beta_0", "beta_1", "beta_2", "beta_3", "beta_4"])
df.head()

for ii in range(1,indiv+1):
    
    alpha_0 = np.random.uniform(low = -1, high = 1)
    alpha_1 = np.random.uniform(low = -1, high = 1)
    alpha_2 = np.random.uniform(low = -1, high = 1)
    alpha_3 = np.random.uniform(low = -1, high = 1)
    alpha_4 = np.random.uniform(low = -1, high = 1)

    beta_0 = np.random.uniform(low = -1, high = 1)
    beta_1 = np.random.uniform(low = -1, high = 1)
    beta_2 = np.random.uniform(low = -1, high = 1)
    beta_3 = np.random.uniform(low = -1, high = 1)
    beta_4 = np.random.uniform(low = -1, high = 1)
    
    coefficient_df.loc[len(coefficient_df)+1] = [ii, alpha_0, alpha_1, alpha_2, alpha_3, alpha_4, beta_0, beta_1, beta_2, beta_3, beta_4, ]
     
    for jj in range(0, time+1): 
        
        ## creating an unobserved variable that affects covariates 
        U = np.random.uniform(low = 0.1, high = 1)
            
        if jj == 0: 
            x_L = alpha_0 + U 
            L1 = np.random.binomial(n=1, p = np.exp(x_L)/(1+np.exp(x_L)))
            
            x_A = beta_0 + U + beta_1*L1 
            A = np.random.binomial(n=1, p = np.exp(x_A)/(1+np.exp(x_A)))
            
        elif jj == 1: 
            x_L = alpha_0 + alpha_1*float(df["L1"][(df.time == jj-1) & (df.indiv == ii)])+alpha_3*float(df["A"][(df.time == jj-1) & (df.indiv == ii)])
            L1 = np.random.binomial(n=1, p = np.exp(x_L)/(1+np.exp(x_L)))

            x_A = beta_0 + beta_1*L1 + beta_2*float(df["L1"][(df.time == jj-1) & (df.indiv == ii)])+ beta_3*float(df["A"][(df.time == jj-1) & (df.indiv == ii)])
            A = np.random.binomial(n=1, p = np.exp(x_A)/(1+np.exp(x_A)))

        else: 
            x_L = alpha_0 + alpha_1*float(df["L1"][(df.time == jj-1) & (df.indiv == ii)])+alpha_2*float(df["L1"][(df.time == jj-2) & (df.indiv == ii)])+alpha_3*float(df["A"][(df.time == jj-1) & (df.indiv == ii)])+alpha_4*float(df["A"][(df.time == jj-2) & (df.indiv == ii)])
            L1 = np.random.binomial(n=1, p = np.exp(x_L)/(1+np.exp(x_L)))
            
        
            x_A = beta_0 + beta_1*L1 + beta_2*float(df["L1"][(df.time == jj-1) & (df.indiv == ii)])+ beta_3*float(df["A"][(df.time == jj-1) & (df.indiv == ii)])+ beta_4*float(df["A"][(df.time == jj-2) & (df.indiv == ii)])
            A = np.random.binomial(n=1, p = np.exp(x_A)/(1+np.exp(x_A)))
         
        Y = np.random.normal(loc = U, scale = 1)
        
        ## df.loc[len(df)+1] = [ii, jj, A, Y, U, L1, L2]
        df.loc[len(df)+1] = [ii, jj, A, Y, U, L1]
# df.head(200)

In [16]:
df.head(300)

Unnamed: 0,indiv,time,A,Y,U,L1
1,1.0,0.0,0.0,-0.277944,0.745989,1.0
2,1.0,1.0,0.0,-0.426052,0.117601,1.0
3,1.0,2.0,0.0,-0.905881,0.548949,0.0
4,1.0,3.0,1.0,-0.005919,0.369481,1.0
5,1.0,4.0,0.0,1.530409,0.380216,1.0
6,1.0,5.0,0.0,1.228272,0.514340,1.0
7,1.0,6.0,1.0,0.770156,0.939312,1.0
8,1.0,7.0,0.0,0.307017,0.472137,1.0
9,1.0,8.0,1.0,-0.443416,0.358669,1.0
10,1.0,9.0,0.0,-0.697259,0.362676,1.0


In [17]:
# creating shifted values 
df["A_1"] = df.A.shift(1)
df["A_2"] = df.A.shift(2)
df["Y_1"] = df.Y.shift(1)
df["Y_2"] = df.Y.shift(2)
df["L1_1"] = df.L1.shift(1)
df["L1_2"] = df.L1.shift(2)
# df["L2_1"] = df.L2.shift(1)
# df["L2_2"] = df.L2.shift(2)


## making sure individuals' values dont get muddled 
# df["A_1"][df.time == 0] = 0
# df["A_2"][df.time == 0] = 0
# df["Y_1"][df.time == 0] = 0   ## WHAT SHOULD THIS BE EQUAL TO???
# df["Y_2"][df.time == 0] = 0   ## HERE TOO 
# df["L1_1"][df.time == 0] = df["L1"][df.time == 0]
# df["L1_2"][df.time == 0] = df["L1"][df.time == 0]
# df["L2_1"][df.time == 0] = df["L2"][df.time == 0]
# df["L2_2"][df.time == 0] = df["L2"][df.time == 0]

# df["A_2"][df.time == 1] = 0
# df["Y_2"][df.time == 1] = 0   ## AND THIS ONE
# df["L1_2"][df.time == 1] = df["L1"][df.time == 1]
# df["L2_2"][df.time == 1] = df["L2"][df.time == 1]

df.to_csv("Sim_Data.csv")

Starting with the model for Y, we use a linear regression to get the following model, 
$$\mathbb{E} \big[Y_{t+1} \mid \overline{A}_t, \overline{L}_t, \overline{Y}_t \big] = \theta_{0,t} + \theta_1 Y_{t} + \theta_2 Y_{t-1} + \theta_3 A_{t}+ \theta_4 A_{t-1} + \theta_5 L_t + \theta_6 L_{t-1} $$ 



In [18]:
columns = ["time", "int", "A", "L1", "A_1", "A_2", "Y_1", "Y_2", "L1_1", "L1_2"]
Y_model_df = pd.DataFrame(columns = columns)
## creating our models 
for ii in range(t_delay, time+1): 
    temp_df = df[df.time == ii]
    Y_model = smf.ols('Y ~ A + L1 + A_1 + A_2 + Y.shift(1) + Y.shift(2) + L1_1 + L1_2', data=temp_df).fit()
    Y_model_df = Y_model_df.append(pd.DataFrame([ii] + [Y_model.params[i] for i in range(0,9)], index = columns).transpose(), ignore_index=True)

In [19]:
Y_model_df.to_csv("Y_model_data.csv")

Now, we build the models for the two covariates separately.  For the first covariate, the model will look like this

$$ logistP\big[L_{1, t+1} \mid \overline{A}_t, \overline{L}_{1,t}, \overline{Y}_{t+1} \big] = \alpha_{0,t} + \alpha_1 Y_{t+1} + \alpha_2 Y_{t} + \alpha_3 Y_{t-1} +  \alpha_4 A_{t}+ \alpha_5 A_{t-1} + \alpha_6 L_{1,t} + \alpha_7 L_{1,t-1}$$ 


In [20]:
## Creating model for L1 
columns = ["time", "int", "A", "A_1", "A_2", "Y", "Y_1", "Y_2", "L1_1", "L1_2"]
train_columns = ["A", "A_1", "A_2", "Y", "Y_1", "Y_2", "L1_1", "L1_2"]
L1_model_df = pd.DataFrame(columns = columns)

for ii in range(t_delay, time+1): 
    temp_df = df[df.time == ii]
    L1_model = sm.Logit(np.asarray(temp_df["L1"]), np.asarray(sm.add_constant(temp_df[train_columns]))).fit()
    L1_model_df = L1_model_df.append(pd.DataFrame([ii] + [L1_model.params[i] for i in range(0,9)], index = columns).transpose(), ignore_index=True)
    

Optimization terminated successfully.
         Current function value: 0.664132
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.610657
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.654680
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.620087
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.604641
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.585663
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.575409
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.636681
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.627147
         Iterations 5


In [21]:
## Creating model for A 
columns = ["time", "int", "L1", "A_1", "A_2", "Y", "Y_1", "Y_2", "L1_1", "L1_2"]
train_columns = ["L1", "A_1", "A_2", "Y", "Y_1", "Y_2", "L1_1", "L1_2"]
L1_model_df = pd.DataFrame(columns = columns)

for ii in range(t_delay, time+1): 
    temp_df = df[df.time == ii]
    L1_model = sm.Logit(np.asarray(temp_df["A"]), np.asarray(sm.add_constant(temp_df[train_columns]))).fit()
    L1_model_df = L1_model_df.append(pd.DataFrame([ii] + [L1_model.params[i] for i in range(0,9)], index = columns).transpose(), ignore_index=True)
   

Optimization terminated successfully.
         Current function value: 0.600753
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.638258
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.633307
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.581134
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.659461
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.609115
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.638397
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.652973
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.645955
         Iterations 5


For the first covariate, the model will look like this

$$ logistP\big[L_{2, t+1} \mid \overline{A}_t, \overline{L}_{1,t}, \overline{Y}_{t+1} \big] = \alpha_{0,t} + \alpha_1 Y_{t+1} + \alpha_2 Y_{t} + \alpha_3 Y_{t-1} +  \alpha_4 A_{t}+ \alpha_5 A_{t-1} + \alpha_6 L_{2,t} + \alpha_7 L_{2,t-1} + \alpha_8 L_{1,t+1} + \alpha_9 L_{1,t} + \alpha_{10} L_{1,t-1} $$

NOTE the assumption here that the model for $L_{2, t+1}$ includes the data point $L_{1,t+1}$ whereas in the above, the reverse was not true 

In [279]:
## Model for L2
columns = ["time", "int", "A", "L1", "A_1", "A_2", "Y", "Y_1", "Y_2", "L1_1", "L1_2", "L2_1", "L2_2"]
train_columns = ["A", "L1", "A_1", "A_2", "Y", "Y_1", "Y_2", "L1_1", "L1_2", "L2_1", "L2_2"]
L2_model_df = pd.DataFrame(columns = columns)
for ii in range(t_delay, time+1): 
    temp_df = df[df.time == ii]
    L2_model = sm.Logit(np.asarray(temp_df["L2"]), np.asarray(sm.add_constant(temp_df[train_columns]))).fit()
    L2_model_df = L2_model_df.append(pd.DataFrame([ii] + [L2_model.params[i] for i in range(0,12)], index = columns).transpose(), ignore_index=True)


Optimization terminated successfully.
         Current function value: 0.509789
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.482829
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.388738
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.464160
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.415793
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.423139
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.357553
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.430307
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.471741
         Iterations 6


Now, using these three models, we can calculate the following expectations

$$ \mathbb{E} \big[Y_{i}^{\overline{a}}\big]  = \sum_{i} \mathbb{E} \big[Y_i \mid A_{i-1} = a_{i-1},  \; A_{i-2} = a_{i-2}, \; L_{1, i-1}, \; L_{1, i-2}, \; L_{2, i-1}, \; L_{2, i-2} \big]$$ 

for the respective $\bar{a} = \bar{0}$ and $\bar{a} = \bar{1}$

In [376]:
## n = 1000 
columns = ["time", "index", "Y", "A", "L1", "L2"]
est_df = pd.DataFrame(columns = columns)

for ii in range(t_delay, time+1): 
    for jj in range(1, indiv+1): 
        A = int(df.A[(df.indiv == jj) & (df.time == ii)])
        
        temp = np.append(1,np.asarray(df[(df.indiv == jj) & (df.time == ii)])[0][[2,7,8,3,9,10,11,12,13,14]])
        L1 = sum(np.asarray(L1_model_df[L1_model_df.time == ii])[0][1:]*temp)
        
        temp2 = np.append(np.append(1, np.asarray(df[(df.indiv == jj) & (df.time == ii)])[0][[2]]), np.append(L1, np.asarray(df[(df.indiv == jj) & (df.time == ii)])[0][[7, 8, 9, 3, 10,11,12,13,14]]))
        L2 = sum(np.asarray(L2_model_df[L2_model_df.time == ii])[0][1:]*temp2)
        
        temp3 = np.append(np.array([1, A, L1]), np.append(L2, np.asarray(df[(df.indiv == jj) & (df.time == ii)])[0][[7,8,9,10,11,12,13,14]]))
        Y = sum(np.asarray(Y_model_df[Y_model_df.time == ii])[0][1:]*temp3)
        
        est_df.loc[len(est_df)+1] = [ii, jj, Y, A, L1, L2]

Calculating the final difference for each time, using the following
$$ \mathbb{E}\big[Y^{a=1}\big] - \mathbb{E}\big[Y^{a=0}\big]$$ 

In [377]:
for ii in range(t_delay, time+1): 
    print("time =", ii, ", " ,np.mean(est_df.Y[(est_df.time == ii) & (est_df.A == 1)])-np.mean(est_df.Y[(est_df.time == ii) & (est_df.A == 0)]))


time = 2 ,  0.46105361018825164
time = 3 ,  0.526697548525421
time = 4 ,  1.2878634635729955
time = 5 ,  1.9127264887202207
time = 6 ,  0.6644596028793268
time = 7 ,  1.3200164322851768
time = 8 ,  0.6295412209039655
time = 9 ,  -0.12736163858216631
time = 10 ,  0.8957420697237359
