We can establish our framework by defining the variables to be used.  We start with the outcome variable, $Y$.  Next, we consider out treatment variable, $A_k$, which takes on value 1 if there is treatment, and is 0 otherwise.  To begin, we will use treatment strategies that are exclusively treatment or exclusively no treatment, corresponding to $\overline{a} = (1,1,\dots,1) = \overline{1}$ and $\overline{a} = (0,0,\dots,0) = \overline{0}$ respectively.  The next measurable variable is $L$, which represents the covariate(s) to be included.  Note that both the covariates, $L$, and the outcome, $Y$ are affected by an unmeasured common cause, $U$.  The diagram below illustrates these relationships.  

![title](image4.png)

## Data Creation

We first need to simulate the data. We will build out the covariates of the population.  For simplicity, we will use one binary covariate, $L_1$.  

The data is being simulated using the following equations, where U is an underlying confounder.  

$$logitP[L_k] = \alpha_0 + \alpha_1 \cdot L_{k-1} + \alpha_2 \cdot L_{k-2} + \alpha_3 A_{k-1} + \alpha_4 A_{K-2} + \alpha_5 U$$ 


$$ logit[A_k] = \beta_0 + \beta_1 L_{k} + \beta_2 L_{k-1} + \beta_3 A_{k-1} + \beta_4 A_{K-2} $$ 

Then, an end $Y$ value will be pulled from the following 
$$ Y \sim N(\mu = U, \sigma = 1) $$ 

where $U \sim Unif(0.1, 1)$ 


## G-formula Simulation Study 

The purpose of this investigation is to measure the average causal effect of treatment, which can be estimated using 
$$ \mathbb{E}\big[Y^{a=1}\big] - \mathbb{E}\big[Y^{a=0}\big]$$ 
for the respective $\bar{a} = \bar{0}$ and $\bar{a} = \bar{1}$

We want to build out the g-formula as follows  
$$ \mathbb{E} \big[Y^{\overline{a}}\big]  = \sum_{l_i} \mathbb{E} \big[Y \mid A_{0} = a_{0},  \; A_{1} = a_{1}, \cdots, \; A_{t} = a_{t},  \; L_{0} = l_0, \; L_{1} = l_1, \cdots, \; L_{t} = l_t,\big] \prod_{k=0}^t P(L_k = l_k \mid \overline{L}_{k-1}, \overline{A}_{k-1})  $$

We can do this by building out two models, one for $Y$ And one for $L$.  We will begin by using a continous $Y$ and a binary $L$ for simplicity.  

For $Y$, we will use a linear regression, and the model will look something like this 

$$\mathbb{E} \big[Y \mid \overline{A}_t, \overline{L}_t \big] = \theta_{0} + \theta_1 A_{t}+ \theta_2 A_{t-1} + \cdots + \theta_j A_0 + \theta_{j+1} L_t + \theta_{j+2} L_{t-1} + \cdots + \theta_{j+k} L_0 $$ 

For each $L$, we will use a logistic regression, also calculated on a time delay of t=2.  This will give us something similar to 

$$ logit(L_k) = \gamma_0 + \gamma_1 L_{k-1} + \gamma_2 L_{k-2} + \gamma_3 A_{k-1} + \gamma_4 A_{k-2} $$ 

For the treatment $A$, we will also use a logistic regression on a time delay of 2, similar to, 
$$ logit(A_k) = \zeta_0 + \zeta_1 L_{k} + \zeta_2 L_{k-1} + \zeta_3 A_{k-1} + \zeta_4 A_{k-2} $$ 



Using these models, I simulated 500 mean casual effects for 500 individual datasets.  
<!---what quadratic time term needed to be added here? --->

## Results 
Mean of means of differences: 1.7217286172967964e-07

Variance of means of differences: 2.3894070465421745e-10


In [62]:
import pandas as pd
import numpy as np
import scipy as sp
import sklearn as sk
import math
import csv
import statsmodels.api as sm
import statsmodels.formula.api as smf
import random
import matplotlib.pyplot as plt

In [127]:
#########################################################################
##[FUNCTION] data_creation simulates data for a given number of 
## individuals(indiv) over a set amount of time (max_time), and can 
## include as many covariates as desired (number_of_covariates)

## -- need to create the functionality for multiple covariates

#########################################################################


def data_creation(indiv, max_time, number_of_covariates, Y_full, alpha, beta): 

    columns = ["indiv", "time","U", "A", "Y",  "L1"]
    df = pd.DataFrame(columns = columns)
    
    for ii in range(1,indiv+1):
     
        ## creating an unobserved variable that affects covariates 
        U = np.random.uniform(low = 0.1, high = 1 )
            
        for jj in range(0, max_time+1): 
            if jj == 0: 
                x_L = alpha[0] + alpha[5]*U 
                L1 = np.random.binomial(n=1, p = np.exp(x_L)/(1+np.exp(x_L)))

                x_A = beta[0] + beta[1]*L1 
                A = np.random.binomial(n=1, p = np.exp(x_A)/(1+np.exp(x_A)))

                df.loc[len(df)+1] = [ii, jj, U, A, "NaN",  L1]

            elif jj == 1: 
                x_L = np.sum(alpha*np.array([1, float(df["L1"][(df.time == jj-1) & (df.indiv == ii)]), \
                            0, float(df["A"][(df.time == jj-1) & (df.indiv == ii)]), 0, U]))
                
                L1 = np.random.binomial(n=1, p = np.exp(x_L)/(1+np.exp(x_L)))
                
                
                x_A = np.sum(beta*np.array([1.0,L1, float(df["L1"][(df.time == jj-1) & (df.indiv == ii)]), \
                      float(df["A"][(df.time == jj-1) & (df.indiv == ii)]), 0.0]))
                
                A = np.random.binomial(n=1, p = np.exp(x_A)/(1+np.exp(x_A)))

                df.loc[len(df)+1] = [ii, jj, U, A, "NaN", L1]

            else: 
                x_L = np.sum(alpha*np.array([1, float(df["L1"][(df.time == jj-1) & (df.indiv == ii)]), \
                      float(df["L1"][(df.time == jj-2) & (df.indiv == ii)]), float(df["A"]\
                      [(df.time == jj-1) & (df.indiv == ii)]), float(df["A"][(df.time == jj-2)\
                      & (df.indiv == ii)]), U]))
                
                L1 = np.random.binomial(n=1, p = np.exp(x_L)/(1+np.exp(x_L)))


                x_A = np.sum(beta*np.array([1,L1,float(df["L1"][(df.time == jj-1) & (df.indiv == ii)]), \
                      float(df["A"][(df.time == jj-1) & (df.indiv == ii)]), float(df["A"]\
                    [(df.time == jj-2) & (df.indiv == ii)])]))
                
                A = np.random.binomial(n=1, p = np.exp(x_A)/(1+np.exp(x_A)))

                if jj == max_time: 
                    x_Y = 0.5 + U 
                    Y = np.random.binomial(n=1, p = np.exp(x_Y)/(1+np.exp(x_Y)))
                    df.loc[len(df)+1] = [ii, jj, U, A, Y, L1]

                else: 
                    df.loc[len(df)+1] = [ii, jj, U, A, "NaN", L1]

    # creating shifted values 
    if Y_full == "TRUE":
        for kk in range(1,max_time+1):
            df["L1_"+str(kk)] = df.L1.shift(kk)
            df["A_"+str(kk)] = df.A.shift(kk)
    else:
        for kk in range(1,4):
            df["L1_"+str(kk)] = df.L1.shift(kk)
            df["A_"+str(kk)] = df.A.shift(kk)

    return(df); 

In [202]:
#########################################################################
##[FUNCTION] data_creation simulates data for a given number of 
## individuals(indiv) over a set amount of time (max_time), and can 
## include as many covariates as desired (number_of_covariates)

## -- need to create the functionality for multiple covariates

#########################################################################


def data_creation2(indiv, max_time, number_of_covariates, Y_full, alpha, beta): 

    columns = ["indiv", "time","U", "A", "Y",  "L1"]
    df = pd.DataFrame(columns = columns)
     
    ## creating an unobserved variable that affects covariates 
    U = np.random.uniform(low = 0.1, high = 1, size = indiv)
            
    for jj in range(0, max_time+1): 
        if jj == 0: 
            x_L = alpha[0] + alpha[5]*U 
            L1 = np.random.binomial(n=1, p = np.exp(x_L)/(1+np.exp(x_L)))

            x_A = beta[0] + beta[1]*L1 
            A = np.random.binomial(n=1, p = np.exp(x_A)/(1+np.exp(x_A)))

            df = pd.DataFrame({"indiv":range(1,indiv+1), "time":jj,"U":U, "A":A, "Y":["Nan"]*indiv, "L1":L1})
            
        elif jj == 1: 
            x_L = np.sum(alpha*np.transpose(np.array([[1.0]*indiv, df["L1"][(df.time == jj-1)], \
                        [0.0]*indiv, df["A"][(df.time == jj-1)], [0.0]*indiv, U])), axis = 1)

            L1 = np.random.binomial(n=1, p = np.exp(x_L)/(1+np.exp(x_L)))


            x_A = np.sum(beta*np.transpose(np.array([[1.0]*indiv, L1, df["L1"][(df.time == jj-1)],\
                  df["A"][(df.time == jj-1)], [0.0]*indiv ])), axis = 1)
                         
            A = np.random.binomial(n=1, p = np.exp(x_A)/(1+np.exp(x_A)))

            temp_df = pd.DataFrame({"indiv":range(1,indiv+1), "time":jj, "U":U, "A":A, "Y":["Nan"]*indiv, "L1":L1})
            df = pd.concat([df, temp_df])

        else: 
            x_L = np.sum(alpha*np.transpose(np.array([[1.0]*indiv, df["L1"][(df.time == jj-1)], \
                  df["L1"][(df.time == jj-2)], df["A"][(df.time == jj-1)], \
                  df["A"][(df.time == jj-2)], U])), axis = 1)

            L1 = np.random.binomial(n=1, p = np.exp(x_L)/(1+np.exp(x_L)))


            x_A = np.sum(beta*np.transpose(np.array([[1.0]*indiv,L1,df["L1"][(df.time == jj-1)], \
                  df["A"][(df.time == jj-1)] , df["A"][(df.time == jj-2)]])), axis = 1)

            A = np.random.binomial(n=1, p = np.exp(x_A)/(1+np.exp(x_A)))

            if jj == max_time: 
                x_Y = 0.5 + U 
                Y = np.random.binomial(n=1, p = np.exp(x_Y)/(1+np.exp(x_Y)))                
                temp_df = pd.DataFrame({"indiv":range(1,indiv+1), "time":jj,"U":U, "A":A, "Y":Y, "L1":L1})
                df = pd.concat([df, temp_df])


            else: 
                temp_df = pd.DataFrame({"indiv":range(1,indiv+1), "time":jj,"U":U, "A":A, "Y":["Nan"]*indiv, "L1":L1})
                df = pd.concat([df, temp_df])


    # creating shifted values 
    if Y_full == "TRUE":
        for kk in range(1,max_time+1):
            df["L1_"+str(kk)] = df.L1.shift(kk)
            df["A_"+str(kk)] = df.A.shift(kk)
    else:
        for kk in range(1,4):
            df["L1_"+str(kk)] = df.L1.shift(kk)
            df["A_"+str(kk)] = df.A.shift(kk)
            
    df = df.reset_index(drop=True)

    return(df); 

In [222]:
#########################################################################
##[FUNCTION] Y_model_creation creates the linear regression model for 
## the observed Ys based on the treatments (A) and covariates (L)  

#########################################################################

def Y_model_creation(df, max_time): 
    temp_df = df[df.time == max_time]
    train_columns ='+'.join(map(str, np.append(list(df)[0:2],list(df)[6:])))
    temp_df = temp_df.astype(float)
    Y_model = smf.ols("Y~"+train_columns, data=temp_df).fit(); 
    return(Y_model)

In [129]:
#########################################################################
##[FUNCTION] covariate_model_creation creates the logistic regression 
## for the observed covariate (L) data from the previous covariates 
## and the previous treatments (A) 

## -- need to create the functionality for multiple covariates
## SHOULD THIS BE FOR ALL HISTORY UP TO THAT POINT TO BE MORE 
## ACCURATE WHEN CALCULATING THE EXPECTATION??? 

#########################################################################


def covariate_model_creation(df, max_time): 
    columns = ["time", "gamma_0", "gamma_1", "gamma_2", "gamma_3", "gamma_4", \
              "gamma_5", "gamma_6"]
    train_columns = ["L1_1", "L1_2", "L1_3", "A_1", "A_2",  "A_3"]
    L1_model_df = pd.DataFrame(columns = columns)

    for ii in range(1, (max_time+1)): 
        temp_df = df[df.time == ii] 
        if ii == 1: 
            L1_model = sm.Logit(np.asarray(temp_df["L1"]), \
                       np.asarray(sm.add_constant(temp_df[["L1_1", "A_1"]]))).fit(); 
            L1_model_df = L1_model_df.append(pd.DataFrame([ii] + \
                         [L1_model.params[i] for i in range(0,2)] + ["Nan"] + \
                         ["Nan"] + [L1_model.params[2]] + ["Nan"] + ["Nan"], \
                         index = columns).transpose(), ignore_index=True)
        elif ii == 2: 
            L1_model = sm.Logit(np.asarray(temp_df["L1"]), \
                       np.asarray(sm.add_constant(temp_df[["L1_1", "L1_2", \
                       "A_1", "A_2"]]))).fit(); 
            L1_model_df = L1_model_df.append(pd.DataFrame([ii] + [L1_model.params[i] \
                          for i in range(0,3)] + ["Nan"] + [L1_model.params[i] for i \
                          in range(3,5)] + ["Nan"], index = columns).transpose(), ignore_index=True)
        else: 
            L1_model = sm.Logit(np.asarray(temp_df["L1"]), \
                       np.asarray(sm.add_constant(temp_df[train_columns]))).fit(); 
            L1_model_df = L1_model_df.append(pd.DataFrame([ii] + [L1_model.params[i] for \
                          i in range(0,7)], index = columns).transpose(), ignore_index=True)
    return(L1_model_df)

$$ logit(L_k) = \gamma_0 + \gamma_1 L_{k-1} + \gamma_2 L_{k-2} + \gamma_3 L_{k-3} + \gamma_4 A_{k-1} + \gamma_5 A_{k-2} + \gamma_6 A_{k-3} $$ 

$$ logit(A_k) = \zeta_0 + \zeta_1 L_{k} + \zeta_2 L_{k-1} + \zeta_3 A_{k-1} + \zeta_4 A_{k-2} $$ 

In [130]:
#########################################################################
##[FUNCTION] treatment_model_creation creates the logistic regression 
## for the observed treatment (A) data from the current and previous 
## covariates and the previous treatments (A) 

## -- need to create the functionality for multiple covariates
#########################################################################


def treatment_model_creation(df, max_time): 
    columns = ["time", "zeta_0", "zeta_1", "zeta_2", "zeta_3", "zeta_4"]
    train_columns = ["L1", "L1_1", "A_1", "A_2"]
    A_model_df = pd.DataFrame(columns = columns)

    for ii in range(1, (max_time+1)): 
        temp_df = df[df.time == ii]   
        if ii == 1: 
            A_model = sm.Logit(np.asarray(temp_df["A"]), np.asarray(sm.add_constant(temp_df[["L1", \
                      "L1_1", "A_1"]]))).fit()
            A_model_df = A_model_df.append(pd.DataFrame([ii] + [A_model.params[i] for i in range(0,4)]\
                         + ["Nan"], index = columns).transpose(), ignore_index=True)
        else: 
            A_model = sm.Logit(np.asarray(temp_df["A"]), np.asarray(sm.add_constant(temp_df[train_columns]))).fit()
            A_model_df = A_model_df.append(pd.DataFrame([ii] + [A_model.params[i] for i in range(0,5)],\
                         index = columns).transpose(), ignore_index=True)
    return(A_model_df)


In [277]:
#########################################################################
##[FUNCTION] simulation_run calculates the causal effect over an  
## established number of repetitions using the models for outcome (Y) 
## and the covariates (L) 

## -- need to create the functionality for multiple covariates

#########################################################################


def simulation_run(df, Y_model, L1_model_df, max_time, Y_full, test_value): 
    reps = 10000
    final_results = np.empty(reps) 

    ### establishing treatment of interest
    A_test = [test_value]*max_time

    for ii in range(0,reps):
        values = np.empty(max_time)
        values[0] = random.choice(list(df["L1"][df["time"] == 0]))
        if values[0] == 0: 
            prod = 1-np.mean(list(df["L1"][df["time"] == 0]))
        else: 
            prod = np.mean(list(df["L1"][df["time"] == 0]))

        for jj in range(1, max_time):
            if jj == 1: 
                values[jj] = np.sum(np.array([L1_model_df.ix[jj-1,][i] for i \
                             in [1,2,5]])*[1.0,values[jj-1],A_test[jj-1]])
            elif jj == 2: 
                values[jj] = np.sum(np.array([L1_model_df.ix[jj-1,][i] for i \
                             in [1,2,3,5,6]])*[1.0,values[jj-1],values[jj-2], \
                             A_test[jj-1], A_test[jj-2]])
            else: 
                values[jj] = np.sum(np.array([L1_model_df.ix[jj-1,][i] for i \
                             in range(1,8)])*[1.0,values[jj-1],values[jj-2], \
                             values[jj-2], A_test[jj-1], A_test[jj-2], A_test[jj-3]])
            prod = prod*(np.exp(values[jj])/(1+np.exp(values[jj])))

        if Y_full == "TRUE": 
            list1 = [A_test[max_time-i] for i in range(1,max_time+2)]
            list2 = [values[max_time-i] for i in range(1,max_time+2)]

        else: 
            list1 = [A_test[max_time-i] for i in range(1,5)]
            list2 = [values[max_time-i] for i in range(1,5)]
            
        result = [None]*(len(list1)+len(list2))
        result[::2] = list1
        result[1::2] = list2
        result = [1] + result

        Y_exp = np.sum(np.array(Y_model.params)*result)

        final_results[ii] = prod*Y_exp

    return(np.mean(final_results)) 

In [459]:
#########################################################################
##[FUNCTION] simulation_run calculates the causal effect over an  
## established number of repetitions using the models for outcome (Y) 
## and the covariates (L) 

## -- need to create the functionality for multiple covariates

#########################################################################


def simulation_run2(df, Y_model, L1_model_df, max_time, Y_full, test_value): 
    reps = 1000
    final_results = np.empty(reps) 

    ### establishing treatment of interest
    A_test = [test_value]*(max_time+1) 

    values = pd.DataFrame(np.random.choice(np.array(df["L1"][df["time"] == 0]), reps))
    prod = np.empty(reps) 

    prod[values[[0]] == 0 ] = 1-np.mean(list(df["L1"][df["time"] == 0]))
    prod[values[[0]] != 0] = np.mean(list(df["L1"][df["time"] == 0]))

    values[1] = np.sum(np.array([L1_model_df.ix[0,][i] for i in [1,2,5]])*np.transpose(np.array([[1.0]*reps,list(values[0]),[A_test[0]]*reps])), axis = 1)
    prod = prod*(np.exp(values[1])/(1+np.exp(values[1])))

    values[2] = np.sum(np.array([L1_model_df.ix[1,][i] for i \
                         in [1,2,3,5,6]])*np.transpose(np.array([[1.0]*reps, list(values[1]),list(values[0]), \
                         [A_test[1]]*reps, [A_test[0]]*reps])), axis = 1 )
    prod = prod*(np.exp(values[2])/(1+np.exp(values[2])))                                                            

    for jj in range(3, max_time+1):
        values[jj] = np.sum(np.array([L1_model_df.ix[jj-1,][i] for i in range(1,8)])*np.transpose(np.array([[1.0]*reps,list(values[jj-1]),list(values[jj-2]), list(values[jj-2]), [A_test[jj-1]]*reps, [A_test[jj-2]]*reps, [A_test[jj-3]]*reps])), axis = 1)
        prod = prod*(np.exp(values[jj])/(1+np.exp(values[jj])))
        
    values.head(250)

    if Y_full == "TRUE": 
        Y_A = [A_test]*reps
        Y_L = np.array(values)
        Y_exp = np.array(Y_model.params[0])*([1.0]*reps) + np.sum(Y_A*np.array([Y_model.params[i]\
                for i in [1,4,6,8,10,12,14,16,18,20,22,24]]), axis = 1)+np.sum([Y_model.params[i] for i in \
                [2,3,5,7,9,11,13,15,17,19,21,23]]*Y_L, axis = 1)
    else: 
        Y_A = [A_test*4]*reps
        Y_L = np.array([values[0], values[1], values[2], values[3], values[4]])
        Y_exp = np.array(Y_model.params[0])*([1.0]*reps) + np.sum(Y_A*np.array([Y_model.params[i]\
                for i in [1,4,6,8]]), axis = 1)+np.sum([Y_model.params[i] for i in \
                [2,3,5,7]]*Y_L, axis = 1)

    return(np.mean(prod*Y_exp)) 

In [458]:
reps = 1000
final_results = np.empty(reps) 

### establishing treatment of interest
A_test = [test_value]*(max_time+1) 

values = pd.DataFrame(np.random.choice(np.array(df["L1"][df["time"] == 0]), reps))
prod = np.empty(reps) 

prod[values[[0]] == 0 ] = 1-np.mean(list(df["L1"][df["time"] == 0]))
prod[values[[0]] != 0] = np.mean(list(df["L1"][df["time"] == 0]))

values[1] = np.sum(np.array([L1_model_df.ix[0,][i] for i in [1,2,5]])*np.transpose(np.array([[1.0]*reps,list(values[0]),[A_test[0]]*reps])), axis = 1)
prod = prod*(np.exp(values[1])/(1+np.exp(values[1])))


values[2] = np.sum(np.array([L1_model_df.ix[1,][i] for i in [1,2,3,5,6]])*np.transpose(np.array([[1.0]*reps, list(values[0]), list(values[1]), [A_test[1]]*reps, [A_test[0]]*reps])), axis =1)
prod = prod*(np.exp(values[2])/(1+np.exp(values[2])))                                                            


for jj in range(3, max_time+1):
  values[jj] = np.sum(np.array([L1_model_df.ix[jj-1,][i] for i in range(1,8)])*np.transpose(np.array([[1.0]*reps,list(values[jj-1]),list(values[jj-2]), list(values[jj-2]), [A_test[jj-1]]*reps, [A_test[jj-2]]*reps, [A_test[jj-3]]*reps])), axis = 1)
  prod = prod*(np.exp(values[jj])/(1+np.exp(values[jj])))

values.head(250)



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0,1.111254,1.067103,1.214161,1.665270,2.240954,0.725185,1.243461,1.677858,1.541395,1.811500,1.025072
1,0,1.111254,1.067103,1.214161,1.665270,2.240954,0.725185,1.243461,1.677858,1.541395,1.811500,1.025072
2,1,1.102624,1.037928,1.212345,1.647352,2.241282,0.726519,1.243407,1.678194,1.541529,1.811529,1.025040
3,0,1.111254,1.067103,1.214161,1.665270,2.240954,0.725185,1.243461,1.677858,1.541395,1.811500,1.025072
4,1,1.102624,1.037928,1.212345,1.647352,2.241282,0.726519,1.243407,1.678194,1.541529,1.811529,1.025040
5,1,1.102624,1.037928,1.212345,1.647352,2.241282,0.726519,1.243407,1.678194,1.541529,1.811529,1.025040
6,1,1.102624,1.037928,1.212345,1.647352,2.241282,0.726519,1.243407,1.678194,1.541529,1.811529,1.025040
7,1,1.102624,1.037928,1.212345,1.647352,2.241282,0.726519,1.243407,1.678194,1.541529,1.811529,1.025040
8,1,1.102624,1.037928,1.212345,1.647352,2.241282,0.726519,1.243407,1.678194,1.541529,1.811529,1.025040
9,0,1.111254,1.067103,1.214161,1.665270,2.240954,0.725185,1.243461,1.677858,1.541395,1.811500,1.025072


In [190]:
## establishing constants 
indiv = 500   ## number of individuals in study 
max_time = 10 ## number of time points being considered 
t_delay = 2 ## number of time delays included in model 
num_sims = 50 
results = np.empty(num_sims)

## RUNNING SIMULATIONS 
start_time = time.time() 
for ii in range(num_sims):
    print(ii) 
    df = data_creation(indiv,max_time, 2, "TRUE") 
    Y_model = Y_model_creation(df, max_time)
    L1_model_df = covariate_model_creation(df, max_time)
    results[ii] = simulation_run(df, Y_model, L1_model_df, max_time, "TRUE")

elapsed_time = time.time() - start_time

0
Optimization terminated successfully.
         Current function value: 0.521503
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.493448
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.504559
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.553813
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.416014
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.474845
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.492815
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.472614
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.484225
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.490118


In [180]:
# plt.hist(list(results[~np.isnan(results)]))

In [196]:
print(np.mean(results))
print(np.var(results)/50)

0.133507746116
0.00710806020948


In [398]:
Results = pd.DataFrame(results)
Results.to_csv("SIM_RESULTS")

## DOUBLY ROBUST METHOD

$$ logit[P(A_{m,i} = 1 \mid \bar{l}_{m,i}, \bar{a}_{m-1,i}; \alpha )] = w_m (\bar{l}_{m,i}, \bar{a}_{m-1,i}; \alpha) $$ 



"Correct" model:
$$logit(P[\hat{A}_{m,i}]) = \alpha_0 + \alpha_1 \cdot L_{m,i} + \alpha_2 \cdot A_{m-1,i} + \alpha_3 \cdot L_{m-1,i} + \alpha_4 \cdot L_{m-2,i} + \alpha_5 \cdot A_{m-2,i} + \alpha_6 \cdot A_{m-3,i}$$ 

"Incorrect model":
$$logit(P[\hat{A}_{m,i}]) = \alpha_0 + \alpha_1 \cdot L_{m-3,i} + \alpha_2 \cdot A_{m-3,i}$$ 


In [68]:
#########################################################################
##[FUNCTION] pi_function creates the w_m function given the following:
## the alpha model of A_{m,i}, the dataframe, the time (m), and an 
## indicator of whether this is the correct or incorrect model 

## do I need to do something in here like 1-expit for those A_j == 0?? 
## i.e. what I did in the last line here 
#########################################################################

def pi_function(m, alpha_model, df, indiv, alpha_wrong): 
    product = [1]*indiv
    for jj in range(2, m+1): 
        if alpha_wrong == "FALSE": 
            x = np.sum(alpha_model*np.array(sm.add_constant(df[df.time == jj][["L1", "A_1", \
                "L1_1", "L1_2", "A_2"]])), axis = 1) 
        else: 
            x = np.sum(alpha_model*np.array(sm.add_constant(df[df.time == jj][["L1_3", "A_3"]])), axis = 1)
        product = product*sp.special.expit(x)
    
    x = np.array(np.divide([1]*indiv, product))
    x[np.where(df[df.time == m]["A_1"] == 0.0)] = 1 - x[np.where(df[df.time == m]["A_1"] == 0.0)]
    return(x)    

In [210]:
#########################################################################
##[FUNCTION] alpha_model_creation creates the logistic regression 
## for the observed treatment (A) data from the current and previous 
## covariates and the previous treatments (A) over all time periods and
## individuals 

## -- need to create the functionality for multiple covariates
#########################################################################


def alpha_model_creation(df, wrong): 
    alpha_df = pd.DataFrame(columns = ["A", "l", "a_1", "l_1", "l_2", "l_3", "a_2", "a_3"])
     
    for ii in range(1,len(df)): 
        if df.loc[ii]["time"] > 2.0:
            alpha_df.loc[len(alpha_df)+1] = [df.loc[ii].A, df.loc[ii].L1, df.loc[ii]["A_1"], \
                                             df.loc[ii]["L1_1"], df.loc[ii]["L1_2"],  df.loc[ii]["L1_3"], \
                                             df.loc[ii]["A_2"], df.loc[ii]["A_3"]]

    if wrong == "TRUE":
        alpha_model = sm.Logit(np.asarray(alpha_df.A),np.asarray(sm.add_constant(alpha_df[["l_3", "a_3"]]))).fit().params
    else: 
        alpha_model = sm.Logit(np.asarray(alpha_df.A),np.asarray(sm.add_constant(alpha_df[["l", "a_1", "l_1",\
                      "l_2", "a_2"]]))).fit().params
    return(alpha_model)  

In [70]:
#########################################################################
##[FUNCTION] DR_estimate_creation calculates the causal effect for a 
## given treatment of interest (test_value), including an indicator 
## of whether the correct or incorrect model is being used 

#########################################################################

def DR_estimate_creation(test_value, max_time, df, indiv, wrong_model):
    alpha_model = alpha_model_creation(df,wrong_model)
    
    A_test = [test_value]*indiv 
    model_df = pd.DataFrame(columns = ["time", "beta_0", "beta_1", "beta_2", \
                "beta_3", "beta_4", "beta_5", "beta_6", "phi"])
    time.counter = max_time
    T = list(df[df.time == max_time].Y)

    while(time.counter > 2.0): 
        time_df = df.loc[df.time == time.counter]
        time_df["T"] = np.array(T)
        pi = pi_function(time.counter, alpha_model, df, indiv, wrong_model) 
        time_df["pi"] = pi 
        train_columns ='+'.join(map(str, np.append(list(time_df)[6:12], \
                        list(time_df)[13])))
        time_df = time_df.astype(float)
        S_model = smf.ols("T~"+train_columns, data=time_df).fit()
        model_df = model_df.append(pd.DataFrame([time.counter] + \
                   [S_model.params[i] for i in range(0,8)]).transpose(), ignore_index=True)
        time_df["A_1"] = np.array(A_test)
        new_T = np.sum([S_model.params[i] for i in range(0,8)]*\
                np.array(sm.add_constant(time_df.loc[:,np.append(list(time_df)[6:12], \
                list(time_df)[13])], has_constant='add')), axis=1)
        T = sp.special.expit(new_T)
        time.counter = time.counter-1
    
    return(np.nanmean(T))  

In [462]:
## CONSTANTS 
alpha = np.random.uniform(low = -1.0, high = 1.0, size = 6)
beta = np.random.uniform(low = -1.0, high = 1.0, size = 5)
alpha[5] = alpha[5] + 1.5
indiv = 1000 
max_time = 11
num_sims = 5
results_g_formula = np.empty(num_sims)
results_dr_estimator = np.empty(num_sims)

for ii in range(0, num_sims): 
    print(ii) 
    
    df = data_creation2(indiv, max_time, 1, "TRUE", alpha, beta) 
    Y_model = Y_model_creation(df, max_time)
    L1_model_df = covariate_model_creation(df, max_time)
    results_g_formula[ii] = simulation_run2(df, Y_model, L1_model_df, max_time, "TRUE", 1) - simulation_run2(df, Y_model, L1_model_df, max_time, "TRUE", 0)
    
    df = df.iloc[:,0:12]
    results_dr_estimator[ii] = DR_estimate_creation(1.0, max_time, df, indiv, "TRUE")-\
    DR_estimate_creation(0.0, max_time, df, indiv, "TRUE")

0
Optimization terminated successfully.
         Current function value: 0.570341
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.491204
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.487255
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.502533
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.496035
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.505612
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.487435
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.489860
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.472231
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.509489




Optimization terminated successfully.
         Current function value: 0.683707
         Iterations 4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  result = result.union(other)
  result = result.union(other)


Optimization terminated successfully.
         Current function value: 0.683707
         Iterations 4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  result = result.union(other)
  result = result.union(other)


1
Optimization terminated successfully.
         Current function value: 0.563820
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.513432
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.498012
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.509849
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.522858
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.473487
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.475601
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.476730
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.520972
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.500733




Optimization terminated successfully.
         Current function value: 0.683277
         Iterations 4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  result = result.union(other)
  result = result.union(other)


Optimization terminated successfully.
         Current function value: 0.683277
         Iterations 4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  result = result.union(other)
  result = result.union(other)


2
Optimization terminated successfully.
         Current function value: 0.573787
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.509179
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.501638
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.487718
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.490745
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.463192
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.479191
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.473661
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.495298
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.482227




Optimization terminated successfully.
         Current function value: 0.684444
         Iterations 4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  result = result.union(other)
  result = result.union(other)


Optimization terminated successfully.
         Current function value: 0.684444
         Iterations 4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  result = result.union(other)
  result = result.union(other)


3
Optimization terminated successfully.
         Current function value: 0.595912
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.517846
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.471329
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.479417
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.488121
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.486653
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.491764
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.495465
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.501917
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.503183




Optimization terminated successfully.
         Current function value: 0.682615
         Iterations 4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  result = result.union(other)
  result = result.union(other)


Optimization terminated successfully.
         Current function value: 0.682615
         Iterations 4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  result = result.union(other)
  result = result.union(other)


4
Optimization terminated successfully.
         Current function value: 0.553485
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.528446
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.462136
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.509402
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.495586
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.486505
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.497313
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.514074
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.506890
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.483361




Optimization terminated successfully.
         Current function value: 0.682557
         Iterations 4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  result = result.union(other)
  result = result.union(other)


Optimization terminated successfully.
         Current function value: 0.682557
         Iterations 4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  result = result.union(other)
  result = result.union(other)


In [432]:
print(np.mean(results_g_formula))
print(np.var(results_g_formula/num_sims))

nan
nan


In [435]:
print(np.mean(results_dr_estimator))
print(np.sqrt(np.var(results_dr_estimator)/num_sims)) 

2.06147067317e-05
0.000338003249718


In [434]:
results_g_formula

array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  n