# Chapter 14: G-estimation of Structural Nested Models
This notebook goes through Chapter 13 of “Hernán MA, Robins JM (2019). Causal Inference. Boca Raton: Chapman & Hall/CRC, forthcoming”, which details g-estimation of strucutural nested models. Within this notebook, I will use zEpid to recreate the analyses detailed in chapter 14. As an introduction to causal inference and the associated methods, I highly recommend reviewing this book, which the preprint is available for free at: https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

## Data Preparation
Data comes from the National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study (NHEFS). The NHEFS was jointly initiated by the National Center for Health Statistics and the National Institute on Aging in collaboration with other agencies of the United States Public Health Service. A detailed description of the NHEFS, together with publicly available data sets and documentation, can be found at wwwn.cdc.gov/nchs/nhanes/nhefs/

The data set used in the book and this tutorial is a subset of the full NHEFS. First, we will load the data and run some basic variable manipulations.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from zepid.causal.snm import GEstimationSNM

df = pd.read_csv('Data/nhefs.csv')

# recoding some variables
df['inactive'] = np.where(df['active'] == 2, 1, 0)
df['no_exercise'] = np.where(df['exercise'] == 2, 1, 0)
df['university'] = np.where(df['education'] == 5, 1, 0)

df = df[['death', 'qsmk', 'sex', 'race', 'age', 'education',
         'smokeintensity', 'smokeyrs', 'exercise', 'active', 'wt71', 'wt82_71']]
df['age_sq'] = df['age']**2
df['smkyr_sq'] = df['smokeyrs']**2
df['wt71_sq'] = df['wt71']**2
df['smkint_sq'] = df['smokeintensity']**2

# Treatment model
pi_model = ('sex + race + age + age_sq + C(education) + smokeyrs + smkyr_sq + '
            'C(exercise) + C(active) + wt71 + wt71_sq + smokeintensity + smkint_sq')
# Missing outcome data model
m_model = ('qsmk + sex + race + age + age_sq + C(education) + smokeyrs + smkyr_sq + '
           'C(exercise) + C(active) + wt71 + wt71_sq + smokeintensity + smkint_sq')

As described in the book, to deal with missing data inverse probability of censoring weights should be used. `GEstimationSNM` allows for automatic calculation of inverse probability of censoring weights through the `missing_model()` function. We will use this to account for missing outcome data

## Section 14.5
We will now estimate the following structural nested mean model
$$E[Y^a - Y^{a=0} | A=a, L] = \psi a$$
We will diverge slightly from the book. In the book, they first demonstrate an inefficient method to solve for $\psi$. `GEstimationSNM` has two options available; grid-search and closed-form solution. Both produce the same results, but the closed form solution is much faster. 

The grid-search approach uses the Nelder-Mead algorithm. Since we are not searching the entire space, we cannot get confidence intervals directly from our grid-search. Instead we will use a nonparametric bootstrap. 

### Grid-Search
The following code uses the grid-search approach

In [2]:
# Initializing G-estimation 
snm = GEstimationSNM(df, exposure='qsmk', outcome='wt82_71')

# Specifying Pr(A=1|L) model
snm.exposure_model(model=pi_model, print_results=False)

# Specifying censoring model
snm.missing_model(m_model, stabilized=False, print_results=False)

# Specifying SNM
snm.structural_nested_model(model='qsmk')

# Grid-search solution
snm.fit(solver='search')
snm.summary(decimal=4)

           G-estimation of Structural Nested Mean Model               
Treatment:        qsmk                   No. Observations:     1629      
Outcome:          wt82_71                No. Missing Outcome:  63        
Missing model:    Logistic       
Method:           Nelder-Mead              No. Iterations:   38        
Alpha values:     0                        Optimized:        True      
----------------------------------------------------------------------
SNM:     psi*qsmk
----------------------------------------------------------------------
qsmk                      3.4459                        


This is the same answer as detailed in the book. You can also compare to the code available at: https://github.com/jrfiedler/causal_inference_python_code/blob/master/chapter14.ipynb My procedure runs a little faster then the notebook. I did not run the confidence interval procedure for this because it would take longer than necessary. For confidence intervals, the closed-form solution is much faster

## Section 14.6
We will now use the closed form solution for the g-estimation procedure. Below is code

In [3]:
# Initializing G-estimation 
snm = GEstimationSNM(df, exposure='qsmk', outcome='wt82_71')
snm.exposure_model(model=pi_model, print_results=False)
snm.structural_nested_model(model='qsmk')
snm.missing_model(m_model, stabilized=False, print_results=False)
snm.fit(solver='closed')
snm.summary(decimal=4)

           G-estimation of Structural Nested Mean Model               
Treatment:        qsmk                   No. Observations:     1629      
Outcome:          wt82_71                No. Missing Outcome:  63        
Missing model:    Logistic       
Method:           Closed-form
----------------------------------------------------------------------
SNM:     psi*qsmk
----------------------------------------------------------------------
qsmk                      3.4459                        


### Two-parameter Structural Nested Models
We will now expand our structural nested model to include a term between `qsmk` and `smokeintensity`. Our SNM will look like
$$E[Y^a - Y^{a=0} | A=a, L] = \psi a + \psi a V$$
where $V$ is smoking intensity.

In [4]:
# Initializing G-estimation 
snm = GEstimationSNM(df, exposure='qsmk', outcome='wt82_71')
snm.exposure_model(model=pi_model, print_results=False)
snm.missing_model(m_model, stabilized=False, print_results=False)
snm.structural_nested_model(model='qsmk + qsmk:smokeintensity')
snm.fit(solver='closed')
snm.summary(decimal=5)

           G-estimation of Structural Nested Mean Model               
Treatment:        qsmk                   No. Observations:     1629      
Outcome:          wt82_71                No. Missing Outcome:  63        
Missing model:    Logistic       
Method:           Closed-form
----------------------------------------------------------------------
SNM:     psi*qsmk + psi*qsmk:smokeintensity
----------------------------------------------------------------------
qsmk                      2.85947                       
qsmk:smokeintensity       0.03004                       


In [5]:
# Initializing G-estimation 
snm = GEstimationSNM(df, exposure='qsmk', outcome='wt82_71')
snm.exposure_model(model=pi_model, print_results=False)
snm.missing_model(m_model, stabilized=False, print_results=False)
snm.structural_nested_model(model='qsmk + qsmk:smokeintensity')
snm.fit(solver='search')
snm.summary(decimal=5)

           G-estimation of Structural Nested Mean Model               
Treatment:        qsmk                   No. Observations:     1629      
Outcome:          wt82_71                No. Missing Outcome:  63        
Missing model:    Logistic       
Method:           Nelder-Mead              No. Iterations:   144       
Alpha values:     0                        Optimized:        True      
----------------------------------------------------------------------
SNM:     psi*qsmk + psi*qsmk:smokeintensity
----------------------------------------------------------------------
qsmk                      2.85947                       
qsmk:smokeintensity       0.03004                       


Both approaches provide the same answer, but the grid-search takes longer. This is because we are manually searching over a grid of potential values. The Nelder-Mead search is good but can take a time to solve, since we don't provide any derivatives. So, the question is why would you ever use the grid-search approach? 

## Fine Point 14.2
Hernan and Robins mention some interesting sensitivity analyses for g-estimation with unmeasured confounding. Specifically, they state "G-estimation relies on the fact ... conditional exchangeability given $L$ holds. Now consider a setting in which conditional exchangeability does not hold. ... But g-estimation does not require that $\alpha = 0$." Essentially, we can place a bound on the magnitude of nonexchangeability. For example, we can imagine the magnitude of nonexchangeability is $\alpha = 0.1$. Instead of minimizing for $\alpha = 0$, we instead minimize $\alpha = 0.1$. 

Returning to the question of why you would use the grid-search approach, only the grid-search approach allows for this change in $\alpha$ sensitivity analysis currently. Using the numbers from the book, we will conduct a sensitivity analysis where $\alpha = 0.1$. We will do this for the one-parameter SNM

In [6]:
# Initializing G-estimation 
snm = GEstimationSNM(df, exposure='qsmk', outcome='wt82_71')
snm.exposure_model(model=pi_model, print_results=False)
snm.missing_model(m_model, stabilized=False, print_results=False)
snm.structural_nested_model(model='qsmk')

# Search solution
snm.fit(solver='search',
        alpha_value=0.1)  # Sensitivity analysis 
snm.summary(decimal=5)

           G-estimation of Structural Nested Mean Model               
Treatment:        qsmk                   No. Observations:     1629      
Outcome:          wt82_71                No. Missing Outcome:  63        
Missing model:    Logistic       
Method:           Nelder-Mead              No. Iterations:   37        
Alpha values:     0.1                      Optimized:        True      
----------------------------------------------------------------------
SNM:     psi*qsmk
----------------------------------------------------------------------
qsmk                      -1.95224                      


If $\alpha = 0.1$ is reasonable for the unmeasured confounding magnitude, this result suggests our results are quite sensitive to unmeasured confounding of this magnitude.

## Conclusion
That concludes chapter 14 of "Cuasal Inference" by Hernan and Robins. Please review the other tutorials on this site for more details and features of `GEstimationSNM`. In the next tutorial, we will go through causal survival analysis