# Generalizability
First, we will approach to problem of generalizability. To frame our problem, imagine we took a random sample from our target population. We collected some basic data from everyone in the random sample. We then recruited individuals to take part in a trial we were conducting. Of the 3,000 individuals included in our study, 486 participated in our trial. However, the trial participants were not a random sample of our target population. Therefore, we are worried about generalizing our results from the trial participants to our target population.

## Randomized Control Trial
For simplicity, we will first generalize our results from a randomized trial (we will not need to worry about confounding). There are three options for generalizing results in *zEpid*; inverse probability of sampling weights (`IPSW`), g-transport formula (`GTransportFormula`), and doubly robust estimator (`AIPSW`).

Before we start generalizing our results, let's take a look at the data and estimate the sample average treatment effect ($SATE$). The $SATE$ is defined as
$$SATE = E[Y^{a=1}] - E[Y^{a=0}]$$
Our sample is indicated by `S=1` and includes only 486 individuals. We are interested in the causal effect of $A$ on $Y$.

In [1]:
import numpy as np
import pandas as pd

import zepid
from zepid import RiskDifference
from zepid import load_generalize_data
from zepid.causal.generalize import IPSW, GTransportFormula, AIPSW

print(zepid.__version__)

0.9.0


In [2]:
df = load_generalize_data(False)
df['W_sq'] = df['W']**2
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   id      3000 non-null   int64  
 1   Y       486 non-null    float64
 2   A       486 non-null    float64
 3   S       3000 non-null   int64  
 4   L       3000 non-null   int64  
 5   W       3000 non-null   float64
 6   W_sq    3000 non-null   float64
dtypes: float64(4), int64(3)
memory usage: 164.2 KB


In [3]:
dfs = df.loc[df['S'] == 1].copy()

rd = RiskDifference()
rd.fit(dfs, exposure='A', outcome='Y')
rd.summary()

Comparison:0 to 1.0
+-----+-------+-------+
|     |   D=1 |   D=0 |
| E=1 |    84 |   172 |
+-----+-------+-------+
| E=0 |    51 |   179 |
+-----+-------+-------+ 

                         Risk Difference                              
        Risk  SD(Risk)  Risk_LCL  Risk_UCL
Ref:0  0.222     0.027     0.168     0.275
1.0    0.328     0.029     0.271     0.386
----------------------------------------------------------------------
       RiskDifference  SD(RD)  RD_LCL  RD_UCL
Ref:0           0.000     NaN     NaN     NaN
1.0             0.106    0.04   0.028   0.185
----------------------------------------------------------------------
       RiskDifference    CLD  LowerBound  UpperBound
Ref:0           0.000    NaN         NaN         NaN
1.0             0.106  0.157      -0.672       0.328
----------------------------------------------------------------------
Missing E:    0
Missing D:    0
Missing E&D:  0


Therefore, the treatment effect of $A$ on $Y$ was 0.11 (95% CL: 0.028, 0.185) in the trial. However, we are concerned about the generalizability of our trial results to our target population. Specifically, we are worried about the individuals who enrolled into our study no longer being representative of the target. We believe $L$ and $W$ are modifiers and have different distributions between the trial population and the target population. Let's compare three methods to deal with this approach

### IPSW
Inverse Probability of Sampling Weights (IPSW) are an approach to reweight the study sample to reflect the full population. Similar to other inverse probability weighting approaches, we generate weights to create a pseudo-population that is reflective of the population we want to draw inference regarding.

IPSW are sampling weights, which weight the observed sample to be reflective of the target population. For generating these weights, factors that (1) differ between the sample and the target and (2) are modifiers should be included in this model. Remember that if something has an effect on the outcome, *it must be a modifier on at least one scale* (risk difference / risk ratio). Therefore, it would be prudent to include strong causes of the outcome that differ substantially between the sample and target.

In our example, we assume that `L` and `W` are sufficient for our results to generalize from the sample to the target population. Below is code to estimate the target population risk difference

In [4]:
ipsw = IPSW(df, exposure='A', outcome='Y', selection='S', generalize=True)
ipsw.sampling_model('L + W + W_sq + L:W + L:W_sq', print_results=False)
ipsw.fit()
ipsw.summary()

rd = ipsw.risk_difference

           Inverse Probability of Sampling Weights
Treatment:        A               Sample Observations:  486                 
Outcome:          Y               Target Observations:  2514                
Target estimate:  Generalize      IP Treatment Weights: No                  
----------------------------------------------------------------------
Risk Difference:  0.0542
Risk Ratio:       1.1665


Confidence intervals come from a boostrapping procedure. This bootstrapping procedure is different from other estimators. Instead of resampling from our entire study sample, we need to account for random error in selection of the study sample and random error in the selection of the basic data collection. To do this, we divide our data into the two pieces, sample from them independently, then stack them again. We then estimate the risk difference

In [5]:
# Step 1: divide data
dfss = df.loc[df['S'] == 1].copy()
dftp = df.loc[df['S'] == 0].copy()
rd_bs = []

for i in range(200):
    # Step 2: Resample data
    dfs = dfss.sample(n=dfss.shape[0], replace=True)
    dft = dftp.sample(n=dftp.shape[0], replace=True)

    # Step 3: restack the data
    dfb = pd.concat([dfs, dft])

    # Step 4: Estimate IPSW
    ipsw = IPSW(dfb, exposure='A', outcome='Y', selection='S', generalize=True)
    ipsw.sampling_model('L + W + L:W', print_results=False)
    ipsw.fit()

    rd_bs.append(ipsw.risk_difference)

se = np.std(rd_bs, ddof=1)

print('95% LCL:', np.round(rd - 1.96*se, 3))
print('95% UCL:', np.round(rd + 1.96*se, 3))

95% LCL: -0.04
95% UCL: 0.148


Therefore, the probability of `Y` given everyone in the target population had `A=1` would have been 5 percentage points higher (95% CL: -0.04, 0.15) than if everyone in the target population had `A=0`. Note that this conclusion is different than our $SATE$

### G-transport Formula
Alternatively, we can also use the g-formula to estimate the causal effect in our target population. Instead of weighting our population, we will estimate a model (including the modifiers) for our trial participants. Then we will use the fitted parametric model to predict the counterfactual outcomes in both the study sample and the target population sample

In [6]:
gtf = GTransportFormula(df, exposure='A', outcome='Y', selection='S', generalize=True)
gtf.outcome_model('A + L + W + W_sq + A:L + A:W + A:W_sq', print_results=False)
gtf.fit()
gtf.summary()

rd = gtf.risk_difference

                       g-Transport formula
Treatment:        A               Sample Observations:  486                 
Outcome:          Y               Target Observations:  2514                
Target estimate:  Generalize     
----------------------------------------------------------------------
Risk Difference:  0.0567
Risk Ratio:       1.1806


Similarly, we use a bootstrapping procedure for confidence intervals. The same procedure as previously described is used

In [7]:
# Step 1: divide data
dfss = df.loc[df['S'] == 1].copy()
dftp = df.loc[df['S'] == 0].copy()
rd_bs = []

for i in range(200):
    # Step 2: Resample data
    dfs = dfss.sample(n=dfss.shape[0], replace=True)
    dft = dftp.sample(n=dftp.shape[0], replace=True)

    # Step 3: restack the data
    dfb = pd.concat([dfs, dft])

    # Step 4: Estimate IPSW
    gtf = GTransportFormula(dfb, exposure='A', outcome='Y', selection='S', generalize=True)
    gtf.outcome_model('A + L + W + W_sq + A:L + A:W + A:W_sq', print_results=False)
    gtf.fit()

    rd_bs.append(gtf.risk_difference)

se = np.std(rd_bs, ddof=1)
print('95% LCL:', np.round(rd - 1.96 * se, 3))
print('95% UCL:', np.round(rd + 1.96 * se, 3))

95% LCL: -0.054
95% UCL: 0.168


Therefore, the probability of `Y` given everyone in the target population had `A=1` would have been 6 percentage points higher (95% CL: -0.05, 0.16) than if everyone in the target population had `A=0`. Note that this conclusion is different than our $SATE$, but similar to IPSW.

### Augmented-IPSW
Similarly to causal inference in an observational study, we may be concerned regarding model misspecification. Through AIPSW, we have 'two chances' to get our model specified correctly. Essentially, it is a recipe to merge IPSW and the g-transport formula together

In [8]:
aipw = AIPSW(df, exposure='A', outcome='Y', selection='S', generalize=True)
aipw.sampling_model('L + W + W_sq + L:W + L:W_sq', print_results=False)
aipw.outcome_model('A + L + W + W_sq + A:L + A:W + A:W_sq', print_results=False)
aipw.fit()
aipw.summary()

rd = aipw.risk_difference

           Augmented Inverse Probability of Sampling Weights          
Treatment:        A               Sample Observations:  486                 
Outcome:          Y               Target Observations:  2514                
Target estimate:  Generalize     
----------------------------------------------------------------------
Risk Difference:  0.055
Risk Ratio:       1.1738


Again, we use a bootstrapping procedure to obtain our confidence intervals

In [9]:
# Step 1: divide data
dfss = df.loc[df['S'] == 1].copy()
dftp = df.loc[df['S'] == 0].copy()
rd_bs = []

for i in range(200):
    # Step 2: Resample data
    dfs = dfss.sample(n=dfss.shape[0], replace=True)
    dft = dftp.sample(n=dftp.shape[0], replace=True)

    # Step 3: restack the data
    dfb = pd.concat([dfs, dft])

    # Step 4: Estimate IPSW
    aipw = AIPSW(dfb, exposure='A', outcome='Y', selection='S', generalize=True)
    aipw.sampling_model('L + W + W_sq + L:W + L:W_sq', print_results=False)
    aipw.outcome_model('A + L + W + W_sq + A:L + A:W + A:W_sq', print_results=False)
    aipw.fit()

    rd_bs.append(aipw.risk_difference)

se = np.std(rd_bs, ddof=1)
print('95% LCL:', np.round(rd - 1.96 * se, 3))
print('95% UCL:', np.round(rd + 1.96 * se, 3))

95% LCL: -0.054
95% UCL: 0.164


Therefore, the probability of `Y` given everyone in the target population had `A=1` would have been 6 percentage points higher (95% CL: -0.05, 0.16) than if everyone in the target population had `A=0`. Note that this conclusion is different than our $SATE$, but similar to both IPSW and g-transport. 

## Observational Study
In the previous examples, we assumed $Y^a \amalg A$. Let's generalize to observational studies with confounders, i.e. $Y^a \amalg A | L$. For observational studies, we will need to make some minor tweaks to our previous estimation procedures. In the following, we will assume that `L` and `W` are both potential confounders

In [10]:
df = load_generalize_data(True)
df['W_sq'] = df['W']**2
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   id      3000 non-null   int64  
 1   Y       486 non-null    float64
 2   A       486 non-null    float64
 3   S       3000 non-null   int64  
 4   L       3000 non-null   int64  
 5   W       3000 non-null   float64
 6   W_sq    3000 non-null   float64
dtypes: float64(4), int64(3)
memory usage: 164.2 KB


### IPSW
For IPSW to account for confounding, IPTW must be generated and passed to `IPSW`. First we will calculate the inverse probability of treatment weights with `IPTW`. Then we will specify the optional `weights` argument for `IPSW`

In [11]:
ipsw = IPSW(df, exposure='A', outcome='Y', selection='S', generalize=True)
ipsw.sampling_model('L + W + W_sq + L:W + L:W_sq', print_results=False)
ipsw.treatment_model('L + W + W_sq', print_results=False)
ipsw.fit()
ipsw.summary()

rd = ipsw.risk_difference

           Inverse Probability of Sampling Weights
Treatment:        A               Sample Observations:  486                 
Outcome:          Y               Target Observations:  2514                
Target estimate:  Generalize      IP Treatment Weights: Yes                 
----------------------------------------------------------------------
Risk Difference:  0.0506
Risk Ratio:       1.1572


Therefore, the probability of `Y` given everyone in the target population had `A=1` would have been 5 percentage points higher than if everyone in the target population had `A=0`. As we would hope (since I simulated the data, I know the true answer), our results are similar to the true value.

Confidence intervals are more complex, since we need to also re-estimate our IPTW to account for that variability. Below is code to estimate the corresponding confidence intervals

In [12]:
# Step 1: divide data
dfss = df.loc[df['S'] == 1].copy()
dftp = df.loc[df['S'] == 0].copy()
rd_bs = []

for i in range(200):
    # Step 2: Resample data
    dfs = dfss.sample(n=dfss.shape[0], replace=True)
    dft = dftp.sample(n=dftp.shape[0], replace=True)

    # Step 3: restack the data
    dfb = pd.concat([dfs, dft])
    
    # Step 4: Estimate IPSW
    ipsw = IPSW(dfb, exposure='A', outcome='Y', selection='S', 
                generalize=True)
    ipsw.sampling_model('L + W + L:W', print_results=False)
    ipsw.treatment_model('L + W + W_sq', print_results=False)
    ipsw.fit()

    rd_bs.append(ipsw.risk_difference)

se = np.std(rd_bs, ddof=1)

print('95% LCL:', np.round(rd - 1.96*se, 3))
print('95% UCL:', np.round(rd + 1.96*se, 3))

95% LCL: -0.083
95% UCL: 0.184


### G-transport formula
Implementation-wise, the g-transport formula remains the same. The only requirement is that we include all confounders for the A-Y relationship into the g-transport formula. This makes no difference in our example, because our modifiers of concern in the RCT are also the confounders in our observational study. The g-transport formula has the disadvantage of requiring that all confounders are measured in both the study sample and the target population sample. If all confounders are not measured in the target population sample, IPSW may be the only option to generalize results

In [13]:
gtf = GTransportFormula(df, exposure='A', outcome='Y', selection='S', generalize=True)
gtf.outcome_model('A + L + W + W_sq + A:L', print_results=False)
gtf.fit()
gtf.summary()

                       g-Transport formula
Treatment:        A               Sample Observations:  486                 
Outcome:          Y               Target Observations:  2514                
Target estimate:  Generalize     
----------------------------------------------------------------------
Risk Difference:  0.0419
Risk Ratio:       1.1353


### Augmented-IPSW
Similar to `IPSW`, we need to calculate IPTW for the augmented-IPSW. Below is code to estimate `AIPSW` and the corresponding confidence intervals

In [14]:
aipw = AIPSW(df, exposure='A', outcome='Y', selection='S', 
             generalize=True)
aipw.sampling_model('L + W + W_sq + L:W + L:W_sq', print_results=False)
aipw.treatment_model('L + W + W_sq', print_results=False)
aipw.outcome_model('A + L + W + W_sq + A:L', print_results=False)
aipw.fit()
aipw.summary()

           Augmented Inverse Probability of Sampling Weights          
Treatment:        A               Sample Observations:  486                 
Outcome:          Y               Target Observations:  2514                
Target estimate:  Generalize     
----------------------------------------------------------------------
Risk Difference:  0.0426
Risk Ratio:       1.1369


In [15]:
# Step 1: divide data
dfss = df.loc[df['S'] == 1].copy()
dftp = df.loc[df['S'] == 0].copy()
rd_bs = []

for i in range(200):
    # Step 2: Resample data
    dfs = dfss.sample(n=dfss.shape[0], replace=True)
    dft = dftp.sample(n=dftp.shape[0], replace=True)

    # Step 3: restack the data
    dfb = pd.concat([dfs, dft])
    
    # Step 4: Estimate AIPSW
    aipw = AIPSW(dfb, exposure='A', outcome='Y', selection='S', 
                 generalize=True)
    aipw.sampling_model('L + W + W_sq + L:W + L:W_sq', print_results=False)
    aipw.treatment_model('L + W + W_sq', print_results=False)
    aipw.outcome_model('A + L + W + W_sq + A:L', print_results=False)
    aipw.fit()

    rd_bs.append(aipw.risk_difference)

se = np.std(rd_bs, ddof=1)

print('95% LCL:', np.round(rd - 1.96*se, 3))
print('95% UCL:', np.round(rd + 1.96*se, 3))

95% LCL: -0.095
95% UCL: 0.197


# Conclusion
In this tutorial, I demonstrated three different estimators to generalize both randomized trial results and observational results to a target population. In the next tutorial, we will address the problem of transportability.

## Further Readings
Lesko CR, Buchanan AL, Westreich D, Edwards JK, Hudgens MG, & Cole SR. (2017). Generalizing study results: a potential outcomes perspective. Epidemiology (Cambridge, Mass.), 28(4), 553

Dahabreh IJ, Hernan MA, Robertson SE, Buchanan A, Steingrimsson JA. (2019). Generalizing trial findings in nested trial designs with sub-sampling of non-randomized individuals. arXiv preprint arXiv:1902.06080