## Review on DoubleLingo: Causal Estimation with Large Language Models
### Motivation
The motivation of the paper addresses a significant limitation in the application of large language models (LLMs) in natural language processing (NLP). Despite the transformative impact of LLMs on NLP, as highlighted by seminal works such as Vaswani et al. (2017) and Min et al. (2023), these models exhibit inherent weaknesses as estimators of causal parameters. Due to both explicit and implicit regularization processes inherent in their training (referenced in Neyshabur, 2017, and Chernozhukov et al., 2018), LLMs fail to consistently converge to the true causal effects. Specifically, utilizing a popular LLM like BERT (Devlin et al., 2019) to predict propensity scores $P(A | T)$ inherently introduces bias, making these models unsuitable for direct causal inference without adjustments.

### Model Setup
<div style="text-align: center;">
    <img src="images/causal_graph.jpg" alt="Causal Graph" width="400"/>
</div>

1. **Variables**:
   - **$A$**: Binary variable indicating whether a patient receives antibiotics $(1)$ or not $(0)$.
   - **$Y$**: Binary outcome variable denoting whether the disease progresses $(1)$ or not $(0)$.
   - **$T$**: Patient medical records.
   - **$C$**: Confounding variables contained in the records.

2. **Equations**:
   - **Equation $(8)$**: $Y = A\theta_0 + g_0(T) + U$, where $\theta_0$ is the ATE, $g_0(T)$ is a function of the confounders, and $U$ is the error term with $E[U | T, A] = 0$.
   - **Equation $(9)$**: $A = m_0(T) + V$, where $m_0(T)$ models the probability of treatment given the covariates, and $V$ is the error term with $E[V | T] = 0$.

### Double Machine Learning Approach
1. **Orthogonalization**:
   - Compute the residuals $\hat{V} = A - \hat{m}_0(T)$, where $\hat{m}_0$ is the machine learning estimator of $m_0$. This step removes the influence of $T$ from $A$, focusing directly on the effect of treatment, thereby isolate the effect of $A$ from the confounding influence of $T$.

2. **Sample Splitting**:
   - The dataset of size $N$ is split into two equal parts: main sample (indices $I$) and auxiliary sample (indices $I_C$). The estimators $\hat{m}_0$ and $\hat{g}_0$ are trained on $I_C$ and then used to estimate $\theta_0$ on $I$.

3. **Estimation of $\theta_0$**:
   - **Treatment Residuals $\hat{V}$**: Derived by predicting $A$ (treatment) from covariates $T$, where $\hat{V} = A - \hat{m}_0(T)$. These residuals isolate the effect of treatment from the confounding variables.
   - **Outcome Adjusted for Covariates**: Calculated as $\hat{W} = Y - \hat{g}_0(T)$, adjusting the outcome $Y$ for influences of $T$, leaving the portion of $Y$ unexplained by $T$.
   - By fitting the residuals $\hat{V}$ against the adjusted outcome $\hat{W}$ in a linear model:
         $$
            \hat{W} = \theta_0 \hat{V}
         $$
   -  The regression provides an estimate for $\theta_0$, representing the causal effect of treatment independent of $T$ .The estimator $\hat{\theta}_0$ is calculated as follows:
     $$
     \hat{\theta}_0 = \left(\frac{1}{n} \sum_{i \in I} \hat{V}_{i} A_i \right)^{-1} \left(\frac{1}{n} \sum_{i \in I} \hat{V}_{i} (Y_i - \hat{g}_0(T_i)) \right)
     $$
   - This formula corrects for the effects of $T$ on both $A$ and $Y$, focusing solely on the treatment effect.

4. **Error Decomposition**:
   - **Term $A$**: Represents the scaled estimation error and converges to a Gaussian distribution, indicating the stability of the estimation process under the model.
   - **Term $B$**: Accounts for the regularization bias due to errors in the machine learning estimators and diminishes with appropriate rates of convergence.
   - **Term $C$**: Accounts for the model misfit and is controlled through sample-splitting, ensuring it remains small.


### Model for Faster Converging
A concern arises with the requirement that the two machine learning estimators must converge at $n^{-1/4}$ to obtain a desired $\sqrt{n}$-consistent estimation of $\theta_0$. Research on the convergence rate of encoder-based transformer classifiers such as BERT, specifically in the context of semiparametric inference, remains sparse. To address potential convergence issues:

- **BERT+Adapter**: Instead of full model fine-tuning, they use parameter-efficient transfer learning through adapters, as proposed by ```Houlsby et al. (2019)```. This adaptation allows for a more focused and potentially faster convergence without comprehensive theoretical bounds currently established.

![BERT Adapter Model](images/bert_adapter.jpg)

- **Embedding+FFN**:  Rather than fully fine-tuning BERT, they propose fine-tuning a feedforward layer on top of BERT’s pre-trained [CLS] encoding. This specific encoding, originally trained for next-sentence prediction, may not provide semantically rich sentence representations. Instead, they  employ embeddings from pre-trained sentence transformers (Reimers and Gurevych, 2019), which offer more nuanced semantic understanding. 
   - ```all-mpnet-base-v2```: based on MPNet (Song et al., 2020) and fine-tuned on over 1 billion sentence pairs including paper abstracts
   - ```SPECTER```: pre-trained on a dataset of scientific paper titles and abstracts


### Results

- Main Findings
   - Among ```BERT+Adapter```, ```MPNetV2+FFN```, ```SPECTER+FFN```, TF-IDF+FFN and Oracle (C), their three DoubleLingo estimators obtain the lowest AT E relative absolute error 0.103 compared to the Oracle 0.115. (Oracle (Outcome regressions) with full access to the (otherwise unobserved) C variable) 
   - While theoretically, adjusting for variable $C$ should perfectly account for all confounding between treatment $A$ and outcome $Y$, in practice, $C$ does not capture all the complexities of the relationship. The passage suggests that other variables, particularly $T$, might also influence $Y$, and including $T$ in the model can lead to better estimation accuracy. The model "DoubleLingo" that incorporates $T$ outperforms the theoretically ideal "Oracle" model, indicating that in real-world applications, it's crucial to consider more than just the theoretically perfect set of confounders.

- Convergence Experiment
   - They obtained rough estimates that BERT+Adapter, MPNetV2+FFN, SPECTER+FFN, and TF-IDF+FFN converge respectively at $n^{-0.57}$, $n^{-0.64}$, $n^{-0.67}$, and $n^{-0.56}$, all faster than the desired $n^{-0.25}$ rate.
   - We can also see as the sample size increase, the estimation accuracy is better. This is align with the consistency. 


- Note
   - According to their Table 2 in Appendix B, the predictive accuracy (on $m_o$ and $g_0$) alone does not directly contribute to a more accurate estimation```(Wood-Doughty et al., 2018)```.



In [2]:


import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from tqdm import tqdm

# Set a random seed for reproducibility
np.random.seed(42)

# data = {
#     'text': [
#         "Medical record 1", "Medical record 2", "Medical record 3", 
#         "Medical record 4", "Medical record 5", "Medical record 6", 
#         "Medical record 7", "Medical record 8", "Medical record 9", "Medical record 10"
#     ],
#     'treatment': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
#     'outcome': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
# }

data = pd.read_csv('data/subpopA_physics_medicine.csv')
df = pd.DataFrame(data)
df.head(3)




Unnamed: 0,X,Y,T,C
0,Thyroid function and prevalent and incident me...,0,0,0
1,An open-system quantum simulator with trapped ...,0,1,1
2,On Thermodynamic Interpretation of Copula Entr...,0,1,1


In [6]:
### Example Code for Double Machine Learning###

# Convert text to features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['X']).toarray()
treatment = df['T'].values
outcome = df['Y'].values

# Number of iterations
n_iterations = 10
ate_estimates = []

for _ in tqdm(range(n_iterations)):
    # Random split into two parts: I and I_C
    X_I, X_IC, treatment_I, treatment_IC, outcome_I, outcome_IC = train_test_split(
        X, treatment, outcome, test_size=0.5, random_state=np.random.randint(0, 10000))  # Use a varying random state for each iteration

    # Train treatment model on I_C
    treatment_model = LogisticRegression()
    treatment_model.fit(X_IC, treatment_IC)
    treatment_preds_I = treatment_model.predict_proba(X_I)[:, 1]

    # Train outcome model on I_C
    outcome_model = LogisticRegression()
    outcome_model.fit(X_IC, outcome_IC)
    outcome_preds_I = outcome_model.predict_proba(X_I)[:, 1]

    # Calculate residuals on I
    treatment_residuals = treatment_I - treatment_preds_I
    outcome_residuals = outcome_I - outcome_preds_I

    # OLS on residuals to estimate treatment effect
    treatment_effect_model = sm.OLS(outcome_residuals, sm.add_constant(treatment_residuals))
    results = treatment_effect_model.fit()
    ate_estimates.append(results.params[1])

# Calculate mean and standard deviation of ATE estimates
ate_mean = np.mean(ate_estimates)
ate_std = np.std(ate_estimates)

print(f"Estimated ATE Mean using Double ML: {ate_mean:.4f}")
print(f"Estimated ATE Std Deviation using Double ML: {ate_std:.4f}")

100%|██████████| 10/10 [00:30<00:00,  3.05s/it]

Estimated ATE Mean using Double ML: 0.0938
Estimated ATE Std Deviation using Double ML: 0.0054



