$\Large \textbf{Deriving a lower bound for hidden confounding in obs studies (Part 1)}$

This notebook offers a tutorial on assessing and measuring hidden confounding in synthetic data, using methods from "Hidden yet quantifiable: A lower bound for confounding strength using randomized trials". Detailed information is in the accompanying paper.

The core idea is comparing the average treatment effect (ATE) estimates from a randomized trial and an observational study through hypothesis testing. The approach uses sensitivity analysis-derived bounds as a measure of distance between these estimates. This distance represent a chosen confounding strength ($\Gamma$); by evaluating increasing $\Gamma$ values, we can establish a lower bound for the true confounding strength.

In [1]:
# Required imports

import numpy as np
from scipy.stats import bootstrap
from sklearn.ensemble import RandomForestClassifier

from test_confounding.ate_bounds.ate_bounds import BootstrapSensitivityAnalysis
from test_confounding.cate_bounds.cate_bounds import MultipleCATEBoundEstimators
from test_confounding.cate_bounds.utils_cate_bounds import compute_bootstrap_variance
from test_confounding.datasets import synthetic
from test_confounding.utils_general import e_x_func

To implement our test, we need a randomized trial and an observational study that adhere to Assumptions 1, 2, and 3 outlined in the paper. Specifically, the conditional ATE should be transportable from the trial to the observational study, the trial must satisfy internal validity, and the support of the trial must be included within that of the observational study. The below code creates a dataset with these properties. 

In [2]:
n_obs = 20000 #obs study size
n_rct = 5000 #randomized trial size
true_conf = 6.0 #true conf. strengh, ie \Gamma^* in the paper
effective_conf = 1.0 #parameter to interpolate between adv prop score (1.0) and uncorrelated hidden confounder (0.0)
seed = 42 #seed for reproducibility

data = synthetic.Synthetic(
        num_examples_obs =  n_obs,
        num_examples_rct = n_rct,
        gamma_star = true_conf,
        effective_conf = effective_conf,
        sigma_y = 0.01,
        seed=seed,
    )

The paper proposes two testing procedures, determined by the target population. Firstly, we consider the randomized trial as the target population. In this approach, we derive CATE sensitivity bounds from observational data, average them over the trial, and then compare this result with the ATE directly computed from the trial. For the CATE bounds, we use the BLearner.

Next, we test the null hypothesis of "sufficient confounding strength," implying that the chosen $\Gamma$ is large enough to explain the discrepancy between the observational study and the trial. We conduct this test for increasing $\Gamma$ until the null hypothesis is accepted, in order to avoid a multiple testing problem.

In [3]:
n_bootstrap = 50 #bootstrap samples
user_conf = [1.0, 3.0, 5.0, 5.5, 6.0, 7.0] #conf strengths to be tested

x_rct, t_rct, y_rct = (
    data.rct.x,
    data.rct.t,
    data.rct.y,
)

x_obs, t_obs, y_obs = (
    data.x,
    data.t,
    data.y,
)

ate = y_rct[t_rct == 1].mean() - y_rct[t_rct == 0].mean()
ate_variance = compute_bootstrap_variance(Y=y_rct, T=t_rct, n_bootstraps=n_bootstrap, arm=None)

bounds_estimator = MultipleCATEBoundEstimators(
    gammas=user_conf, n_bootstrap=n_bootstrap
)

bounds_estimator.fit(x_obs=x_obs, t_obs=t_obs, y_obs=y_obs)

All CATE estimators are now instantiated.
All CATE bounds estimators are now trained. Elapsed time: 40.76 seconds


In [4]:
from test_confounding.test import run_multiple_cate_hypothesis_test

results_dict_cate = run_multiple_cate_hypothesis_test(bounds_estimator = bounds_estimator, ate = ate, ate_variance = ate_variance, alpha = 5.0, x_rct = x_rct, user_conf = user_conf, verbose = False)

We now present the outcomes of the testing procedure. We identify $\Gamma_{LB}=5.5$ as a lower bound for the true strength $\Gamma^*=6.0$, indicating that our test is valid and yields reasonable power. It's worth noting that a finer discretisaion in the tested $\Gamma$ values would be more beneficial in practical applications.

In [5]:
def display_gamma_info(gamma_dict, user_conf): 
    print("Test Results:")
    print("-" * 30)
    
    for gamma in user_conf:
        if gamma in gamma_dict:
            value = gamma_dict[gamma]
            reject = "Yes" if value['reject'] else "No"
            test_statistic = value['test_statistic']
        else:
            # If gamma is not in gamma_dict, assume hypothesis is accepted
            reject = "No"
            test_statistic = "N/A"

        print(f"Γ: {gamma}")
        print(f"  - Reject Null Hypothesis: {reject}")
        print(f"  - Test Statistic: {test_statistic}")
        print("-" * 30)
    
    gamma_effective = gamma_dict.get('gamma_effective', 'Not available')
    print(f"Γ_{{LB}} (Lower Bound): {gamma_effective}")


display_gamma_info(results_dict_cate, user_conf)

Test Results:
------------------------------
Γ: 1.0
  - Reject Null Hypothesis: Yes
  - Test Statistic: -10.756724410509511
------------------------------
Γ: 3.0
  - Reject Null Hypothesis: Yes
  - Test Statistic: -5.2264810086448374
------------------------------
Γ: 5.0
  - Reject Null Hypothesis: Yes
  - Test Statistic: -1.976865212607717
------------------------------
Γ: 5.5
  - Reject Null Hypothesis: No
  - Test Statistic: -1.2038360113769537
------------------------------
Γ: 6.0
  - Reject Null Hypothesis: No
  - Test Statistic: N/A
------------------------------
Γ: 7.0
  - Reject Null Hypothesis: No
  - Test Statistic: N/A
------------------------------
Γ_{LB} (Lower Bound): 5.5


Finally, we apply the same methodology but with the observational study as the target population. Specifically, we focus only on the region where the trial and the study overlap, achieved by performing trimming. Here, we estimate ATE sensitivity bounds on the observational study using the QB estimator and compare it to the (weighted) ATE estimated in the trial.

We observe that the test is still valid and also identifies $\Gamma_{LB}=5.5$ as the lower bound for the true confounding strength.

In [6]:
x = np.concatenate((data.rct.x.reshape(-1), data.x.reshape(-1))).reshape(-1, 1)
s = np.concatenate((np.ones(data.rct.x.size), np.zeros(data.x.size)))
clf = RandomForestClassifier(max_depth=5, random_state=seed)
clf.fit(x, s)
pi_s = clf.predict_proba(x)[:, 1]
O_idx = pi_s > 0.001

x_obs, t_obs, y_obs = (
    data.x[O_idx[n_rct :]],
    data.t[O_idx[n_rct :]],
    data.y[O_idx[n_rct :]],
)
x_rct, t_rct, y_rct = (
    data.rct.x[O_idx[: n_rct]],
    data.rct.t[O_idx[: n_rct]],
    data.rct.y[O_idx[: n_rct]],
)

mask = np.logical_and(O_idx, s)  
rct_to_obs_ratio = s[O_idx].sum() / (s[O_idx].size - s[O_idx].sum())
ys = 2 * (y_rct * t_rct - y_rct * (1 - t_rct)) * (1 - pi_s[mask]) / pi_s[mask]
bootstrap_rct = bootstrap((ys,), np.mean, n_resamples=n_bootstrap, axis=0)
std_rct = bootstrap_rct.standard_error
var_rct = np.power(std_rct, 2) * (rct_to_obs_ratio**2)
mean_rct = rct_to_obs_ratio * ys.mean()
bootstrap_sa = BootstrapSensitivityAnalysis(sa_name="QB", inputs=x_obs, treatment=t_obs, outcome=y_obs, gammas=user_conf, seed=seed, e_x_func=e_x_func)
bounds_dist = bootstrap_sa.bootstrap(num_samples=n_bootstrap, fast_quantile=True)


Quantile functions are now trained for QB. Starting bootstrap.
Elapsed time for 50 bootstrap samples: 292.27 seconds


In [7]:
from test_confounding.test import run_multiple_ate_hypothesis_test

results_dict_ate = run_multiple_ate_hypothesis_test(mean_rct = mean_rct, var_rct = var_rct, bounds_dist = bounds_dist, alpha = 5.0, user_conf = user_conf, verbose = False)

In [8]:
display_gamma_info(results_dict_ate, user_conf)

Test Results:
------------------------------
Γ: 1.0
  - Reject Null Hypothesis: Yes
  - Test Statistic: -12.92638469497381
------------------------------
Γ: 3.0
  - Reject Null Hypothesis: Yes
  - Test Statistic: -5.849617052695993
------------------------------
Γ: 5.0
  - Reject Null Hypothesis: Yes
  - Test Statistic: -1.98155584903497
------------------------------
Γ: 5.5
  - Reject Null Hypothesis: No
  - Test Statistic: -1.1879252583555024
------------------------------
Γ: 6.0
  - Reject Null Hypothesis: No
  - Test Statistic: N/A
------------------------------
Γ: 7.0
  - Reject Null Hypothesis: No
  - Test Statistic: N/A
------------------------------
Γ_{LB} (Lower Bound): 5.5
