# **Target Trial Emulation**

### **Submitted by:**
- **Ladrera**, Raiken
- **Tibon**, Hestia

## **Instructions**

Assignment 1 for Clustering: New and novel methods in Machine Learning are made either by borrowing formulas and concepts from other scientific fields and redefining it based on new sets of assumptions, or by adding an extra step to an already existing framework of methodology.

In this exercise (Assignment 1 of the Clustering Topic), we will try to develop a novel method of Target Trial Emulation by integrating concepts of Clustering into the already existing framework. Target Trial Emulation is a new methodological framework in epidemiology which tries to account for the biases in old and traditional designs.

These are the instructions:
- Look at this website: https://rpubs.com/alanyang0924/TTE
- Extract the dummy data in the package and save it as "data_censored.csv"
- Convert the R codes into Python Codes (use Jupyter Notebook), replicate the results using your python code.
- Create another copy of your Python Codes, name it TTE-v2 (use Jupyter Notebook).
- Using TTE-v2, think of a creative way on where you would integrate a clustering mechanism, understand each step carefully and decide at which step a clustering method can be implemented. Generate insights from your results.
- Do this by pair, preferably your thesis partner.
- Push to your github repository.

HINT: For those who dont have a thesis topic yet, you can actually develop a thesis topic out of this assignment.

## **1. Setup**

In [1]:
import os
import pandas as pd

estimand_pp = "PP"  # Per-protocol
estimand_itt = "ITT"  # Intention-to-treat

# Directories for saving outputs
trial_pp_dir = os.path.join(os.getcwd(), "trial_pp")
trial_itt_dir = os.path.join(os.getcwd(), "trial_itt")

os.makedirs(trial_pp_dir, exist_ok=True)
os.makedirs(trial_itt_dir, exist_ok=True)

print(f"Directories created:\n{trial_pp_dir}\n{trial_itt_dir}")


Directories created:
C:\Users\meizi\Documents\GitHub\TTE-v2\trial_pp
C:\Users\meizi\Documents\GitHub\TTE-v2\trial_itt


## **2. Data Preparation**

In [14]:
data_path = "data_censored.csv" 
data_censored = pd.read_csv(data_path)
print(data_censored.head())

columns_needed = ["id", "period", "treatment", "outcome", "eligible", "age", "x1", "x2", "x3"]
trial_pp = data_censored[columns_needed].copy()
trial_itt = data_censored[columns_needed].copy()

print(trial_pp.head())
print(trial_itt.head())


   id  period  treatment  x1        x2  x3        x4  age     age_s  outcome  \
0   1       0          1   1  1.146148   0  0.734203   36  0.083333        0   
1   1       1          1   1  0.002200   0  0.734203   37  0.166667        0   
2   1       2          1   0 -0.481762   0  0.734203   38  0.250000        0   
3   1       3          1   0  0.007872   0  0.734203   39  0.333333        0   
4   1       4          1   1  0.216054   0  0.734203   40  0.416667        0   

   censored  eligible  
0         0         1  
1         0         0  
2         0         0  
3         0         0  
4         0         0  
   id  period  treatment  outcome  eligible  age  x1        x2  x3
0   1       0          1        0         1   36   1  1.146148   0
1   1       1          1        0         0   37   1  0.002200   0
2   1       2          1        0         0   38   0 -0.481762   0
3   1       3          1        0         0   39   0  0.007872   0
4   1       4          1        0       

## **3. Weight Models and Censoring**

#### **3.1. Treatment and Switching Weight Models**

In [23]:
import statsmodels.api as sm
import numpy as np
import pandas as pd

def check_perfect_separation(data, outcome_col, covariates):
    """Check for perfect separation and return filtered covariates."""
    valid_covariates = covariates.copy()
    for col in covariates:
        if data[col].nunique() <= 1 or data.groupby(col)[outcome_col].nunique().max() == 1:
            print(f"Warning: {col} perfectly predicts {outcome_col}, removing it.")
            valid_covariates.remove(col)
    if not valid_covariates:
        print("Warning: No valid covariates remain after filtering.")
    return valid_covariates


def fit_logistic_model(data, outcome_col, covariates):
    """Fit a logistic regression model with increased solver accuracy."""
    if not covariates:
        return np.full(len(data), 0.5)  # Default probability if no valid predictors
    
    X = sm.add_constant(data[covariates])
    y = data[outcome_col]
    try:
        model = sm.Logit(y, X).fit_regularized(disp=0, alpha=1e-6, maxiter=500)
        return np.clip(model.predict(X), 1e-6, 1 - 1e-6)
    except Exception as e:
        print(f"Warning: Logistic regression failed for {outcome_col}: {e}")
        return np.full(len(data), 0.5)


def calculate_weights(data, treatment_col, numerator_covariates, denominator_covariates):
    """Calculate stabilized inverse probability of treatment weights (IPTW)."""
    data = data.copy()
    prev_treatment = data[treatment_col].shift(1)
    
    for prev_value in [0, 1]:
        subset = data.loc[prev_treatment == prev_value].copy()
        if subset.empty:
            continue
        
        num_cov = check_perfect_separation(subset, treatment_col, numerator_covariates.copy())
        denom_cov = check_perfect_separation(subset, treatment_col, denominator_covariates.copy())
        
        data.loc[subset.index, f"num_propensity_{prev_value}"] = fit_logistic_model(subset, treatment_col, num_cov)
        data.loc[subset.index, f"denom_propensity_{prev_value}"] = fit_logistic_model(subset, treatment_col, denom_cov)
    
    data["stabilized_weight"] = (
        data.filter(like="num_propensity_").sum(axis=1) /
        (data.filter(like="denom_propensity_").sum(axis=1) + 1e-6)
    )
    return data

def apply_ipcw(data, censor_col, numerator_covariates, denominator_covariates):
    """Apply inverse probability of censoring weights (IPCW)."""
    data = data.copy()
    num_cov = check_perfect_separation(data, censor_col, numerator_covariates.copy())
    denom_cov = check_perfect_separation(data, censor_col, denominator_covariates.copy())
    
    data["num_censor_prob"] = fit_logistic_model(data, censor_col, num_cov)
    data["denom_censor_prob"] = fit_logistic_model(data, censor_col, denom_cov)
    
    data["ipcw_weight"] = (
        data["num_censor_prob"] / (data["denom_censor_prob"] + 1e-6)
    )
    return data

# Example application
numerator_covariates = ["age"]
denominator_covariates = ["age", "x1", "x3"]
trial_pp = calculate_weights(trial_pp, "treatment", numerator_covariates, denominator_covariates)

numerator_censor_covariates = ["x2"]
denominator_censor_covariates = ["x2", "x1"]
trial_pp = apply_ipcw(trial_pp, "eligible", numerator_censor_covariates, denominator_censor_covariates)




Try increasing solver accuracy or number of iterations, decreasing alpha, or switch solvers
Try increasing solver accuracy or number of iterations, decreasing alpha, or switch solvers
Try increasing solver accuracy or number of iterations, decreasing alpha, or switch solvers
Try increasing solver accuracy or number of iterations, decreasing alpha, or switch solvers
Try increasing solver accuracy or number of iterations, decreasing alpha, or switch solvers


In [20]:
print(trial_pp["x2"].value_counts())
print(trial_pp.groupby("x2")["eligible"].nunique())


x2
 1.146148    1
-1.162366    1
 0.024422    1
-1.711934    1
 0.151505    1
            ..
 0.846777    1
-0.980451    1
 1.503913    1
 0.231787    1
-1.340497    1
Name: count, Length: 725, dtype: int64
x2
-3.284355    1
-2.789628    1
-2.778994    1
-2.716380    1
-2.614978    1
            ..
 2.465086    1
 2.831169    1
 2.866680    1
 3.321383    1
 3.907648    1
Name: eligible, Length: 725, dtype: int64


In [21]:
print(trial_pp[["x1", "x2", "x3", "age", "eligible"]].corr())


                x1        x2        x3       age  eligible
x1        1.000000  0.072324 -0.034552  0.040720  0.060087
x2        0.072324  1.000000 -0.067664  0.002637  0.074121
x3       -0.034552 -0.067664  1.000000 -0.135130  0.079643
age       0.040720  0.002637 -0.135130  1.000000 -0.319038
eligible  0.060087  0.074121  0.079643 -0.319038  1.000000


In [24]:
print(trial_pp.isnull().sum())


id                      0
period                  0
treatment               0
outcome                 0
eligible                0
age                     0
x1                      0
x2                      0
x3                      0
num_propensity_0      340
denom_propensity_0    340
num_propensity_1      386
denom_propensity_1    386
stabilized_weight       0
num_censor_prob         0
denom_censor_prob       0
ipcw_weight             0
dtype: int64
