# **Target Trial Emulation**

### **Submitted by:**
- **Ladrera**, Raiken
- **Tibon**, Hestia

## **Instructions**

Assignment 1 for Clustering: New and novel methods in Machine Learning are made either by borrowing formulas and concepts from other scientific fields and redefining it based on new sets of assumptions, or by adding an extra step to an already existing framework of methodology.

In this exercise (Assignment 1 of the Clustering Topic), we will try to develop a novel method of Target Trial Emulation by integrating concepts of Clustering into the already existing framework. Target Trial Emulation is a new methodological framework in epidemiology which tries to account for the biases in old and traditional designs.

These are the instructions:
- Look at this website: https://rpubs.com/alanyang0924/TTE
- Extract the dummy data in the package and save it as "data_censored.csv"
- Convert the R codes into Python Codes (use Jupyter Notebook), replicate the results using your python code.
- Create another copy of your Python Codes, name it TTE-v2 (use Jupyter Notebook).
- Using TTE-v2, think of a creative way on where you would integrate a clustering mechanism, understand each step carefully and decide at which step a clustering method can be implemented. Generate insights from your results.
- Do this by pair, preferably your thesis partner.
- Push to your github repository.


## **1. Setup**

In [1]:
import os
import pandas as pd
import statsmodels.api as sm

estimand_pp = "PP"  # Per-protocol
estimand_itt = "ITT"  # Intention-to-treat

trial_pp_dir = os.path.join(os.getcwd(), "trial_pp")
trial_itt_dir = os.path.join(os.getcwd(), "trial_itt")

os.makedirs(trial_pp_dir, exist_ok=True)
os.makedirs(trial_itt_dir, exist_ok=True)

print(f"Directories created:\n{trial_pp_dir}\n{trial_itt_dir}")

Directories created:
C:\Users\meizi\Documents\GitHub\TTE-v2\trial_pp
C:\Users\meizi\Documents\GitHub\TTE-v2\trial_itt


## **2. Data Preparation**

In [2]:
data_path = "data_censored.csv" 
data_censored = pd.read_csv(data_path)

print("Initial Data Preview:")
#print(data_censored.head())

columns_needed = ["id", "period", "treatment", "outcome", "eligible", "age", "x1", "x2", "x3", "censored"]
trial_pp = data_censored[columns_needed].copy()
trial_itt = data_censored[columns_needed].copy()

print("Prepared Data (PP Model):")
print(trial_pp.head())

print("Prepared Data (ITT Model):")
print(trial_itt.head())

Initial Data Preview:
Prepared Data (PP Model):
   id  period  treatment  outcome  eligible  age  x1        x2  x3  censored
0   1       0          1        0         1   36   1  1.146148   0         0
1   1       1          1        0         0   37   1  0.002200   0         0
2   1       2          1        0         0   38   0 -0.481762   0         0
3   1       3          1        0         0   39   0  0.007872   0         0
4   1       4          1        0         0   40   1  0.216054   0         0
Prepared Data (ITT Model):
   id  period  treatment  outcome  eligible  age  x1        x2  x3  censored
0   1       0          1        0         1   36   1  1.146148   0         0
1   1       1          1        0         0   37   1  0.002200   0         0
2   1       2          1        0         0   38   0 -0.481762   0         0
3   1       3          1        0         0   39   0  0.007872   0         0
4   1       4          1        0         0   40   1  0.216054   0         0


## **3. Weight Models and Censoring**

#### **3.1. Trial Class: Treatment and Censoring**

In [3]:
class Trial:
    def __init__(self, data):
        self.data = data.copy()
        self.switch_model = None
        self.switch_weights = None
        self.censor_model = None
        self.censor_weights = None
        self.numerator = None
        self.denominator = None
        self.censor_event = None
        self.pool_models = None
        self.model_fitted = False  

    def set_censor_weight_model(self, censor_event, numerator, denominator, pool_models="none"):
        self.censor_event = censor_event
        self.numerator = numerator
        self.denominator = denominator
        self.pool_models = pool_models
        self.model_fitted = False
        print(f"Censor Model set: 1 - {self.censor_event} ~ {self.numerator} / {self.denominator}")

    def calculate_weights(self, save_path, model_type="logit"):
        os.makedirs(save_path, exist_ok=True)
        self.data["denom"] = self.data.eval(self.denominator)
        self.data["censor_binary"] = 1 - self.data[self.censor_event]
        X = sm.add_constant(self.data["denom"])
        y = self.data["censor_binary"]
        model = sm.Logit(y, X).fit(disp=0)
        model.save(os.path.join(save_path, "censor_model.pickle"))
        self.censor_model = model
        self.censor_weights = model.predict(X)
        self.model_fitted = True
        print(f"Censor weights saved in {save_path}.")

    @property
    def get_censor_weights(self):
        if not self.model_fitted:
            print("Model not fitted. Running `calculate_weights()`...")
            self.calculate_weights("trial_default")
        return self.censor_weights

#### **3.2 Example Usage**

**Per-Protocol (PP) Model**

In [4]:
trial_pp = Trial(data_censored)
trial_pp.set_censor_weight_model(
    censor_event="censored",
    numerator="x2",
    denominator="x2 + x1",
    pool_models="none"
)
print(trial_pp.get_censor_weights.head())

Censor Model set: 1 - censored ~ x2 / x2 + x1
Model not fitted. Running `calculate_weights()`...
Censor weights saved in trial_default.
0    0.882943
1    0.908124
2    0.933494
3    0.925940
4    0.903820
dtype: float64


**Intention-To-Treat (ITT) Model**

In [5]:
trial_itt = Trial(data_censored)
trial_itt.set_censor_weight_model(
    censor_event="censored",
    numerator="x2",
    denominator="x2 + x1",
    pool_models="numerator"
)
print(trial_itt.get_censor_weights.head())

Censor Model set: 1 - censored ~ x2 / x2 + x1
Model not fitted. Running `calculate_weights()`...
Censor weights saved in trial_default.
0    0.882943
1    0.908124
2    0.933494
3    0.925940
4    0.903820
dtype: float64


## **4. Calculate Weights**

In [6]:
import statsmodels.api as sm
import pandas as pd

df = pd.read_csv("data_censored.csv") 

X = df[["x1", "x2","x3","x4"]] 
X = sm.add_constant(X)  

y = df["censored"]  
model = sm.Logit(y, X).fit()

print(model.summary())

Optimization terminated successfully.
         Current function value: 0.258039
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:               censored   No. Observations:                  725
Model:                          Logit   Df Residuals:                      720
Method:                           MLE   Df Model:                            4
Date:                Sat, 08 Mar 2025   Pseudo R-squ.:                 0.07436
Time:                        17:59:49   Log-Likelihood:                -187.08
converged:                       True   LL-Null:                       -202.11
Covariance Type:            nonrobust   LLR p-value:                 4.761e-06
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.1848      0.222     -9.839      0.000      -2.620      -1.750
x1            -0.6156      0.

In [7]:
print(df.columns) 

Index(['id', 'period', 'treatment', 'x1', 'x2', 'x3', 'x4', 'age', 'age_s',
       'outcome', 'censored', 'eligible'],
      dtype='object')


## **5. Specify Outcome Model**

***Fit Logistic Regression Models to Estimate Probabilities.*** 

In R, calculate_weights() estimates models for censoring and treatment assignment separately. You can fit another logistic regression model by having *treatment weights*. Using *stabilized weights* (like in R), you need to estimate the numerator model.

**Censored:**

In [8]:
import statsmodels.api as sm
import pandas as pd

df = pd.read_csv("data_censored.csv")

X_censor = df[["x1", "x2"]]
X_censor = sm.add_constant(X_censor)

y_censor = df["censored"]

model_censor = sm.Logit(y_censor, X_censor).fit()

df["p_censor"] = model_censor.predict(X_censor)
df["weight_censor"] = 1 / df["p_censor"]

print(model_censor.summary())


Optimization terminated successfully.
         Current function value: 0.267425
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:               censored   No. Observations:                  725
Model:                          Logit   Df Residuals:                      722
Method:                           MLE   Df Model:                            2
Date:                Sat, 08 Mar 2025   Pseudo R-squ.:                 0.04069
Time:                        17:59:53   Log-Likelihood:                -193.88
converged:                       True   LL-Null:                       -202.11
Covariance Type:            nonrobust   LLR p-value:                 0.0002679
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.2059      0.165    -13.339      0.000      -2.530      -1.882
x1            -0.7019      0.

**Treatment:**

In [9]:
X_treatment = df[["x1", "x2", "x3"]]
X_treatment = sm.add_constant(X_treatment)

y_treatment = df["treatment"]

model_treatment = sm.Logit(y_treatment, X_treatment).fit()

df["p_treatment"] = model_treatment.predict(X_treatment)
df["weight_treatment"] = 1 / df["p_treatment"]

print(model_treatment.summary())

X_numerator = sm.add_constant(df[[]])
model_numerator = sm.Logit(y_treatment, X_numerator).fit()

df["p_numerator"] = model_numerator.predict(X_numerator)
df["stabilized_weight"] = df["p_numerator"] / df["p_treatment"]

Optimization terminated successfully.
         Current function value: 0.682194
         Iterations 4
                           Logit Regression Results                           
Dep. Variable:              treatment   No. Observations:                  725
Model:                          Logit   Df Residuals:                      721
Method:                           MLE   Df Model:                            3
Date:                Sat, 08 Mar 2025   Pseudo R-squ.:                 0.01281
Time:                        17:59:55   Log-Likelihood:                -494.59
converged:                       True   LL-Null:                       -501.01
Covariance Type:            nonrobust   LLR p-value:                  0.005012
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.2294      0.124     -1.850      0.064      -0.472       0.014
x1             0.1908      0.

## **6.Expand Trials**

In [10]:
import pandas as pd
import numpy as np

df["trial_period"] = 0
df["followup_time"] = 0

def expand_trials(df, chunk_size=500):
    expanded_data = []
    
    for i in range(0, len(df), chunk_size):
        chunk = df.iloc[i:i + chunk_size].copy()
        chunk["trial_period"] = 0  
        chunk["followup_time"] = 0  
        expanded_data.append(chunk)
    
    return pd.concat(expanded_data, ignore_index=True)

df_expanded = expand_trials(df, chunk_size=500)

def load_expanded_data(df, seed=1234, p_control=0.5):
    np.random.seed(seed)
    
    df_sampled = df.copy()
    
    if "outcome" in df_sampled.columns:
        mask = (df_sampled["outcome"] == 0) & (np.random.rand(len(df_sampled)) > p_control)
        df_sampled = df_sampled[~mask]
    
    return df_sampled

df_loaded = load_expanded_data(df_expanded, seed=1234, p_control=0.5)


In [92]:
trial_summary = df_expanded.groupby("trial_period").size().reset_index(name="count")
print(trial_summary)

print("\n" + "-"*50 + "\n")

trial_summary = df_expanded.groupby("trial_period").agg({
    "followup_time": ["mean", "std"],
    "outcome": ["mean", "sum"],
    "stabilized_weight": ["mean", "sum"]  # Use the correct weight column
}).reset_index()

print(trial_summary)

print("\n" + "-"*50 + "\n")

for period, group in df_expanded.groupby("trial_period"):
    print(f"Trial Period: {period}")
    print(group.head())  # Show first few records for each trial period
    print("\n" + "-"*50 + "\n")


   trial_period  count
0             0    725

--------------------------------------------------

  trial_period followup_time        outcome     stabilized_weight            
                        mean  std      mean sum              mean         sum
0            0           0.0  0.0  0.015172  11          1.020809  740.086598

--------------------------------------------------

Trial Period: 0
   id  period  treatment  x1        x2  x3        x4  age     age_s  outcome  \
0   1       0          1   1  1.146148   0  0.734203   36  0.083333        0   
1   1       1          1   1  0.002200   0  0.734203   37  0.166667        0   
2   1       2          1   0 -0.481762   0  0.734203   38  0.250000        0   
3   1       3          1   0  0.007872   0  0.734203   39  0.333333        0   
4   1       4          1   1  0.216054   0  0.734203   40  0.416667        0   

   censored  eligible  p_censor  weight_censor  p_treatment  weight_treatment  \
0         0         1  0.085615     