# **Target Trial Emulation**

### **Submitted by:**
- **Ladrera**, Raiken
- **Tibon**, Hestia

## **Instructions**

Assignment 1 for Clustering: New and novel methods in Machine Learning are made either by borrowing formulas and concepts from other scientific fields and redefining it based on new sets of assumptions, or by adding an extra step to an already existing framework of methodology.

In this exercise (Assignment 1 of the Clustering Topic), we will try to develop a novel method of Target Trial Emulation by integrating concepts of Clustering into the already existing framework. Target Trial Emulation is a new methodological framework in epidemiology which tries to account for the biases in old and traditional designs.

These are the instructions:
- Look at this website: https://rpubs.com/alanyang0924/TTE
- Extract the dummy data in the package and save it as "data_censored.csv"
- Convert the R codes into Python Codes (use Jupyter Notebook), replicate the results using your python code.
- Create another copy of your Python Codes, name it TTE-v2 (use Jupyter Notebook).
- Using TTE-v2, think of a creative way on where you would integrate a clustering mechanism, understand each step carefully and decide at which step a clustering method can be implemented. Generate insights from your results.
- Do this by pair, preferably your thesis partner.
- Push to your github repository.

HINT: For those who dont have a thesis topic yet, you can actually develop a thesis topic out of this assignment.

## **1. Setup**

In [1]:
import os
import pandas as pd

# Define the estimand variables
estimand_pp = "PP"  # Per-protocol
estimand_itt = "ITT"  # Intention-to-treat

# Create directories for saving outputs
trial_pp_dir = os.path.join(os.getcwd(), "trial_pp")
trial_itt_dir = os.path.join(os.getcwd(), "trial_itt")

os.makedirs(trial_pp_dir, exist_ok=True)
os.makedirs(trial_itt_dir, exist_ok=True)

print(f"Directories created:\n{trial_pp_dir}\n{trial_itt_dir}")


Directories created:
C:\Users\meizi\Documents\GitHub\TTE-v2\trial_pp
C:\Users\meizi\Documents\GitHub\TTE-v2\trial_itt


## **2. Data Preparation**

In [3]:
# Load the dataset
data_path = "data_censored.csv"  # Path to uploaded file
data_censored = pd.read_csv(data_path)

# Display first few rows
print(data_censored.head())

# Select relevant columns for target trial emulation
columns_needed = ["id", "period", "treatment", "outcome", "eligible"]
trial_pp = data_censored[columns_needed].copy()
trial_itt = data_censored[columns_needed].copy()

# Display processed data
print(trial_pp.head())
print(trial_itt.head())


   id  period  treatment  x1        x2  x3        x4  age     age_s  outcome  \
0   1       0          1   1  1.146148   0  0.734203   36  0.083333        0   
1   1       1          1   1  0.002200   0  0.734203   37  0.166667        0   
2   1       2          1   0 -0.481762   0  0.734203   38  0.250000        0   
3   1       3          1   0  0.007872   0  0.734203   39  0.333333        0   
4   1       4          1   1  0.216054   0  0.734203   40  0.416667        0   

   censored  eligible  
0         0         1  
1         0         0  
2         0         0  
3         0         0  
4         0         0  
   id  period  treatment  outcome  eligible
0   1       0          1        0         1
1   1       1          1        0         0
2   1       2          1        0         0
3   1       3          1        0         0
4   1       4          1        0         0
   id  period  treatment  outcome  eligible
0   1       0          1        0         1
1   1       1          

## **3. Weight Models and Censoring**

In [5]:
import pandas as pd
import statsmodels.formula.api as smf
import os

# Create directory to store models if it doesn't exist
trial_pp_dir = "switch_models"
os.makedirs(trial_pp_dir, exist_ok=True)

# Function to set treatment switching weight models
def set_switch_weight_model(data, numerator_formula, denominator_formula, save_path):
    # Fit numerator model
    num_model = smf.logit(numerator_formula, data=data).fit(maxiter=1000)  # Increased max iterations
    
    # Fit denominator model
    denom_model = smf.logit(denominator_formula, data=data).fit(maxiter=1000)  # Increased max iterations
    
    # Save models
    num_model.save(os.path.join(save_path, "numerator_model.pickle"))
    denom_model.save(os.path.join(save_path, "denominator_model.pickle"))
    
    return {"numerator_model": num_model, "denominator_model": denom_model}

# Function to set other informative censoring weight models
def set_censor_weight_model(data, censor_event, numerator_formula, denominator_formula, pool_models, save_path):
    data["censored"] = data[censor_event]  # Define censoring column

    # Fit numerator model
    num_model = smf.logit(numerator_formula, data=data).fit(maxiter=1000)  # Increased max iterations
    
    # Fit denominator model
    denom_model = smf.logit(denominator_formula, data=data).fit(maxiter=1000)  # Increased max iterations
    
    # Save models
    num_model.save(os.path.join(save_path, "censor_numerator_model.pickle"))
    denom_model.save(os.path.join(save_path, "censor_denominator_model.pickle"))
    
    return {"numerator_model": num_model, "denominator_model": denom_model}

# Function to calculate weights (Inverse Probability Weights)
def calculate_weights(data, models, weight_col="ipw"):
    num_preds = models["numerator_model"].predict(data)
    denom_preds = models["denominator_model"].predict(data)
    
    # Avoid division by zero
    denom_preds = denom_preds.replace(0, 1e-6)
    
    # Compute stabilized weights
    data[weight_col] = num_preds / denom_preds
    return data

# Example DataFrame (replace this with your actual dataset)
trial_pp = pd.DataFrame({
    "treatment": [0, 1, 0, 1, 0],
    "age": [25, 35, 45, 30, 40],
    "x1": [1, 0, 1, 0, 1],
    "x3": [0, 1, 0, 1, 0],
    "x2": [0, 1, 1, 0, 1],
    "censored": [0, 1, 0, 1, 0]
})

# Fit switching weight models
switch_models = set_switch_weight_model(
    data=trial_pp,
    numerator_formula="treatment ~ age",
    denominator_formula="treatment ~ age + x1 + x3",
    save_path=trial_pp_dir
)

# Fit censoring weight models
censor_models = set_censor_weight_model(
    data=trial_pp,
    censor_event="censored",
    numerator_formula="censored ~ x2",
    denominator_formula="censored ~ x2 + x1",
    pool_models="none",
    save_path=trial_pp_dir
)

# Calculate weights
trial_pp = calculate_weights(trial_pp, switch_models, weight_col="switch_ipw")
trial_pp = calculate_weights(trial_pp, censor_models, weight_col="censor_ipw")

# Display results
print(trial_pp)



Optimization terminated successfully.
         Current function value: 0.630237
         Iterations 5


  return 1/(1+np.exp(-X))


         Current function value: inf
         Iterations: 1000


  return np.sum(np.log(self.cdf(q * linpred)))


LinAlgError: Singular matrix