<a href="https://colab.research.google.com/github/nmansour67/skills-introduction-to-github/blob/main/First_Causal_Inference_Model_step1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
"""
Section 2: The “RWE Foundry” Protocol



How do you build this engine? You need a Propensity Score Matching (PSM) pipeline.

The Problem with Real Data: It is biased.

If you compare patients who got "AI Treatment" vs. "No AI," you might find the AI patients did worse.

Why? Because doctors might have used the AI only on the sickest patients.

The Fix: You must mathematically "match" patients to create a synthetic randomized trial.



Section 3: Technical Workshop, Your First Causal Inference Model



Target Audience: The Data Scientist, The MD-PhD, The Curious Skeptic.



We are going to solve the "Selection Bias" problem. We will generate a dataset where a new "AI Protocol" looks like it kills people, not because of anything other than it was used on sicker patients. Then, we will use Propensity Score Matching (PSM) to prove that it actually saves lives.

Step 1: Generate the "Confounded" Dataset Open Google Colab. We created 1,000 patients.

The Sicker Patients: Older, more comorbidities. They are more likely to get the AI Protocol, because doctors are worried.

The Truth: The AI Protocol reduces mortality risk by 20%.



Now let’s go back to coding, the technical part of the equation of empowering the healthcare professional. A short recap on how it works, and the terminology. The Python computer code is the language in which we communicate with the machine; it is the way we give it orders and tell the machine what we want to do or implement in terms of statistical methods, mathematical models, data analysis, and much more. These machines have been trained to use these methods and analysis protocols. We just need to trigger these machines and order what to do. Consider a big coffee machine that is trained to do 9 various types of coffee: ristretto, espresso, macchiato, americano, cappuccino, cafe latte, glace, and frappe. The machine does not produce you a cup of coffee of your choice, unless you order it to do so, and provide it with your choice. Same thing with the machines, or the artificial intelligence machines that we are dealing with. First, we start by writing the code. For this, and since we do not have the technical expertise in writing code, we now have the privilege of communicating with LLMs using our human natural language, telling these LLMS what we exactly want, one, two, three, etc... LLMs will be generating our code in Python. All we have to do is copy the code generated, and paste it in Google Colab environment, in the cell that is automatically open [+ Code], and if it is not open yet, just choose it from the menu. Once paste the code, click on [Run] or [Run All]. The machine will work automatically, conduct all the statistical, mathematical, or data analysis models that you have included in the Python code, and generate results. Sometime, the code is not a perfect code. No worries! Google Colab is augmented with “Gemini”, the Google AI LLM tool that can help in diagnosing the parts that are not working, proposing modifications, and asking you whether you wanted these modifications to be implemented or not. If you choose the “yes”, modify and run, Gemini will affect the modifications to the code, so it is a new corrected code, and run the code again. Most probably you shall get results. If not, expect Gemini to propose other modifications and run the code again, but surely it will run perfectly from the first modification. Here comes your role, not in the technical part certainly, not in writing and modifying the code, but rather in analyzing the results, the outcomes of running the code. Here comes your role, be it a physician, nurse, student in training, or a research scientist. You are the only one who has the know-how, how to scrutinize and pick up the anomalies in the data analysis report, and here is your power. Our promise is to empower you with AI tools, so AI can do the technical part on your behalf, but run it as per your directions and plans, then comes your role as a scientist, as the one who have the know-how and expertise in your own field to judge the outcomes. Here comes your role as Customer Zero, so let us play that role.
"""


# Python

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import NearestNeighbors



np.random.seed(42)

n_patients = 1000



# 1. Generate Covariates (Patient Features)

age = np.random.normal(65, 10, n_patients)

severity_score = np.random.normal(5, 2, n_patients) # Apache II Score



# 2. Determine who gets the "AI Protocol" (Treatment Assignment)

# Doctors give the AI tool to SICKER, OLDER patients.

# Logistic function to determine probability of treatment

prob_treatment = 1 / (1 + np.exp(-(0.1 * age + 0.5 * severity_score - 10)))

treatment = np.random.binomial(1, prob_treatment)



# 3. Simulate Outcome (Mortality)

# Base Risk + Age/Severity Risk - TREATMENT BENEFIT (The AI helps!)

# Notice the -0.8 coefficient for treatment (It saves lives)

mortality_risk = 1 / (1 + np.exp(-(0.05 * age + 0.3 * severity_score - 0.8 * treatment - 5)))

outcome = np.random.binomial(1, mortality_risk)



df = pd.DataFrame({

    'Age': age,

    'Severity': severity_score,

    'Treated_with_AI': treatment,

    'Mortality': outcome

})



print("Dataset Generated.")

print("Average Severity of Treated:", df[df['Treated_with_AI']==1]['Severity'].mean())

print("Average Severity of Untreated:", df[df['Treated_with_AI']==0]['Severity'].mean())
"""
Step 2: The "Naive" Analysis (The Trap) If you just compare mortality rates, what do you see?
"""

# Python

# Naive comparison

mortality_treated = df[df['Treated_with_AI']==1]['Mortality'].mean()

mortality_untreated = df[df['Treated_with_AI']==0]['Mortality'].mean()



print(f"Mortality Rate (Treated): {mortality_treated*100:.1f}%")

print(f"Mortality Rate (Untreated): {mortality_untreated*100:.1f}%")

Dataset Generated.
Average Severity of Treated: 6.121680873315007
Average Severity of Untreated: 4.6951766537856106
Mortality Rate (Treated): 42.2%
Mortality Rate (Untreated): 38.6%
