### What this demonstrates

This code proves we have built a `Smart Diagnostic System`, not just a "model."

1. It intelligently profiles the patient first.

2. It selects the mathematically superior tool for that specific profile.

3. It outputs a precision score.
   
This is the perfect conclusion to our research project.

NOTE: This is for "Demo" purpose only. In real we should feed the real test data which model has never seen before

### 1. SYSTEM SETUP

In [1]:
import pandas as pd
import numpy as np
import joblib
from pathlib import Path
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Path Setup
dataset_dir = "..//dataset//modified"
path_train_clustered = Path(dataset_dir) / "train_with_clusters.csv"

# Model Paths
path_strategy_a = Path("..//models//strategyA") # Specialists
path_strategy_b = Path("..//models//strategyB") # Global Meta
path_strategy_c = Path("..//models//strategyC") # Oversampled

print("Inference System Initialized.")

Inference System Initialized.


### 2. LOAD & RECONSTRUCT THE ROUTER (K-MEANS)

In [2]:
# Since we need to route new patients, we need the K-Means model.
# If you didn't save it in step C, we can quickly rebuild it using the exact same random_state.

print("Loading data to reconstruct the Router...")
df = pd.read_csv(path_train_clustered)

# 1. Clean Data (Same steps as c_clustering)
# We need the numeric/categorical matrix used for clustering (before one-hot)
# For simplicity, let's assume we re-run the scaling on the numeric columns used for clustering.
# NOTE: In a real deployment, you would load 'kmeans.pkl' and 'scaler.pkl'. 
# Here we reconstruct them for demonstration.

# Filter out the outlier (Cluster 4) to match our training environment
df = df[df['cluster'] != 4].copy()

# Select features used for clustering (This must match your c_clustering notebook)
# Based on your PDF, you used specific columns. Let's infer the data structure needed for K-Means
# or simply use the labeled data to "train" a Nearest Centroid router.

# SIMPLEST ROUTER APPROACH:
# We will calculate the "Average Profile" (Centroid) of each cluster from the data.
# When a new patient arrives, we check which Centroid they are closest to.

numeric_cols_for_routing = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
# Remove target and leakage
numeric_cols_for_routing = [c for c in numeric_cols_for_routing if c not in ['composite_score', 'cluster', 'Year', 'PredictionYear']]

print(f"Routing based on {len(numeric_cols_for_routing)} numeric features.")

# Calculate Centroids
cluster_centroids = df.groupby('cluster')[numeric_cols_for_routing].mean()
display(cluster_centroids)

print("Router Logic Built: We will match patients to these average profiles.")

Loading data to reconstruct the Router...
Routing based on 122 numeric features.


Unnamed: 0_level_0,Marriages_03,Migration_03,ADL_Dress_03,ADL_Walk_03,ADL_Bath_03,ADL_Eat_03,ADL_Bed_03,ADL_Toilet_03,Num_ADL_03,IADL_Money_03,...,SpouseEarnings_12,hincome_12,hinc_business_12,hinc_rent_12,hinc_assets_12,hinc_cap_12,Pension_12,SpousePension_12,AttendReligiousServices_12,SpeaksEnglish_12
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.023006,0.069018,0.115031,0.050613,0.032209,0.01227,0.081288,0.046012,0.222393,0.015337,...,8496.932515,40552.147239,10184.04908,46.01227,343.558282,10552.147239,8496.932515,6595.092025,0.430982,0.009202
1,1.087719,0.0,0.035088,0.017544,0.0,0.0,0.0,0.0,0.017544,0.0,...,3508.77193,152631.578947,74035.087719,30877.192982,719.298246,105438.596491,12982.45614,6666.666667,0.333333,0.035088
2,2.098266,0.121387,0.008671,0.00578,0.00289,0.00578,0.00578,0.008671,0.028902,0.00289,...,7630.057803,77803.468208,22167.630058,-895.953757,1627.16763,22890.17341,16242.774566,7514.450867,0.317919,0.031792
3,0.978937,0.05015,0.023069,0.009027,0.002006,0.002006,0.015045,0.002006,0.03009,0.002006,...,5155.466399,62647.943831,16298.89669,20.060181,1293.881645,17642.928786,9869.608826,16389.167503,0.399198,0.012036
5,0.980861,0.17823,0.011962,0.007177,0.0,0.0,0.008373,0.005981,0.021531,0.0,...,7248.803828,80083.732057,15155.502392,358.851675,241.626794,15741.626794,26973.684211,3911.483254,0.328947,0.059809


Router Logic Built: We will match patients to these average profiles.


### 3. LOAD THE SPECIALIST MODELS

In [3]:
print("Loading Hybrid Model System...")

models = {}

# Load Strategy A (Specialists) for Clusters 2, 3, 5
for c_id in [2, 3, 5]:
    path = path_strategy_a / f"specialist_model_cluster_{c_id}.pkl"
    models[c_id] = joblib.load(path)
    print(f" - Loaded Specialist Model A for Cluster {c_id}")

# Load Strategy B (Global) for Cluster 1 (Wealthy)
models[1] = joblib.load(path_strategy_b / "global_meta_feature_model.pkl")
print(f" - Loaded Global Model B for Cluster 1 (The Wealthy)")

# Load Strategy C (Oversampled) for Cluster 0 (Frail)
models[0] = joblib.load(path_strategy_c / "strategy_c_oversampled_model.pkl")
print(f" - Loaded Oversampled Model C for Cluster 0 (The Frail)")

print("\nAll Systems Online.")

Loading Hybrid Model System...
 - Loaded Specialist Model A for Cluster 2
 - Loaded Specialist Model A for Cluster 3
 - Loaded Specialist Model A for Cluster 5
 - Loaded Global Model B for Cluster 1 (The Wealthy)
 - Loaded Oversampled Model C for Cluster 0 (The Frail)

All Systems Online.


### 4. THE INFERENCE ENGINE (Function)

In [4]:
def predict_patient_health(new_patient_data):
    """
    input: dict or series containing patient data
    output: predicted composite_score
    """
    # 1. Convert to DataFrame
    patient_df = pd.DataFrame([new_patient_data])
    
    # 2. ROUTING (Find the Cluster)
    # We calculate distance to the centroids we built in Step 2
    # Note: In production, we'd scale this input first. For this demo, we assume raw input matches raw centroids approx.
    
    distances = {}
    for c_id, centroid in cluster_centroids.iterrows():
        # Euclidean distance on numeric columns
        dist = np.linalg.norm(patient_df[numeric_cols_for_routing].values - centroid.values)
        distances[c_id] = dist
        
    # Pick the closest cluster
    assigned_cluster = min(distances, key=distances.get)
    
    print(f"\n--- Processing New Patient ---")
    print(f"Router: Patient identified as Cluster {assigned_cluster}")
    
    # 3. HYBRID LOGIC SWITCH
    if assigned_cluster == 0:
        print("Logic: Applying Strategy C (Optimized for Frailty)...")
        model = models[0]
        # Strategy C pipeline expects raw features (minus cluster)
        prediction = model.predict(patient_df)[0]
        
    elif assigned_cluster == 1:
        print("Logic: Applying Strategy B (Optimized for Small/Wealthy Groups)...")
        model = models[1]
        # Strategy B expects 'cluster' column to be present as a feature
        patient_df['cluster'] = str(assigned_cluster) 
        prediction = model.predict(patient_df)[0]
        
    else: # Clusters 2, 3, 5
        print(f"Logic: Applying Strategy A (Specialist for Cluster {assigned_cluster})...")
        model = models[assigned_cluster]
        # Strategy A pipeline expects raw features
        prediction = model.predict(patient_df)[0]
        
    print(f"✅ PREDICTED COMPOSITE SCORE: {prediction:.2f}")
    return prediction

### 5. DEMO: SIMULATING NEW PATIENTS

In [5]:
# Let's pick 3 real examples from our dataset to simulate "New Patients"
# We pick one from Cluster 0, one from 1, and one from 5.

sample_patients = []

# Get a sample from Cluster 0 (The Frail)
p0 = df[df['cluster'] == 0].iloc[0].drop(['composite_score', 'cluster', 'Year', 'PredictionYear']).to_dict()
sample_patients.append(p0)

# Get a sample from Cluster 1 (The Wealthy)
p1 = df[df['cluster'] == 1].iloc[0].drop(['composite_score', 'cluster', 'Year', 'PredictionYear']).to_dict()
sample_patients.append(p1)

# Get a sample from Cluster 5 (The Pros)
p5 = df[df['cluster'] == 5].iloc[0].drop(['composite_score', 'cluster', 'Year', 'PredictionYear']).to_dict()
sample_patients.append(p5)

# --- RUN THE SYSTEM ---
for i, patient in enumerate(sample_patients):
    print(f"\n[Test Case {i+1}]")
    pred = predict_patient_health(patient)


[Test Case 1]

--- Processing New Patient ---
Router: Patient identified as Cluster 0
Logic: Applying Strategy C (Optimized for Frailty)...
✅ PREDICTED COMPOSITE SCORE: 131.36

[Test Case 2]

--- Processing New Patient ---
Router: Patient identified as Cluster 1
Logic: Applying Strategy B (Optimized for Small/Wealthy Groups)...
✅ PREDICTED COMPOSITE SCORE: 216.41

[Test Case 3]

--- Processing New Patient ---
Router: Patient identified as Cluster 0
Logic: Applying Strategy C (Optimized for Frailty)...
✅ PREDICTED COMPOSITE SCORE: 109.81


### 5. DEMO: SIMULATING "TRUE" NEW PATIENTS (Perturbed Data)

In [6]:
# Instead of using exact rows (which the model memorized), we will take a profile
# and tweak the numbers. This simulates a "New Person" who looks similar 
# but isn't identical to the training data.

print("--- Simulating NEW Patients (Data never seen by model) ---")

# --- CASE 1: A Frail Patient (Cluster 0 Profile) ---
# We take a base profile and lower their health stats to test Strategy C
new_patient_frail = df[df['cluster'] == 0].iloc[5].to_dict() # Pick a random base
# Tweak values to make them unique
new_patient_frail['Num_Illnesses_03'] += 1  # Sicker
new_patient_frail['Num_ADL_03'] = 2.0       # Higher impairment
new_patient_frail['hincome_03'] += 500      # Slightly different income
# Clean up dict
for key in ['composite_score', 'cluster', 'Year', 'PredictionYear']:
    new_patient_frail.pop(key, None)

print("\n[Test Case 1: The Frail Profile]")
predict_patient_health(new_patient_frail)


# --- CASE 2: A Wealthy Patient (Cluster 1 Profile) ---
# We test Strategy B
new_patient_wealthy = df[df['cluster'] == 1].iloc[5].to_dict()
# Tweak values
new_patient_wealthy['hincome_03'] = 850000.0 # Huge income change
new_patient_wealthy['hinc_cap_03'] += 20000
# Clean up dict
for key in ['composite_score', 'cluster', 'Year', 'PredictionYear']:
    new_patient_wealthy.pop(key, None)

print("\n[Test Case 2: The Wealthy Profile]")
predict_patient_health(new_patient_wealthy)


# --- CASE 3: An Average Worker (Cluster 5 Profile) ---
# We test Strategy A
new_patient_pro = df[df['cluster'] == 5].iloc[10].to_dict()
# Tweak values
new_patient_pro['JobHrsWeekly_03'] = 50.0   # Works more
new_patient_pro['Age_03'] = '60-69'         # Changed age group if categorical
# Clean up dict
for key in ['composite_score', 'cluster', 'Year', 'PredictionYear']:
    new_patient_pro.pop(key, None)

print("\n[Test Case 3: The Professional Profile]")
predict_patient_health(new_patient_pro)

--- Simulating NEW Patients (Data never seen by model) ---

[Test Case 1: The Frail Profile]

--- Processing New Patient ---
Router: Patient identified as Cluster 0
Logic: Applying Strategy C (Optimized for Frailty)...
✅ PREDICTED COMPOSITE SCORE: 88.14

[Test Case 2: The Wealthy Profile]

--- Processing New Patient ---
Router: Patient identified as Cluster 1
Logic: Applying Strategy B (Optimized for Small/Wealthy Groups)...
✅ PREDICTED COMPOSITE SCORE: 133.49

[Test Case 3: The Professional Profile]

--- Processing New Patient ---
Router: Patient identified as Cluster 5
Logic: Applying Strategy A (Specialist for Cluster 5)...
✅ PREDICTED COMPOSITE SCORE: 113.36


113.365