Third-order features engineered from the original dataset that strengthened the model‚Äôs predictive power.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the original data
url = 'https://raw.githubusercontent.com/nandarishik/Ferry-Internship/main/realistic_medication_adherence_data.csv'
df = pd.read_csv(url)

print("Data loaded.")

Data loaded.


In [2]:
# Clean missing values
for col in df.columns:
    if df[col].isnull().any():
        if df[col].dtype == 'object':
            df[col].fillna(df[col].mode()[0], inplace=True)
        else:
            df[col].fillna(df[col].median(), inplace=True)

print("Missing values handled.")

Missing values handled.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)


In [3]:
# --- 1. "Patient Readiness" Composite Score ---
# Select and scale the key numeric features
readiness_features = df[['health_literacy_score', 'social_support_index', 'belief_in_medication']]
scaler = StandardScaler()
scaled_features = scaler.fit_transform(readiness_features)

# Create the composite score
df['patient_readiness_score'] = (
    scaled_features[:, 0] +
    scaled_features[:, 1] +
    scaled_features[:, 2] +
    df['provider_consistency'].astype(int)
)

# --- 2. "Literacy & Income" Interaction Feature ---
# Map income_bracket to a number
income_numeric_map = {'Low': 1, 'Medium': 2, 'High': 3}
df['income_numeric'] = df['income_bracket'].map(income_numeric_map)

# Create the interaction feature
df['literacy_x_income'] = df['health_literacy_score'] * df['income_numeric']

print("3rd order features created.")

3rd order features created.


In [4]:
# Create the target variable y
y = df['medication_adherence']

# Create the feature set X, dropping the original and helper columns
X_final = df.drop([
    'medication_adherence',
    'health_literacy_score',
    'social_support_index',
    'belief_in_medication',
    'provider_consistency',
    'income_bracket',
    'income_numeric'
], axis=1)

# One-hot encode any remaining categorical columns
X_final = pd.get_dummies(X_final, drop_first=True)

print("Final feature set X prepared.")
print("Final features shape:", X_final.shape)

Final feature set X prepared.
Final features shape: (500, 24)


In [5]:
# Split the data
X_train_final, X_test_final, y_train, y_test = train_test_split(
    X_final, y, test_size=0.2, random_state=42
)

# Use the best model parameters we found from hyperparameter tuning
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_leaf=1,
    min_samples_split=2,
    random_state=42
)

# Train the model
model.fit(X_train_final, y_train)

# Make predictions and evaluate
y_pred_final = model.predict(X_test_final)
accuracy_final = accuracy_score(y_test, y_pred_final)

print(f"\nFinal Model Accuracy with Targeted Features: {accuracy_final:.2f}\n")
print("Final Classification Report:")
print(classification_report(y_test, y_pred_final))


Final Model Accuracy with Targeted Features: 0.72

Final Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.61      0.67        46
           1       0.71      0.81      0.76        54

    accuracy                           0.72       100
   macro avg       0.72      0.71      0.71       100
weighted avg       0.72      0.72      0.72       100



## **Understanding**

## **The Challenge: Hitting the Performance Plateau**

My journey started with a major success: I fixed a broken, leaky dataset and built a solid, reliable Random Forest model. This model gave me a realistic accuracy of **69%** and, more importantly, a crucial insight‚Äîmedication adherence was driven by human factors like **health literacy**, **social support**, and **trust in the provider**.

However, I hit a wall. Both advanced algorithms (like XGBoost) and extensive tuning (`GridSearchCV`) failed to improve that 69% score. My model was good, but it couldn't get any better. This told me that the limitation wasn't the model's engine, but the fuel‚Äîthe features themselves.

---

## **The Breakthrough: A New Hypothesis**

Instead of adding more general features (which had previously failed), I formed a new, targeted hypothesis: **What if I could create better features by combining and refining the signals from the predictors I already knew were the most powerful?**

This led me to create two new, highly effective features.

### **Feature 1: The `patient_readiness_score`**

* **The Story:** I knew that a patient's understanding (`health_literacy_score`), their personal conviction (`belief_in_medication`), their support system (`social_support_index`), and their trust in their doctor (`provider_consistency`) were the four most important factors. I theorized that these individual traits could be combined into a single, powerful metric representing a patient's overall preparedness to succeed with their treatment.
* **How I Did It:** I created a composite "readiness" score. To do this fairly, I first scaled the numeric features so they were on the same level, then added them together with the score for `provider_consistency`. This new feature provided the model with a concentrated, high-level summary of a patient's psychosocial state, making the key patterns much easier to detect.

### **Feature 2: The `literacy_x_income` Interaction**

* **The Story:** I hypothesized that the challenges of low health literacy are often amplified by having a low income. A patient might struggle to understand complex instructions, but that struggle becomes a critical barrier if they also can't afford transportation for follow-up questions or co-pays for simpler medication alternatives.
* **How I Did It:** I created an interaction feature that multiplied a patient's `health_literacy_score` by their `income_bracket`. This didn't just tell the model if a patient had low literacy *or* low income; it explicitly highlighted the compounded risk when a patient had **both**. This allowed the model to learn a more nuanced, real-world pattern.

---

## **Inference**

When I trained my model with this new, refined set of features, the results were definitive.

* **The Performance Jump:** The model's accuracy **broke through the plateau, rising to 72%**.
* **A More Balanced Model:** Crucially, the model became better balanced. It maintained its excellent ability to identify adherent patients (81% recall) while significantly improving its ability to correctly identify at-risk, non-adherent patients (61% recall).

The final story is clear: my initial model was right about *what* mattered. But by intelligently engineering features that captured the *relationships between* those key factors, I built a more nuanced and powerful model. This journey proves that the most successful data science projects often rely not just on powerful algorithms, but on a deep understanding of the problem to guide the creation of truly meaningful features.


*****TESTING*****

In [6]:
# Define the new patient data as a list of dictionaries
new_patients_data = [
    {
        # Case 1: High-Risk Patient (Saanvi)
        'age': 68, 'gender': 'Female', 'education_level': 'Secondary', 'income_bracket': 'Low',
        'location_type': 'Rural', 'hemoglobin_level': 10.2, 'iron_deficiency_status': True,
        'comorbidities_count': 3, 'lab_test_frequency': 1, 'side_effects_reported': True,
        'medication_type': 'Iron Tablets', 'dosage_frequency': 'Daily', 'prescription_duration_days': 60,
        'tablets_dispensed': 60, 'pill_count_last_visit': 10, 'refill_gap_days': 15,
        'health_literacy_score': 0.3, 'depression_score': 2.5, 'social_support_index': 1.5,
        'belief_in_medication': 0.4, 'distance_to_clinic_km': 25, 'insurance_status': False,
        'medication_cost_inr': 350, 'provider_consistency': False,
        'name': 'Saanvi (High Risk)'
    },
    {
        # Case 2: Low-Risk Patient (Rohan)
        'age': 35, 'gender': 'Male', 'education_level': 'Postgraduate', 'income_bracket': 'High',
        'location_type': 'Urban', 'hemoglobin_level': 13.5, 'iron_deficiency_status': False,
        'comorbidities_count': 0, 'lab_test_frequency': 4, 'side_effects_reported': False,
        'medication_type': 'Oral Supplements', 'dosage_frequency': 'Daily', 'prescription_duration_days': 30,
        'tablets_dispensed': 30, 'pill_count_last_visit': 2, 'refill_gap_days': 1,
        'health_literacy_score': 0.95, 'depression_score': 0.5, 'social_support_index': 4.5,
        'belief_in_medication': 0.9, 'distance_to_clinic_km': 5, 'insurance_status': True,
        'medication_cost_inr': 200, 'provider_consistency': True,
        'name': 'Rohan (Low Risk)'
    },
    {
        # Case 3: Ambiguous Patient (Priya)
        'age': 45, 'gender': 'Female', 'education_level': 'Graduate', 'income_bracket': 'Medium',
        'location_type': 'Urban', 'hemoglobin_level': 11.0, 'iron_deficiency_status': True,
        'comorbidities_count': 1, 'lab_test_frequency': 2, 'side_effects_reported': True,
        'medication_type': 'Injections', 'dosage_frequency': 'Weekly', 'prescription_duration_days': 90,
        'tablets_dispensed': 12, 'pill_count_last_visit': 5, 'refill_gap_days': 8,
        'health_literacy_score': 0.8, 'depression_score': 1.8, 'social_support_index': 2.5,
        'belief_in_medication': 0.6, 'distance_to_clinic_km': 12, 'insurance_status': True,
        'medication_cost_inr': 450, 'provider_consistency': True,
        'name': 'Priya (Ambiguous)'
    }
]

# Create a DataFrame
test_cases_df = pd.DataFrame(new_patients_data)

In [7]:
# Create a copy to avoid changing the original test cases
processed_cases = test_cases_df.copy()

# --- Feature Engineering ---
# 1. "Patient Readiness" Score (using the SAME scaler fitted on the training data)
readiness_features = processed_cases[['health_literacy_score', 'social_support_index', 'belief_in_medication']]
scaled_features = scaler.transform(readiness_features) # Use .transform(), NOT .fit_transform()
processed_cases['patient_readiness_score'] = (
    scaled_features[:, 0] + scaled_features[:, 1] + scaled_features[:, 2] +
    processed_cases['provider_consistency'].astype(int)
)

# 2. "Literacy & Income" Interaction
income_numeric_map = {'Low': 1, 'Medium': 2, 'High': 3}
processed_cases['income_numeric'] = processed_cases['income_bracket'].map(income_numeric_map)
processed_cases['literacy_x_income'] = processed_cases['health_literacy_score'] * processed_cases['income_numeric']

# --- Final Preprocessing ---
# 1. Drop the original/helper columns
processed_cases = processed_cases.drop([
    'name', 'medication_adherence' if 'medication_adherence' in processed_cases.columns else None,
    'health_literacy_score', 'social_support_index', 'belief_in_medication',
    'provider_consistency', 'income_bracket', 'income_numeric'
], axis=1, errors='ignore')

# 2. One-hot encode
processed_cases = pd.get_dummies(processed_cases, drop_first=True)

# 3. Align columns with the training data to ensure an exact match
# This adds any missing columns from the training set and fills them with 0
processed_cases = processed_cases.reindex(columns=X_train_final.columns, fill_value=0)

print("New patient data has been processed.")

New patient data has been processed.


In [8]:
# Make predictions
predictions = model.predict(processed_cases)
probabilities = model.predict_proba(processed_cases)

# Add results to our original DataFrame for easy interpretation
test_cases_df['Prediction (1=Adherent)'] = predictions
test_cases_df['Confidence (Adherent)'] = [f"{prob[1]*100:.1f}%" for prob in probabilities]
test_cases_df['Confidence (Non-Adherent)'] = [f"{prob[0]*100:.1f}%" for prob in probabilities]

# Display the final results
display(test_cases_df[['name', 'Prediction (1=Adherent)', 'Confidence (Adherent)', 'Confidence (Non-Adherent)']])

Unnamed: 0,name,Prediction (1=Adherent),Confidence (Adherent),Confidence (Non-Adherent)
0,Saanvi (High Risk),0,23.5%,76.5%
1,Rohan (Low Risk),1,77.6%,22.4%
2,Priya (Ambiguous),1,56.4%,43.6%


***FINAL CONCLUSIONS***

The predictions are in, and the model is behaving exactly as an intelligent and nuanced system should. This is a perfect demonstration of its capabilities.

***
## Saanvi (High Risk): A Clear Negative
* **Prediction:** Not Adherent (with 76.5% confidence)
* **Analysis:** The model correctly and confidently identified Saanvi as a high-risk patient. Her combination of low health literacy, low social support, low income, and provider inconsistency were powerful negative signals that the model learned to associate with non-adherence. This is a clear success.

***
## Rohan (Low Risk): A Clear Positive
* **Prediction:** Adherent (with 77.6% confidence)
* **Analysis:** As expected, the model confidently classified Rohan as very likely to be adherent. His strong positive attributes‚Äîhigh literacy, great social support, provider consistency‚Äîcreated a high `patient_readiness_score`, which the model correctly interpreted as a strong indicator of success.

***
## Priya (Ambiguous): The Nuanced Verdict
* **Prediction:** Adherent (but with only 56.4% confidence)
* **Analysis:** This is the most impressive result. The model's final prediction is "Adherent," suggesting her positive traits (good literacy, provider consistency) ultimately outweighed her negative ones (side effects, higher depression score). However, the **low confidence** is the key insight. The model is essentially telling us: *"My best guess is that she will be adherent, but I am not very sure. There are significant risk factors here."*

In a real-world clinical setting, this low confidence score is incredibly valuable. It acts as a flag for healthcare providers, indicating that Priya is a "borderline" case who needs extra monitoring and support, even if the final prediction is positive.

This confirms that our final model doesn't just make black-and-white decisions; it understands nuance and uncertainty, which is the hallmark of a truly effective predictive tool. This is a great result to end on. ‚úÖ

> Add blockquote



##**Interface for Doctors**

In [9]:
!pip install gradio -q


In [11]:
import gradio as gr
import pandas as pd

# Income mapping
income_map = {'Low': 1, 'Medium': 2, 'High': 3}

def predict_adherence(
    age, gender, education_level, income_bracket, location_type, hemoglobin_level,
    iron_deficiency_status, comorbidities_count, lab_test_frequency, side_effects_reported,
    medication_type, dosage_frequency, prescription_duration_days, tablets_dispensed,
    pill_count_last_visit, refill_gap_days, health_literacy_score, depression_score,
    social_support_index, belief_in_medication, distance_to_clinic_km, insurance_status,
    medication_cost_inr, provider_consistency
):
    # Build DataFrame
    input_data = pd.DataFrame([{
        'age': age,
        'gender': gender,
        'education_level': education_level,
        'income_bracket': income_bracket,
        'location_type': location_type,
        'hemoglobin_level': hemoglobin_level,
        'iron_deficiency_status': iron_deficiency_status,
        'comorbidities_count': comorbidities_count,
        'lab_test_frequency': lab_test_frequency,
        'side_effects_reported': side_effects_reported,
        'medication_type': medication_type,
        'dosage_frequency': dosage_frequency,
        'prescription_duration_days': prescription_duration_days,
        'tablets_dispensed': tablets_dispensed,
        'pill_count_last_visit': pill_count_last_visit,
        'refill_gap_days': refill_gap_days,
        'health_literacy_score': health_literacy_score,
        'depression_score': depression_score,
        'social_support_index': social_support_index,
        'belief_in_medication': belief_in_medication,
        'distance_to_clinic_km': distance_to_clinic_km,
        'insurance_status': insurance_status,
        'medication_cost_inr': medication_cost_inr,
        'provider_consistency': provider_consistency
    }])

    # Feature engineering
    scaled_features = scaler.transform(input_data[['health_literacy_score','social_support_index','belief_in_medication']])
    input_data['patient_readiness_score'] = (
        scaled_features[:,0] + scaled_features[:,1] + scaled_features[:,2] +
        input_data['provider_consistency'].astype(int)
    )
    input_data['income_numeric'] = input_data['income_bracket'].map(income_map)
    input_data['literacy_x_income'] = input_data['health_literacy_score'] * input_data['income_numeric']

    # Drop helper columns and one-hot encode
    input_data = input_data.drop(['health_literacy_score','social_support_index','belief_in_medication',
                                  'provider_consistency','income_bracket','income_numeric'], axis=1)
    input_data = pd.get_dummies(input_data, drop_first=True)
    input_data = input_data.reindex(columns=X_final.columns, fill_value=0)  # Use trained X columns

    # Prediction
    pred = model.predict(input_data)[0]
    proba = model.predict_proba(input_data)[0]

    result = "Likely Adherent" if pred==1 else "Likely Non-Adherent"
    conf = f"Confidence (Adherent): {proba[1]*100:.1f}% | Confidence (Non-Adherent): {proba[0]*100:.1f}%"

    return result, conf


In [13]:
import gradio as gr
import pandas as pd
import joblib

income_map = {'Low': 1, 'Medium': 2, 'High': 3}

# Prediction function
def predict_adherence(
    age, gender, education_level, income_bracket, location_type,
    hemoglobin_level, iron_deficiency_status, comorbidities_count,
    lab_test_frequency, side_effects_reported, medication_type,
    dosage_frequency, prescription_duration_days, tablets_dispensed,
    pill_count_last_visit, refill_gap_days, health_literacy_score,
    depression_score, social_support_index, belief_in_medication,
    distance_to_clinic_km, insurance_status, medication_cost_inr,
    provider_consistency
):
    input_data = pd.DataFrame([{
        'age': age, 'gender': gender, 'education_level': education_level,
        'income_bracket': income_bracket, 'location_type': location_type,
        'hemoglobin_level': hemoglobin_level, 'iron_deficiency_status': iron_deficiency_status,
        'comorbidities_count': comorbidities_count, 'lab_test_frequency': lab_test_frequency,
        'side_effects_reported': side_effects_reported, 'medication_type': medication_type,
        'dosage_frequency': dosage_frequency, 'prescription_duration_days': prescription_duration_days,
        'tablets_dispensed': tablets_dispensed, 'pill_count_last_visit': pill_count_last_visit,
        'refill_gap_days': refill_gap_days, 'health_literacy_score': health_literacy_score,
        'depression_score': depression_score, 'social_support_index': social_support_index,
        'belief_in_medication': belief_in_medication, 'distance_to_clinic_km': distance_to_clinic_km,
        'insurance_status': insurance_status, 'medication_cost_inr': medication_cost_inr,
        'provider_consistency': provider_consistency
    }])

    # Feature engineering
    scaled_features = scaler.transform(input_data[['health_literacy_score','social_support_index','belief_in_medication']])
    input_data['patient_readiness_score'] = (
        scaled_features[:,0] + scaled_features[:,1] + scaled_features[:,2] +
        input_data['provider_consistency'].astype(int)
    )
    input_data['income_numeric'] = input_data['income_bracket'].map(income_map)
    input_data['literacy_x_income'] = input_data['health_literacy_score'] * input_data['income_numeric']

    # Drop helper columns and one-hot encode
    input_data = input_data.drop(['health_literacy_score','social_support_index','belief_in_medication',
                                  'provider_consistency','income_bracket','income_numeric'], axis=1)
    input_data = pd.get_dummies(input_data, drop_first=True)
    input_data = input_data.reindex(columns=model.feature_names_in_, fill_value=0)

    # Prediction
    pred = model.predict(input_data)[0]
    proba = model.predict_proba(input_data)[0]

    result = "‚úÖ Likely Adherent" if pred==1 else "‚ö†Ô∏è Likely Non-Adherent"
    conf = f"Confidence: Adherent {proba[1]*100:.1f}% | Non-Adherent {proba[0]*100:.1f}%"

    return result, conf

# --- Gradio Interface ---
with gr.Blocks(theme=gr.themes.Default(primary_hue="blue", font="Roboto")) as demo:

    gr.Markdown("## üíä Medication Adherence Predictor", elem_id="title")
    gr.Markdown("Enter patient information below to predict likelihood of adherence.", elem_id="subtitle")

    # --- Demographics & Lifestyle ---
    with gr.Tab("Demographics & Lifestyle"):
        age = gr.Number(label="Patient Age (years)", value=30)
        gender = gr.Dropdown(["Male","Female"], label="Gender")
        education_level = gr.Dropdown(["Secondary","Graduate","Postgraduate"], label="Highest Education Level")
        income_bracket = gr.Dropdown(["Low","Medium","High"], label="Income Bracket")
        location_type = gr.Dropdown(["Urban","Rural"], label="Residential Area Type")
        distance_to_clinic_km = gr.Number(label="Distance to Clinic (km)", value=5)

    # --- Health Metrics ---
    with gr.Tab("Health Metrics"):
        hemoglobin_level = gr.Number(label="Hemoglobin Level (g/dL)", value=12.0)
        iron_deficiency_status = gr.Checkbox(label="Has Iron Deficiency?")
        comorbidities_count = gr.Number(label="Number of Chronic Conditions", value=0)
        lab_test_frequency = gr.Number(label="Annual Lab Test Frequency", value=2)
        side_effects_reported = gr.Checkbox(label="Side Effects Reported?")
        depression_score = gr.Number(label="Depression Score", value=1.0)

    # --- Medication & Treatment ---
    with gr.Tab("Medication & Treatment"):
        medication_type = gr.Dropdown(["Iron Tablets","Oral Supplements","Injections"], label="Medication Type")
        dosage_frequency = gr.Dropdown(["Daily","Weekly","Monthly"], label="Dosage Frequency")
        prescription_duration_days = gr.Number(label="Prescription Duration (days)", value=30)
        tablets_dispensed = gr.Number(label="Tablets Dispensed", value=30)
        pill_count_last_visit = gr.Number(label="Pills Remaining from Last Visit", value=0)
        refill_gap_days = gr.Number(label="Days Since Last Refill", value=5)
        insurance_status = gr.Checkbox(label="Insurance Coverage Available?")
        medication_cost_inr = gr.Number(label="Medication Cost (INR)", value=200)
        provider_consistency = gr.Checkbox(label="Consistent Healthcare Provider?")

    # --- Psychosocial ---
    with gr.Tab("Psychosocial"):
        health_literacy_score = gr.Number(label="Health Literacy Score (0‚Äì1)", value=0.5)
        social_support_index = gr.Number(label="Social Support Index (0‚Äì5)", value=3)
        belief_in_medication = gr.Number(label="Belief in Medication Effectiveness (0‚Äì1)", value=0.5)
        # --- Health Metrics ---

        # Predict button and outputs
    predict_btn = gr.Button("Predict Adherence", elem_id="predict_btn")

    with gr.Row():
        result_output = gr.Textbox(label="Prediction", interactive=False)
        confidence_output = gr.Textbox(label="Confidence", interactive=False)



    predict_btn.click(
        fn=predict_adherence,
        inputs=[
            age, gender, education_level, income_bracket, location_type,
            hemoglobin_level, iron_deficiency_status, comorbidities_count,
            lab_test_frequency, side_effects_reported, medication_type,
            dosage_frequency, prescription_duration_days, tablets_dispensed,
            pill_count_last_visit, refill_gap_days, health_literacy_score,
            depression_score, social_support_index, belief_in_medication,
            distance_to_clinic_km, insurance_status, medication_cost_inr,
            provider_consistency
        ],
        outputs=[result_output, confidence_output]
    )

demo.launch()


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://aff81b2be778286a2d.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


