In [1]:
import pandas as pd
import numpy as np

# Reload the original data
url = 'https://raw.githubusercontent.com/nandarishik/Ferry-Internship/main/realistic_medication_adherence_data.csv'
df = pd.read_csv(url)

# Clean missing values
for col in df.columns:
    if df[col].isnull().any():
        if df[col].dtype == 'object':
            df[col].fillna(df[col].mode()[0], inplace=True)
        else:
            df[col].fillna(df[col].median(), inplace=True)

print("✅ Data reloaded and cleaned.")

✅ Data reloaded and cleaned.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)


In [2]:
# --- 1. Domain-Specific Composite Score ---
med_type_complexity = {'Injections': 3, 'Iron Tablets': 2, 'Oral Supplements': 1}
dosage_freq_complexity = {'Daily': 3, 'Weekly': 2, 'Monthly': 1}
df['treatment_complexity'] = (
    df['medication_type'].map(med_type_complexity) +
    df['dosage_frequency'].map(dosage_freq_complexity) +
    df['side_effects_reported'].astype(int) +
    df['comorbidities_count']
)

# --- 2. Polynomial Features ---
df['age_squared'] = df['age']**2
df['depression_score_squared'] = df['depression_score']**2

# --- 3. Log Transformation ---
df['distance_log'] = np.log1p(df['distance_to_clinic_km'])

print("✅ Advanced features created successfully.")

✅ Advanced features created successfully.


In [3]:
# Create the target variable y
y = df['medication_adherence']

# Create the feature set X
X_advanced = df.drop([
    # Drop the target
    'medication_adherence',

    # Drop original columns that were replaced or used in new features
    'age',
    'depression_score',
    'distance_to_clinic_km',
    'medication_type',
    'dosage_frequency',
    'side_effects_reported',
    'comorbidities_count'
], axis=1)

# One-hot encode any remaining categorical columns
X_advanced = pd.get_dummies(X_advanced, drop_first=True)

print("✅ Final feature set X prepared for modeling.")
print("Final features shape:", X_advanced.shape)

✅ Final feature set X prepared for modeling.
Final features shape: (500, 23)


In [4]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Split the data
X_train_adv, X_test_adv, y_train, y_test = train_test_split(
    X_advanced, y, test_size=0.2, random_state=42
)

# Use the best model parameters we found earlier
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_leaf=1,
    min_samples_split=2,
    random_state=42
)

# Train the model on the new advanced data
model.fit(X_train_adv, y_train)

# Make predictions and evaluate
y_pred_adv = model.predict(X_test_adv)
accuracy_adv = accuracy_score(y_test, y_pred_adv)

print(f"\n🚀 Final Model Accuracy with Advanced Features: {accuracy_adv:.2f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred_adv))


🚀 Final Model Accuracy with Advanced Features: 0.63

Classification Report:
              precision    recall  f1-score   support

           0       0.63      0.48      0.54        46
           1       0.63      0.76      0.69        54

    accuracy                           0.63       100
   macro avg       0.63      0.62      0.62       100
weighted avg       0.63      0.63      0.62       100



Conclusions

This is our final and most definitive result. The advanced feature engineering **decreased the model's performance**, bringing the accuracy down to 63%.

This is not a failure; it's the most important lesson of the entire project.

---
## The Lesson: Simplicity Wins
This outcome is a classic in machine learning. Our attempts to create complex, "smarter" features actually added more noise than signal. The model struggled to find a clear pattern in these highly abstract features (`treatment_complexity`, `age_squared`, etc.) and its ability to generalize to new data got worse.

Most importantly, the **recall for non-adherent patients (Class 0) dropped to 0.48**. This means the model is now missing more than half of the at-risk patients you would want to help, making it the least useful model we've built for any practical purpose.

---
## Final Verdict and Project Recommendation
We have now completed the full data science lifecycle. We started with a broken model, fixed it, established a strong baseline, and ran multiple experiments to improve it. The data has given us a clear answer.

**Your best model is the first one we properly tuned, before any feature engineering was attempted.**

That model is the winner because it:
1.  Achieved the highest and most stable accuracy (**69%**).
2.  Had the most **balanced performance** for identifying both adherent and non-adherent patients.
3.  Relied on the original, most **interpretable features**, leading to clear, actionable insights (like the importance of health literacy).

Congratulations, Nanda! You've successfully taken a complex problem from a flawed starting point to a robust, well-understood, and realistic conclusion. You now have a reliable baseline model and, more importantly, a data-driven reason to trust that a simpler approach is the most effective one for this problem.