**Second-Order Feature Engineering**

In [None]:
import pandas as pd
import numpy as np

# Reload the original data
url = 'https://raw.githubusercontent.com/nandarishik/Ferry-Internship/main/realistic_medication_adherence_data.csv'
df = pd.read_csv(url)

# Clean missing values
for col in df.columns:
    if df[col].isnull().any():
        if df[col].dtype == 'object':
            df[col].fillna(df[col].mode()[0], inplace=True)
        else:
            df[col].fillna(df[col].median(), inplace=True)

print("Data loaded and processed.")

Data loaded and processed.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)


In [None]:
# --- 1. Domain-Specific Composite Score ---
med_type_complexity = {'Injections': 3, 'Iron Tablets': 2, 'Oral Supplements': 1}
dosage_freq_complexity = {'Daily': 3, 'Weekly': 2, 'Monthly': 1}
df['treatment_complexity'] = (
    df['medication_type'].map(med_type_complexity) +
    df['dosage_frequency'].map(dosage_freq_complexity) +
    df['side_effects_reported'].astype(int) +
    df['comorbidities_count']
)

# --- 2. Polynomial Features ---
df['age_squared'] = df['age']**2
df['depression_score_squared'] = df['depression_score']**2

# --- 3. Log Transformation ---
df['distance_log'] = np.log1p(df['distance_to_clinic_km'])

print("Advanced features created successfully.")

Advanced features created successfully.


In [None]:
# Create the target variable y
y = df['medication_adherence']

# Create the feature set X
X_advanced = df.drop([
    # Drop the target
    'medication_adherence',

    # Drop original columns that were replaced or used in new features
    'age',
    'depression_score',
    'distance_to_clinic_km',
    'medication_type',
    'dosage_frequency',
    'side_effects_reported',
    'comorbidities_count'
], axis=1)

# One-hot encode any remaining categorical columns
X_advanced = pd.get_dummies(X_advanced, drop_first=True)

print("Dataset X ready.")
print("Final features shape:", X_advanced.shape)

Dataset X ready.
Final features shape: (500, 23)


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Split the data
X_train_adv, X_test_adv, y_train, y_test = train_test_split(
    X_advanced, y, test_size=0.2, random_state=42
)

# Use the best model parameters we found earlier
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_leaf=1,
    min_samples_split=2,
    random_state=42
)

# Train the model on the new advanced data
model.fit(X_train_adv, y_train)

# Make predictions and evaluate
y_pred_adv = model.predict(X_test_adv)
accuracy_adv = accuracy_score(y_test, y_pred_adv)

print(f"\nFinal Model Accuracy with Advanced Features: {accuracy_adv:.2f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred_adv))


Final Model Accuracy with Advanced Features: 0.63

Classification Report:
              precision    recall  f1-score   support

           0       0.63      0.48      0.54        46
           1       0.63      0.76      0.69        54

    accuracy                           0.63       100
   macro avg       0.63      0.62      0.62       100
weighted avg       0.63      0.63      0.62       100



Conclusions

This is my **final and most definitive result**. The advanced feature engineering actually **reduced the model’s performance**, bringing the accuracy down to **63%**.

I don’t see this as a failure — in fact, I think it’s the **most important lesson** of the entire project.

---

## **The Lesson: Simplicity Wins**

What happened here is a classic case in machine learning. My attempts to design more complex, “smarter” features ended up introducing **more noise than signal**. Instead of helping, features like `treatment_complexity` and `age_squared` confused the model, making it harder to find meaningful patterns. As a result, its **generalization to new data got worse**.

The most critical red flag was the **recall for non-adherent patients (Class 0)**, which fell to **0.48**. That means the model was now **missing more than half of the very patients who are most at risk** — the exact group this project was meant to help. In practical terms, this version of the model is the **least useful one I’ve built**.

---

##**My Final Verdict and Recommendation**

At this point, I’ve walked through the **full data science lifecycle**: starting with a broken model, fixing it, building a solid baseline, and running multiple experiments to try and improve it. The data has now given me a **clear and decisive answer**.

**The best model so far is the first one I properly tuned, before I tried any feature engineering.**

That model stands out because it:

1. Delivered the **highest and most stable accuracy (69%)**.
2. Struck the best **balance** between catching adherent and non-adherent patients.
3. Relied on the original features, which were not only effective but also the most **interpretable**, giving me **actionable insights** like the importance of health literacy.

---

Looking back, I feel I’ve successfully taken a **complex, messy problem** from a weak starting point all the way to a **robust and realistic conclusion**. The lesson is clear: for this dataset and this problem, **simplicity is more powerful than complexity**.

That said, my **next step** is going to be to try a more powerful algorithm like **XGBoost** on this dataset. It may uncover stronger patterns while still preserving interpretability and performance.
