Feature Engineering

In [1]:
import pandas as pd
import numpy as np

# Use the "raw" URL for the CSV file
url = 'https://raw.githubusercontent.com/nandarishik/Ferry-Internship/main/realistic_medication_adherence_data.csv'

df = pd.read_csv(url)

# Preview the first 5 rows of the data
df.head()


Unnamed: 0,age,gender,education_level,income_bracket,location_type,hemoglobin_level,iron_deficiency_status,comorbidities_count,lab_test_frequency,side_effects_reported,...,refill_gap_days,health_literacy_score,depression_score,social_support_index,belief_in_medication,distance_to_clinic_km,insurance_status,medication_cost_inr,provider_consistency,medication_adherence
0,56,Female,Secondary,Low,Rural,13.42,False,2,5,False,...,14.0,0.69,0.06,3.66,0.62,19.19,True,492.45,True,0
1,24,Male,Graduate,Medium,Urban,11.18,False,0,3,False,...,17.0,0.73,1.28,0.07,0.78,4.45,False,452.67,True,0
2,25,Female,Secondary,Low,Urban,13.77,False,0,2,False,...,16.0,0.21,1.54,0.23,0.71,1.47,False,322.68,False,0
3,45,Male,Secondary,Medium,Rural,14.57,False,4,2,True,...,6.0,0.62,1.25,2.25,0.48,12.62,False,362.98,True,0
4,32,Male,Graduate,High,Urban,13.57,False,0,0,False,...,19.0,0.46,1.69,4.71,0.35,0.38,True,124.1,True,1


In [2]:
# Loop through each column to fill missing values
for col in df.columns:
    if df[col].isnull().any():
        if df[col].dtype == 'object':
            df[col].fillna(df[col].mode()[0], inplace=True) # Fill text with most frequent value
        else:
            df[col].fillna(df[col].median(), inplace=True) # Fill numbers with median

print("Missing values handled.")

Missing values handled.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True) # Fill text with most frequent value
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True) # Fill numbers with median


In [3]:
# Interaction Feature: Frailty Index
df['frailty_index'] = df['age'] * (df['comorbidities_count'] + 1)

# Ratio Feature: Financial Burden
income_numeric_map = {'Low': 1, 'Medium': 2, 'High': 3}
df['income_numeric'] = df['income_bracket'].map(income_numeric_map)
df['financial_burden'] = df['medication_cost_inr'] / (df['income_numeric'] + 1)

# Binning Feature: Age Group
bins = [17, 35, 55, 81]
labels = ['Young_Adult', 'Middle_Aged', 'Senior']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)

print("Features engineered.")

Features engineered.


In [4]:
# Select features, dropping original columns we replaced and helper columns
X = df.drop([
    'medication_adherence', 'age', 'income_bracket', 'income_numeric',
    'medication_cost_inr', 'comorbidities_count'
], axis=1)
y = df['medication_adherence']

# One-hot encode all remaining categorical columns (like our new 'age_group')
X = pd.get_dummies(X, drop_first=True)

print("PreProcessing Complete.")
print("Final features shape:", X.shape)

PreProcessing Complete.
Final features shape: (500, 27)


In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Data split into training and testing sets.")

Data split into training and testing sets.


In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Use the best parameters we found from hyperparameter tuning
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_leaf=1,
    min_samples_split=2,
    random_state=42
)

# Train the model
model.fit(X_train, y_train)

# Make predictions and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"\nFinal Model Accuracy: {accuracy:.2f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred))


Final Model Accuracy: 0.68

Classification Report:
              precision    recall  f1-score   support

           0       0.69      0.54      0.61        46
           1       0.67      0.80      0.73        54

    accuracy                           0.68       100
   macro avg       0.68      0.67      0.67       100
weighted avg       0.68      0.68      0.67       100



In [7]:
corr = X.corrwith(y)
print(corr.sort_values(ascending=False).head(10))


health_literacy_score               0.297665
provider_consistency                0.167828
social_support_index                0.165901
belief_in_medication                0.138714
hemoglobin_level                    0.048991
age_group_Senior                    0.045172
prescription_duration_days          0.024084
tablets_dispensed                   0.023183
medication_type_Oral Supplements    0.019235
iron_deficiency_status              0.013766
dtype: float64


### Explanation of My Engineered Features

When I looked at the limits of my original dataset, I realized I needed to create features that better reflect the real-world challenges patients face. Here’s what I built and why:

1. **Frailty Index**
   I created this by multiplying a patient’s **age** with their **comorbidities_count**.
   My reasoning was simple: risk isn’t just additive, it multiplies. A 75-year-old with three other health conditions is in a far more fragile state than a 35-year-old with the same three conditions. By collapsing that into a single **frailty_index**, the model now gets a clearer signal of combined vulnerability.

2. **Financial Burden**
   This came from dividing **medication_cost_inr** by the patient’s **income_bracket** (which I converted to numbers).
   The idea here was context. A ₹500 prescription means very different things depending on who you are — almost negligible for a high-income patient, but a serious obstacle for someone in a lower bracket. By creating this ratio, I gave the model a way to measure **relative financial strain**, which feels much closer to reality than just looking at raw costs.

3. **Age Group**
   Instead of keeping age as just a continuous number, I binned it into categories like **Young_Adult**, **Middle_Aged**, and **Senior**.
   This was my way of handling the **non-linear behavior of age**. From what I’ve observed, adherence isn’t a straight line — young adults might do well, middle age tends to dip (work, family, responsibilities), and then seniors often return to higher adherence. By splitting age into groups, I gave the model a chance to treat each group differently, capturing these patterns more effectively.


