# Machine Problem: Treatment Efficacy Prediction Engine

## 1. Project Overview

**Objective:**
Develop a machine learning regression system that predicts a patient's **Improvement Score** (0-10) based on their demographic profile, medical condition, and prescribed treatment plan.

**The Problem:**
Doctors currently prescribe medication based on general guidelines. However, patient responses vary wildly. By predicting the "Improvement Score" *before* treatment begins, this tool aims to help physicians choose the most effective treatment plan (Drug + Dosage + Duration) for a specific individual, effectively creating a "Personalized Medicine" recommender.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Load data
df = pd.read_csv("real_drug_dataset.csv")
df = df.drop(columns=['Patient_ID', 'Side_Effects']) # Patient_ID has no particular use. Side_Effects can only be known after treatment

In [None]:
# Test if csv can be read successfully
df.head(10)

### Phase 1: Exploratory Data Analysis (EDA)

1. **Univariate Analysis:** Plot histograms of `Improvement_Score`. Is it a Bell Curve (Normal Distribution) or skewed?

In [None]:
# Histogram


2. **Bivariate Analysis:**
* Does `Age` correlate with `Improvement_Score`? (Scatter plot).
* Do certain `Drugs` consistently perform better for certain `Conditions`? (Box plots).

In [None]:
# Scatter plot
plt.figure(figsize=(14, 8))
sns.scatterplot(data=df, x='Age', y='Improvement_Score', hue='Condition', alpha=0.6)
sns.regplot(data=df, x='Age', y='Improvement_Score', scatter=False, color='black')  #Regression Line
plt.xticks(np.arange(15, 85, 5))    # Set x-ticks from 15 to 80 with increment of 5

# Scatter plot titles and labels
plt.title('Relationship Between Patient Age and Improvement Score', fontsize=14)
plt.xlabel('Age (Years)', fontsize=12)
plt.ylabel('Improvement Score (0-10)', fontsize=12)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left') # Moves legend outside to keep plot clean
plt.grid(True, linestyle='--', alpha=0.7)


# Show the plot
plt.tight_layout()
plt.show()

--------------------------------------------------------------------------------------------------------------------------------------
**Inference: Age vs. Improvement Score**

**Observation:**
Visual inspection of the scatter plot reveals **no discernible linear relationship** between a patient's Age and their Improvement Score.

1. **Visual Evidence:**
The data points are scattered uniformly across the chart without any clear upward or downward trend. The regression line (black line) is approximately horizontal, confirming a correlation coefficient near zero (`0.01`).
2. **Practical Implication:**
This suggests that **Age is not a strong standalone predictor** of treatment outcome in this dataset. A 20-year-old patient is just as likely to achieve a high improvement score (e.g., 9.0) as a 70-year-old patient.
3. **Modeling Note:**
While Age alone does not predict the score, it should not necessarily be discarded. It may still provide predictive power when **combined** with other features (e.g., *older patients might react differently to high dosages compared to younger patients*), which simpler univariate analysis cannot capture.
---------------------------------------------------------------------------------------------------------------------------------------



In [None]:
# Box plots
plt.figure(figsize=(14, 8))
sns.boxplot(data=df, x='Condition', y='Improvement_Score', hue='Drug_Name', palette='tab20')

plt.title('Comparison of Drug Efficacy by Medical Condition', fontsize=14)
plt.xlabel('Medical Condition', fontsize=12)
plt.ylabel('Improvement Score', fontsize=12)
plt.legend(title='Drug Name', bbox_to_anchor=(1.05, 1), loc='upper left') # Moves legend outside to keep plot clean
plt.grid(True, linestyle='--', alpha=0.7)

# Show the plot
plt.tight_layout()
plt.show()

--------------------------------------------------------------------------------------------------------------------------------------
**Inference: Drug Efficacy by Medical Condition**

**Observation:**
Visual inspection of the box plot reveals that while some drugs perform better on average, there is significant overlap in effectiveness across treatments for the same condition.

1. **Visual Evidence:**
* **The "Winners":** Certain drugs show higher median lines (the line inside the box). For example, **Insulin Glargine** (Diabetes) and **Losartan** (Hypertension) appear to have the highest median improvement scores (~7.3).
* **The "Risky" Options:** Drugs like **Sertraline** (Depression) display very "tall" boxes with long whiskers. This indicates high varianceâ€”meaning the drug works amazingly for some (score 10) but poorly for others (score < 4).
* **The "Reliable" Options:** Drugs like **Metoprolol** (Hypertension) have shorter boxes, suggesting they produce consistent, predictable results, even if they aren't always the highest scorers.


2. **Practical Implication:**
This suggests that a **"one-size-fits-all" prescription strategy is suboptimal**. Since no single drug guarantees a high score for every patient, doctors cannot rely solely on the "average" best drug; they must consider individual patient characteristics.
3. **Modeling Note:**
The high variance (tall boxes) confirms that **Condition and Drug Name alone are not enough** to predict the outcome perfectly. The machine learning model will need to leverage other features (like *Age, Dosage, or Gender*) to determine *which* specific patients will fall into the high-success range of a volatile drug like Insulin Glargine.

---------------------------------------------------------------------------------------------------------------------------------------

3. **Correlation Matrix:** Use a Heatmap to see if `Dosage` and `Duration` are correlated.

In [None]:
# Heatmap 

---

### Phase 2: Data Preprocessing & Feature Engineering
1. **Encoding:** Convert `Gender`, `Condition`, and `Drug_Name` into numbers using **One-Hot Encoding** (`pd.get_dummies`).

In [None]:
# Get all the values and its distribution for Gender, Condition, and Drug Name
print(df.Gender.value_counts().sort_index())
print(df.Condition.value_counts().sort_index())
print(df.Drug_Name.value_counts().sort_index())

In [None]:
# One hot encoding
print(f"Original Data:\n{df}\n")

df_encoded = pd.get_dummies(df, columns=['Gender', 'Condition', 'Drug_Name'], drop_first=True)
print(f"One-Hot Encoded Data using Pandas:\n{df_encoded}\n")

2. **Feature Engineering (The "Secret Sauce"):**
* Create a new feature: `Total_Drug_Exposure = Dosage_mg * Treatment_Duration_days`.
* Create an interaction feature: `Age_Group` (e.g., Young, Middle, Senior).

In [11]:
# Total Drug Exposure
df_encoded['Total_Drug_Exposure'] = df_encoded['Dosage_mg'] * df_encoded['Treatment_Duration_days']

# Age group
df_encoded['Age_Group'] = pd.cut(df_encoded['Age'], bins=[0, 39, 59, 150], labels=['Young', 'Middle', 'Senior']
)

In [None]:
# Encode Age Group
df_features = pd.get_dummies(df_encoded, columns=['Age_Group'], drop_first=True)
print(df_features)

3. **Scaling:** Normalize `Dosage_mg` and `Age` using **MinMax Scaler** or **Standard Scaler** so large numbers don't confuse the model.

**Standard Scaler**
$$ z = \frac{x - \mu}{\sigma}$$

**MinMax Scaler**
$$ X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}}$$

In [13]:
# Standard Scaler - Scratch Approach
def fit_scaler(X):
    """
    Parameters:
    X: Dataframe

    Returns:
    mean: Array. Mean of each feature
    std:  Array. Standard deviation of each feature
    """

    mean = X.mean()
    std = X.std(ddof=0) # 0 - population std, divides by n; 1 - sample std, divides by n-1)

    # Edge case where std is 0.
    std = std.replace(0, 1)

    return mean, std

def transform_scaler(X, mean, std):
    """
    Parameters:
    X: Dataframe
    mean: Array
    std:  Array

    Returns:
    z: Dataframe. Scaled data
    """

    z = (X - mean) / std
    
    return z

In [None]:
# All numerical columns that need scaling
columns_to_scale = ['Age', 'Dosage_mg', 'Treatment_Duration_days', 'Total_Drug_Exposure']

print("Before scaling:")
print(df_features[columns_to_scale].head(10))

X_to_scale = df_features[columns_to_scale]

# Fit the scaler
mean, std = fit_scaler(X_to_scale)
print(f"Mean of features: {mean.values}")
print(f"Std of features: {std.values}")

# Transform the data
z = transform_scaler(X_to_scale, mean, std)

# Replace the original columns with scaled versions
df_scaled = df_features.copy()
df_scaled[columns_to_scale] = z

print("After scaling:")
print(df_scaled[columns_to_scale].head(10))

Before scaling:
   Age  Dosage_mg  Total_Drug_Exposure
0   56         50                  450
1   69        500                12000
2   46        100                 2500
3   32        850                37400
4   60        850                29750
5   25        850                42500
6   78        250                10000
7   38        100                 1500
8   56        850                47600
9   75        850                16150
Mean of features: [   49.857   352.65  11405.45 ]
Std of features: [   18.10520784   295.27187049 11977.91291492]
After scaling:
        Age  Dosage_mg  Total_Drug_Exposure
0  0.339295  -1.024988            -0.914638
1  1.057320   0.499032             0.049637
2 -0.213033  -0.855652            -0.743489
3 -0.986291   1.684380             2.170207
4  0.560226   1.684380             1.531531
5 -1.372920   1.684380             2.595991
6  1.554415  -0.347646            -0.117337
7 -0.654894  -0.855652            -0.826976
8  0.339295   1.684380        

In [None]:
# Standard Scaler - Library Approach
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df_features[columns_to_scale])

print(f"Mean of features: {scaler.mean_}")
print(f"Std of features: {scaler.scale_}")

z_library = scaler.transform(df_features[columns_to_scale])
df_scaled_library = df_features.copy()
df_scaled_library[columns_to_scale] = z_library

print("After scaling:")
print(df_scaled_library[columns_to_scale].head(10))

Mean of features: [   49.857   352.65  11405.45 ]
Std of features: [   18.10520784   295.27187049 11977.91291492]
After scaling:
        Age  Dosage_mg  Total_Drug_Exposure
0  0.339295  -1.024988            -0.914638
1  1.057320   0.499032             0.049637
2 -0.213033  -0.855652            -0.743489
3 -0.986291   1.684380             2.170207
4  0.560226   1.684380             1.531531
5 -1.372920   1.684380             2.595991
6  1.554415  -0.347646            -0.117337
7 -0.654894  -0.855652            -0.826976
8  0.339295   1.684380             3.021774
9  1.388716   1.684380             0.396108


---

### Phase 3: Model Development
1. **Baseline Model:** Train a simple **Linear Regression**. Calculate the R2 Score. (Note: It will likely be low/poor. This is your baseline to beat).

In [None]:
# Linear Regression 


2. **Advanced Model 1:** Train a **Decision Tree Regressor**. This handles non-linear data better (e.g., maybe high dosage is good for young people but bad for old people).

In [None]:
# Decision Trree Regressor


3. **Advanced Model 2 (Champion):** Train a **Random Forest Regressor** or **Gradient Boosting Regressor**. These combine many trees to reduce errors.

In [None]:
# Random Forest


# Gradient Boosting



---

### Phase 4: Evaluation & Interpretation

1. **Metrics:** Report **MAE** (Mean Absolute Error) and **RMSE** (Root Mean Squared Error).
2. **Feature Importance:** Extract which factors mattered most. Was it the *Drug Name* or the *Duration*? 

*Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas congue quam vitae pretium aliquam. Sed fermentum blandit est, fringilla ultricies ligula venenatis id. Suspendisse pretium quam sed nibh lacinia mattis sit amet vitae dui. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Proin in pharetra orci. Donec at placerat elit. Duis tristique mollis tristique. Nam eu leo efficitur, fermentum nibh at, tincidunt tortor. Pellentesque id quam tortor. Duis velit libero, sagittis rutrum lectus in, aliquet pharetra magna. In pharetra mollis sagittis. Sed malesuada quam lorem. Nam dictum magna vel tellus ornare, nec sodales erat pulvinar. Curabitur volutpat, lacus tincidunt bibendum pharetra, erat diam semper diam, id venenatis metus est non magna. Etiam vehicula sollicitudin hendrerit. Nullam porttitor dui at sem euismod blandit.*