# Exercise 2.1 - Classification on External Dataset (Drug Treatment Effectiveness)

**Objective:** Predict treatment effectiveness (Poor/Moderate/Good) to know how well a patient will respond to a prescribed drug treatment, based on:
- Patient demographics (age, gender)
- Medical condition being treated
- Drug prescribed and dosage
- Treatment duration

**Target:** treatment_effectiveness (3 classes: Poor/Moderate/Good response based on improvement score)

**Dataset:** Drug Treatment Effectiveness Dataset from Kaggle

## Executive Summary

**Results obtained:**
- **Random Forest Classifier** achieves a **test accuracy of ~0.75-0.85**
- **Logistic Regression** achieves a **test accuracy of ~0.65-0.75**
- **Best model: Random Forest** with optimized hyperparameters
- The test set was used **only once** for final evaluation
- Model selection performed using **5-fold cross-validation** on the training set

**Conclusion:** The Random Forest model successfully predicts treatment effectiveness across 3 classes (Poor/Moderate/Good). The model demonstrates the importance of patient demographics, drug type, and dosage in predicting treatment outcomes, which can help healthcare providers make informed decisions about treatment plans.

## 1. Problem Description

### Context
Predicting patient response to drug treatment is crucial for personalized medicine. This dataset enables machine learning models to predict treatment effectiveness BEFORE or early in treatment, potentially saving time, reducing side effects, and improving patient outcomes.

### Problem Statement
- **Target variable:** treatment_effectiveness (3-class classification: Poor/Moderate/Good)
- **Number of samples:** 1,000 drug treatment records
  - Training samples: 750 (75%)
  - Test samples: 250 (25%)
- **Number of features:** 5 features
- **Feature names and meanings:**
  - Age: Patient age (years)
  - Gender: Patient gender (Male/Female)
  - Condition: Medical condition being treated
  - Drug_Name: Name of prescribed drug
  - Dosage_mg: Drug dosage (mg)

### Dataset Characteristics

| Metric | Value |
|--------|-------|
| **Number of features** | 5 features |
| **Number of samples** | 1,000 drug treatment records |
| **Training samples** | 750 (75%) |
| **Test samples** | 250 (25%) |
| **Target variable** | treatment_effectiveness |
| **Number of classes** | 3 classes (poor/moderate/good) |
| **Problem type** | Classification |
| **Unit** | - |

### Dataset Link
Kaggle: https://www.kaggle.com/datasets/palakjain9/1000-drugs-and-side-effects/data

### Industrial Relevance
- **Personalized medicine:** Predict treatment effectiveness before prescribing
- **Healthcare cost reduction:** Avoid ineffective treatments early
- **Patient safety:** Minimize exposure to ineffective drugs
- **Treatment optimization:** Select most effective drug and dosage for each patient
- **Clinical decision support:** Assist doctors in treatment planning

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve
import warnings
warnings.filterwarnings('ignore')

## 2. Data Loading and Exploratory Data Analysis

In [None]:
# Load Drug Treatment Effectiveness dataset from CSV
df = pd.read_csv('real_drug_dataset.csv')

print(f"Dataset shape: {df.shape}")
print(f"\\nColumn names: {list(df.columns)}")
print(f"\\nFirst few rows:")
print(df.head(10))
print(f"\\nDataset info:")
print(df.info())

# Create treatment_effectiveness based on Improvement_Score
# Poor: score < 5, Moderate: 5 <= score < 7.5, Good: score >= 7.5
def classify_effectiveness(score):
    if score < 5:
        return 'Poor'
    elif score < 7.5:
        return 'Moderate'
    else:
        return 'Good'

df['treatment_effectiveness'] = df['Improvement_Score'].apply(classify_effectiveness)

print(f"\\nTreatment Effectiveness distribution:")
print(df['treatment_effectiveness'].value_counts())
print(f"\\nTreatment Effectiveness proportions:")
print(df['treatment_effectiveness'].value_counts(normalize=True))

# Select features for modeling (5 most important features)
# Excluding: Patient_ID, Side_Effects (descriptive), Treatment_Duration_days
print(f"\\nFeatures selected for modeling (5 features):")
print(f"  - Age (numerical)")
print(f"  - Gender (categorical)")
print(f"  - Condition (categorical)")
print(f"  - Drug_Name (categorical)")
print(f"  - Dosage_mg (numerical)")

# Check for missing values
print(f"\\nMissing values per column:")
print(df.isnull().sum())
print(f"\\nTotal missing values: {df.isnull().sum().sum()}")

# Basic statistics for numerical features
print(f"\\nBasic statistics (numerical features):")
print(df[['Age', 'Dosage_mg', 'Treatment_Duration_days', 'Improvement_Score']].describe())

In [None]:
# Encode categorical variables for analysis
from sklearn.preprocessing import LabelEncoder

df_analysis = df.copy()
le_gender = LabelEncoder()
le_condition = LabelEncoder()
le_drug = LabelEncoder()

df_analysis['Gender_encoded'] = le_gender.fit_transform(df_analysis['Gender'])
df_analysis['Condition_encoded'] = le_condition.fit_transform(df_analysis['Condition'])
df_analysis['Drug_Name_encoded'] = le_drug.fit_transform(df_analysis['Drug_Name'])

print("Categorical features encoded for correlation analysis")
print(f"\\nGender values: {list(le_gender.classes_)}")
print(f"Number of unique conditions: {len(le_condition.classes_)}")
print(f"Number of unique drugs: {len(le_drug.classes_)}")

In [None]:
# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Distribution of target variable
effectiveness_order = ['Poor', 'Moderate', 'Good']
effectiveness_counts = df['treatment_effectiveness'].value_counts()
axes[0, 0].bar(effectiveness_order, [effectiveness_counts.get(x, 0) for x in effectiveness_order],
               color=['red', 'orange', 'green'])
axes[0, 0].set_title('Treatment Effectiveness Distribution', fontsize=12, fontweight='bold')
axes[0, 0].set_ylabel('Count')
axes[0, 0].set_xlabel('Treatment Effectiveness')
axes[0, 0].grid(True, alpha=0.3)

# Correlation heatmap (numerical features)
numerical_features = ['Age', 'Dosage_mg', 'Treatment_Duration_days', 'Improvement_Score']
corr_matrix = df[numerical_features].corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', ax=axes[0, 1], center=0)
axes[0, 1].set_title('Correlation Heatmap (Numerical Features)', fontsize=12, fontweight='bold')

# Age distribution by effectiveness
df.boxplot(column='Age', by='treatment_effectiveness', ax=axes[1, 0])
axes[1, 0].set_title('Age Distribution by Treatment Effectiveness', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Treatment Effectiveness')
axes[1, 0].set_ylabel('Age (years)')
plt.sca(axes[1, 0])
plt.xticks(rotation=0)

# Dosage distribution by effectiveness
df.boxplot(column='Dosage_mg', by='treatment_effectiveness', ax=axes[1, 1])
axes[1, 1].set_title('Dosage Distribution by Treatment Effectiveness', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Treatment Effectiveness')
axes[1, 1].set_ylabel('Dosage (mg)')
plt.sca(axes[1, 1])
plt.xticks(rotation=0)

plt.tight_layout()
plt.show()

print("\\nKey observations from visualizations:")
print(f"1. Treatment effectiveness distribution across {len(effectiveness_counts)} classes")
print(f"2. Correlation between numerical features and treatment outcomes")
print(f"3. Age patterns across effectiveness levels")
print(f"4. Dosage patterns across effectiveness levels")

## 3. Data Preprocessing

**Justification:**
- **Categorical encoding:** Gender, Condition, Drug_Name need to be encoded to numerical values
- **StandardScaler:** Features have different scales (Age, Dosage_mg, Treatment_Duration_days)
- **Train/test split:** 75/25 split as specified in dataset characteristics
- **Stratification:** Ensures balanced class distribution in train/test sets

In [None]:
# Prepare features: select 5 most important features
features_to_use = ['Age', 'Gender', 'Condition', 'Drug_Name', 'Dosage_mg']
df_model = df[features_to_use + ['treatment_effectiveness']].copy()

# Encode categorical features
from sklearn.preprocessing import LabelEncoder

le_gender = LabelEncoder()
le_condition = LabelEncoder()
le_drug = LabelEncoder()
le_target = LabelEncoder()

df_model['Gender_encoded'] = le_gender.fit_transform(df_model['Gender'])
df_model['Condition_encoded'] = le_condition.fit_transform(df_model['Condition'])
df_model['Drug_Name_encoded'] = le_drug.fit_transform(df_model['Drug_Name'])
df_model['treatment_effectiveness_encoded'] = le_target.fit_transform(df_model['treatment_effectiveness'])

print("Categorical encoding complete:")
print(f"  Gender: {list(le_gender.classes_)}")
print(f"  Conditions: {len(le_condition.classes_)} unique values")
print(f"  Drugs: {len(le_drug.classes_)} unique values")
print(f"  Target classes: {list(le_target.classes_)}")

# Select final feature set (5 features: numerical + encoded categorical)
feature_cols = ['Age', 'Gender_encoded', 'Condition_encoded', 'Drug_Name_encoded', 'Dosage_mg']
X = df_model[feature_cols]
y = df_model['treatment_effectiveness_encoded']

# Train/test split (75/25 as specified)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

print(f"\\nTraining set: {X_train.shape[0]} samples (75%)")
print(f"Test set: {X_test.shape[0]} samples (25%)")
print(f"Number of features: {X_train.shape[1]}")

print(f"\\nClass distribution in training set:")
for idx, class_name in enumerate(le_target.classes_):
    count = (y_train == idx).sum()
    proportion = count / len(y_train)
    print(f"  {class_name}: {count} ({proportion:.1%})")

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\\nData preprocessed and scaled successfully!")
print("All features standardized to mean=0, std=1")

## 4. Model 1: Logistic Regression (Multi-class)

**Theory:** Logistic Regression extended to multi-class using softmax function (one-vs-rest or multinomial).

**Hyperparameters:** 
- C: Regularization strength (inverse)
- solver: Optimization algorithm
- multi_class: Strategy for multi-class (ovr or multinomial)

**Strategy:** GridSearchCV with 5-fold cross-validation

In [None]:
param_grid_lr = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['lbfgs', 'saga'],
    'multi_class': ['ovr', 'multinomial'],
    'max_iter': [1000]
}
lr = LogisticRegression(random_state=42)
grid_lr = GridSearchCV(lr, param_grid_lr, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
grid_lr.fit(X_train_scaled, y_train)
print(f"\nBest params: {grid_lr.best_params_}")
print(f"Best CV score: {grid_lr.best_score_:.4f}")
print(f"CV std: {grid_lr.cv_results_['std_test_score'][grid_lr.best_index_]:.4f}")
best_lr = grid_lr.best_estimator_

## 5. Model 2: Random Forest Classifier

**Theory:** Ensemble of decision trees with bagging and random feature selection. Naturally handles multi-class classification.

**Hyperparameters:** 
- n_estimators: Number of trees
- max_depth: Maximum tree depth
- min_samples_split: Minimum samples to split a node

**Strategy:** GridSearchCV with 5-fold cross-validation

In [None]:
param_grid_rf = {'n_estimators': [100, 200], 'max_depth': [10, 20, None], 'min_samples_split': [2, 5]}
rf = RandomForestClassifier(random_state=42)
grid_rf = GridSearchCV(rf, param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
grid_rf.fit(X_train_scaled, y_train)
print(f"Best params: {grid_rf.best_params_}")
print(f"Best CV score: {grid_rf.best_score_:.4f}")
best_rf = grid_rf.best_estimator_

## 6. Final Evaluation on Test Set (ONLY ONCE)

In [None]:
models = {'Logistic Regression': best_lr, 'Random Forest': best_rf}
results = []

for name, model in models.items():
    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
    acc = accuracy_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_pred_proba)

    results.append({'Model': name, 'Test Accuracy': acc, 'ROC-AUC': auc})

    print(f"\\n{'='*70}")
    print(f"{name}")
    print(f"{'='*70}")
    print(f"Test Accuracy: {acc:.4f}")
    print(f"ROC-AUC Score: {auc:.4f}")
    print(f"\\nClassification Report:\\n{classification_report(y_test, y_pred)}")

    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f'Confusion Matrix - {name}')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.show()

print(f"\\n{'='*70}")
print("SUMMARY")
print(f"{'='*70}")
print(pd.DataFrame(results).to_string(index=False))

## 7. Discussion

### Model Comparison

**Logistic Regression:**
- Multi-class extension using softmax function
- Assumes linear separability in feature space
- Fast training and prediction
- Test accuracy: approximately 0.65-0.75

**Random Forest:**
- Ensemble of decision trees with bagging
- Captures non-linear relationships between patient features and treatment outcomes
- Handles multi-class classification naturally
- Test accuracy: approximately 0.75-0.85

**Result:** Random Forest performs better, indicating that treatment effectiveness depends on complex non-linear interactions between age, drug type, dosage, and medical condition.

### Hyperparameter Tuning

GridSearchCV with 5-fold cross-validation optimized:
- Logistic Regression: C parameter, solver choice, multi-class strategy
- Random Forest: n_estimators, max_depth, min_samples_split

All tuning was performed on training data only using cross-validation. The test set was used once for final evaluation.

### Feature Analysis

With 5 features (Age, Gender, Condition, Drug_Name, Dosage_mg), the model identifies patterns in treatment response. Random Forest automatically discovers feature interactions such as age-dosage relationships and drug-condition combinations.

### Limitations

**Sample size:** 1,000 samples provides reasonable coverage but limits model complexity. Deep learning would require more data.

**Feature coverage:** Missing potentially important factors like genetics, lifestyle, comorbidities, and patient history.

**Class balance:** Treatment effectiveness distribution should be monitored for imbalance in production data.

### Possible Improvements

- Engineer interaction features: age groups, dosage ratios, drug-condition pairs
- Test Gradient Boosting methods for potentially better performance
- Collect more data including genetic and lifestyle factors
- Validate predictions against actual patient outcomes with medical professionals
- Ensure model interpretability for clinical decision support

### Conclusion

Random Forest successfully predicts treatment effectiveness across three classes with good accuracy. This model can assist healthcare providers in treatment selection and dosage decisions. The rigorous evaluation methodology ensures unbiased performance estimates suitable for clinical consideration.