<a href="https://colab.research.google.com/github/kritonai/heartdisease/blob/main/student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Heart Failure Prediction - Model Training & Evaluation**

## Let's Build Some Models! üöÄ

In this notebook, we'll:
- Split our data into training and test sets
- Train multiple machine learning models
- Evaluate and compare their performance
- Select the best model for heart disease prediction

<img src="https://miro.medium.com/v2/resize:fit:1400/1*cG6U1qstYDijh9bPL42e-Q.jpeg" width="600">

# Importing Libraries

In [None]:
# Data handling
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Evaluation metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, auc,
    roc_auc_score
)

# Set style
plt.style.use('ggplot')
sns.set_palette("husl")

print("All libraries imported successfully! ‚úì")

All libraries imported successfully! ‚úì


## Suppose df_label is the dataframe at the end of your data preprocessing step.

# Train-Test Split: Why Do We Need It?

<img src="https://miro.medium.com/v2/resize:fit:1400/1*-8_kogvwmL1H6ooN1A1tsQ.png" width="600">

## The Problem
If we train and test on the same data, our model will **memorize** the answers instead of **learning patterns**. This is like:
- üìö Memorizing exam questions instead of understanding concepts
- üéÆ Playing a game level you've already seen vs. a new level

## The Solution: Train-Test Split

We split our data into two parts:

### Training Set (typically 70-80%)
- Used to **train** the model
- Model learns patterns from this data
- Model **sees the answers** here

### Test Set (typically 20-30%)
- Used to **evaluate** the model
- Model has **never seen** this data before
- Tells us how well the model generalizes to new patients

## Important Rules üö®
1. **Never train on test data** - That's cheating!
2. **Split BEFORE any preprocessing** that uses statistics (like scaling)
3. **Use the same split** for fair comparison between models
4. **Stratify** when classes are imbalanced (keeps the same proportion in both sets)

## How It Works
```python
train_test_split(X, y, test_size=0.2, random_state=42)
```
- `test_size=0.2`: Use 20% for testing, 80% for training
- `random_state=42`: Makes the split reproducible (same split every time)
- `stratify=y`: Keeps class proportions balanced in both sets

## Train-Test Split

In [None]:
# Split your dataset into training and testing subsets. Use a test size of 20%. Research what "stratify" is and decide if you should implement it in your case

# YOUR CODE HERE



# Feature Scaling: Making Features Comparable


## Why Scale?

Imagine you're comparing patients:
- Age: 28-77 years
- Cholesterol: 100-600 mg/dl
- Oldpeak: 0-6.2

Without scaling, Cholesterol dominates because its numbers are bigger!

## Which Models Need Scaling?

### ‚úÖ Need Scaling:
- **KNN**: Uses distances between points
- **SVM**: Uses distances to find decision boundary
- **Logistic Regression**: Gradient descent works better with scaled features
- **Naive Bayes**: Can benefit from scaling in some implementations

### ‚ùå Don't Need Scaling:
- **Decision Trees**: Split on thresholds, not distances
- **Random Forest**: Ensemble of decision trees

## Important: Scale AFTER Splitting! üö®

```python
# ‚ùå WRONG - Information leakage!
X_scaled = scaler.fit_transform(X)
X_train, X_test = train_test_split(X_scaled)

# ‚úÖ CORRECT - No leakage
X_train, X_test = train_test_split(X)
scaler.fit(X_train)  # Learn only from training
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
```

In [None]:
# YOUR CODE HERE

# Identify numerical columns that need scaling

# Initialize the scaler

# Fit on training data ONLY

# Transform both train and test

# Print the numerical columns that you scales before and after the scaling

# Machine Learning Models Overview

You'll train 6 different models. Each has different strengths!

| Model | Type | Best For | Key Strength |
|-------|------|----------|-------------|
| **Logistic Regression** | Linear | Baseline, interpretability | Simple, fast, probabilistic |
| **Decision Tree** | Tree | Non-linear patterns | Easy to visualize and interpret |
| **Random Forest** | Ensemble | Robust predictions | Reduces overfitting, handles complexity |
| **Naive Bayes** | Probabilistic | Small datasets | Fast, works with high dimensions |
| **SVM** | Margin-based | Complex boundaries | Powerful for non-linear problems |
| **KNN** | Instance-based | Local patterns | Simple, no training phase |

<img src="https://miro.medium.com/v2/resize:fit:1400/1*m6kKsW0O-wWH8Xg_ZA3v8w.png" width="700">

# 1Ô∏è‚É£ Logistic Regression (Baseline Model)

## What is it?
Despite its name, Logistic Regression is used for **classification**, not regression!

## How it works:
- Draws a **linear decision boundary** between classes
- Outputs **probabilities** (0 to 1) instead of just class labels
- Uses a sigmoid function to squash predictions between 0 and 1

## Why start here?
- ‚úÖ Simple and fast
- ‚úÖ Good baseline to beat
- ‚úÖ Provides probability estimates
- ‚úÖ Interpretable coefficients

## When to use:
- Linear relationships between features
- Need probability scores
- Want to understand feature importance

<img src="https://miro.medium.com/v2/resize:fit:828/1*dm6ZaX5fuSmuVvM4Ds-vcg.gif" width="400">

In [None]:
# Initialize the model
log_reg = LogisticRegression(max_iter=1000, random_state=42)

# Train the model
log_reg.fit(X_train_scaled, y_train_onehot)

# Make predictions
y_pred_log = log_reg.predict(X_test_scaled)
y_pred_proba_log = log_reg.predict_proba(X_test_scaled)[:, 1]  # Probability of class 1

# Calculate accuracy
accuracy_log = accuracy_score(y_test_onehot, y_pred_log)

print("Logistic Regression Results:")
print(f"Accuracy: {accuracy_log:.4f} ({accuracy_log*100:.2f}%)")
print("\nFirst 10 predictions vs actual:")
comparison = pd.DataFrame({
    'Actual': y_test_onehot.values[:10],
    'Predicted': y_pred_log[:10],
    'Probability': y_pred_proba_log[:10]
})
print(comparison)

# 2Ô∏è‚É£ Decision Tree

## What is it?
A tree of yes/no questions that leads to a prediction!

## How it works:
```
Is Oldpeak > 1.0?
‚îú‚îÄ‚îÄ Yes ‚Üí Is MaxHR < 140?
‚îÇ   ‚îú‚îÄ‚îÄ Yes ‚Üí Heart Disease ‚ù§Ô∏è
‚îÇ   ‚îî‚îÄ‚îÄ No ‚Üí Healthy ‚úÖ
‚îî‚îÄ‚îÄ No ‚Üí Healthy ‚úÖ
```

## Advantages:
- ‚úÖ Very interpretable (you can visualize the tree!)
- ‚úÖ Handles non-linear relationships
- ‚úÖ No feature scaling needed
- ‚úÖ Can capture complex interactions

## Disadvantages:
- ‚ùå Prone to overfitting (memorizing training data)
- ‚ùå Unstable (small data changes = different tree)
- ‚ùå Can create overly complex trees

<img src="https://miro.medium.com/v2/resize:fit:1170/1*XMId5sJqPtm8-RIwVVz2tg.png" width="500">

In [None]:
## YOUR CODE HERE


# 3Ô∏è‚É£ Random Forest

## What is it?
**"Wisdom of the Crowd"** - Many decision trees voting together!

## How it works:
1. Create 100 different decision trees (each sees random subset of data)
2. Each tree makes a prediction
3. Take a **majority vote** (most common prediction wins)

```
Tree 1: Heart Disease ‚ù§Ô∏è
Tree 2: Heart Disease ‚ù§Ô∏è  
Tree 3: Healthy ‚úÖ         } ‚Üí Final: Heart Disease ‚ù§Ô∏è
Tree 4: Heart Disease ‚ù§Ô∏è  
...     
Tree 100: Heart Disease ‚ù§Ô∏è
```

## Why it's powerful:
- ‚úÖ Reduces overfitting (trees average out their mistakes)
- ‚úÖ More stable than single decision tree
- ‚úÖ Handles complex patterns
- ‚úÖ Provides feature importance
- ‚úÖ One of the best "out-of-the-box" models

## Trade-offs:
- ‚ùå Less interpretable (can't visualize 100 trees easily)
- ‚ùå Slower to train and predict
- ‚ùå Needs more memory

<img src="https://miro.medium.com/v2/resize:fit:1092/1*i0o8mjFfCn-uD79-F1Cqkw.png" width="600">

In [None]:
# YOUR CODE HERE

# 4Ô∏è‚É£ Naive Bayes

## What is it?
A probabilistic model based on Bayes' Theorem. It calculates the probability of a patient having heart disease given their features.

## How it works:
For each patient, it calculates:
```
P(Heart Disease | Features) = P(Features | Heart Disease) √ó P(Heart Disease) / P(Features)
```

## The "Naive" Part:
Assumes all features are **independent** (they don't influence each other)
- Example: Assumes Age and Cholesterol are independent
- This is often false but works surprisingly well anyway!

## Advantages:
- ‚úÖ Very fast to train and predict
- ‚úÖ Works well with small datasets
- ‚úÖ Good with high-dimensional data
- ‚úÖ Provides probability estimates

## When to use:
- Small training datasets
- Need fast predictions
- Text classification (spam detection, sentiment analysis)

<img src="https://miro.medium.com/v2/resize:fit:1400/1*tjcmj9cDQ-rHXAtxCu5bRQ.png" width="500">

In [None]:
## YOUR CODE HERE

# 5Ô∏è‚É£ Support Vector Machine (SVM)

## What is it?
Finds the **best line** (or hyperplane) that separates the two classes with the **maximum margin**.

## How it works:
1. Find the line that separates classes
2. Maximize the distance to the nearest points from each class
3. These nearest points are called **support vectors**

## The Kernel Trick üé©:
Can handle non-linear boundaries using kernels:
- **Linear**: Straight line separation
- **RBF (Radial Basis Function)**: Curved, complex boundaries
- **Polynomial**: Polynomial curves

## Advantages:
- ‚úÖ Effective in high-dimensional spaces
- ‚úÖ Works well with clear margin of separation
- ‚úÖ Versatile (different kernels for different problems)
- ‚úÖ Robust to overfitting (especially with RBF kernel)

## Disadvantages:
- ‚ùå Slow with large datasets (>10,000 samples)
- ‚ùå Sensitive to feature scaling (must scale!)
- ‚ùå Hard to interpret
- ‚ùå Needs tuning of hyperparameters (C, gamma)

<img src="https://miro.medium.com/v2/resize:fit:1400/1*ZpkLQf2FNfzfH4HXeMw4MQ.png" width="500">

In [None]:
# YOUR CODE HERE

# 6Ô∏è‚É£ K-Nearest Neighbors (KNN)

## What is it?
**"You are the average of your 5 closest friends"** - looks at nearby data points to make predictions!

## How it works:
For a new patient:
1. Find the K nearest neighbors (patients with similar features)
2. Look at their diagnoses (heart disease or healthy)
3. Take a **majority vote**

```
New Patient: ?

5 Nearest Neighbors:
1. Heart Disease ‚ù§Ô∏è
2. Heart Disease ‚ù§Ô∏è
3. Healthy ‚úÖ          } ‚Üí Prediction: Heart Disease ‚ù§Ô∏è
4. Heart Disease ‚ù§Ô∏è   (3 votes vs 2 votes)
5. Healthy ‚úÖ
```

## Choosing K:
- **K=1**: Just copy nearest neighbor (overfitting!)
- **K=5**: More stable, less overfitting
- **K=100**: Too smooth, underfitting
- **Rule of thumb**: K = ‚àö(number of samples)

## Advantages:
- ‚úÖ Simple to understand
- ‚úÖ No training phase (just store data)
- ‚úÖ Can handle complex decision boundaries
- ‚úÖ Naturally handles multi-class problems

## Disadvantages:
- ‚ùå Slow prediction (must compare to all training points)
- ‚ùå Needs feature scaling (distances must be comparable)
- ‚ùå Sensitive to irrelevant features
- ‚ùå Doesn't work well in high dimensions (curse of dimensionality)

<img src="https://miro.medium.com/v2/resize:fit:1400/1*wW8O-0xVQUFhBGexx2B6hg.png" width="500">

In [None]:
# YOUR CODE HERE

# Model Evaluation: Beyond Accuracy

## Why Accuracy Isn't Enough

Imagine a model that predicts **everyone has heart disease**:
- If 55% of patients have heart disease ‚Üí **55% accuracy!**
- But it's completely useless!

We need more metrics to truly understand performance.

<img src="https://miro.medium.com/v2/resize:fit:1400/1*LQ1YMKBlbDhH9K6Ujz8QTw.jpeg" width="600">

## Confusion Matrix: The Foundation

A confusion matrix shows 4 types of predictions:

```
                    Predicted
                 Healthy  Disease
Actual  Healthy    TN       FP      
        Disease    FN       TP
```

- **True Positive (TP)**: Correctly predicted disease ‚úÖ
- **True Negative (TN)**: Correctly predicted healthy ‚úÖ
- **False Positive (FP)**: Predicted disease, actually healthy ‚ùå (Type 1 Error)
- **False Negative (FN)**: Predicted healthy, actually disease ‚ùå (Type 2 Error)

### Medical Context:
- **FP (False Alarm)**: Tell healthy person they're sick ‚Üí unnecessary worry/tests
- **FN (Missed Disease)**: Tell sick person they're healthy ‚Üí dangerous!

In medical diagnosis, **False Negatives are usually worse** than False Positives!

## Key Metrics Explained

### 1. Accuracy
**"Overall, how often is the model correct?"**
```
Accuracy = (TP + TN) / (TP + TN + FP + FN)
```
- Good when classes are balanced
- Misleading with imbalanced classes

### 2. Precision
**"When it predicts disease, how often is it right?"**
```
Precision = TP / (TP + FP)
```
- Important when False Positives are costly
- Example: Avoid unnecessary surgeries

### 3. Recall (Sensitivity)
**"Of all actual disease cases, how many did we catch?"**
```
Recall = TP / (TP + FN)
```
- Important when False Negatives are dangerous
- Example: Don't miss cancer patients

### 4. F1-Score
**"Balanced measure of Precision and Recall"**
```
F1 = 2 √ó (Precision √ó Recall) / (Precision + Recall)
```
- Harmonic mean of Precision and Recall
- Good single metric when you need balance

### Trade-off Example:
```
Model A: High Precision (90%), Low Recall (50%)
‚Üí When it says disease, it's usually right, but misses many cases

Model B: Low Precision (60%), High Recall (95%)
‚Üí Catches almost all cases, but many false alarms

Which is better? Depends on the cost of errors!
```

In [None]:
def evaluate_model(name, y_true, y_pred, y_pred_proba=None):
    """
    Comprehensive evaluation of a classification model
    """
    print(f"\n{'='*60}")
    print(f"  {name} - Detailed Evaluation")
    print(f"{'='*60}\n")

    # Calculate metrics
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)

    # Print metrics
    print("Metrics:")
    print(f"  Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
    print(f"  Precision: {precision:.4f} ({precision*100:.2f}%)")
    print(f"  Recall:    {recall:.4f} ({recall*100:.2f}%)")
    print(f"  F1-Score:  {f1:.4f}\n")

    # Confusion Matrix
    cm = confusion_matrix(y_true, y_pred)
    print("Confusion Matrix:")
    print(f"                Predicted")
    print(f"              Healthy  Disease")
    print(f"Actual Healthy   {cm[0,0]:3d}     {cm[0,1]:3d}")
    print(f"       Disease   {cm[1,0]:3d}     {cm[1,1]:3d}\n")

    # Interpretation
    tn, fp, fn, tp = cm.ravel()
    print(f"True Negatives (TN):  {tn} - Correctly predicted healthy")
    print(f"False Positives (FP): {fp} - Healthy predicted as disease")
    print(f"False Negatives (FN): {fn} - Disease predicted as healthy (DANGEROUS!)")
    print(f"True Positives (TP):  {tp} - Correctly predicted disease\n")

    # Classification Report
    print("Classification Report:")
    print(classification_report(y_true, y_pred, target_names=['Healthy', 'Disease']))

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'confusion_matrix': cm
    }

## Evaluating All Models

In [None]:
# Store results for comparison
results = {}

# Evaluate Logistic Regression
results['Logistic Regression'] = evaluate_model(
    'Logistic Regression',
    y_test_onehot,
    y_pred_log,
    y_pred_proba_log
)

# YOUR CODE HERE

# Evaluate Decision Tree

# Evaluate Random Forest

# Evaluate Naive Bayes

# Evaluate SVM

# Evaluate KNN

# Model Comparison

Let's compare all models side by side to find our best performer!

In [None]:
# Create comparison dataframe
comparison_df = pd.DataFrame({
    'Model': list(results.keys()),
    'Accuracy': [results[m]['accuracy'] for m in results.keys()],
    'Precision': [results[m]['precision'] for m in results.keys()],
    'Recall': [results[m]['recall'] for m in results.keys()],
    'F1-Score': [results[m]['f1'] for m in results.keys()]
})

# Sort by F1-Score (balanced metric)
comparison_df = comparison_df.sort_values('F1-Score', ascending=False)

print("\n" + "="*80)
print("                         MODEL COMPARISON")
print("="*80 + "\n")
print(comparison_df.to_string(index=False))
print("\n" + "="*80)

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A']

for idx, (ax, metric, color) in enumerate(zip(axes.flat, metrics, colors)):
    data = comparison_df.sort_values(metric, ascending=True)
    ax.barh(data['Model'], data[metric], color=color, alpha=0.7)
    ax.set_xlabel(metric, fontsize=12, fontweight='bold')
    ax.set_title(f'{metric} Comparison', fontsize=14, fontweight='bold')
    ax.set_xlim([0, 1])

    # Add value labels
    for i, v in enumerate(data[metric]):
        ax.text(v + 0.01, i, f'{v:.3f}', va='center')

plt.tight_layout()
plt.show()

**Your Interpretation Here:**

Which model performed best? Why? What metrics did you prioritize and why?

# Visualizing Confusion Matrices

In [None]:
# Plot confusion matrices for all models
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

for idx, (model_name, result) in enumerate(results.items()):
    cm = result['confusion_matrix']

    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['Healthy', 'Disease'],
                yticklabels=['Healthy', 'Disease'],
                ax=axes[idx], cbar=False)

    axes[idx].set_title(f'{model_name}\nF1: {result["f1"]:.3f}',
                        fontsize=12, fontweight='bold')
    axes[idx].set_ylabel('Actual', fontsize=10)
    axes[idx].set_xlabel('Predicted', fontsize=10)

plt.tight_layout()
plt.show()

# Feature Importance: What Matters Most?

Tree-based models (Decision Tree, Random Forest) can tell us **which features** are most important for predictions!

## How it works:
- Measures how much each feature **reduces uncertainty** (impurity)
- Higher importance = feature is used more often for splitting
- Helps understand what the model learned

## Why it's useful:
- ‚úÖ Understand model decisions
- ‚úÖ Identify key risk factors  
- ‚úÖ Feature selection (remove unimportant features)
- ‚úÖ Medical insights (what causes heart disease?)

In [None]:
# Get feature importance from Random Forest (usually most reliable)
feature_importance = pd.DataFrame({
    'Feature': X_train_label.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nFeature Importance (Random Forest):")
print("="*50)
print(feature_importance.to_string(index=False))

# Visualize
plt.figure(figsize=(12, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'],
         color='steelblue', alpha=0.7)
plt.xlabel('Importance', fontsize=12, fontweight='bold')
plt.title('Feature Importance - Random Forest', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

**Your Interpretation Here:**

Which features are most important? Does this align with medical knowledge? Are there any surprises?

# Cross-Validation: More Robust Evaluation

## The Problem with Single Train-Test Split:
- Results depend on **which samples** ended up in test set
- Lucky/unlucky split can make models look better/worse
- Not using all data for training AND testing

## Cross-Validation Solution:

**K-Fold Cross-Validation** (typically K=5 or K=10):

```
Fold 1: [Test] [Train] [Train] [Train] [Train]
Fold 2: [Train] [Test] [Train] [Train] [Train]
Fold 3: [Train] [Train] [Test] [Train] [Train]
Fold 4: [Train] [Train] [Train] [Test] [Train]
Fold 5: [Train] [Train] [Train] [Train] [Test]

Final Score = Average of all 5 folds
```

## Benefits:
- ‚úÖ More reliable estimate of performance
- ‚úÖ Uses all data for both training and testing
- ‚úÖ Reduces variance in results
- ‚úÖ Shows standard deviation (how stable is the model?)

## Trade-off:
- Takes K times longer (trains K models instead of 1)
- But gives much more confidence in results!

<img src="https://miro.medium.com/v2/resize:fit:1400/1*rgba1BIOUys7wQcXcL4U5A.png" width="500">

In [None]:
# Perform 5-fold cross-validation on our best models
print("\n" + "="*70)
print("                    5-FOLD CROSS-VALIDATION RESULTS")
print("="*70 + "\n")

cv_results = {}



# Logistic Regression
log_scores = cross_val_score(log_reg, X_train_scaled, y_train_onehot, cv=5, scoring='f1')
cv_results['Logistic Regression'] = log_scores
print(f"Logistic Regression:")
print(f"  Scores: {log_scores}")
print(f"  Mean F1: {log_scores.mean():.4f} (+/- {log_scores.std() * 2:.4f})\n")

# YOUR CODE HERE

# SVM

# Random Forest

## **Training Set, Validation Set, Testing Set**

When we build machine learning models, we do not use all the data in the same way.  
We split the data so that learning, model selection, and final evaluation are kept separate.  
Each split has a specific role and answers a different question.

- The **training set** is used to learn the model parameters.
- The **validation set** is used to compare models or hyperparameters and decide which one is better.
- The **test set** is used only once, at the end, to estimate how well the chosen model generalizes to unseen data.


<img src="https://miro.medium.com/1*OECM6SWmlhVzebmSuvMtBg.png">

# Hyperparameter Tuning: Optimizing Performance

## What are Hyperparameters?

Settings you choose **before** training that control how the model learns:

### Random Forest Hyperparameters:
- `n_estimators`: Number of trees (more = better, but slower)
- `max_depth`: How deep each tree can grow
- `min_samples_split`: Minimum samples needed to split a node
- `min_samples_leaf`: Minimum samples in a leaf node

### KNN Hyperparameters:
- `n_neighbors`: Number of neighbors (K value)
- `weights`: All equal or weighted by distance?
- `metric`: Distance metric to use

### SVM Hyperparameters:
- `C`: Regularization (higher = more complex)
- `gamma`: Kernel coefficient (how far influence reaches)
- `kernel`: Type of decision boundary (linear, rbf, poly)

## Grid Search:
Try **all combinations** of hyperparameters and pick the best!

```python
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15]
}
# Tests: 3 √ó 3 = 9 combinations
```

## Warning:
- Can be SLOW with many parameters
- Use small grid for learning
- Cross-validation inside makes it even slower!

### What happens in this GridSearchCV example (mental model)

- Start with a **train / test split**
  - `X_train_label, y_train_label` ‚Üí used for tuning and training
  - `X_test_label, y_test_label` ‚Üí kept untouched until the end

- Define a **hyperparameter grid**
  - 108 different Random Forest configurations
  - Each configuration is a different ‚Äúmodel recipe‚Äù

- Call `grid_search.fit(X_train_label, y_train_label)`
  - GridSearchCV only sees the **training set**

- Inside GridSearchCV:
  - Split the training set into **5 CV folds**
  - For each hyperparameter combination:
    - Train on 4 folds
    - Validate on 1 fold
    - Repeat until each fold has been validation once
    - Average the 5 F1 scores ‚Üí one CV score per combination

- Compare all combinations
  - Select the one with the **highest mean CV F1**

- **Refit step (automatic)**
  - Take the best hyperparameters
  - Train one final model on **100% of the training data**
  - Store it as `best_estimator_`

- Final evaluation
  - Use `best_estimator_` to predict on the **test set**
  - Compute test F1 once
  - This is the honest performance estimate

- Key takeaway
  - CV folds are **validation**, not the real test
  - The test set is used **only once**, at the very end


In [None]:
# Hyperparameter tuning for Random Forest
print("Hyperparameter Tuning for Random Forest...\n")

# Define parameter grid (keeping it small for speed)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,                    # 5-fold cross-validation
    scoring='f1',            # Optimize for F1-score
    n_jobs=-1,               # Use all CPU cores
    verbose=1
)

# Fit (this might take a while!)
print(f"Testing {len(param_grid['n_estimators']) * len(param_grid['max_depth']) * len(param_grid['min_samples_split']) * len(param_grid['min_samples_leaf'])} combinations...")
grid_search.fit(X_train_label, y_train_label)

# Best parameters
print("\n" + "="*60)
print("BEST PARAMETERS FOUND:")
print("="*60)
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")

print(f"\nBest F1-Score (Cross-Validation): {grid_search.best_score_:.4f}")

# Test on test set
best_rf = grid_search.best_estimator_
y_pred_best = best_rf.predict(X_test_label)
test_f1 = f1_score(y_test_label, y_pred_best)
print(f"Test Set F1-Score: {test_f1:.4f}")
print("="*60)


## Your Task:
Perform Hyperparameter Tuning using GridSearchCV for your **TOP 3 Models** so far (the ones with better score from simple CV you performed above)

- Use Hypeparameters mentioned above for RANDOM FOREST, SVM, KNN. If you have
another model in the top 3 perform the relevant research on the most important
hyperparameters needing tuning. You **don't have to have** 108 combinations for every model, you might just have 20 or less, do not worry about that.


In [None]:
# Hyperparameter tuning for 2nd Model
print("Hyperparameter Tuning for SVM...\n")

# Define parameter grid for 2nd Model
# YOUR CODE HERE



# Initialize GridSearchCV
# YOUR CODE HERE

# Fit (this might take a while!)
# YOUR CODE HERE

# Best parameters
# YOUR CODE HERE

# Test on test set
# YOUR CODE HERE


In [None]:
# Hyperparameter tuning for 3rd Model
print("Hyperparameter Tuning for SVM...\n")

# Define parameter grid for 3rd Model
# YOUR CODE HERE



# Initialize GridSearchCV
# YOUR CODE HERE

# Fit (this might take a while!)
# YOUR CODE HERE

# Best parameters
# YOUR CODE HERE

# Test on test set
# YOUR CODE HERE


## **YOUR INTERPRETATION HERE**

# Final Model Selection & Insights

## üìä Dataset Information
- **Total Samples**:
- **Training Samples**:
- **Test Samples**:
- **Number of Features**:

## üèÜ Best Performing Model
- **Model**:
- **F1-Score**: (OR THE METRIC YOU USED)

## üìà Top 3 Most Important Features
1.
2.
3.

## üí° Key Insights
1.
2.
3.
4.

## üéØ Recommendations
-
-
-
-

## üî≠ How could/would you improve this Project ?
-
-
-
-

## üèÅ Final Comment on the Project
-