# 2.1 - Supervised Learning Essentials

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/madeforai/madeforai/blob/main/docs/understanding-ai/module-2/2.1-supervised-learning.ipynb)

---

**Master the art of prediction‚Äîfrom house prices to disease diagnosis, supervised learning powers 70% of real-world AI applications.**

## üìö What You'll Learn

- **The supervised learning workflow**: From problem formulation to model deployment
- **Regression fundamentals**: Predicting continuous values (prices, temperatures, quantities)
- **Classification fundamentals**: Predicting categories (spam/not spam, disease/healthy)
- **Evaluation metrics that matter**: How to measure success for regression and classification
- **Overfitting and regularization**: The #1 challenge in ML and how to fix it

## ‚è±Ô∏è Estimated Time
35-40 minutes

## üìã Prerequisites
- Completed Module 1 (especially 1.3 and 1.4)
- Understanding of train-test splits and gradient descent
- Basic Python and scikit-learn familiarity

## üéØ Welcome to the Workhorse of Machine Learning

If unsupervised learning is like exploring a new city without a map, **supervised learning is like having a GPS with turn-by-turn directions**.

Here's why supervised learning dominates the ML landscape in 2026:

- **70% of industry ML applications** use supervised learning
- **Billion-dollar industries** built on it: fraud detection, medical diagnosis, recommendation systems
- **Interpretable and reliable**: You know what you're predicting and can measure success
- **Data-efficient**: With enough labeled examples, it just works

**The Core Idea:**
> "Show me examples of correct input‚Üíoutput pairs, and I'll learn the pattern so well that I can predict the output for new inputs."

Think of it like learning to grade essays. After seeing 1000 essays with teacher grades (labeled data), you start recognizing patterns: "Strong thesis = high grade, poor grammar = low grade." Eventually, you can grade new essays yourself.

That's supervised learning! Let's dive deep. üèä‚Äç‚ôÇÔ∏è

In [None]:
# Setup: Install and import libraries
# Uncomment if running in Google Colab
# !pip install numpy pandas matplotlib seaborn scikit-learn plotly -q

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from sklearn.datasets import make_regression, make_classification
from sklearn.model_selection import train_test_split, cross_val_score, learning_curve
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, roc_auc_score
)

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
warnings.filterwarnings('ignore')
np.random.seed(42)

print("‚úÖ Libraries loaded successfully!")
print("üìò Module 2.1: Supervised Learning Essentials")
print("üéØ Let's master prediction!")

## üìä Part 1: Regression - Predicting Numbers

### What is Regression?

**Regression** predicts a **continuous numerical value** based on input features.

**Real-World Examples:**
- üè† Predicting house prices from size, location, bedrooms
- üìà Forecasting stock prices from historical data
- üå°Ô∏è Estimating temperature from weather patterns
- üí∞ Predicting customer lifetime value from purchase history
- üöó Estimating used car prices from mileage, age, brand

**Key Characteristics:**
- Output is a **number** (can be any value on a continuous scale)
- Examples: $350,000, 23.7¬∞C, $1,245.99
- Measured using **distance metrics** (how far off were we?)

Let's build a complete regression model from scratch!

<!-- [PLACEHOLDER IMAGE]
Prompt for image generation:
"Create an educational diagram showing regression concept.
Style: Clean, modern, split into 3 panels.
Panel 1: Scatter plot of house size vs price with points
Panel 2: Same plot with best-fit line drawn through points
Panel 3: New house (marked with star) with predicted price shown
Elements:
- X-axis: House Size (sq ft)
- Y-axis: Price ($)
- Best-fit line in blue
- Training points in gray
- New prediction in red star
- Arrows showing 'Learn from data' ‚Üí 'Find pattern' ‚Üí 'Predict new'
Color scheme: Professional blue and orange gradient.
Format: Wide horizontal 16:9 layout." -->

### üè† Case Study: California Housing Prices

We'll predict median house prices in California neighborhoods using features like:
- Median income in the area
- Average house age
- Average number of rooms
- Population density
- Location (latitude/longitude)

In [None]:
# Generate synthetic California housing data (realistic patterns)
from sklearn.datasets import fetch_california_housing

# Load the dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='MedianHouseValue')

# Convert to actual dollar values (multiply by 100,000)
y = y * 100000

print("üè† California Housing Dataset Loaded\n")
print(f"Samples: {len(X):,}")
print(f"Features: {X.shape[1]}")
print(f"\nFeature names: {list(X.columns)}")
print(f"\nTarget: Median house value in USD")
print(f"Price range: ${y.min():,.0f} - ${y.max():,.0f}")
print(f"Average price: ${y.mean():,.0f}")

# Show first few rows
print("\nüìä First 5 samples:")
sample_df = X.head().copy()
sample_df['MedianHouseValue'] = y.head()
display(sample_df)

In [None]:
# Explore relationships between features and price
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

# Select key features to visualize
features_to_plot = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup']
feature_labels = ['Median Income', 'House Age', 'Avg Rooms', 'Avg Bedrooms', 'Population', 'Avg Occupancy']

for idx, (feature, label) in enumerate(zip(features_to_plot, feature_labels)):
    ax = axes[idx]
    
    # Scatter plot with transparency
    ax.scatter(X[feature], y, alpha=0.3, s=10, color='#3b82f6', edgecolors='none')
    
    ax.set_xlabel(label, fontsize=11, fontweight='bold')
    ax.set_ylabel('Median House Value ($)', fontsize=11, fontweight='bold')
    ax.set_title(f'{label} vs Price', fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3)
    
    # Format y-axis
    ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))
    
    # Add correlation coefficient
    corr = np.corrcoef(X[feature], y)[0, 1]
    ax.text(0.05, 0.95, f'Correlation: {corr:.2f}', 
           transform=ax.transAxes, fontsize=10, verticalalignment='top',
           bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.suptitle('Feature-Price Relationships in California Housing', 
            fontsize=15, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\nüîç Key Observations:")
print("   ‚Üí Median Income shows strongest positive correlation with price")
print("   ‚Üí House age has weak negative correlation (older = slightly cheaper)")
print("   ‚Üí Number of rooms positively correlates with price")
print("   ‚Üí High population density areas tend to have higher prices")

### üìê Building a Linear Regression Model

**Linear Regression** is the simplest and most interpretable regression algorithm.

**The Math (Don't Panic!):**
```
y = Œ≤‚ÇÄ + Œ≤‚ÇÅx‚ÇÅ + Œ≤‚ÇÇx‚ÇÇ + ... + Œ≤‚Çôx‚Çô
```

Where:
- `y` = predicted value (house price)
- `Œ≤‚ÇÄ` = intercept (baseline price)
- `Œ≤‚ÇÅ, Œ≤‚ÇÇ, ...` = coefficients (how much each feature affects price)
- `x‚ÇÅ, x‚ÇÇ, ...` = feature values (income, age, rooms, etc.)

**Goal:** Find the best Œ≤ values that minimize prediction error!

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"üìä Data Split:")
print(f"   Training: {len(X_train):,} samples ({len(X_train)/len(X)*100:.0f}%)")
print(f"   Testing: {len(X_test):,} samples ({len(X_test)/len(X)*100:.0f}%)")

# Feature scaling (important for many algorithms!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\n‚úÖ Features scaled to mean=0, std=1")

# Train linear regression
print(f"\nüîß Training Linear Regression...")
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)

# Make predictions
y_train_pred = lr_model.predict(X_train_scaled)
y_test_pred = lr_model.predict(X_test_scaled)

print(f"‚úÖ Model trained!\n")

# Evaluate performance
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)

print("üìä Model Performance:")
print("="*60)
print(f"{'Metric':<20} {'Training':>15} {'Testing':>15}")
print("="*60)
print(f"{'R¬≤ Score':<20} {train_r2:>15.4f} {test_r2:>15.4f}")
print(f"{'RMSE (Root MSE)':<20} ${train_rmse:>14,.0f} ${test_rmse:>14,.0f}")
print(f"{'MAE (Mean Abs Err)':<20} ${train_mae:>14,.0f} ${test_mae:>14,.0f}")
print("="*60)

print(f"\nüí° Interpretation:")
print(f"   ‚Üí R¬≤ = {test_r2:.1%}: Model explains {test_r2:.1%} of price variation")
print(f"   ‚Üí On average, predictions are off by ${test_mae:,.0f}")
print(f"   ‚Üí Training and test scores are similar ‚Üí Good generalization!")

In [None]:
# Visualize predictions vs actual values
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Left: Training set
ax1.scatter(y_train, y_train_pred, alpha=0.4, s=20, color='#3b82f6', edgecolors='none')
ax1.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 
        'r--', linewidth=3, label='Perfect Predictions')
ax1.set_xlabel('Actual Price ($)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Predicted Price ($)', fontsize=12, fontweight='bold')
ax1.set_title(f'Training Set Predictions\n(R¬≤ = {train_r2:.3f})', 
             fontsize=13, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)
ax1.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))
ax1.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

# Right: Test set
ax2.scatter(y_test, y_test_pred, alpha=0.4, s=20, color='#10b981', edgecolors='none')
ax2.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
        'r--', linewidth=3, label='Perfect Predictions')
ax2.set_xlabel('Actual Price ($)', fontsize=12, fontweight='bold')
ax2.set_ylabel('Predicted Price ($)', fontsize=12, fontweight='bold')
ax2.set_title(f'Test Set Predictions (Unseen Data!)\n(R¬≤ = {test_r2:.3f})', 
             fontsize=13, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)
ax2.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))
ax2.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

plt.suptitle('Actual vs Predicted House Prices\n(Points closer to red line = better predictions)', 
            fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\nüéØ What This Shows:")
print("   ‚Üí Points near the red line = accurate predictions")
print("   ‚Üí Points far from the line = prediction errors")
print("   ‚Üí Test set (green) shows similar pattern to training (blue)")
print("   ‚Üí Model generalizes well to unseen data! ‚úÖ")

In [None]:
# Analyze feature importance (coefficients)
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': lr_model.coef_,
    'Abs_Coefficient': np.abs(lr_model.coef_)
}).sort_values('Abs_Coefficient', ascending=False)

print("\nüîç Feature Importance (Coefficients):")
print("="*60)
for idx, row in feature_importance.iterrows():
    effect = "increases" if row['Coefficient'] > 0 else "decreases"
    print(f"{row['Feature']:<15}: ${row['Coefficient']:>10,.0f} ({effect} price)")
print("="*60)

# Visualize coefficients
fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#10b981' if c > 0 else '#ef4444' for c in feature_importance['Coefficient']]
bars = ax.barh(feature_importance['Feature'], feature_importance['Coefficient'], 
              color=colors, alpha=0.7, edgecolor='white', linewidth=2)

ax.set_xlabel('Coefficient Value (Impact on Price)', fontsize=12, fontweight='bold')
ax.set_title('Feature Impact on House Price\n(Green = Positive, Red = Negative)', 
            fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')
ax.axvline(x=0, color='black', linestyle='-', linewidth=1)

plt.tight_layout()
plt.show()

print("\nüí° Business Insights:")
print("   ‚Üí MedInc (income) is by FAR the strongest predictor")
print("   ‚Üí AveOccup (occupancy) negatively affects price (overcrowding)")
print("   ‚Üí Latitude matters (location, location, location!)")
print("   ‚Üí House age has minimal impact (California market is hot!)")

### üìä Understanding Regression Metrics

**3 Key Metrics to Judge Regression Models:**

#### 1Ô∏è‚É£ R¬≤ Score (R-Squared)
- **Range:** 0 to 1 (1 = perfect, 0 = useless)
- **Meaning:** % of variance in target explained by features
- **Example:** R¬≤ = 0.60 means "60% of price variation is explained by our features"
- **Pro:** Easy to interpret, dimensionless
- **Con:** Can be misleading with too many features

#### 2Ô∏è‚É£ RMSE (Root Mean Squared Error)
- **Range:** 0 to ‚àû (0 = perfect)
- **Meaning:** Average prediction error in original units
- **Example:** RMSE = $69,000 means "on average, off by $69K"
- **Pro:** Penalizes large errors more (squares before averaging)
- **Con:** Sensitive to outliers

#### 3Ô∏è‚É£ MAE (Mean Absolute Error)
- **Range:** 0 to ‚àû (0 = perfect)
- **Meaning:** Average absolute difference from truth
- **Example:** MAE = $49,000 means "typical error is $49K"
- **Pro:** More robust to outliers than RMSE
- **Con:** Doesn't differentiate between small and large errors

**Which to use?**
- Use **R¬≤** for overall model quality
- Use **RMSE** if large errors are very bad (e.g., medical dosage)
- Use **MAE** for interpretability ("typical error")

## üè∑Ô∏è Part 2: Classification - Predicting Categories

### What is Classification?

**Classification** predicts a **discrete category/label** from input features.

**Real-World Examples:**
- üìß Spam detection (spam vs. not spam)
- üè• Disease diagnosis (healthy vs. diseased)
- üí≥ Fraud detection (fraudulent vs. legitimate)
- üòä Sentiment analysis (positive, neutral, negative)
- üê± Image recognition (cat, dog, bird, etc.)

**Types of Classification:**
- **Binary:** 2 classes (yes/no, 0/1, true/false)
- **Multi-class:** 3+ classes, one label (apple, orange, banana)
- **Multi-label:** Multiple classes possible (tags: sports, politics, tech)

**Key Difference from Regression:**
- Regression: "How much?" ‚Üí $250,000
- Classification: "Which one?" ‚Üí Class A

<!-- [PLACEHOLDER IMAGE]
Prompt for image generation:
"Create an educational diagram showing classification concept.
Style: Clean, modern, split into 2 sections.
Section 1 (Binary Classification):
- Scatter plot with two distinct clusters (red and blue points)
- Decision boundary line separating the clusters
- Labels: 'Class 0' and 'Class 1'
Section 2 (Multi-class Classification):
- Scatter plot with three clusters (red, blue, green)
- Decision boundaries separating all three classes
- Labels: 'Class A', 'Class B', 'Class C'
- New unlabeled point (star) being classified
Color scheme: Professional, high contrast.
Format: Wide horizontal 16:9 layout." -->

### ü©∫ Case Study: Heart Disease Prediction

We'll predict whether a patient has heart disease based on:
- Age, sex, chest pain type
- Resting blood pressure
- Cholesterol levels
- Maximum heart rate achieved
- Exercise-induced angina
- And more clinical measurements

In [None]:
# Generate synthetic heart disease data
np.random.seed(42)
n_patients = 1000

# Create realistic patient data
data = pd.DataFrame({
    'age': np.random.randint(30, 80, n_patients),
    'sex': np.random.choice([0, 1], n_patients),  # 0=Female, 1=Male
    'chest_pain_type': np.random.choice([0, 1, 2, 3], n_patients),
    'resting_bp': np.random.randint(90, 200, n_patients),
    'cholesterol': np.random.randint(150, 350, n_patients),
    'fasting_blood_sugar': np.random.choice([0, 1], n_patients, p=[0.85, 0.15]),
    'rest_ecg': np.random.choice([0, 1, 2], n_patients),
    'max_heart_rate': np.random.randint(70, 200, n_patients),
    'exercise_angina': np.random.choice([0, 1], n_patients, p=[0.7, 0.3]),
    'oldpeak': np.random.uniform(0, 6, n_patients),
    'slope': np.random.choice([0, 1, 2], n_patients)
})

# Create target variable (heart disease presence)
# Higher probability if: older, male, high BP/cholesterol, lower max HR
disease_prob = 0.1  # Base probability
disease_prob += (data['age'] > 60) * 0.25
disease_prob += (data['sex'] == 1) * 0.20
disease_prob += (data['resting_bp'] > 150) * 0.20
disease_prob += (data['cholesterol'] > 250) * 0.25
disease_prob += (data['max_heart_rate'] < 120) * 0.20
disease_prob += (data['exercise_angina'] == 1) * 0.30
disease_prob += (data['chest_pain_type'] == 3) * 0.15

data['heart_disease'] = (np.random.random(n_patients) < disease_prob).astype(int)

print("ü©∫ Heart Disease Dataset Created\n")
print(f"Total patients: {len(data):,}")
print(f"Features: {data.shape[1] - 1}")
print(f"\nClass Distribution:")
print(f"   No disease (0): {(data['heart_disease']==0).sum()} ({(data['heart_disease']==0).sum()/len(data)*100:.1f}%)")
print(f"   Has disease (1): {(data['heart_disease']==1).sum()} ({(data['heart_disease']==1).sum()/len(data)*100:.1f}%)")

print("\nüìä First 5 patients:")
display(data.head())

In [None]:
# Visualize feature distributions by class
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

features_to_plot = ['age', 'resting_bp', 'cholesterol', 'max_heart_rate', 'oldpeak', 'chest_pain_type']
feature_labels = ['Age', 'Resting BP', 'Cholesterol', 'Max Heart Rate', 'Oldpeak', 'Chest Pain Type']

for idx, (feature, label) in enumerate(zip(features_to_plot, feature_labels)):
    ax = axes[idx]
    
    # Separate by class
    no_disease = data[data['heart_disease'] == 0][feature]
    has_disease = data[data['heart_disease'] == 1][feature]
    
    # Plot histograms
    ax.hist(no_disease, bins=30, alpha=0.6, label='No Disease', color='#10b981', edgecolor='white', linewidth=0.5)
    ax.hist(has_disease, bins=30, alpha=0.6, label='Has Disease', color='#ef4444', edgecolor='white', linewidth=0.5)
    
    ax.set_xlabel(label, fontsize=11, fontweight='bold')
    ax.set_ylabel('Count', fontsize=11, fontweight='bold')
    ax.set_title(f'{label} Distribution by Disease Status', fontsize=12, fontweight='bold')
    ax.legend(fontsize=10)
    ax.grid(True, alpha=0.3, axis='y')

plt.suptitle('Feature Distributions: Healthy vs Heart Disease', 
            fontsize=15, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\nüîç Key Observations:")
print("   ‚Üí Age: Disease patients tend to be older")
print("   ‚Üí BP & Cholesterol: Higher values more common in disease patients")
print("   ‚Üí Max Heart Rate: Disease patients have lower max heart rates")
print("   ‚Üí These patterns will help our classifier distinguish classes!")

### üìä Building a Logistic Regression Classifier

**Logistic Regression** is the most popular binary classification algorithm.

**Despite the name, it's NOT regression‚Äîit's classification!**

**How it works:**
1. Computes a weighted sum of features (like linear regression)
2. Passes result through **sigmoid function** ‚Üí outputs probability (0 to 1)
3. If probability > 0.5 ‚Üí Class 1, else ‚Üí Class 0

**Sigmoid Function:**
```
P(y=1) = 1 / (1 + e^(-z))
where z = Œ≤‚ÇÄ + Œ≤‚ÇÅx‚ÇÅ + Œ≤‚ÇÇx‚ÇÇ + ...
```

**Why sigmoid?** It squashes any number into range [0, 1], perfect for probabilities!

In [None]:
# Prepare features and target
X = data.drop('heart_disease', axis=1)
y = data['heart_disease']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # stratify keeps class balance
)

print(f"üìä Data Split:")
print(f"   Training: {len(X_train):,} patients")
print(f"   Testing: {len(X_test):,} patients")

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\n‚úÖ Features scaled")

# Train logistic regression
print(f"\nüîß Training Logistic Regression Classifier...")
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(X_train_scaled, y_train)

# Make predictions
y_train_pred = clf.predict(X_train_scaled)
y_test_pred = clf.predict(X_test_scaled)

# Get probability predictions
y_train_proba = clf.predict_proba(X_train_scaled)[:, 1]
y_test_proba = clf.predict_proba(X_test_scaled)[:, 1]

print(f"‚úÖ Model trained!\n")

# Evaluate performance
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)
train_prec = precision_score(y_train, y_train_pred)
test_prec = precision_score(y_test, y_test_pred)
train_rec = recall_score(y_train, y_train_pred)
test_rec = recall_score(y_test, y_test_pred)
train_f1 = f1_score(y_train, y_train_pred)
test_f1 = f1_score(y_test, y_test_pred)
train_auc = roc_auc_score(y_train, y_train_proba)
test_auc = roc_auc_score(y_test, y_test_proba)

print("üìä Model Performance:")
print("="*60)
print(f"{'Metric':<20} {'Training':>15} {'Testing':>15}")
print("="*60)
print(f"{'Accuracy':<20} {train_acc:>15.3f} {test_acc:>15.3f}")
print(f"{'Precision':<20} {train_prec:>15.3f} {test_prec:>15.3f}")
print(f"{'Recall':<20} {train_rec:>15.3f} {test_rec:>15.3f}")
print(f"{'F1-Score':<20} {train_f1:>15.3f} {test_f1:>15.3f}")
print(f"{'ROC-AUC':<20} {train_auc:>15.3f} {test_auc:>15.3f}")
print("="*60)

print(f"\nüí° Quick Interpretation:")
print(f"   ‚Üí Accuracy: {test_acc:.1%} of predictions are correct")
print(f"   ‚Üí Precision: {test_prec:.1%} of predicted disease cases are actual disease")
print(f"   ‚Üí Recall: {test_rec:.1%} of actual disease cases were caught")
print(f"   ‚Üí ROC-AUC: {test_auc:.3f} (0.5=random, 1.0=perfect)")

### üìä Understanding Classification Metrics

Classification has MORE metrics than regression because different use cases care about different errors!

#### 1Ô∏è‚É£ Accuracy
**Formula:** (Correct Predictions) / (Total Predictions)
- **Pro:** Easy to understand
- **Con:** Misleading for imbalanced datasets
- **Example:** 99% accuracy sounds great, but if only 1% have disease, predicting "no disease" for everyone gives 99% accuracy!

#### 2Ô∏è‚É£ Precision
**Formula:** True Positives / (True Positives + False Positives)
- **Question:** "Of all predicted disease cases, how many actually have it?"
- **When to prioritize:** When false alarms are expensive (e.g., spam filter)
- **Medical example:** Don't want healthy patients unnecessarily treated

#### 3Ô∏è‚É£ Recall (Sensitivity)
**Formula:** True Positives / (True Positives + False Negatives)
- **Question:** "Of all actual disease cases, how many did we catch?"
- **When to prioritize:** When missing positives is dangerous (e.g., cancer screening)
- **Medical example:** Don't want to miss any sick patients

#### 4Ô∏è‚É£ F1-Score
**Formula:** 2 √ó (Precision √ó Recall) / (Precision + Recall)
- **Meaning:** Harmonic mean of precision and recall
- **Use:** When you need balance between precision and recall

#### 5Ô∏è‚É£ ROC-AUC
**Range:** 0.5 (random) to 1.0 (perfect)
- **Meaning:** Area under the ROC curve (True Positive Rate vs False Positive Rate)
- **Use:** Overall classifier quality, threshold-independent

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_test_pred)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Left: Confusion matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', square=True,
           xticklabels=['No Disease', 'Has Disease'],
           yticklabels=['No Disease', 'Has Disease'],
           cbar_kws={'label': 'Count'},
           ax=ax1, annot_kws={'size': 14, 'weight': 'bold'})
ax1.set_xlabel('Predicted', fontsize=13, fontweight='bold')
ax1.set_ylabel('Actual', fontsize=13, fontweight='bold')
ax1.set_title('Confusion Matrix\n(Test Set)', fontsize=14, fontweight='bold', pad=15)

# Right: ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_test_proba)
ax2.plot(fpr, tpr, linewidth=3, color='#3b82f6', label=f'Logistic Regression (AUC = {test_auc:.3f})')
ax2.plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random Classifier (AUC = 0.500)')
ax2.set_xlabel('False Positive Rate', fontsize=13, fontweight='bold')
ax2.set_ylabel('True Positive Rate', fontsize=13, fontweight='bold')
ax2.set_title('ROC Curve\n(Closer to top-left = better)', fontsize=14, fontweight='bold', pad=15)
ax2.legend(fontsize=11, loc='lower right')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Break down confusion matrix
tn, fp, fn, tp = cm.ravel()
print("\nüìä Confusion Matrix Breakdown:")
print("="*60)
print(f"True Negatives (TN):  {tn:4d} ‚Üê Correctly predicted healthy")
print(f"False Positives (FP): {fp:4d} ‚Üê Healthy but predicted disease (Type I error)")
print(f"False Negatives (FN): {fn:4d} ‚Üê Disease but predicted healthy (Type II error) ‚ö†Ô∏è")
print(f"True Positives (TP):  {tp:4d} ‚Üê Correctly predicted disease")
print("="*60)

print(f"\nüí° Medical Implications:")
print(f"   ‚Üí We correctly identified {tp} disease patients (can treat!)")
print(f"   ‚Üí We missed {fn} disease patients (dangerous!)")
print(f"   ‚Üí We had {fp} false alarms (unnecessary worry/tests)")
print(f"   ‚Üí Recall = {test_rec:.1%}: We catch {test_rec:.1%} of disease cases")

## ‚öñÔ∏è Part 3: Overfitting and Regularization

### The #1 Challenge in Machine Learning

**Overfitting** = Model memorizes training data instead of learning patterns

**The Problem:**
- Training accuracy: 99% ‚úÖ
- Test accuracy: 65% ‚ùå

**Analogy:** A student who memorizes exact exam questions but can't solve new problems.

**Signs of Overfitting:**
1. Large gap between training and test performance
2. Model performs well on training data, poorly on new data
3. Model is too complex for the amount of data

**Underfitting** = Opposite problem, model is too simple
- Training accuracy: 60% ‚ùå
- Test accuracy: 58% ‚ùå

**Goal:** Find the sweet spot ‚Üí **Good fit**
- Training accuracy: 85% ‚úÖ
- Test accuracy: 83% ‚úÖ

<!-- [PLACEHOLDER IMAGE]
Prompt for image generation:
"Create an educational diagram showing overfitting, underfitting, and good fit.
Style: Clean, modern, three panels side-by-side.
Panel 1 - Underfitting:
- Scatter plot with curved data pattern
- Simple straight line that doesn't capture the curve
- Label: 'Too Simple - High Bias'
Panel 2 - Good Fit:
- Same data
- Smooth curve that follows the pattern
- Label: 'Just Right - Balanced'
Panel 3 - Overfitting:
- Same data
- Extremely wiggly line passing through every point
- Label: 'Too Complex - High Variance'
Color scheme: Red (underfitting), Green (good fit), Blue (overfitting).
Format: Wide horizontal 16:9 layout." -->

### üíä The Cure: Regularization

**Regularization** = Adding a penalty for model complexity

**How it works:**
```
Old loss function: Loss = Error on training data
New loss function: Loss = Error on training data + Œª √ó (Model Complexity)
```

Where Œª (lambda) = regularization strength (hyperparameter you tune)

**Two Popular Types:**

#### üÖª1Ô∏è‚É£ L1 Regularization (Lasso)
- Penalty = sum of absolute values of coefficients
- **Effect:** Drives some coefficients to EXACTLY zero
- **Use case:** Feature selection (automatic feature elimination)
- **Formula:** `Œª √ó Œ£|Œ≤·µ¢|`

#### üÖª2Ô∏è‚É£ L2 Regularization (Ridge)
- Penalty = sum of squared coefficients
- **Effect:** Shrinks all coefficients toward zero (but not to zero)
- **Use case:** When all features matter, just reduce their impact
- **Formula:** `Œª √ó Œ£Œ≤·µ¢¬≤`

**Choosing Œª:**
- Œª = 0: No regularization (may overfit)
- Œª too small: Still overfits
- Œª just right: Sweet spot! ‚ú®
- Œª too large: Underfits (model too constrained)

In [None]:
# Demonstrate overfitting vs regularization
# Generate synthetic data with polynomial pattern
np.random.seed(42)
X_demo = np.sort(np.random.rand(50, 1) * 10, axis=0)
y_demo = 2 * X_demo + 3 * np.sin(X_demo) + np.random.randn(50, 1) * 2

# Create polynomial features (degree 15 - way too complex!)
poly_15 = PolynomialFeatures(degree=15)
X_demo_poly = poly_15.fit_transform(X_demo)

# Train three models:
# 1. Linear (underfitting)
# 2. Polynomial without regularization (overfitting)
# 3. Polynomial with regularization (good fit)

linear_model = LinearRegression()
linear_model.fit(X_demo, y_demo)

overfit_model = LinearRegression()
overfit_model.fit(X_demo_poly, y_demo)

regularized_model = Ridge(alpha=10.0)  # L2 regularization
regularized_model.fit(X_demo_poly, y_demo)

# Create smooth line for plotting
X_plot = np.linspace(0, 10, 300).reshape(-1, 1)
X_plot_poly = poly_15.transform(X_plot)

# Predictions
y_linear = linear_model.predict(X_plot)
y_overfit = overfit_model.predict(X_plot_poly)
y_regularized = regularized_model.predict(X_plot_poly)

# Plot
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 5))

# Underfitting
ax1.scatter(X_demo, y_demo, alpha=0.6, s=60, color='gray', edgecolors='white', linewidth=1.5, label='Data')
ax1.plot(X_plot, y_linear, linewidth=3, color='#ef4444', label='Linear Model')
ax1.set_title('Underfitting\n(Too Simple - High Bias)', fontsize=13, fontweight='bold')
ax1.set_xlabel('X', fontsize=11, fontweight='bold')
ax1.set_ylabel('y', fontsize=11, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Overfitting
ax2.scatter(X_demo, y_demo, alpha=0.6, s=60, color='gray', edgecolors='white', linewidth=1.5, label='Data')
ax2.plot(X_plot, y_overfit, linewidth=3, color='#3b82f6', label='Degree-15 Polynomial')
ax2.set_title('Overfitting\n(Too Complex - High Variance)', fontsize=13, fontweight='bold')
ax2.set_xlabel('X', fontsize=11, fontweight='bold')
ax2.set_ylabel('y', fontsize=11, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)
ax2.set_ylim(y_demo.min() - 10, y_demo.max() + 10)

# Good fit (with regularization)
ax3.scatter(X_demo, y_demo, alpha=0.6, s=60, color='gray', edgecolors='white', linewidth=1.5, label='Data')
ax3.plot(X_plot, y_regularized, linewidth=3, color='#10b981', label='Regularized Polynomial')
ax3.set_title('Good Fit with Regularization\n(Balanced - Low Bias & Variance)', fontsize=13, fontweight='bold')
ax3.set_xlabel('X', fontsize=11, fontweight='bold')
ax3.set_ylabel('y', fontsize=11, fontweight='bold')
ax3.legend(fontsize=10)
ax3.grid(True, alpha=0.3)

plt.suptitle('The Bias-Variance Tradeoff', fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\nüéØ What You're Seeing:")
print("   LEFT (Red): Too simple ‚Üí misses the pattern (underfitting)")
print("   MIDDLE (Blue): Too complex ‚Üí memorizes noise (overfitting)")
print("   RIGHT (Green): Just right ‚Üí captures pattern, ignores noise (regularization!)")
print("\nüí° Regularization prevents the wild oscillations of overfitting!")

In [None]:
# Compare Ridge vs Lasso regularization
# Use California housing data

# Try different regularization strengths
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]

ridge_scores = []
lasso_scores = []

for alpha in alphas:
    # Ridge (L2)
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_scaled, y_train)
    ridge_scores.append(ridge.score(X_test_scaled, y_test))
    
    # Lasso (L1)
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_train_scaled, y_train)
    lasso_scores.append(lasso.score(X_test_scaled, y_test))

# Plot
fig, ax = plt.subplots(figsize=(12, 7))

ax.semilogx(alphas, ridge_scores, marker='o', linewidth=3, markersize=10, 
           label='Ridge (L2) Regularization', color='#3b82f6')
ax.semilogx(alphas, lasso_scores, marker='s', linewidth=3, markersize=10, 
           label='Lasso (L1) Regularization', color='#f59e0b')

ax.set_xlabel('Regularization Strength (Œ±)', fontsize=13, fontweight='bold')
ax.set_ylabel('R¬≤ Score on Test Set', fontsize=13, fontweight='bold')
ax.set_title('Effect of Regularization Strength\n(Higher Œ± = More Regularization)', 
            fontsize=14, fontweight='bold', pad=15)
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)

# Mark optimal points
best_ridge_idx = np.argmax(ridge_scores)
best_lasso_idx = np.argmax(lasso_scores)

ax.plot(alphas[best_ridge_idx], ridge_scores[best_ridge_idx], 
       marker='*', markersize=20, color='red', label='Optimal Ridge')
ax.plot(alphas[best_lasso_idx], lasso_scores[best_lasso_idx], 
       marker='*', markersize=20, color='green', label='Optimal Lasso')

ax.legend(fontsize=11)

plt.tight_layout()
plt.show()

print(f"\nüìä Regularization Comparison:")
print("="*60)
print(f"Best Ridge Œ±: {alphas[best_ridge_idx]:.3f} ‚Üí R¬≤ = {ridge_scores[best_ridge_idx]:.4f}")
print(f"Best Lasso Œ±: {alphas[best_lasso_idx]:.3f} ‚Üí R¬≤ = {lasso_scores[best_lasso_idx]:.4f}")
print("="*60)

print("\nüí° Observations:")
print("   ‚Üí Too little regularization (Œ± < 0.01): Similar to no regularization")
print("   ‚Üí Sweet spot (Œ± ~ 0.1-1.0): Best test performance")
print("   ‚Üí Too much regularization (Œ± > 10): Underfitting, poor performance")
print("   ‚Üí Ridge generally more stable than Lasso for this dataset")

## üéØ Exercise 1: Build and Compare Regressors

**Objective:** Practice building regression models and understanding regularization

**Task:**  
Using the California housing dataset:
1. Train THREE models: LinearRegression, Ridge(alpha=1.0), Lasso(alpha=1.0)
2. Compare their R¬≤ scores on the test set
3. Examine feature coefficients for each model
4. Which model performs best? Why?

<details>
<summary>üí° Hint: Getting Started</summary>

```python
from sklearn.linear_model import LinearRegression, Ridge, Lasso

# Train models
lr = LinearRegression()
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=1.0)

# Fit on training data
lr.fit(X_train_scaled, y_train)
# ... (do same for ridge and lasso)

# Evaluate on test data
print(f"Linear Regression R¬≤: {lr.score(X_test_scaled, y_test):.4f}")
```
</details>

In [None]:
# Your code here!






## üéØ Exercise 2: Classification Threshold Tuning

**Objective:** Understand how classification thresholds affect precision and recall

**Context:**  
By default, logistic regression uses 0.5 as the decision threshold:
- P(disease) > 0.5 ‚Üí Predict disease
- P(disease) ‚â§ 0.5 ‚Üí Predict no disease

**Task:**  
1. Try different thresholds: 0.3, 0.5, 0.7
2. For each threshold, calculate precision and recall
3. Observe the tradeoff: lowering threshold increases recall but decreases precision

**Questions:**
- Which threshold would you choose for disease screening (where missing cases is dangerous)?
- Which threshold for spam detection (where false alarms are annoying)?

<details>
<summary>üí° Hint: How to Apply Threshold</summary>

```python
# Get probabilities
probs = clf.predict_proba(X_test_scaled)[:, 1]

# Apply custom threshold
threshold = 0.3
predictions = (probs > threshold).astype(int)

# Calculate metrics
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
```
</details>

In [None]:
# Your code here!






## üéì Key Takeaways

Let's recap what we've mastered in supervised learning:

- ‚úÖ **Supervised learning fundamentals**: Learn from labeled input-output pairs
- ‚úÖ **Regression**: Predict continuous values (prices, quantities, measurements)
  - Metrics: R¬≤, RMSE, MAE
  - Algorithms: Linear Regression, Ridge, Lasso
- ‚úÖ **Classification**: Predict discrete categories (spam/not, disease/healthy)
  - Metrics: Accuracy, Precision, Recall, F1, ROC-AUC
  - Algorithms: Logistic Regression (and more in 2.2!)
- ‚úÖ **Overfitting**: The #1 ML challenge‚Äîmemorizing instead of learning
- ‚úÖ **Regularization**: L1 (Lasso) and L2 (Ridge) prevent overfitting
- ‚úÖ **Metric selection**: Choose metrics based on business cost of errors

### ü§î The Big Picture:

**70% of real-world ML is supervised learning** because:
1. You have clear targets (labeled data)
2. Success is measurable (test set performance)
3. It works reliably when you have enough data
4. It's interpretable (you can explain predictions)

**The workflow:**
1. Define problem (regression or classification?)
2. Collect & label data
3. Split: train-validation-test
4. Train models, tune hyperparameters
5. Evaluate honestly on test set
6. Deploy and monitor!

You now have the foundation to solve 70% of ML problems! üöÄ

## üìñ Further Learning

**Recommended Reading:**
- [scikit-learn Supervised Learning Guide](https://scikit-learn.org/stable/supervised_learning.html) - Official comprehensive reference
- [Andrew Ng's ML Course](https://www.coursera.org/learn/machine-learning) - The gold standard (free)
- [Google's ML Crash Course](https://developers.google.com/machine-learning/crash-course) - Interactive and practical

**Deep Dives:**
- [StatQuest: Regularization](https://www.youtube.com/watch?v=Q81RR3yKn30) - Visual explanations
- [Understanding ROC Curves](https://www.youtube.com/watch?v=4jRBRDbJemM) - Essential for classification
- [Bias-Variance Tradeoff](https://www.youtube.com/watch?v=EuBBz3bI-aA) - Core ML concept

**Interactive Practice:**
- [Kaggle Learn: Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) - Hands-on exercises
- [Google Colab Tutorials](https://colab.research.google.com/notebooks/intro.ipynb) - Free GPU notebooks

**Research Papers** (for the curious):
- [An Introduction to Statistical Learning](https://www.statlearning.com/) - Free textbook, highly accessible

## ‚û°Ô∏è What's Next?

You've mastered the fundamentals of supervised learning! Next up:

**In Chapter 2.2 - Classification vs Regression**, you'll learn:

**Coming up:**
- **When to use which**: Decision framework for choosing regression vs classification
- **Advanced regression**: Polynomial regression, feature engineering tricks
- **Advanced classification**: Multi-class, multi-label, and imbalanced datasets
- **Model selection**: How to choose the right algorithm for your problem
- **Real-world case studies**: End-to-end projects with messy data

You'll go from understanding concepts to making confident modeling decisions! üéØ

Ready to dive deeper? Open **[Chapter 2.2 - Classification vs Regression](2.2-classification-vs-regression.ipynb)**!

---

### üí¨ Feedback & Community

**Questions?** Join our [Discord community](https://discord.gg/madeforai)

**Found a bug?** [Open an issue on GitHub](https://github.com/madeforai/madeforai/issues)

**Share your projects!** Tweet with #MadeForAI

**Keep learning!** üåü