# Chapter 3.9-3.10: ROC Curves and Class Imbalance

Goal: Understand why accuracy is misleading for imbalanced data, use ROC/AUC for evaluation, and apply class weighting.

### Topics:
- ROC curves: TPR vs FPR across thresholds
- AUC as a summary metric (1.0 = perfect, 0.5 = random)
- Class imbalance and why accuracy fails
- Using `class_weight='balanced'` to handle imbalance

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import (confusion_matrix, classification_report,
                             roc_curve, roc_auc_score)

## Quick Recap

- **ROC curve** plots True Positive Rate vs False Positive Rate at every threshold
- **AUC** (Area Under the Curve) summarizes the ROC: 1.0 = perfect, 0.5 = random guessing
- **Class imbalance** = one class is much more common than the other
- With imbalanced data, a model can get high accuracy by always predicting the majority class
- `class_weight='balanced'` tells the model to pay more attention to the minority class
- With imbalanced data, focus on **precision**, **recall**, and **F1** instead of accuracy

## Data

We'll use the **Spotify** dataset with an intentionally imbalanced target: predicting "viral hits" (popularity > 70).

In [None]:
# Load Spotify data
spotify = pd.read_csv('../../Textbook/data/spotify.csv')
spotify.head()

## Practice

### 1. Use AI — Create an imbalanced target and check balance

Create a binary target: `viral = (popularity > 70).astype(int)`. Check the class distribution — how imbalanced is it?

In [None]:
# Create binary target for viral hits
spotify['viral'] = ...

# Check class distribution
...

### 2. Use AI — Fit logistic regression and evaluate

Select features (`danceability`, `energy`, `valence`, `tempo`, `loudness`, `acousticness`), train/test split, fit a `LogisticRegression` model. Print both the **accuracy** and the full **classification report**.

In [None]:
# Step 1: Select features and target


# Step 2: Train/test split (80/20, random_state=42)


# Step 3: Fit LogisticRegression


# Step 4: Print accuracy and classification report


### 3. Interpretation — The accuracy trap

Look at the classification report. Is accuracy high? What about recall for the minority class (viral hits)? What's happening?

**Your answer:**

(Write your answer here)

### 4. Use AI — Fit with class_weight='balanced' and compare

Fit a new `LogisticRegression(class_weight='balanced')`. Display the confusion matrices for both models (unweighted and balanced) side by side.

In [None]:
# Step 1: Fit LogisticRegression with class_weight='balanced'


# Step 2: Display both confusion matrices side by side
# Hint: use plt.subplots(1, 2, figsize=(12, 5))


### 5. Interpretation — Balanced vs unweighted

What changed between the two models? Did accuracy go up or down? Did recall for viral hits improve? Is the balanced model "better"?

**Your answer:**

(Write your answer here)

### 6. Use AI — Plot ROC curves for both models

Plot the ROC curves for the unweighted and balanced logistic regression models on the same plot. Include the AUC score in the legend. Add a dashed diagonal line for reference (random classifier).

In [None]:
# Plot ROC curves for both models on the same plot
# Note: Use predict_proba()[:, 1] to get probabilities for the positive class


### 7. Interpretation — ROC and AUC

Looking at the ROC curves:
- Which model is actually better at **ranking** predictions?
- Does AUC tell a different story than accuracy?

**Your answer:**

(Write your answer here)

### 8. Use AI — Compare multiple models with ROC curves

Fit logistic regression, random forest, and SVM (all with `class_weight='balanced'`) on the Spotify data. Plot all three ROC curves on the same plot with AUC in the legend.

Note: For SVM, use `SVC(class_weight='balanced', probability=True)` to enable `predict_proba()`.

In [None]:
# Fit all three models and plot ROC curves


### 9. Interpretation — Why accuracy misleads

Why is accuracy a misleading metric for imbalanced data? Give a concrete example using the viral hits scenario. (For instance: if only 5% of songs are viral, what accuracy would a model get by predicting "not viral" for every song?)

**Your answer:**

(Write your answer here)

## Discussion

In the real world, most classification problems have imbalanced classes. Before you evaluate any model, what's the very first thing you should check about your data?

(Discuss with a neighbor)