# Class 3: Evaluation Metrics for Classification

**Week 6: Supervised Learning Algorithms**

## Overview
Welcome to Class 3! Today, we'll learn how to **evaluate** classification models to understand their performance. We'll use **accuracy**, **precision**, **recall**, **F1-score**, and **confusion matrices** to assess the logistic regression, KNN, and decision tree models from Classes 1 and 2. By the end, you'll be able to measure and compare model quality using scikit-learn.

## Objectives
- Understand why evaluation metrics matter.
- Learn the definitions and use cases for accuracy, precision, recall, and F1-score.
- Explore confusion matrices to analyze model errors.
- Evaluate and compare multiple models on the Iris dataset.

## Agenda
1. Why evaluate models?
2. Understanding evaluation metrics
3. Confusion matrix explained
4. Hands-on: Evaluate models

Let’s get started!

## 1. Why Evaluate Models?

Training a model isn’t enough—we need to know **how well it performs**. A model might seem good on training data but fail on new data. Evaluation metrics help us:
- Quantify model performance.
- Compare different models (e.g., logistic regression vs. KNN).
- Identify issues like overfitting or poor predictions.

**Example**: In Class 2, we saw predictions from KNN and decision trees. But how do we know which model is better?

**Question**: Why might a model with 100% accuracy on training data still be bad? (Hint: Think about new data.)

## 2. Understanding Evaluation Metrics

We’ll focus on four common metrics for classification:

- **Accuracy**: Fraction of correct predictions (correct / total).
  - Good for balanced datasets, but misleading if classes are imbalanced.
  - Example: 90 correct out of 100 = 90% accuracy.

- **Precision**: Fraction of positive predictions that are correct (true positives / predicted positives).
  - Important when false positives are costly (e.g., spam detection).
  - Example: If model predicts 10 emails as spam and 8 are actually spam, precision = 8/10 = 0.8.

- **Recall**: Fraction of actual positives correctly identified (true positives / actual positives).
  - Important when false negatives are costly (e.g., disease detection).
  - Example: If 10 patients have a disease and model identifies 7, recall = 7/10 = 0.7.

- **F1-Score**: Harmonic mean of precision and recall (2 * precision * recall / (precision + recall)).
  - Balances precision and recall, useful for imbalanced data.
  - Example: Precision = 0.8, recall = 0.7 → F1 = 2 * 0.8 * 0.7 / (0.8 + 0.7) ≈ 0.746.

**Question**: When might precision matter more than recall? (Pause and discuss!)

## 3. Confusion Matrix Explained

A **confusion matrix** summarizes a model’s predictions by comparing predicted vs. actual labels. For binary classification (e.g., setosa = 0, versicolor = 1), it looks like:

|                  | Predicted 0 | Predicted 1 |
|------------------|-------------|-------------|
| **Actual 0**     | True Negative (TN) | False Positive (FP) |
| **Actual 1**     | False Negative (FN) | True Positive (TP) |

- **True Positive (TP)**: Correctly predicted positive (e.g., predicted versicolor, actually versicolor).
- **True Negative (TN)**: Correctly predicted negative.
- **False Positive (FP)**: Incorrectly predicted positive (e.g., predicted versicolor, actually setosa).
- **False Negative (FN)**: Incorrectly predicted negative.

Metrics are calculated from the matrix:
- Accuracy = (TP + TN) / (TP + TN + FP + FN)
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)

We’ll visualize confusion matrices to make errors clear.

## 4. Hands-On: Evaluate Models

We’ll evaluate the logistic regression, KNN, and decision tree models from Classes 1 and 2 on the Iris dataset (binary: setosa vs. versicolor). We’ll compute accuracy, precision, recall, F1-score, and visualize confusion matrices.

**Steps**:
1. Load and prepare the Iris dataset.
2. Train all three models.
3. Compute evaluation metrics.
4. Visualize confusion matrices.

Let’s dive into the code!

In [None]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create binary classification: setosa (0) vs. versicolor (1)
mask = y < 2
X_binary = X[mask]
y_binary = y[mask]

# Use two features (petal length, petal width) for consistency
X_binary = X_binary[:, 2:4]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_binary, y_binary, test_size=0.2, random_state=42)

# Check the data
print("Feature names:", iris.feature_names[2:4])
print("Target names:", iris.target_names[:2])
print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

**What did we do?**
- Loaded the same binary Iris dataset (100 samples, petal length/width).
- Split into 80% training (80 samples) and 20% testing (20 samples).

Now, let’s train all three models.

In [None]:
# Train models
# Logistic Regression
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# KNN
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)

# Decision Tree
dt = DecisionTreeClassifier(max_depth=3, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

print("Models trained! Ready to evaluate.")

Let’s compute evaluation metrics for each model.

In [None]:
# Function to compute and print metrics
def print_metrics(y_true, y_pred, model_name):
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred)
    rec = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    print(f"{model_name}:")
    print(f"  Accuracy: {acc:.3f}")
    print(f"  Precision: {prec:.3f}")
    print(f"  Recall: {rec:.3f}")
    print(f"  F1-Score: {f1:.3f}")
    print()

# Compute metrics for all models
print_metrics(y_test, y_pred_lr, "Logistic Regression")
print_metrics(y_test, y_pred_knn, "KNN (k=3)")
print_metrics(y_test, y_pred_dt, "Decision Tree")

**Your turn!**
- Look at the metrics. Which model has the highest accuracy? F1-score?
- Are precision and recall similar for all models? Why might they differ?

Now, let’s visualize the confusion matrices.

In [None]:
# Function to plot confusion matrix
def plot_confusion_matrix(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(5, 4))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=iris.target_names[:2], yticklabels=iris.target_names[:2])
    plt.title(title)
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.show()

# Plot confusion matrices
plot_confusion_matrix(y_test, y_pred_lr, "Logistic Regression Confusion Matrix")
plot_confusion_matrix(y_test, y_pred_knn, "KNN (k=3) Confusion Matrix")
plot_confusion_matrix(y_test, y_pred_dt, "Decision Tree Confusion Matrix")

**Discussion**:
- Look at the confusion matrices:
  - **Diagonal** (top-left, bottom-right): Correct predictions (TN, TP).
  - **Off-diagonal**: Errors (FP, FN).
- Which model has the fewest errors?
- Do any models make more false positives than false negatives?

**Your turn!**
- Try changing the KNN `n_neighbors` to 7 in the training cell and re-run the metrics and confusion matrix. Does performance improve?

## Wrap-Up

Today, you:
- Learned why **evaluation metrics** are critical.
- Computed **accuracy**, **precision**, **recall**, and **F1-score**.
- Used **confusion matrices** to analyze model errors.
- Evaluated logistic regression, KNN, and decision tree models.

**Homework**:
- Re-run the notebook with a different train-test split (change `random_state` to 123) and check how metrics change.
- Explore the [scikit-learn metrics documentation](https://scikit-learn.org/stable/modules/model_evaluation.html) (5-10 min).

**Next Class**:
- We’ll cover **hyperparameter tuning** with grid search and work on a **mini-project** to tie everything together.
- Be ready to build and compare models!

Any questions?