# Class 2: K-Nearest Neighbors and Decision Trees

**Week 6: Supervised Learning Algorithms**

## Overview
Welcome to Class 2! Today, we'll explore two new supervised learning algorithms: **k-nearest neighbors (KNN)** and **decision trees**. We'll compare them to the logistic regression model from Class 1 and see how they work for classification tasks. By the end, you'll train KNN and decision tree models using scikit-learn on the Iris dataset.

## Objectives
- Understand how k-nearest neighbors (KNN) classifies data based on proximity.
- Learn how decision trees make predictions using hierarchical rules.
- Compare KNN, decision trees, and logistic regression.
- Train and visualize predictions for both algorithms.

## Agenda
1. Introduction to k-nearest neighbors (KNN)
2. Introduction to decision trees
3. Hands-on: Train KNN and decision tree models
4. Compare predictions

Let's dive in!

## 1. Introduction to K-Nearest Neighbors (KNN)

**K-nearest neighbors (KNN)** is a simple, intuitive classification algorithm. It predicts the class of a new sample by looking at the **k closest samples** in the training data and taking a majority vote.

**How it works**:
- Compute the distance (e.g., Euclidean) between a new sample and all training samples.
- Find the *k* closest samples (neighbors).
- Assign the class that appears most among those neighbors.

**Key parameter**: *k* (number of neighbors).
- Small *k* (e.g., 1): Sensitive to noise, overfitting.
- Large *k*: Smoother predictions, but may miss patterns.

**Example**: If a new Iris flower's closest 3 neighbors are two *setosa* and one *versicolor*, KNN predicts *setosa*.

**Pros**: Simple, no training phase. **Cons**: Slow for large datasets, sensitive to irrelevant features.

**Question**: What might happen if *k* is too large? (Pause and discuss!)

## 2. Introduction to Decision Trees

**Decision trees** are classification algorithms that split data into regions based on feature values, creating a tree of decisions.

**How it works**:
- Start at the root and ask a question (e.g., "Is petal length > 2.5 cm?").
- Follow branches based on answers, splitting data at each node.
- Reach a leaf node, which gives the predicted class.

**Key parameter**: Max depth (how many splits).
- Shallow trees: Simple, may underfit.
- Deep trees: Complex, may overfit.

**Example**: A decision tree might split Iris data first by petal length, then petal width, to separate species.

**Pros**: Interpretable, handles mixed data. **Cons**: Can overfit, sensitive to small changes.

**Question**: Why might a very deep tree be a problem? (Hint: Think about memorizing the data.)

## 3. Hands-On: Train KNN and Decision Tree Models

We'll use the same Iris dataset (binary: *setosa* vs. *versicolor*) as Class 1 to train KNN and decision tree models. We'll also revisit logistic regression for comparison.

**Steps**:
1. Load and prepare the Iris dataset (same as Class 1).
2. Train a KNN model and experiment with *k*.
3. Train a decision tree model.
4. Visualize predictions and decision boundaries.

Let’s get coding!

In [None]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create binary classification: setosa (0) vs. versicolor (1)
mask = y < 2
X_binary = X[mask]
y_binary = y[mask]

# Use two features (petal length, petal width) for visualization
X_binary = X_binary[:, 2:4]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_binary, y_binary, test_size=0.2, random_state=42)

# Check the data
print("Feature names:", iris.feature_names[2:4])
print("Target names:", iris.target_names[:2])
print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

**What did we do?**
- Loaded the same binary Iris dataset as Class 1 (100 samples, petal length/width).
- Split into 80% training (80 samples) and 20% testing (20 samples).

Let’s train a KNN model with *k=3*.

In [None]:
# Train KNN model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Make predictions
y_pred_knn = knn.predict(X_test)

print("KNN (k=3) predictions:", y_pred_knn)
print("Actual labels:", y_test)

**Your turn!**
- Compare `y_pred_knn` and `y_test`. How many predictions are correct?
- Try changing `n_neighbors` to 1 or 7 in the cell below and re-run. What changes?

In [None]:
# Experiment with different k
knn_experiment = KNeighborsClassifier(n_neighbors=7)  # Try 1, 5, 7, etc.
knn_experiment.fit(X_train, y_train)
y_pred_knn_experiment = knn_experiment.predict(X_test)
print("KNN (new k) predictions:", y_pred_knn_experiment)

Now, let’s train a decision tree.

In [None]:
# Train decision tree model
dt = DecisionTreeClassifier(max_depth=3, random_state=42)
dt.fit(X_train, y_train)

# Make predictions
y_pred_dt = dt.predict(X_test)

print("Decision tree predictions:", y_pred_dt)
print("Actual labels:", y_test)

**Question**: Compare the decision tree predictions to KNN. Are they similar?

## 4. Compare Predictions

Let’s train a logistic regression model (from Class 1) and visualize decision boundaries for all three algorithms to see how they differ.

In [None]:
# Train logistic regression for comparison
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# Function to plot decision boundaries
def plot_decision_boundary(model, X, y, title):
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, alpha=0.3, cmap="coolwarm")
    plt.scatter(X[y == 0, 0], X[y == 0, 1], label="Setosa", color="blue")
    plt.scatter(X[y == 1, 0], X[y == 1, 1], label="Versicolor", color="orange")
    plt.xlabel("Petal Length (cm)")
    plt.ylabel("Petal Width (cm)")
    plt.title(title)
    plt.legend()

# Plot decision boundaries for all models
plt.figure(figsize=(15, 4))

plt.subplot(1, 3, 1)
plot_decision_boundary(lr, X_binary, y_binary, "Logistic Regression")

plt.subplot(1, 3, 2)
plot_decision_boundary(knn, X_binary, y_binary, "KNN (k=3)")

plt.subplot(1, 3, 3)
plot_decision_boundary(dt, X_binary, y_binary, "Decision Tree")

plt.tight_layout()
plt.show()

**Discussion**:
- Look at the decision boundaries:
  - **Logistic Regression**: Linear boundary (straight line).
  - **KNN**: Wavy, follows data points closely.
  - **Decision Tree**: Blocky, rectangular splits.
- Which boundary looks most flexible? Which looks simplest?
- Try changing `n_neighbors` or `max_depth` and re-run the plots. What changes?

## Wrap-Up

Today, you:
- Learned how **KNN** uses neighbors to classify data.
- Explored **decision trees** and their rule-based splits.
- Trained both models on the Iris dataset.
- Compared decision boundaries with logistic regression.

**Homework**:
- Re-run the KNN model with a different *k* (e.g., 1, 10) and note how predictions change.
- Check out the [scikit-learn KNN documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) and [DecisionTreeClassifier documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) (5-10 min).

**Next Class**:
- We’ll dive into **evaluation metrics** (accuracy, precision, recall, F1-score) to measure how good our models are.
- Bring questions about today’s models!

Any questions before we finish?