# Classification Methods Workshop

**CMSC 173 - Machine Learning**  
**University of the Philippines - Cebu**  
**Instructor:** Noel Jeffrey Pinton  
**Department:** Department of Computer Science

---

## 📚 Learning Objectives

By the end of this workshop, you will be able to:

1. **Understand** the fundamental concepts of classification and supervised learning
2. **Implement** Naive Bayes, K-Nearest Neighbors, and Decision Trees from scratch
3. **Apply** classification algorithms to real-world datasets
4. **Evaluate** classification performance using appropriate metrics
5. **Compare** different classification methods and choose the best for a given problem

**Estimated Time:** 75-90 minutes  
**Prerequisites:** Linear algebra, Python, NumPy, basic probability

---

## 📋 Table of Contents

1. [Setup & Imports](#setup)
2. [Part 1: Motivation & Background](#part1)
3. [Part 2: Naive Bayes Classification](#part2)
4. [Part 3: K-Nearest Neighbors](#part3)
5. [Part 4: Decision Trees](#part4)
6. [Part 5: Comparison & Evaluation](#part5)
7. [Part 6: Advanced Topics](#part6)
8. [Student Challenge](#challenge)
9. [Solutions](#solutions)
10. [Summary & Next Steps](#summary)

<a id='setup'></a>
## 1. Setup & Imports

In [None]:
# Environment setup
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_iris, load_wine, load_breast_cancer, make_classification
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve
)
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11
sns.set_style('whitegrid')

print("✅ Environment setup complete!")
print(f"NumPy version: {np.__version__}")
print(f"Python packages loaded successfully")

<a id='part1'></a>
## 2. Part 1: Motivation & Background

### What is Classification?

**Classification** is a supervised learning task where we predict a discrete class label for new observations based on training examples.

**Key Characteristics:**
- Supervised learning (requires labeled training data)
- Discrete output (categories/classes)
- Learn from examples
- Make predictions on new data

**Real-World Applications:**
- Medical diagnosis (disease detection)
- Email spam filtering
- Credit scoring (loan approval)
- Image recognition (object classification)
- Sentiment analysis (positive/negative reviews)

In [None]:
# Load the classic Iris dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

print("Iris Dataset Overview")
print("=" * 50)
print(f"Number of samples: {X_iris.shape[0]}")
print(f"Number of features: {X_iris.shape[1]}")
print(f"Number of classes: {len(target_names)}")
print(f"\nFeatures: {feature_names}")
print(f"Classes: {list(target_names)}")
print(f"\nClass distribution:")
for i, name in enumerate(target_names):
    count = np.sum(y_iris == i)
    print(f"  {name}: {count} samples")

# Visualize the data
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter plot of first two features
for i, name in enumerate(target_names):
    mask = y_iris == i
    axes[0].scatter(X_iris[mask, 0], X_iris[mask, 1], 
                   s=80, alpha=0.6, edgecolors='k', label=name)
axes[0].set_xlabel(feature_names[0])
axes[0].set_ylabel(feature_names[1])
axes[0].set_title('Iris Dataset: 2D Visualization', fontweight='bold')
axes[0].legend()

# Box plot of all features
df = pd.DataFrame(X_iris, columns=feature_names)
df['species'] = [target_names[i] for i in y_iris]
df_melt = pd.melt(df, id_vars='species', var_name='feature', value_name='value')
sns.boxplot(data=df_melt, x='feature', y='value', hue='species', ax=axes[1])
axes[1].set_title('Feature Distributions by Class', fontweight='bold')
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=45, ha='right')

plt.tight_layout()
plt.show()

<a id='part2'></a>
## 3. Part 2: Naive Bayes Classification

### Theory: Bayes' Theorem

$$P(C_k | \mathbf{x}) = \frac{P(\mathbf{x} | C_k) \cdot P(C_k)}{P(\mathbf{x})}$$

**Naive Assumption:** Features are conditionally independent given the class

$$P(\mathbf{x} | C_k) = \prod_{i=1}^{d} P(x_i | C_k)$$

**Classification Rule:**

$$\hat{y} = \arg\max_{k} P(C_k) \prod_{i=1}^{d} P(x_i | C_k)$$

In [None]:
# Implement Gaussian Naive Bayes from scratch
class GaussianNBFromScratch:
    """Gaussian Naive Bayes classifier for educational purposes"""
    
    def __init__(self):
        self.classes = None
        self.priors = {}
        self.means = {}
        self.vars = {}
    
    def fit(self, X, y):
        """Fit Gaussian Naive Bayes"""
        n_samples = X.shape[0]
        self.classes = np.unique(y)
        
        # Compute parameters for each class
        for c in self.classes:
            X_c = X[y == c]
            
            # Prior probability
            self.priors[c] = len(X_c) / n_samples
            
            # Mean and variance for each feature
            self.means[c] = X_c.mean(axis=0)
            self.vars[c] = X_c.var(axis=0)
        
        return self
    
    def _gaussian_pdf(self, x, mean, var):
        """Compute Gaussian probability density function"""
        eps = 1e-9  # Small value to avoid division by zero
        coeff = 1.0 / np.sqrt(2 * np.pi * var + eps)
        exponent = np.exp(-((x - mean) ** 2) / (2 * var + eps))
        return coeff * exponent
    
    def predict_proba(self, X):
        """Predict class probabilities for X"""
        n_samples = X.shape[0]
        n_classes = len(self.classes)
        probs = np.zeros((n_samples, n_classes))
        
        for idx, c in enumerate(self.classes):
            # Compute log probabilities to avoid underflow
            log_prior = np.log(self.priors[c])
            log_likelihood = np.sum(
                np.log(self._gaussian_pdf(X, self.means[c], self.vars[c]) + 1e-9),
                axis=1
            )
            probs[:, idx] = log_prior + log_likelihood
        
        # Convert back from log space and normalize
        probs = np.exp(probs)
        probs = probs / probs.sum(axis=1, keepdims=True)
        
        return probs
    
    def predict(self, X):
        """Predict class labels for X"""
        probs = self.predict_proba(X)
        return self.classes[np.argmax(probs, axis=1)]

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

# Train from scratch
nb_scratch = GaussianNBFromScratch()
nb_scratch.fit(X_train, y_train)

# Train sklearn version
nb_sklearn = GaussianNB()
nb_sklearn.fit(X_train, y_train)

# Compare predictions
y_pred_scratch = nb_scratch.predict(X_test)
y_pred_sklearn = nb_sklearn.predict(X_test)

print("Gaussian Naive Bayes Results")
print("=" * 50)
print(f"From Scratch - Train Accuracy: {accuracy_score(y_train, nb_scratch.predict(X_train)):.3f}")
print(f"From Scratch - Test Accuracy:  {accuracy_score(y_test, y_pred_scratch):.3f}")
print(f"\nSklearn     - Train Accuracy: {accuracy_score(y_train, nb_sklearn.predict(X_train)):.3f}")
print(f"Sklearn     - Test Accuracy:  {accuracy_score(y_test, y_pred_sklearn):.3f}")
print(f"\nPredictions match: {np.allclose(y_pred_scratch, y_pred_sklearn)}")

# Display learned parameters
print("\nLearned Parameters (From Scratch):")
print("-" * 50)
for c in nb_scratch.classes:
    print(f"\nClass {target_names[c]}:")
    print(f"  Prior: {nb_scratch.priors[c]:.3f}")
    print(f"  Means: {nb_scratch.means[c]}")
    print(f"  Variances: {nb_scratch.vars[c]}")

In [None]:
# Visualize Naive Bayes decision boundaries (2D projection)
from matplotlib.colors import ListedColormap

def plot_decision_boundaries(X, y, classifier, title, feature_indices=(0, 1)):
    """Plot decision boundaries for 2D data"""
    # Use only two features
    X_2d = X[:, feature_indices]
    
    # Train classifier on 2D data
    classifier.fit(X_2d, y)
    
    # Create mesh
    h = 0.02  # Step size
    x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
    y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    # Predict on mesh
    Z = classifier.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
    cmap_bold = ['#FF0000', '#00FF00', '#0000FF']
    
    plt.figure(figsize=(10, 7))
    plt.contourf(xx, yy, Z, alpha=0.3, cmap=cmap_light)
    
    # Plot training points
    for i, name in enumerate(target_names):
        mask = y == i
        plt.scatter(X_2d[mask, 0], X_2d[mask, 1], 
                   c=cmap_bold[i], s=80, alpha=0.7, 
                   edgecolors='k', label=name)
    
    plt.xlabel(feature_names[feature_indices[0]])
    plt.ylabel(feature_names[feature_indices[1]])
    plt.title(title, fontweight='bold', fontsize=13)
    plt.legend()
    plt.tight_layout()
    plt.show()

# Plot Naive Bayes boundaries
plot_decision_boundaries(X_train, y_train, GaussianNB(), 
                        'Naive Bayes Decision Boundaries', (0, 1))

<a id='part3'></a>
## 4. Part 3: K-Nearest Neighbors

### Theory: The KNN Algorithm

**Core Idea:** "You are the average of your $k$ closest neighbors"

**Algorithm:**
1. Choose number of neighbors $k$
2. For a new point:
   - Compute distance to all training points
   - Find $k$ nearest neighbors
   - Take majority vote of their labels
   - Return predicted class

**Key Parameters:**
- $k$: Number of neighbors (odd for binary classification)
- Distance metric: Euclidean, Manhattan, Minkowski, etc.

In [None]:
# Implement KNN from scratch
class KNNFromScratch:
    """K-Nearest Neighbors classifier for educational purposes"""
    
    def __init__(self, k=3, distance_metric='euclidean'):
        self.k = k
        self.distance_metric = distance_metric
        self.X_train = None
        self.y_train = None
    
    def fit(self, X, y):
        """Store training data (lazy learning)"""
        self.X_train = X
        self.y_train = y
        return self
    
    def _euclidean_distance(self, x1, x2):
        """Compute Euclidean distance"""
        return np.sqrt(np.sum((x1 - x2) ** 2, axis=1))
    
    def _manhattan_distance(self, x1, x2):
        """Compute Manhattan distance"""
        return np.sum(np.abs(x1 - x2), axis=1)
    
    def predict(self, X):
        """Predict class labels for X"""
        predictions = []
        
        for x in X:
            # Compute distances to all training points
            if self.distance_metric == 'euclidean':
                distances = self._euclidean_distance(self.X_train, x)
            elif self.distance_metric == 'manhattan':
                distances = self._manhattan_distance(self.X_train, x)
            else:
                raise ValueError(f"Unknown distance metric: {self.distance_metric}")
            
            # Find k nearest neighbors
            k_indices = np.argsort(distances)[:self.k]
            k_nearest_labels = self.y_train[k_indices]
            
            # Majority vote
            most_common = np.bincount(k_nearest_labels).argmax()
            predictions.append(most_common)
        
        return np.array(predictions)

# Must scale features for KNN!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train from scratch with different k values
print("KNN Performance for Different k Values")
print("=" * 50)

for k in [1, 3, 5, 7, 9]:
    knn_scratch = KNNFromScratch(k=k)
    knn_scratch.fit(X_train_scaled, y_train)
    y_pred = knn_scratch.predict(X_test_scaled)
    
    train_pred = knn_scratch.predict(X_train_scaled)
    train_acc = accuracy_score(y_train, train_pred)
    test_acc = accuracy_score(y_test, y_pred)
    
    print(f"k={k}: Train Acc={train_acc:.3f}, Test Acc={test_acc:.3f}")

# Compare with sklearn
print("\nComparison with Sklearn (k=5):")
print("-" * 50)
knn_sklearn = KNeighborsClassifier(n_neighbors=5)
knn_sklearn.fit(X_train_scaled, y_train)
print(f"Sklearn - Train Accuracy: {knn_sklearn.score(X_train_scaled, y_train):.3f}")
print(f"Sklearn - Test Accuracy:  {knn_sklearn.score(X_test_scaled, y_test):.3f}")

In [None]:
# Visualize effect of k on decision boundaries
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for idx, k in enumerate([1, 5, 15]):
    knn = KNeighborsClassifier(n_neighbors=k)
    
    # Use only two features for visualization
    X_2d = X_train_scaled[:, [0, 1]]
    knn.fit(X_2d, y_train)
    
    # Create mesh
    h = 0.02
    x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
    y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
    cmap_bold = ['#FF0000', '#00FF00', '#0000FF']
    
    axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap=cmap_light)
    
    for i, name in enumerate(target_names):
        mask = y_train == i
        axes[idx].scatter(X_2d[mask, 0], X_2d[mask, 1],
                         c=cmap_bold[i], s=60, alpha=0.7,
                         edgecolors='k', label=name)
    
    axes[idx].set_xlabel(f'Feature 1 (scaled)')
    axes[idx].set_ylabel(f'Feature 2 (scaled)')
    axes[idx].set_title(f'KNN with k={k}', fontweight='bold')
    if idx == 0:
        axes[idx].legend()

plt.suptitle('Effect of k on Decision Boundaries', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Find optimal k using cross-validation
k_values = range(1, 31)
cv_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_scaled, y_train, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

optimal_k = k_values[np.argmax(cv_scores)]

plt.figure(figsize=(10, 6))
plt.plot(k_values, cv_scores, marker='o', linewidth=2)
plt.axvline(x=optimal_k, color='r', linestyle='--', label=f'Optimal k = {optimal_k}')
plt.xlabel('Number of Neighbors (k)', fontweight='bold')
plt.ylabel('Cross-Validation Accuracy', fontweight='bold')
plt.title('Finding Optimal k using Cross-Validation', fontweight='bold', fontsize=13)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Optimal k: {optimal_k}")
print(f"Best CV accuracy: {max(cv_scores):.3f}")

<a id='part4'></a>
## 5. Part 4: Decision Trees

### Theory: Decision Tree Learning

**Goal:** Build a tree of if-then-else rules to classify data

**CART Algorithm (Classification And Regression Trees):**
1. Start with all data at root
2. Find best split that maximizes information gain
3. Partition data into left and right subsets
4. Recursively build subtrees
5. Stop when stopping criterion met

**Splitting Criteria:**

**Gini Impurity:**
$$\text{Gini}(D) = 1 - \sum_{k=1}^{C} p_k^2$$

**Entropy (Information Gain):**
$$\text{Entropy}(D) = -\sum_{k=1}^{C} p_k \log_2(p_k)$$

In [None]:
# Train Decision Tree with different depths
print("Decision Tree Performance for Different Depths")
print("=" * 50)

for depth in [2, 3, 5, 10, None]:
    dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)
    
    train_acc = dt.score(X_train, y_train)
    test_acc = dt.score(X_test, y_test)
    
    depth_str = str(depth) if depth is not None else 'None'
    print(f"max_depth={depth_str:>4}: Train Acc={train_acc:.3f}, Test Acc={test_acc:.3f}")

# Train final model with optimal depth
dt_final = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_final.fit(X_train, y_train)

print(f"\nFinal Model (max_depth=3):")
print(f"  Number of leaves: {dt_final.get_n_leaves()}")
print(f"  Tree depth: {dt_final.get_depth()}")
print(f"  Number of features used: {np.sum(dt_final.feature_importances_ > 0)}")

In [None]:
# Visualize the decision tree
plt.figure(figsize=(20, 10))
plot_tree(dt_final, 
          feature_names=feature_names,
          class_names=target_names,
          filled=True,
          rounded=True,
          fontsize=10)
plt.title('Decision Tree Visualization (max_depth=3)', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# Feature importance
importances = dt_final.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(10, 6))
plt.bar(range(len(importances)), importances[indices], color='steelblue', edgecolor='black')
plt.xticks(range(len(importances)), [feature_names[i] for i in indices], rotation=45, ha='right')
plt.xlabel('Features', fontweight='bold')
plt.ylabel('Importance', fontweight='bold')
plt.title('Feature Importance in Decision Tree', fontweight='bold', fontsize=13)
plt.tight_layout()
plt.show()

print("Feature Importance Ranking:")
for i, idx in enumerate(indices):
    print(f"{i+1}. {feature_names[idx]}: {importances[idx]:.3f}")

In [None]:
# Visualize decision boundaries for decision tree
plot_decision_boundaries(X_train, y_train, 
                        DecisionTreeClassifier(max_depth=3, random_state=42),
                        'Decision Tree Decision Boundaries (max_depth=3)', (0, 1))

<a id='part5'></a>
## 6. Part 5: Comparison & Evaluation

### Performance Metrics

For classification, we use multiple metrics:

1. **Accuracy:** Proportion of correct predictions
2. **Precision:** TP / (TP + FP) - Of predicted positives, how many are correct?
3. **Recall:** TP / (TP + FN) - Of actual positives, how many did we find?
4. **F1-Score:** Harmonic mean of precision and recall
5. **Confusion Matrix:** Visual breakdown of predictions

In [None]:
# Train all three classifiers
classifiers = {
    'Naive Bayes': GaussianNB(),
    'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
    'Decision Tree (depth=3)': DecisionTreeClassifier(max_depth=3, random_state=42)
}

results = []

for name, clf in classifiers.items():
    # Use scaled data for KNN, original for others
    if 'KNN' in name:
        clf.fit(X_train_scaled, y_train)
        y_pred = clf.predict(X_test_scaled)
        train_acc = clf.score(X_train_scaled, y_train)
    else:
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        train_acc = clf.score(X_train, y_train)
    
    test_acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    results.append({
        'Method': name,
        'Train Acc': train_acc,
        'Test Acc': test_acc,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1
    })

# Display results
results_df = pd.DataFrame(results)
print("\nPerformance Comparison on Iris Dataset")
print("=" * 80)
print(results_df.to_string(index=False))

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot of test accuracies
axes[0].bar(results_df['Method'], results_df['Test Acc'], color='steelblue', edgecolor='black')
axes[0].set_ylabel('Test Accuracy', fontweight='bold')
axes[0].set_title('Test Accuracy Comparison', fontweight='bold')
axes[0].set_ylim([0.85, 1.0])
axes[0].set_xticklabels(results_df['Method'], rotation=15, ha='right')

# Grouped bar plot of all metrics
metrics = ['Precision', 'Recall', 'F1-Score']
x = np.arange(len(results_df))
width = 0.25

for i, metric in enumerate(metrics):
    axes[1].bar(x + i * width, results_df[metric], width, label=metric, edgecolor='black')

axes[1].set_ylabel('Score', fontweight='bold')
axes[1].set_title('Performance Metrics Comparison', fontweight='bold')
axes[1].set_xticks(x + width)
axes[1].set_xticklabels(results_df['Method'], rotation=15, ha='right')
axes[1].legend()
axes[1].set_ylim([0.85, 1.0])

plt.tight_layout()
plt.show()

In [None]:
# Display confusion matrices
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for idx, (name, clf) in enumerate(classifiers.items()):
    # Get predictions
    if 'KNN' in name:
        y_pred = clf.predict(X_test_scaled)
    else:
        y_pred = clf.predict(X_test)
    
    # Compute confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    
    # Plot
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=target_names, yticklabels=target_names,
                ax=axes[idx], cbar=True)
    axes[idx].set_xlabel('Predicted', fontweight='bold')
    axes[idx].set_ylabel('Actual', fontweight='bold')
    axes[idx].set_title(f'{name}', fontweight='bold')

plt.suptitle('Confusion Matrices', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

<a id='part6'></a>
## 7. Part 6: Advanced Topics

### Text Classification with Naive Bayes

Naive Bayes is particularly effective for text classification tasks like spam detection.

In [None]:
# Simple text classification example
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Sample email dataset (spam/ham)
emails = [
    "Free money now!!!",
    "Hi Bob, how was your weekend?",
    "Win a lottery jackpot!",
    "Meeting at 3pm tomorrow",
    "Claim your prize immediately",
    "Lunch next week?",
    "URGENT: Act now for free prize",
    "Can you send the report?"
]

labels = [1, 0, 1, 0, 1, 0, 1, 0]  # 1 = spam, 0 = ham

# Vectorize text
vectorizer = CountVectorizer()
X_text = vectorizer.fit_transform(emails)

# Train Multinomial Naive Bayes
mnb = MultinomialNB()
mnb.fit(X_text, labels)

# Test predictions
test_emails = [
    "Free prize waiting for you",
    "See you at the meeting"
]

X_test_text = vectorizer.transform(test_emails)
predictions = mnb.predict(X_test_text)
probabilities = mnb.predict_proba(X_test_text)

print("Text Classification Demo (Spam Detection)")
print("=" * 50)
for email, pred, prob in zip(test_emails, predictions, probabilities):
    label = "SPAM" if pred == 1 else "HAM"
    confidence = prob[pred]
    print(f"\nEmail: \"{email}\"")
    print(f"Prediction: {label} (confidence: {confidence:.2%})")

### Ensemble Methods Preview: Random Forest

Random Forest combines multiple decision trees to improve performance.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Compare single tree vs Random Forest
dt_single = DecisionTreeClassifier(max_depth=3, random_state=42)
rf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)

dt_single.fit(X_train, y_train)
rf.fit(X_train, y_train)

print("Single Decision Tree vs Random Forest")
print("=" * 50)
print(f"Decision Tree - Test Accuracy: {dt_single.score(X_test, y_test):.3f}")
print(f"Random Forest  - Test Accuracy: {rf.score(X_test, y_test):.3f}")
print(f"\nImprovement: {(rf.score(X_test, y_test) - dt_single.score(X_test, y_test)):.3f}")

print("\nNote: Random Forest typically provides more robust predictions")
print("by averaging multiple trees. This will be covered in detail in")
print("the Ensemble Methods module!")

<a id='challenge'></a>
## 8. Student Challenge

### Challenge Tasks

Apply what you've learned to the **Wine dataset**. Complete the following tasks:

1. **Load and explore** the Wine dataset
2. **Split** the data into train/test sets (70/30 split)
3. **Train** all three classifiers (Naive Bayes, KNN, Decision Tree)
4. **Tune** hyperparameters using cross-validation:
   - Find optimal k for KNN
   - Find optimal max_depth for Decision Tree
5. **Compare** performance using accuracy, precision, recall, F1-score
6. **Visualize** confusion matrices for all three methods
7. **Interpret** which method works best and why

**Bonus Challenge:**
- Try the Breast Cancer dataset instead
- Experiment with feature selection
- Create ROC curves for binary classification

In [None]:
# Task 1: Load and explore Wine dataset
wine = load_wine()
X_wine = wine.data
y_wine = wine.target

# YOUR CODE HERE
print("Wine Dataset Overview:")
print(f"Number of samples: {X_wine.shape[0]}")
print(f"Number of features: {X_wine.shape[1]}")
print(f"Number of classes: {len(np.unique(y_wine))}")
print(f"\nFeature names: {wine.feature_names[:5]}...")  # Show first 5
print(f"Target names: {list(wine.target_names)}")

In [None]:
# Task 2-7: YOUR CODE HERE
# Split data, train classifiers, tune parameters, compare performance

# Hint for Task 2: Use train_test_split with test_size=0.3
# Hint for Task 3: Don't forget to scale data for KNN!
# Hint for Task 4: Use GridSearchCV or loop over parameter values
# Hint for Task 5: Use the metrics functions from sklearn.metrics
# Hint for Task 6: Use confusion_matrix and sns.heatmap

pass  # Remove this and add your code

<a id='solutions'></a>
## 9. Solutions

### Challenge Solutions

In [None]:
# Solution to Student Challenge

# Task 1: Already done above

# Task 2: Split data
X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(
    X_wine, y_wine, test_size=0.3, random_state=42, stratify=y_wine
)

# Scale for KNN
scaler_wine = StandardScaler()
X_train_wine_scaled = scaler_wine.fit_transform(X_train_wine)
X_test_wine_scaled = scaler_wine.transform(X_test_wine)

# Task 3 & 4: Train and tune classifiers
wine_classifiers = {}

# Naive Bayes (no tuning needed)
wine_classifiers['Naive Bayes'] = GaussianNB()
wine_classifiers['Naive Bayes'].fit(X_train_wine, y_train_wine)

# KNN with optimal k
k_range = range(1, 21)
k_scores = []
for k in k_range:
    knn_temp = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn_temp, X_train_wine_scaled, y_train_wine, cv=5)
    k_scores.append(scores.mean())

optimal_k_wine = k_range[np.argmax(k_scores)]
print(f"Optimal k for Wine dataset: {optimal_k_wine}")

wine_classifiers['KNN'] = KNeighborsClassifier(n_neighbors=optimal_k_wine)
wine_classifiers['KNN'].fit(X_train_wine_scaled, y_train_wine)

# Decision Tree with optimal depth
depth_range = [2, 3, 5, 7, 10, 15]
depth_scores = []
for depth in depth_range:
    dt_temp = DecisionTreeClassifier(max_depth=depth, random_state=42)
    scores = cross_val_score(dt_temp, X_train_wine, y_train_wine, cv=5)
    depth_scores.append(scores.mean())

optimal_depth_wine = depth_range[np.argmax(depth_scores)]
print(f"Optimal depth for Wine dataset: {optimal_depth_wine}")

wine_classifiers['Decision Tree'] = DecisionTreeClassifier(
    max_depth=optimal_depth_wine, random_state=42
)
wine_classifiers['Decision Tree'].fit(X_train_wine, y_train_wine)

# Task 5: Compare performance
print("\n" + "="*70)
print("Performance on Wine Dataset")
print("="*70)

wine_results = []
for name, clf in wine_classifiers.items():
    # Use appropriate data (scaled for KNN)
    if name == 'KNN':
        y_pred = clf.predict(X_test_wine_scaled)
        train_acc = clf.score(X_train_wine_scaled, y_train_wine)
    else:
        y_pred = clf.predict(X_test_wine)
        train_acc = clf.score(X_train_wine, y_train_wine)
    
    test_acc = accuracy_score(y_test_wine, y_pred)
    precision = precision_score(y_test_wine, y_pred, average='weighted')
    recall = recall_score(y_test_wine, y_pred, average='weighted')
    f1 = f1_score(y_test_wine, y_pred, average='weighted')
    
    wine_results.append({
        'Method': name,
        'Train Acc': train_acc,
        'Test Acc': test_acc,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1
    })

wine_results_df = pd.DataFrame(wine_results)
print(wine_results_df.to_string(index=False))

# Task 6: Confusion matrices
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for idx, (name, clf) in enumerate(wine_classifiers.items()):
    if name == 'KNN':
        y_pred = clf.predict(X_test_wine_scaled)
    else:
        y_pred = clf.predict(X_test_wine)
    
    cm = confusion_matrix(y_test_wine, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=wine.target_names,
                yticklabels=wine.target_names,
                ax=axes[idx])
    axes[idx].set_xlabel('Predicted', fontweight='bold')
    axes[idx].set_ylabel('Actual', fontweight='bold')
    axes[idx].set_title(f'{name}', fontweight='bold')

plt.suptitle('Confusion Matrices - Wine Dataset', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# Task 7: Interpretation
print("\nInterpretation:")
print("-" * 70)
best_method = wine_results_df.loc[wine_results_df['Test Acc'].idxmax(), 'Method']
best_acc = wine_results_df['Test Acc'].max()
print(f"Best performing method: {best_method}")
print(f"Test accuracy: {best_acc:.3f}")
print(f"\nPossible reasons:")
print("- Wine features may have complex non-linear relationships")
print("- Feature scaling helps KNN perform better")
print("- Decision trees can capture feature interactions")
print("- Naive Bayes independence assumption may be too strong")

<a id='summary'></a>
## 10. Summary & Next Steps

### Key Takeaways

#### 1. Classification Methods
- **Naive Bayes:** Fast probabilistic method, assumes independence
- **K-Nearest Neighbors:** Instance-based, non-parametric, needs scaling
- **Decision Trees:** Interpretable rules, handles mixed types, prone to overfit

#### 2. Method Selection
- Choose based on: data size, dimensionality, interpretability needs
- Always try multiple methods and compare
- Use cross-validation for hyperparameter tuning

#### 3. Best Practices
- Scale features for distance-based methods (KNN)
- Use appropriate metrics (not just accuracy)
- Visualize decision boundaries when possible
- Check for overfitting (train vs test performance)

### Next Steps

1. **Practice:** Apply to different datasets (UCI ML Repository)
2. **Read:** 
   - "The Elements of Statistical Learning" (Hastie et al.)
   - "Pattern Recognition and Machine Learning" (Bishop)
3. **Explore:**
   - Support Vector Machines (SVM)
   - Ensemble Methods (Random Forest, Boosting)
   - Neural Networks
4. **Build:** Create end-to-end ML pipeline with preprocessing, model selection, evaluation

### Additional Resources

- **Scikit-learn Documentation:** https://scikit-learn.org/stable/
- **Kaggle Learn:** https://www.kaggle.com/learn
- **UCI ML Repository:** https://archive.ics.uci.edu/ml/
- **StatQuest Videos:** https://www.youtube.com/c/joshstarmer

---

### Workshop Complete! 🎉

You now have hands-on experience with three fundamental classification methods. Keep practicing and exploring!

In [None]:
# Final summary visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Compare both datasets
combined_results = pd.concat([
    results_df.assign(Dataset='Iris'),
    wine_results_df.assign(Dataset='Wine')
])

# Test accuracy comparison
pivot_acc = combined_results.pivot(index='Method', columns='Dataset', values='Test Acc')
pivot_acc.plot(kind='bar', ax=axes[0], color=['steelblue', 'orange'], edgecolor='black')
axes[0].set_ylabel('Test Accuracy', fontweight='bold')
axes[0].set_title('Test Accuracy: Iris vs Wine', fontweight='bold')
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=15, ha='right')
axes[0].legend(title='Dataset')
axes[0].set_ylim([0.8, 1.0])

# F1-score comparison
pivot_f1 = combined_results.pivot(index='Method', columns='Dataset', values='F1-Score')
pivot_f1.plot(kind='bar', ax=axes[1], color=['steelblue', 'orange'], edgecolor='black')
axes[1].set_ylabel('F1-Score', fontweight='bold')
axes[1].set_title('F1-Score: Iris vs Wine', fontweight='bold')
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=15, ha='right')
axes[1].legend(title='Dataset')
axes[1].set_ylim([0.8, 1.0])

plt.suptitle('Final Performance Comparison', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\n" + "="*70)
print("🎉 Workshop Complete! Great job!")
print("="*70)
print("\nYou've successfully learned and implemented:")
print("  ✓ Naive Bayes Classification")
print("  ✓ K-Nearest Neighbors")
print("  ✓ Decision Trees")
print("  ✓ Performance Evaluation")
print("  ✓ Model Comparison")
print("\nKeep practicing and exploring! 🚀")