# Support Vector Machines (SVM)

## Overview

**Support Vector Machines (SVM)** are powerful supervised learning models for classification and regression that find the optimal decision boundary (hyperplane) separating different classes.

### Core Concept

*"Find the hyperplane that maximally separates the classes"*

### Key Ideas

1. **Maximum Margin**: SVM finds the decision boundary with the largest distance to the nearest data points (support vectors)
2. **Kernel Trick**: Transform data to higher dimensions to handle non-linear patterns
3. **Support Vectors**: Only boundary points matter for defining the decision surface
4. **Soft Margin**: Allow some misclassifications to handle noisy data

## Mathematical Foundation

### Linear SVM (Binary Classification)

**Goal**: Find hyperplane \(w^T x + b = 0\) that maximizes margin.

**Decision Function**:
\[
f(x) = \text{sign}(w^T x + b)
\]

**Margin**: Distance from hyperplane to nearest point
\[
\text{margin} = \frac{2}{||w||}
\]

**Hard Margin Optimization**:
\[
\min_{w,b} \frac{1}{2}||w||^2 \quad \text{subject to} \quad y_i(w^T x_i + b) \geq 1 \quad \forall i
\]

**Soft Margin (with slack variables \(\xi_i\))**:
\[
\min_{w,b,\xi} \frac{1}{2}||w||^2 + C\sum_{i=1}^{n}\xi_i
\]
\[
\text{subject to} \quad y_i(w^T x_i + b) \geq 1 - \xi_i \quad \text{and} \quad \xi_i \geq 0
\]

where:
- \(C\) = regularization parameter (inverse of margin width)
- \(\xi_i\) = slack variables (allow misclassifications)
- Large \(C\) → hard margin (less tolerance)
- Small \(C\) → soft margin (more tolerance)

### Kernel Trick

**Idea**: Map data to higher dimensions where it becomes linearly separable.

\[
\phi: \mathbb{R}^d \rightarrow \mathbb{R}^D \quad (D >> d)
\]

**Kernel Function**: Compute inner product in high-dimensional space without explicit mapping
\[
K(x_i, x_j) = \phi(x_i)^T \phi(x_j)
\]

**Common Kernels**:

1. **Linear**: \(K(x_i, x_j) = x_i^T x_j\)

2. **Polynomial**: \(K(x_i, x_j) = (\gamma x_i^T x_j + r)^d\)

3. **RBF (Gaussian)**: \(K(x_i, x_j) = \exp(-\gamma ||x_i - x_j||^2)\)

4. **Sigmoid**: \(K(x_i, x_j) = \tanh(\gamma x_i^T x_j + r)\)

## Topics Covered

1. Linear SVM theory and implementation
2. Kernel methods and the kernel trick
3. Hyperparameter tuning (C, gamma, kernel)
4. Feature scaling importance
5. SVC vs LinearSVC
6. Multi-class classification
7. SVR (Support Vector Regression)
8. Practical applications and best practices

## Setup and Import

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from time import time

# SVM models
from sklearn.svm import SVC, SVR, LinearSVC, LinearSVR

# Other models for comparison
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Utilities
from sklearn.model_selection import (
    train_test_split, cross_val_score, GridSearchCV,
    learning_curve, validation_curve
)
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    mean_squared_error, r2_score, make_scorer
)
from sklearn.datasets import (
    make_classification, make_circles, make_moons,
    load_breast_cancer, load_wine, load_diabetes
)

np.random.seed(42)
sns.set_style('whitegrid')
print("✓ Libraries imported successfully")

## 1. Linear SVM Intuition

### 1.1 Visualizing the Maximum Margin Concept

In [None]:
# Generate linearly separable data
from sklearn.datasets import make_blobs

X_linear, y_linear = make_blobs(n_samples=100, n_features=2, centers=2, 
                                cluster_std=1.0, center_box=(-5, 5), random_state=42)

print("Linear SVM Demonstration")
print("="*70)
print(f"Samples: {X_linear.shape[0]}")
print(f"Features: {X_linear.shape[1]}")
print(f"Classes: {np.unique(y_linear)}")

# Train linear SVM
svm_linear = SVC(kernel='linear', C=1.0)
svm_linear.fit(X_linear, y_linear)

print(f"\nSupport Vectors: {svm_linear.n_support_}")
print(f"Total: {svm_linear.support_vectors_.shape[0]} out of {X_linear.shape[0]} samples")
print(f"Percentage: {svm_linear.support_vectors_.shape[0]/X_linear.shape[0]*100:.1f}%")

In [None]:
def plot_svm_decision_boundary(model, X, y, title="SVM Decision Boundary"):
    """Plot SVM decision boundary with margins."""
    plt.figure(figsize=(10, 7))
    
    # Create mesh
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    # Predict
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot decision boundary
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    
    # Plot margins (if available)
    if hasattr(model, 'decision_function'):
        decision_values = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
        decision_values = decision_values.reshape(xx.shape)
        
        # Decision boundary and margins
        plt.contour(xx, yy, decision_values, colors='k', levels=[-1, 0, 1],
                   alpha=0.5, linestyles=['--', '-', '--'])
    
    # Plot data points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolors='k', s=50)
    
    # Highlight support vectors
    if hasattr(model, 'support_vectors_'):
        plt.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1],
                   s=200, linewidth=2, facecolors='none', edgecolors='green',
                   label='Support Vectors')
    
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title(title)
    plt.legend()
    plt.tight_layout()
    plt.show()

# Plot
plot_svm_decision_boundary(svm_linear, X_linear, y_linear, 
                          "Linear SVM: Maximum Margin Classifier")

print("\n💡 Key Observations:")
print("   - Solid line = Decision boundary (w^T x + b = 0)")
print("   - Dashed lines = Margins (w^T x + b = ±1)")
print("   - Green circles = Support vectors (only these points define the boundary)")
print("   - SVM maximizes the distance between margins")

### 1.2 Effect of C Parameter (Regularization)

In [None]:
print("Effect of C Parameter")
print("="*70)
print("C = Regularization parameter (inverse of margin width)")
print("  - Large C: Hard margin (less tolerance for errors)")
print("  - Small C: Soft margin (more tolerance for errors)\n")

# Test different C values
C_values = [0.01, 1.0, 100.0]

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, C in enumerate(C_values):
    svm = SVC(kernel='linear', C=C)
    svm.fit(X_linear, y_linear)
    
    # Create mesh
    h = 0.02
    x_min, x_max = X_linear[:, 0].min() - 1, X_linear[:, 0].max() + 1
    y_min, y_max = X_linear[:, 1].min() - 1, X_linear[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    
    # Decision boundary
    decision_values = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
    decision_values = decision_values.reshape(xx.shape)
    axes[idx].contour(xx, yy, decision_values, colors='k', levels=[-1, 0, 1],
                     alpha=0.5, linestyles=['--', '-', '--'])
    
    axes[idx].scatter(X_linear[:, 0], X_linear[:, 1], c=y_linear, 
                     cmap='RdYlBu', edgecolors='k', s=50)
    axes[idx].scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
                     s=200, linewidth=2, facecolors='none', edgecolors='green')
    
    axes[idx].set_title(f'C = {C}\n{svm.n_support_.sum()} Support Vectors')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("\n💡 Observations:")
print("   - Small C (0.01): Wide margin, many support vectors, may underfit")
print("   - Medium C (1.0): Balanced margin")
print("   - Large C (100.0): Narrow margin, few support vectors, may overfit")

## 2. The Kernel Trick

### 2.1 Non-Linear Data: When Linear SVM Fails

In [None]:
# Generate non-linearly separable data
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.3, random_state=42)
X_moons, y_moons = make_moons(n_samples=200, noise=0.1, random_state=42)

print("Non-Linear Datasets")
print("="*70)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Circles
axes[0].scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, 
               cmap='RdYlBu', edgecolors='k', s=50)
axes[0].set_title('Concentric Circles (Non-Linearly Separable)')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')

# Moons
axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, 
               cmap='RdYlBu', edgecolors='k', s=50)
axes[1].set_title('Half Moons (Non-Linearly Separable)')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("\n💡 These patterns cannot be separated by a straight line!")
print("   Solution: Use kernel trick to map to higher dimensions")

### 2.2 RBF (Gaussian) Kernel

In [None]:
print("RBF Kernel Demonstration")
print("="*70)
print("RBF Kernel: K(x, x') = exp(-gamma * ||x - x'||^2)")
print("  - Transforms data to infinite-dimensional space")
print("  - gamma controls the 'reach' of a single training example\n")

# Compare linear vs RBF kernel on circles dataset
kernels = ['linear', 'rbf']
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

for idx, kernel in enumerate(kernels):
    svm = SVC(kernel=kernel, C=1.0, gamma='scale')
    svm.fit(X_circles, y_circles)
    
    # Create mesh
    h = 0.02
    x_min, x_max = X_circles[:, 0].min() - 0.5, X_circles[:, 0].max() + 0.5
    y_min, y_max = X_circles[:, 1].min() - 0.5, X_circles[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    axes[idx].scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles,
                     cmap='RdYlBu', edgecolors='k', s=50)
    
    accuracy = svm.score(X_circles, y_circles)
    axes[idx].set_title(f'{kernel.upper()} Kernel\nAccuracy: {accuracy:.3f}')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("\n💡 RBF kernel perfectly captures the circular pattern!")
print("   Linear kernel fails on non-linear data")

### 2.3 Effect of Gamma Parameter (RBF Kernel)

In [None]:
print("Effect of Gamma Parameter (RBF Kernel)")
print("="*70)
print("gamma = Kernel coefficient")
print("  - Large gamma: High curvature, narrow decision boundaries (may overfit)")
print("  - Small gamma: Low curvature, smooth decision boundaries\n")

gamma_values = [0.1, 1.0, 10.0]
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, gamma in enumerate(gamma_values):
    svm = SVC(kernel='rbf', C=1.0, gamma=gamma)
    svm.fit(X_moons, y_moons)
    
    h = 0.02
    x_min, x_max = X_moons[:, 0].min() - 0.5, X_moons[:, 0].max() + 0.5
    y_min, y_max = X_moons[:, 1].min() - 0.5, X_moons[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    axes[idx].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons,
                     cmap='RdYlBu', edgecolors='k', s=50)
    
    accuracy = svm.score(X_moons, y_moons)
    axes[idx].set_title(f'gamma = {gamma}\nAccuracy: {accuracy:.3f}')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("\n💡 Gamma controls model complexity:")
print("   - Too small: Underfitting (overly smooth)")
print("   - Too large: Overfitting (too wiggly)")
print("   - Use cross-validation to find optimal gamma")

### 2.4 Comparing All Kernels

In [None]:
print("Kernel Comparison on Moons Dataset")
print("="*70)

kernels = ['linear', 'poly', 'rbf', 'sigmoid']
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, kernel in enumerate(kernels):
    # Special parameters for polynomial
    if kernel == 'poly':
        svm = SVC(kernel=kernel, C=1.0, degree=3, gamma='scale')
    else:
        svm = SVC(kernel=kernel, C=1.0, gamma='scale')
    
    svm.fit(X_moons, y_moons)
    
    h = 0.02
    x_min, x_max = X_moons[:, 0].min() - 0.5, X_moons[:, 0].max() + 0.5
    y_min, y_max = X_moons[:, 1].min() - 0.5, X_moons[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    axes[idx].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons,
                     cmap='RdYlBu', edgecolors='k', s=50)
    
    accuracy = svm.score(X_moons, y_moons)
    n_sv = svm.support_vectors_.shape[0]
    axes[idx].set_title(f'{kernel.upper()} Kernel\nAccuracy: {accuracy:.3f}, Support Vectors: {n_sv}')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("\nKernel Summary:")
print("  Linear: Fast, interpretable, good for linearly separable data")
print("  Polynomial: Good for polynomial relationships, needs degree tuning")
print("  RBF: Most versatile, good default choice, needs C and gamma tuning")
print("  Sigmoid: Similar to neural networks, rarely used")

## 3. Importance of Feature Scaling

### 3.1 SVM with and without Scaling

In [None]:
# Create dataset with different scales
np.random.seed(42)
X_unscaled = np.random.randn(200, 2)
X_unscaled[:, 0] *= 10  # First feature has large scale
X_unscaled[:, 1] *= 0.1  # Second feature has small scale
y_unscaled = (X_unscaled[:, 0] + X_unscaled[:, 1] > 0).astype(int)

print("Importance of Feature Scaling for SVM")
print("="*70)
print(f"Feature 1 range: [{X_unscaled[:, 0].min():.2f}, {X_unscaled[:, 0].max():.2f}]")
print(f"Feature 2 range: [{X_unscaled[:, 1].min():.2f}, {X_unscaled[:, 1].max():.2f}]")
print(f"\nFeature 1 is ~100x larger scale than Feature 2!\n")

# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_unscaled)

# Train SVMs
svm_unscaled = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_scaled = SVC(kernel='rbf', C=1.0, gamma='scale')

svm_unscaled.fit(X_unscaled, y_unscaled)
svm_scaled.fit(X_scaled, y_unscaled)

# Compare
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for idx, (X, svm, title) in enumerate([
    (X_unscaled, svm_unscaled, 'Without Scaling'),
    (X_scaled, svm_scaled, 'With Scaling')
]):
    h = 0.02 if idx == 1 else 0.5
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    axes[idx].scatter(X[:, 0], X[:, 1], c=y_unscaled,
                     cmap='RdYlBu', edgecolors='k', s=50)
    
    accuracy = svm.score(X, y_unscaled)
    axes[idx].set_title(f'{title}\nAccuracy: {accuracy:.3f}')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("\n💡 KEY LESSON: Always scale features for SVM!")
print("   SVM is sensitive to feature scales because:")
print("   - Distance-based algorithm (uses Euclidean distance)")
print("   - Large-scale features dominate the distance calculation")
print("   - Scaling ensures all features contribute equally")

## 4. Real-World Classification: Breast Cancer Detection

### 4.1 Dataset Exploration and Preprocessing

In [None]:
# Load dataset
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target

print("Breast Cancer Wisconsin Dataset")
print("="*70)
print(f"Samples: {X_cancer.shape[0]}")
print(f"Features: {X_cancer.shape[1]}")
print(f"Classes: {cancer.target_names}")
print(f"Class distribution: Malignant={np.sum(y_cancer==0)}, Benign={np.sum(y_cancer==1)}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)

# Scale features (CRITICAL for SVM!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\nTrain set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\n✓ Features scaled using StandardScaler")

In [None]:
# Train SVM with RBF kernel
print("\nTraining SVM (RBF Kernel)...")
print("="*70)

svm_cancer = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)

start_time = time()
svm_cancer.fit(X_train_scaled, y_train)
train_time = time() - start_time

# Predictions
y_pred = svm_cancer.predict(X_test_scaled)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Training time: {train_time:.3f}s")
print(f"Accuracy: {accuracy:.4f}")
print(f"Support vectors: {svm_cancer.n_support_}")
print(f"Total support vectors: {svm_cancer.support_vectors_.shape[0]} / {X_train.shape[0]}")
print(f"Percentage: {svm_cancer.support_vectors_.shape[0]/X_train.shape[0]*100:.1f}%")

print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))

### 4.2 Hyperparameter Tuning with Grid Search

In [None]:
print("Hyperparameter Tuning for SVM")
print("="*70)

# Define parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1, 'scale', 'auto'],
    'kernel': ['rbf', 'poly']
}

print("Parameter Grid:")
for param, values in param_grid.items():
    print(f"  {param}: {values}")

print(f"\nTotal combinations: {len(param_grid['C']) * len(param_grid['gamma']) * len(param_grid['kernel'])}")
print("Performing Grid Search with 5-fold CV...\n")

# Grid search
grid_search = GridSearchCV(
    SVC(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

start_time = time()
grid_search.fit(X_train_scaled, y_train)
grid_time = time() - start_time

print(f"\nGrid Search completed in {grid_time:.2f}s")
print(f"\nBest Parameters:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")

print(f"\nBest CV Score: {grid_search.best_score_:.4f}")

# Evaluate on test set
best_svm = grid_search.best_estimator_
y_pred_best = best_svm.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_pred_best)

print(f"Test Set Accuracy: {test_accuracy:.4f}")
print(f"Improvement over default: {(test_accuracy - accuracy)*100:+.2f}%")

In [None]:
# Visualize grid search results
results_df = pd.DataFrame(grid_search.cv_results_)

# Plot for RBF kernel
rbf_results = results_df[results_df['param_kernel'] == 'rbf']

# Create pivot table for heatmap
pivot_data = []
for gamma_val in param_grid['gamma']:
    if gamma_val not in ['scale', 'auto']:
        row_data = []
        for c_val in param_grid['C']:
            mask = (rbf_results['param_C'] == c_val) & (rbf_results['param_gamma'] == gamma_val)
            if mask.any():
                score = rbf_results[mask]['mean_test_score'].values[0]
                row_data.append(score)
        if row_data:
            pivot_data.append(row_data)

if pivot_data:
    pivot_array = np.array(pivot_data)
    
    plt.figure(figsize=(10, 6))
    sns.heatmap(pivot_array, annot=True, fmt='.3f', cmap='YlGnBu',
                xticklabels=param_grid['C'],
                yticklabels=[g for g in param_grid['gamma'] if g not in ['scale', 'auto']])
    plt.xlabel('C')
    plt.ylabel('gamma')
    plt.title('SVM Grid Search Results (RBF Kernel)\nCV Accuracy Scores')
    plt.tight_layout()
    plt.show()

print("\n💡 Heatmap shows CV accuracy for different C and gamma combinations")
print("   Darker blue = Better performance")

## 5. SVC vs LinearSVC

### Key Differences

| Aspect | SVC | LinearSVC |
|--------|-----|----------|
| Algorithm | libsvm (SMO) | liblinear |
| Kernels | All (linear, rbf, poly, sigmoid) | Linear only |
| Scalability | O(n²) to O(n³) | O(n) - Much faster |
| Best For | Small-medium datasets, non-linear | Large datasets, linear problems |
| Probability | `probability=True` required | Not available |
| Multi-class | One-vs-One | One-vs-Rest |

### 5.1 Performance Comparison

In [None]:
print("SVC vs LinearSVC Comparison")
print("="*70)

# Train both models
models = {
    'SVC (linear kernel)': SVC(kernel='linear', C=1.0, random_state=42),
    'LinearSVC': LinearSVC(C=1.0, random_state=42, max_iter=10000)
}

results = []

for name, model in models.items():
    # Train
    start_time = time()
    model.fit(X_train_scaled, y_train)
    train_time = time() - start_time
    
    # Predict
    start_time = time()
    y_pred = model.predict(X_test_scaled)
    predict_time = time() - start_time
    
    # Evaluate
    accuracy = accuracy_score(y_test, y_pred)
    
    results.append({
        'Model': name,
        'Accuracy': accuracy,
        'Train Time (s)': train_time,
        'Predict Time (s)': predict_time
    })
    
    print(f"{name:25} - Accuracy: {accuracy:.4f}, Train: {train_time:.3f}s")

results_df = pd.DataFrame(results)
print("\n" + results_df.to_string(index=False))

print("\n💡 LinearSVC is much faster for linear problems!")
print("   Use LinearSVC for:")
print("   - Large datasets (>10,000 samples)")
print("   - Text classification")
print("   - High-dimensional data")
print("   \n   Use SVC when:")
print("   - Non-linear kernel needed")
print("   - Small-medium datasets")
print("   - Need probability estimates")

## 6. Multi-class Classification

### 6.1 Wine Dataset (3 classes)

In [None]:
# Load wine dataset
wine = load_wine()
X_wine = wine.data
y_wine = wine.target

print("Multi-class Classification with SVM - Wine Dataset")
print("="*70)
print(f"Samples: {X_wine.shape[0]}")
print(f"Features: {X_wine.shape[1]}")
print(f"Classes: {wine.target_names}")
print(f"Class distribution: {np.bincount(y_wine)}")

# Split and scale
X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(
    X_wine, y_wine, test_size=0.2, random_state=42, stratify=y_wine
)

scaler_wine = StandardScaler()
X_train_wine_scaled = scaler_wine.fit_transform(X_train_wine)
X_test_wine_scaled = scaler_wine.transform(X_test_wine)

# Train SVM
svm_wine = SVC(kernel='rbf', C=10, gamma='scale', random_state=42)
svm_wine.fit(X_train_wine_scaled, y_train_wine)

# Predict
y_pred_wine = svm_wine.predict(X_test_wine_scaled)

# Evaluate
accuracy_wine = accuracy_score(y_test_wine, y_pred_wine)

print(f"\nMulti-class SVM Accuracy: {accuracy_wine:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test_wine, y_pred_wine, target_names=wine.target_names))

# Confusion matrix
cm_wine = confusion_matrix(y_test_wine, y_pred_wine)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_wine, annot=True, fmt='d', cmap='Blues',
            xticklabels=wine.target_names,
            yticklabels=wine.target_names)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('SVM Multi-class Confusion Matrix')
plt.tight_layout()
plt.show()

print("\n💡 SVM handles multi-class automatically using One-vs-One strategy")
print(f"   Number of binary classifiers: {len(svm_wine.classes_) * (len(svm_wine.classes_) - 1) // 2}")

## 7. Support Vector Regression (SVR)

### Concept

**SVR** finds a function that deviates from actual targets by at most \(\epsilon\), while being as flat as possible.

**Epsilon-insensitive loss**:
\[
|y - f(x)|_\epsilon = \begin{cases}
0 & \text{if } |y - f(x)| \leq \epsilon \\
|y - f(x)| - \epsilon & \text{otherwise}
\end{cases}
\]

**Epsilon tube**: Predictions within \(\epsilon\) of true value incur no penalty.

### 7.1 SVR Example

In [None]:
# Load diabetes dataset
diabetes = load_diabetes()
X_diabetes = diabetes.data
y_diabetes = diabetes.target

print("Support Vector Regression - Diabetes Dataset")
print("="*70)
print(f"Samples: {X_diabetes.shape[0]}")
print(f"Features: {X_diabetes.shape[1]}")

# Split data
X_train_diab, X_test_diab, y_train_diab, y_test_diab = train_test_split(
    X_diabetes, y_diabetes, test_size=0.2, random_state=42
)

# Scale (IMPORTANT for SVR too!)
scaler_diab = StandardScaler()
X_train_diab_scaled = scaler_diab.fit_transform(X_train_diab)
X_test_diab_scaled = scaler_diab.transform(X_test_diab)

# Train SVR with different kernels
svr_models = {
    'Linear': SVR(kernel='linear', C=1.0),
    'RBF': SVR(kernel='rbf', C=1.0, gamma='scale'),
    'Polynomial': SVR(kernel='poly', C=1.0, degree=3, gamma='scale')
}

svr_results = []

for name, svr in svr_models.items():
    svr.fit(X_train_diab_scaled, y_train_diab)
    y_pred = svr.predict(X_test_diab_scaled)
    
    mse = mean_squared_error(y_test_diab, y_pred)
    r2 = r2_score(y_test_diab, y_pred)
    
    svr_results.append({
        'Kernel': name,
        'MSE': mse,
        'RMSE': np.sqrt(mse),
        'R² Score': r2
    })
    
    print(f"{name:15} - R²: {r2:.4f}, RMSE: {np.sqrt(mse):.2f}")

svr_df = pd.DataFrame(svr_results)
print("\n" + svr_df.to_string(index=False))

print("\n💡 SVR can handle regression with kernel trick too!")
print("   RBF kernel often performs best for non-linear relationships")

## 8. Best Practices and Guidelines

### 8.1 Decision Guide

In [None]:
print("SVM Decision Guide")
print("="*70)

guide = [
    {
        'Scenario': 'Linearly separable data',
        'Model': 'LinearSVC',
        'Reason': 'Fastest, simplest, works well'
    },
    {
        'Scenario': 'Non-linear patterns',
        'Model': 'SVC with RBF kernel',
        'Reason': 'Most flexible, handles complexity'
    },
    {
        'Scenario': 'Large dataset (>10k samples)',
        'Model': 'LinearSVC or SGDClassifier',
        'Reason': 'Better scalability'
    },
    {
        'Scenario': 'High-dimensional data',
        'Model': 'LinearSVC',
        'Reason': 'Often linear patterns emerge in high-D'
    },
    {
        'Scenario': 'Need probability estimates',
        'Model': 'SVC(probability=True)',
        'Reason': 'Enables predict_proba()'
    },
    {
        'Scenario': 'Regression task',
        'Model': 'SVR (kernel depends on data)',
        'Reason': 'Handles outliers well with epsilon-tube'
    },
]

guide_df = pd.DataFrame(guide)
print(guide_df.to_string(index=False))

### 8.2 Common Pitfalls and Solutions

In [None]:
print("\nCommon Pitfalls with SVM")
print("="*70)

pitfalls = [
    ("❌ Pitfall", "✓ Solution"),
    ("-" * 40, "-" * 40),
    ("Not scaling features", "Always use StandardScaler"),
    ("Using default C and gamma", "Grid search to find optimal values"),
    ("Wrong kernel choice", "Try RBF first, then linear if needed"),
    ("Using SVC on large datasets", "Use LinearSVC for >10k samples"),
    ("Ignoring convergence warnings", "Increase max_iter parameter"),
    ("Not setting random_state", "Set for reproducibility"),
    ("Using probability=True by default", "Only use when needed (slower training)"),
    ("Forgetting to scale test data", "Use fitted scaler from training"),
    ("Overfitting with large gamma", "Use cross-validation to tune"),
]

for pitfall, solution in pitfalls:
    print(f"{pitfall:<45} {solution}")

## Summary and Quick Reference

### Quick Reference Code

```python
from sklearn.svm import SVC, SVR, LinearSVC
from sklearn.preprocessing import StandardScaler

# ALWAYS SCALE FEATURES FOR SVM!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ===== CLASSIFICATION =====

# Linear SVM (for linearly separable data)
svm_linear = LinearSVC(C=1.0, max_iter=10000, random_state=42)

# SVC with RBF kernel (most common choice)
svm_rbf = SVC(
    kernel='rbf',      # 'linear', 'poly', 'rbf', 'sigmoid'
    C=1.0,             # Regularization (smaller = softer margin)
    gamma='scale',     # Kernel coefficient ('scale', 'auto', or float)
    probability=False, # True for predict_proba (slower)
    random_state=42
)

# Polynomial kernel
svm_poly = SVC(kernel='poly', degree=3, C=1.0, gamma='scale')

# ===== REGRESSION =====

# SVR with RBF kernel
svr = SVR(
    kernel='rbf',
    C=1.0,
    gamma='scale',
    epsilon=0.1  # Width of epsilon-insensitive tube
)

# Train and predict
model.fit(X_train_scaled, y_train)
predictions = model.predict(X_test_scaled)
```

### Key Hyperparameters

**C (Regularization)**:
- Controls margin width vs misclassification penalty
- Small C → Wider margin, more tolerance (may underfit)
- Large C → Narrower margin, less tolerance (may overfit)
- Typical range: [0.01, 0.1, 1, 10, 100]

**gamma (RBF/Poly kernel)**:
- Controls reach of single training example
- Small gamma → Far reach (smooth, may underfit)
- Large gamma → Close reach (complex, may overfit)
- Typical values: [0.001, 0.01, 0.1, 1, 'scale', 'auto']
- 'scale': 1 / (n_features * X.var())
- 'auto': 1 / n_features

**kernel**:
- 'linear': Best for linearly separable, high-dimensional data
- 'rbf': Best default choice for non-linear
- 'poly': For polynomial relationships (specify degree)
- 'sigmoid': Rarely used (similar to neural net)

### Kernel Selection Strategy

```
1. Start with LinearSVC
   ├─ Good performance? → Done!
   └─ Poor performance? → Try RBF

2. Try SVC with RBF kernel
   ├─ Tune C and gamma with grid search
   └─ Usually best for non-linear problems

3. If RBF doesn't work well:
   ├─ Try polynomial kernel
   └─ Consider other algorithms (RF, XGBoost)
```

### Computational Complexity

| Model | Training | Prediction | Scalability |
|-------|----------|------------|-------------|
| LinearSVC | O(n × d) | O(d) | Excellent |
| SVC | O(n² × d) to O(n³ × d) | O(n_sv × d) | Poor for n>10k |

where:
- n = number of samples
- d = number of features
- n_sv = number of support vectors

### Feature Scaling Rules

✓ **Always scale** for SVM (use StandardScaler)
✓ Fit scaler on training data only
✓ Transform both train and test with same scaler
✓ Scale before grid search (include in pipeline)

### SVM Strengths

✓ Effective in high-dimensional spaces
✓ Memory efficient (only stores support vectors)
✓ Versatile (different kernels)
✓ Works well with clear margin of separation
✓ Robust to overfitting in high dimensions

### SVM Limitations

✗ Poor scalability to large datasets (n > 10k)
✗ Sensitive to feature scaling
✗ No probabilistic interpretation (needs probability=True)
✗ Difficult to interpret (black box)
✗ Sensitive to noise (use soft margin)
✗ Slow training with RBF kernel

### When to Use SVM

✓ **Use SVM when:**
- Small to medium datasets (<10k samples)
- High-dimensional data
- Clear margin of separation
- Need robust classifier
- Text classification, image classification

✗ **Avoid SVM when:**
- Very large datasets (>100k samples)
- Need probability estimates (or use probability=True)
- Need interpretability
- Many noisy features

### Comparison with Other Algorithms

| Algorithm | Pros over SVM | Cons vs SVM |
|-----------|---------------|-------------|
| Logistic Regression | Faster, probabilistic, interpretable | Only linear |
| Random Forest | Handles non-linearity, faster, no scaling | Less accurate margin |
| XGBoost | Better performance, faster | Less theoretical foundation |
| Neural Networks | More flexible, scalable | Needs more data, harder to tune |

### Further Reading

- **Original Paper**: Vapnik (1995) - "The Nature of Statistical Learning Theory"
- **Tutorial**: "A Practical Guide to SVM" by Hsu et al.
- **Book**: "Learning with Kernels" by Schölkopf & Smola
- **sklearn Documentation**: https://scikit-learn.org/stable/modules/svm.html

### Next Steps

- Custom kernels
- One-Class SVM (anomaly detection)
- Nu-SVM variants
- Kernel approximation for large datasets
- SVM for structured prediction