# K-Nearest Neighbors (KNN) and Radius Neighbors

## Overview

**K-Nearest Neighbors (KNN)** is a simple, intuitive, non-parametric algorithm that makes predictions based on the most similar examples in the training data.

### Core Concept

*"You are the average of your k closest neighbors"*

### Key Ideas

1. **Instance-Based Learning**: No explicit training phase - stores all data
2. **Lazy Learning**: Computation happens at prediction time
3. **Distance-Based**: Uses distance metrics to find neighbors
4. **Non-Parametric**: Makes no assumptions about data distribution

## Mathematical Foundation

### Classification

**Prediction**: Majority vote among k nearest neighbors
\[
\hat{y} = \text{mode}\{y_1, y_2, ..., y_k\}
\]

Or with distance weighting:
\[
\hat{y} = \arg\max_c \sum_{i \in N_k(x)} w_i \cdot \mathbb{1}(y_i = c)
\]

where \(w_i = \frac{1}{d(x, x_i)}\) or \(w_i = \frac{1}{d(x, x_i)^2}\)

### Regression

**Prediction**: Average of k nearest neighbors
\[
\hat{y} = \frac{1}{k}\sum_{i=1}^{k} y_i
\]

Or with distance weighting:
\[
\hat{y} = \frac{\sum_{i \in N_k(x)} w_i \cdot y_i}{\sum_{i \in N_k(x)} w_i}
\]

### Distance Metrics

**Euclidean (L2)**:
\[
d(x, x') = \sqrt{\sum_{j=1}^{d}(x_j - x'_j)^2}
\]

**Manhattan (L1)**:
\[
d(x, x') = \sum_{j=1}^{d}|x_j - x'_j|
\]

**Minkowski (general)**:
\[
d(x, x') = \left(\sum_{j=1}^{d}|x_j - x'_j|^p\right)^{1/p}
\]

## Topics Covered

1. KNN for classification and regression
2. Effect of k parameter
3. Distance metrics and their impact
4. Weighted vs uniform voting
5. Feature scaling importance
6. Curse of dimensionality
7. Radius Neighbors (fixed-radius approach)
8. Computational efficiency and optimization
9. Best practices and practical applications

## Setup and Import

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from time import time

# KNN models
from sklearn.neighbors import (
    KNeighborsClassifier, KNeighborsRegressor,
    RadiusNeighborsClassifier, RadiusNeighborsRegressor,
    NearestNeighbors
)

# Other models for comparison
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, Ridge

# Utilities
from sklearn.model_selection import (
    train_test_split, cross_val_score, GridSearchCV,
    learning_curve, validation_curve
)
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    mean_squared_error, r2_score, mean_absolute_error
)
from sklearn.datasets import (
    make_classification, make_regression, make_moons,
    load_breast_cancer, load_wine, load_diabetes, load_iris
)

np.random.seed(42)
sns.set_style('whitegrid')
print("✓ Libraries imported successfully")

## 1. KNN Intuition and Visualization

### 1.1 How KNN Works

In [None]:
# Generate simple 2D dataset
X_simple, y_simple = make_classification(
    n_samples=100, n_features=2, n_redundant=0,
    n_informative=2, n_clusters_per_class=1,
    random_state=42
)

print("KNN Classification Visualization")
print("="*70)
print(f"Samples: {X_simple.shape[0]}")
print(f"Features: {X_simple.shape[1]}")
print(f"Classes: {np.unique(y_simple)}")

# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_simple, y_simple)

# Create mesh for decision boundary
h = 0.02
x_min, x_max = X_simple[:, 0].min() - 1, X_simple[:, 0].max() + 1
y_min, y_max = X_simple[:, 1].min() - 1, X_simple[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot
plt.figure(figsize=(12, 5))

# Decision boundary
plt.subplot(1, 2, 1)
plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
plt.scatter(X_simple[:, 0], X_simple[:, 1], c=y_simple,
           cmap='RdYlBu', edgecolors='k', s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('KNN Decision Boundary (k=5)')

# Demonstrate nearest neighbors for a test point
plt.subplot(1, 2, 2)
plt.scatter(X_simple[:, 0], X_simple[:, 1], c=y_simple,
           cmap='RdYlBu', edgecolors='k', s=50, alpha=0.6)

# Pick a test point
test_point = np.array([[0, 0]])
plt.scatter(test_point[0, 0], test_point[0, 1], 
           marker='*', s=500, c='green', edgecolors='black', linewidths=2,
           label='Test Point', zorder=10)

# Find k nearest neighbors
distances, indices = knn.kneighbors(test_point)
neighbors = X_simple[indices[0]]

# Highlight neighbors
plt.scatter(neighbors[:, 0], neighbors[:, 1],
           s=200, facecolors='none', edgecolors='green', linewidths=2,
           label='5 Nearest Neighbors')

# Draw lines to neighbors
for neighbor in neighbors:
    plt.plot([test_point[0, 0], neighbor[0]], 
            [test_point[0, 1], neighbor[1]],
            'g--', alpha=0.5, linewidth=1)

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('KNN: Finding 5 Nearest Neighbors')
plt.legend()

plt.tight_layout()
plt.show()

# Show prediction
prediction = knn.predict(test_point)[0]
neighbor_labels = y_simple[indices[0]]

print(f"\nTest Point: {test_point[0]}")
print(f"Neighbor Labels: {neighbor_labels}")
print(f"Majority Vote: {prediction}")
print(f"Distances to neighbors: {distances[0]}")

print("\n💡 KNN predicts based on majority vote of k nearest neighbors")

## 2. Effect of k Parameter

### 2.1 Visualizing Different k Values

In [None]:
print("Effect of k Parameter")
print("="*70)
print("k = Number of neighbors to consider")
print("  - Small k: More complex boundary (may overfit)")
print("  - Large k: Smoother boundary (may underfit)\n")

# Test different k values
k_values = [1, 5, 20, 50]
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, k in enumerate(k_values):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_simple, y_simple)
    
    # Decision boundary
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    axes[idx].scatter(X_simple[:, 0], X_simple[:, 1], c=y_simple,
                     cmap='RdYlBu', edgecolors='k', s=50)
    
    accuracy = knn.score(X_simple, y_simple)
    axes[idx].set_title(f'k = {k}\nTraining Accuracy: {accuracy:.3f}')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("\n💡 Observations:")
print("   k=1: Very complex boundary, perfect training accuracy (overfits!)")
print("   k=5-20: Balanced complexity")
print("   k=50: Very smooth boundary (may underfit)")
print("\n   Use cross-validation to find optimal k!")

### 2.2 Finding Optimal k with Cross-Validation

In [None]:
# Load iris dataset for more robust example
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

print("Finding Optimal k - Iris Dataset")
print("="*70)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

# Scale features (important for KNN!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Test different k values
k_range = range(1, 31)
train_scores = []
cv_scores = []
test_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    
    # Train score
    knn.fit(X_train_scaled, y_train)
    train_scores.append(knn.score(X_train_scaled, y_train))
    
    # CV score
    cv_score = cross_val_score(knn, X_train_scaled, y_train, cv=5).mean()
    cv_scores.append(cv_score)
    
    # Test score
    test_scores.append(knn.score(X_test_scaled, y_test))

# Find best k
best_k = k_range[np.argmax(cv_scores)]
best_cv_score = max(cv_scores)

print(f"Best k (by CV): {best_k}")
print(f"Best CV Score: {best_cv_score:.4f}")

# Plot
plt.figure(figsize=(10, 6))
plt.plot(k_range, train_scores, 'o-', label='Training Score', linewidth=2)
plt.plot(k_range, cv_scores, 's-', label='CV Score', linewidth=2)
plt.plot(k_range, test_scores, '^-', label='Test Score', linewidth=2)
plt.axvline(x=best_k, color='red', linestyle='--', alpha=0.5, 
           label=f'Best k={best_k}')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Accuracy')
plt.title('KNN: Finding Optimal k')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("\n💡 Key Insights:")
print("   - Training accuracy decreases as k increases")
print("   - CV score shows bias-variance tradeoff")
print("   - Choose k where CV score is maximized")
print("   - Typically k = 3-10 works well for many problems")

## 3. Distance Metrics

### 3.1 Comparing Different Distance Metrics

In [None]:
print("Distance Metrics Comparison")
print("="*70)

# Distance metrics to test
metrics = {
    'euclidean': 'Euclidean (L2)',
    'manhattan': 'Manhattan (L1)',
    'chebyshev': 'Chebyshev (L∞)',
    'minkowski': 'Minkowski (p=3)'
}

results = []

for metric_name, metric_label in metrics.items():
    if metric_name == 'minkowski':
        knn = KNeighborsClassifier(n_neighbors=5, metric=metric_name, p=3)
    else:
        knn = KNeighborsClassifier(n_neighbors=5, metric=metric_name)
    
    # CV score
    cv_score = cross_val_score(knn, X_train_scaled, y_train, cv=5).mean()
    
    # Train and test
    knn.fit(X_train_scaled, y_train)
    test_score = knn.score(X_test_scaled, y_test)
    
    results.append({
        'Metric': metric_label,
        'CV Score': cv_score,
        'Test Score': test_score
    })
    
    print(f"{metric_label:20} - CV: {cv_score:.4f}, Test: {test_score:.4f}")

results_df = pd.DataFrame(results)

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
x = np.arange(len(metrics))
width = 0.35

ax.bar(x - width/2, results_df['CV Score'], width, label='CV Score', alpha=0.8)
ax.bar(x + width/2, results_df['Test Score'], width, label='Test Score', alpha=0.8)

ax.set_xlabel('Distance Metric')
ax.set_ylabel('Accuracy')
ax.set_title('KNN Performance with Different Distance Metrics')
ax.set_xticks(x)
ax.set_xticklabels(results_df['Metric'], rotation=15, ha='right')
ax.legend()
ax.grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n💡 Distance Metric Notes:")
print("   Euclidean: Most common, good default")
print("   Manhattan: Better for high-dimensional sparse data")
print("   Chebyshev: Maximum difference across dimensions")
print("   Minkowski: Generalization (p=1→Manhattan, p=2→Euclidean)")

## 4. Uniform vs Distance-Weighted Voting

### 4.1 Comparing Weighting Strategies

In [None]:
print("Uniform vs Distance-Weighted Voting")
print("="*70)
print("uniform: All neighbors have equal vote")
print("distance: Closer neighbors have more influence (1/distance)\n")

weights_options = ['uniform', 'distance']
k_values = [1, 3, 5, 10, 20]

uniform_scores = []
distance_scores = []

for k in k_values:
    # Uniform
    knn_uniform = KNeighborsClassifier(n_neighbors=k, weights='uniform')
    uniform_score = cross_val_score(knn_uniform, X_train_scaled, y_train, cv=5).mean()
    uniform_scores.append(uniform_score)
    
    # Distance-weighted
    knn_distance = KNeighborsClassifier(n_neighbors=k, weights='distance')
    distance_score = cross_val_score(knn_distance, X_train_scaled, y_train, cv=5).mean()
    distance_scores.append(distance_score)
    
    print(f"k={k:2} - Uniform: {uniform_score:.4f}, Distance: {distance_score:.4f}")

# Plot
plt.figure(figsize=(10, 6))
plt.plot(k_values, uniform_scores, 'o-', label='Uniform Weighting', linewidth=2)
plt.plot(k_values, distance_scores, 's-', label='Distance Weighting', linewidth=2)
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('CV Accuracy')
plt.title('Uniform vs Distance-Weighted KNN')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("\n💡 When to use distance weighting:")
print("   - When closer neighbors should have more influence")
print("   - Helps with larger k values")
print("   - Can prevent ties in voting")
print("   - Note: k=1 gives same result for both")

## 5. Importance of Feature Scaling

### 5.1 KNN with and without Scaling

In [None]:
# Create dataset with different scales
np.random.seed(42)
X_unscaled = np.random.randn(200, 2)
X_unscaled[:, 0] *= 10  # Feature 1: large scale
X_unscaled[:, 1] *= 0.5  # Feature 2: small scale
y_unscaled = (X_unscaled[:, 0] + 5 * X_unscaled[:, 1] > 0).astype(int)

print("Importance of Feature Scaling for KNN")
print("="*70)
print(f"Feature 1 range: [{X_unscaled[:, 0].min():.2f}, {X_unscaled[:, 0].max():.2f}]")
print(f"Feature 2 range: [{X_unscaled[:, 1].min():.2f}, {X_unscaled[:, 1].max():.2f}]")
print(f"\nFeature 1 is ~20x larger scale than Feature 2!\n")

# Scale data
scaler_demo = StandardScaler()
X_scaled_demo = scaler_demo.fit_transform(X_unscaled)

# Train KNN
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled = KNeighborsClassifier(n_neighbors=5)

knn_unscaled.fit(X_unscaled, y_unscaled)
knn_scaled.fit(X_scaled_demo, y_unscaled)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for idx, (X, knn, title) in enumerate([
    (X_unscaled, knn_unscaled, 'Without Scaling'),
    (X_scaled_demo, knn_scaled, 'With Scaling')
]):
    h = 0.5 if idx == 0 else 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    axes[idx].scatter(X[:, 0], X[:, 1], c=y_unscaled,
                     cmap='RdYlBu', edgecolors='k', s=50)
    
    accuracy = knn.score(X, y_unscaled)
    axes[idx].set_title(f'{title}\nAccuracy: {accuracy:.3f}')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("\n💡 CRITICAL: Always scale features for KNN!")
print("   KNN uses distance → large-scale features dominate")
print("   Without scaling, Feature 1 dominates distance calculation")
print("   With scaling, both features contribute equally")

## 6. KNN for Regression

### 6.1 Regression Example

In [None]:
# Load diabetes dataset
diabetes = load_diabetes()
X_diabetes = diabetes.data
y_diabetes = diabetes.target

print("KNN Regression - Diabetes Dataset")
print("="*70)
print(f"Samples: {X_diabetes.shape[0]}")
print(f"Features: {X_diabetes.shape[1]}")

# Split
X_train_diab, X_test_diab, y_train_diab, y_test_diab = train_test_split(
    X_diabetes, y_diabetes, test_size=0.2, random_state=42
)

# Scale
scaler_diab = StandardScaler()
X_train_diab_scaled = scaler_diab.fit_transform(X_train_diab)
X_test_diab_scaled = scaler_diab.transform(X_test_diab)

# Test different k values
k_values_reg = [1, 3, 5, 10, 20, 50]
reg_results = []

for k in k_values_reg:
    knn_reg = KNeighborsRegressor(n_neighbors=k)
    knn_reg.fit(X_train_diab_scaled, y_train_diab)
    
    y_pred = knn_reg.predict(X_test_diab_scaled)
    
    mse = mean_squared_error(y_test_diab, y_pred)
    mae = mean_absolute_error(y_test_diab, y_pred)
    r2 = r2_score(y_test_diab, y_pred)
    
    reg_results.append({
        'k': k,
        'MSE': mse,
        'MAE': mae,
        'R²': r2
    })
    
    print(f"k={k:2} - R²: {r2:.4f}, RMSE: {np.sqrt(mse):.2f}, MAE: {mae:.2f}")

reg_df = pd.DataFrame(reg_results)

# Plot R² vs k
plt.figure(figsize=(10, 6))
plt.plot(reg_df['k'], reg_df['R²'], 'o-', linewidth=2, markersize=8)
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('R² Score')
plt.title('KNN Regression: R² Score vs k')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

best_k = reg_df.loc[reg_df['R²'].idxmax(), 'k']
best_r2 = reg_df['R²'].max()

print(f"\nBest k: {best_k:.0f}")
print(f"Best R²: {best_r2:.4f}")

print("\n💡 KNN Regression: Predicts average of k nearest neighbors")
print("   Larger k → Smoother predictions")
print("   Smaller k → More localized predictions")

## 7. Radius Neighbors

### Concept

**Radius Neighbors**: Instead of finding k nearest neighbors, find all neighbors within a fixed radius.

**Advantages**:
- Adapts to local density
- More neighbors in dense regions, fewer in sparse regions

**Disadvantages**:
- Radius must be chosen carefully
- May have no neighbors (radius too small)
- May have too many neighbors (radius too large)

### 7.1 Radius Neighbors Classifier

In [None]:
print("Radius Neighbors Classifier")
print("="*70)

# Use breast cancer dataset
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target

# Split and scale
X_train_cancer, X_test_cancer, y_train_cancer, y_test_cancer = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)

scaler_cancer = StandardScaler()
X_train_cancer_scaled = scaler_cancer.fit_transform(X_train_cancer)
X_test_cancer_scaled = scaler_cancer.transform(X_test_cancer)

# Test different radius values
radius_values = [0.5, 1.0, 2.0, 5.0, 10.0]
radius_results = []

for radius in radius_values:
    try:
        rad_clf = RadiusNeighborsClassifier(radius=radius, weights='distance',
                                            outlier_label='most_frequent')
        rad_clf.fit(X_train_cancer_scaled, y_train_cancer)
        
        y_pred = rad_clf.predict(X_test_cancer_scaled)
        accuracy = accuracy_score(y_test_cancer, y_pred)
        
        # Count average neighbors
        sample_neighbors = rad_clf.radius_neighbors(X_test_cancer_scaled[:10])
        avg_neighbors = np.mean([len(n) for n in sample_neighbors[1]])
        
        radius_results.append({
            'Radius': radius,
            'Accuracy': accuracy,
            'Avg Neighbors': avg_neighbors
        })
        
        print(f"Radius={radius:5.1f} - Accuracy: {accuracy:.4f}, Avg Neighbors: {avg_neighbors:.1f}")
    except ValueError as e:
        print(f"Radius={radius:5.1f} - Error: {str(e)[:50]}...")

# Compare with KNN
knn_cancer = KNeighborsClassifier(n_neighbors=5)
knn_cancer.fit(X_train_cancer_scaled, y_train_cancer)
knn_accuracy = knn_cancer.score(X_test_cancer_scaled, y_test_cancer)

print(f"\nKNN (k=5) Accuracy: {knn_accuracy:.4f}")

if radius_results:
    radius_df = pd.DataFrame(radius_results)
    best_radius = radius_df.loc[radius_df['Accuracy'].idxmax()]
    print(f"\nBest Radius: {best_radius['Radius']}")
    print(f"Best Accuracy: {best_radius['Accuracy']:.4f}")

print("\n💡 Radius Neighbors:")
print("   - Use when local density varies significantly")
print("   - Requires careful radius tuning")
print("   - KNN is usually more robust")

## 8. Curse of Dimensionality

### 8.1 KNN Performance vs Number of Features

In [None]:
print("Curse of Dimensionality")
print("="*70)
print("As dimensions increase, distance between points becomes less meaningful\n")

# Generate datasets with different dimensions
n_samples = 200
dimensions = [2, 5, 10, 20, 50, 100]
results_dim = []

for n_features in dimensions:
    X, y = make_classification(
        n_samples=n_samples,
        n_features=n_features,
        n_informative=min(n_features, 10),
        n_redundant=0,
        random_state=42
    )
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    # Scale
    scaler_dim = StandardScaler()
    X_train_scaled = scaler_dim.fit_transform(X_train)
    X_test_scaled = scaler_dim.transform(X_test)
    
    # Train KNN
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train_scaled, y_train)
    
    # Evaluate
    cv_score = cross_val_score(knn, X_train_scaled, y_train, cv=5).mean()
    test_score = knn.score(X_test_scaled, y_test)
    
    # Calculate average distance to nearest neighbor
    distances, _ = knn.kneighbors(X_test_scaled)
    avg_distance = distances[:, 1:].mean()  # Exclude self (distance 0)
    
    results_dim.append({
        'Dimensions': n_features,
        'CV Score': cv_score,
        'Test Score': test_score,
        'Avg Distance': avg_distance
    })
    
    print(f"Dimensions: {n_features:3d} - CV: {cv_score:.4f}, "
          f"Test: {test_score:.4f}, Avg Distance: {avg_distance:.2f}")

dim_df = pd.DataFrame(results_dim)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy vs dimensions
axes[0].plot(dim_df['Dimensions'], dim_df['CV Score'], 'o-', label='CV Score', linewidth=2)
axes[0].plot(dim_df['Dimensions'], dim_df['Test Score'], 's-', label='Test Score', linewidth=2)
axes[0].set_xlabel('Number of Dimensions')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('KNN Performance vs Dimensionality')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Average distance vs dimensions
axes[1].plot(dim_df['Dimensions'], dim_df['Avg Distance'], 'o-', linewidth=2, color='red')
axes[1].set_xlabel('Number of Dimensions')
axes[1].set_ylabel('Average Distance to Nearest Neighbor')
axes[1].set_title('Distance Increase with Dimensionality')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Curse of Dimensionality:")
print("   - Performance degrades in high dimensions")
print("   - Points become equidistant (distance less meaningful)")
print("   - Need exponentially more data as dimensions increase")
print("   \n   Solutions:")
print("   - Feature selection/dimensionality reduction (PCA)")
print("   - Use algorithms less affected (tree-based, linear models)")
print("   - Increase sample size significantly")

## 9. Computational Efficiency

### 9.1 Training vs Prediction Time

In [None]:
print("Computational Efficiency: KNN vs Other Algorithms")
print("="*70)

# Use larger dataset
X_large, y_large = make_classification(
    n_samples=5000, n_features=20, random_state=42
)

X_train_large, X_test_large, y_train_large, y_test_large = train_test_split(
    X_large, y_large, test_size=0.2, random_state=42
)

scaler_large = StandardScaler()
X_train_large_scaled = scaler_large.fit_transform(X_train_large)
X_test_large_scaled = scaler_large.transform(X_test_large)

# Models to compare
models = {
    'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

efficiency_results = []

for name, model in models.items():
    # Training time
    start = time()
    model.fit(X_train_large_scaled, y_train_large)
    train_time = time() - start
    
    # Prediction time
    start = time()
    y_pred = model.predict(X_test_large_scaled)
    predict_time = time() - start
    
    # Accuracy
    accuracy = accuracy_score(y_test_large, y_pred)
    
    efficiency_results.append({
        'Model': name,
        'Train Time (s)': train_time,
        'Predict Time (s)': predict_time,
        'Accuracy': accuracy
    })
    
    print(f"{name:25} - Train: {train_time:.4f}s, Predict: {predict_time:.4f}s, "
          f"Accuracy: {accuracy:.4f}")

eff_df = pd.DataFrame(efficiency_results)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training time
axes[0].barh(eff_df['Model'], eff_df['Train Time (s)'], alpha=0.8)
axes[0].set_xlabel('Training Time (seconds)')
axes[0].set_title('Training Time Comparison')
axes[0].grid(alpha=0.3, axis='x')

# Prediction time
axes[1].barh(eff_df['Model'], eff_df['Predict Time (s)'], alpha=0.8, color='orange')
axes[1].set_xlabel('Prediction Time (seconds)')
axes[1].set_title('Prediction Time Comparison')
axes[1].grid(alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\n💡 KNN Efficiency Characteristics:")
print("   Training: O(1) - Just stores data (fastest!)")
print("   Prediction: O(n*d) - Must compare to all training samples (slowest!)")
print("   \n   Improvements:")
print("   - Use KD-Tree or Ball-Tree for faster neighbor search")
print("   - algorithm='auto' chooses best structure automatically")
print("   - Consider approximate methods for very large datasets")

### 9.2 Neighbor Search Algorithms

In [None]:
print("Neighbor Search Algorithm Comparison")
print("="*70)

algorithms = ['brute', 'kd_tree', 'ball_tree', 'auto']
algo_results = []

for algo in algorithms:
    knn = KNeighborsClassifier(n_neighbors=5, algorithm=algo)
    
    # Training (building tree structure)
    start = time()
    knn.fit(X_train_large_scaled, y_train_large)
    train_time = time() - start
    
    # Prediction
    start = time()
    y_pred = knn.predict(X_test_large_scaled)
    predict_time = time() - start
    
    accuracy = accuracy_score(y_test_large, y_pred)
    
    algo_results.append({
        'Algorithm': algo,
        'Train Time (s)': train_time,
        'Predict Time (s)': predict_time,
        'Accuracy': accuracy
    })
    
    print(f"{algo:15} - Train: {train_time:.4f}s, Predict: {predict_time:.4f}s")

algo_df = pd.DataFrame(algo_results)
print("\n" + algo_df.to_string(index=False))

print("\n💡 Algorithm Selection:")
print("   brute: O(n*d) - Checks all points, slow but always works")
print("   kd_tree: O(log n) - Fast for low dimensions (d < 20)")
print("   ball_tree: O(log n) - Better for high dimensions")
print("   auto: Chooses best based on data characteristics")
print("   \n   Recommendation: Use 'auto' (default)")

## 10. Best Practices and Guidelines

### 10.1 Decision Guide

In [None]:
print("KNN Decision Guide")
print("="*70)

guide = [
    {
        'Scenario': 'Small dataset (<1000 samples)',
        'Recommendation': 'KNN works well',
        'Note': 'Good baseline, easy to implement'
    },
    {
        'Scenario': 'Large dataset (>10k samples)',
        'Recommendation': 'Consider other algorithms',
        'Note': 'Prediction time becomes bottleneck'
    },
    {
        'Scenario': 'Low dimensions (d < 20)',
        'Recommendation': 'KNN performs well',
        'Note': 'Use kd_tree for speed'
    },
    {
        'Scenario': 'High dimensions (d > 50)',
        'Recommendation': 'Apply dimensionality reduction first',
        'Note': 'Or use algorithms designed for high-D'
    },
    {
        'Scenario': 'Non-linear decision boundary',
        'Recommendation': 'KNN handles well',
        'Note': 'No assumptions about data distribution'
    },
    {
        'Scenario': 'Need fast predictions',
        'Recommendation': 'Avoid KNN',
        'Note': 'Use parametric models instead'
    },
    {
        'Scenario': 'Imbalanced classes',
        'Recommendation': 'Use weighted KNN',
        'Note': 'Or combine with resampling'
    },
]

guide_df = pd.DataFrame(guide)
print(guide_df.to_string(index=False))

### 10.2 Common Pitfalls

In [None]:
print("\nCommon Pitfalls with KNN")
print("="*70)

pitfalls = [
    ("❌ Pitfall", "✓ Solution"),
    ("-" * 40, "-" * 40),
    ("Not scaling features", "Always use StandardScaler"),
    ("Using default k=5 without tuning", "Cross-validate to find optimal k"),
    ("Applying to high-dimensional data", "Use PCA or feature selection first"),
    ("Using KNN on large datasets", "Consider other algorithms or sampling"),
    ("Ignoring computational cost", "Profile prediction time for production"),
    ("Not considering distance metric", "Try Manhattan for high-D sparse data"),
    ("Using uniform weights always", "Try distance weighting for larger k"),
    ("Forgetting algorithm parameter", "Use algorithm='auto' for optimization"),
    ("Treating all features equally", "Consider feature weighting or selection"),
]

for pitfall, solution in pitfalls:
    print(f"{pitfall:<45} {solution}")

## Summary and Quick Reference

### Quick Reference Code

```python
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler

# ALWAYS SCALE FEATURES!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ===== CLASSIFICATION =====

knn_clf = KNeighborsClassifier(
    n_neighbors=5,       # Number of neighbors (tune with CV!)
    weights='uniform',   # 'uniform' or 'distance'
    algorithm='auto',    # 'auto', 'brute', 'kd_tree', 'ball_tree'
    metric='minkowski',  # Distance metric
    p=2,                 # Power for Minkowski (2=Euclidean)
    n_jobs=-1            # Parallel processing
)

# ===== REGRESSION =====

knn_reg = KNeighborsRegressor(
    n_neighbors=5,
    weights='uniform',
    algorithm='auto',
    metric='minkowski',
    p=2
)

# ===== RADIUS NEIGHBORS =====

from sklearn.neighbors import RadiusNeighborsClassifier

rad_clf = RadiusNeighborsClassifier(
    radius=1.0,
    weights='distance',
    outlier_label='most_frequent'  # For points with no neighbors
)

# Train and predict
model.fit(X_train_scaled, y_train)
predictions = model.predict(X_test_scaled)
probabilities = model.predict_proba(X_test_scaled)  # Classification only
```

### Key Hyperparameters

**n_neighbors (k)**:
- Most important parameter
- Small k → Complex boundary (overfitting)
- Large k → Smooth boundary (underfitting)
- Typical range: [3, 5, 7, 10, 15, 20]
- Use odd numbers for binary classification (avoid ties)
- Rule of thumb: k = √n

**weights**:
- 'uniform': All neighbors equal vote
- 'distance': Closer neighbors more influence (weight = 1/distance)
- Distance weighting helpful for larger k

**metric** (distance measure):
- 'euclidean' (default): Standard L2 distance
- 'manhattan': L1 distance, good for high-D
- 'minkowski': General Lp metric (p parameter)
- 'chebyshev': L∞ (maximum difference)

**algorithm** (neighbor search):
- 'auto': Chooses best automatically (recommended)
- 'brute': O(n) - Always works
- 'kd_tree': O(log n) - Fast for d < 20
- 'ball_tree': O(log n) - Better for high-D

### Computational Complexity

| Phase | Brute Force | Tree-based (low-D) |
|-------|-------------|-------------------|
| Training | O(1) | O(n log n) |
| Prediction (per sample) | O(n × d) | O(log n × d) |
| Memory | O(n × d) | O(n × d) |

### Feature Scaling

✓ **CRITICAL**: Always scale features for KNN
✓ Use StandardScaler or MinMaxScaler
✓ Fit on training data only
✓ Transform both train and test

### KNN Strengths

✓ Simple and intuitive
✓ No training phase (fast to update)
✓ Non-parametric (no assumptions)
✓ Handles non-linear decision boundaries
✓ Multi-class naturally supported
✓ Works for both classification and regression

### KNN Limitations

✗ Slow predictions on large datasets
✗ Memory intensive (stores all training data)
✗ Sensitive to irrelevant features
✗ Suffers from curse of dimensionality
✗ Sensitive to feature scaling
✗ Requires choosing k
✗ Poor with imbalanced data

### When to Use KNN

✓ **Use KNN when:**
- Small to medium datasets (<10k samples)
- Low dimensionality (d < 20)
- Non-linear decision boundaries
- Need simple baseline
- Data frequently updated (no retraining)
- Multi-modal class distributions

✗ **Avoid KNN when:**
- Large datasets (>100k samples)
- High dimensionality (d > 50)
- Need fast predictions
- Many irrelevant features
- Interpretability needed
- Storage is limited

### Comparison with Other Algorithms

| vs Algorithm | KNN Advantages | KNN Disadvantages |
|--------------|----------------|-------------------|
| Logistic Reg | Non-linear, multi-modal | Slower, less interpretable |
| Decision Tree | Smoother boundaries | Slower, needs scaling |
| SVM | Faster prediction | Worse with noisy data |
| Random Forest | Simple, no hyperparams | Much slower, worse accuracy |
| Neural Networks | Much simpler | Less flexible, worse performance |

### Practical Tips

1. **Start with k=5**: Good default for most problems
2. **Cross-validate k**: Test odd numbers [3, 5, 7, 9, 11, 15, 21]
3. **Scale features**: Use StandardScaler (mandatory!)
4. **Try distance weighting**: Especially with larger k
5. **Use PCA**: For high-dimensional data (d > 20)
6. **Feature selection**: Remove irrelevant features
7. **Check prediction time**: Profile on production-sized data
8. **Use algorithm='auto'**: Let sklearn choose best structure

### Further Reading

- **Classic Paper**: Cover & Hart (1967) - "Nearest Neighbor Pattern Classification"
- **Book**: "Introduction to Statistical Learning" - Chapter on KNN
- **sklearn Documentation**: https://scikit-learn.org/stable/modules/neighbors.html
- **Efficiency**: Ball Tree and KD-Tree algorithms

### Next Steps

- Approximate Nearest Neighbors (ANN) for large-scale
- Locally weighted learning
- Distance metric learning
- Hybrid methods (KNN + other algorithms)
- KNN for anomaly detection