# Lesson 5A: K-Nearest Neighbors TheoryKNN is the simplest ML algorithm: classify based on your neighbors' votes!

## IntroductionImagine moving to a new neighborhood. To guess the average income, you'd ask your K closest neighbors and average their incomes. That's KNN!KNN makes no assumptions about data distribution - it simply memorizes the training data and looks up nearest neighbors at prediction time.

## Table of Contents1. What is KNN?2. Distance metrics3. Choosing K4. Algorithm steps5. KNN from scratch6. Curse of dimensionality

In [None]:
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import load_iris, load_breast_cancerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score, classification_reportfrom collections import Counternp.random.seed(42)print('✅ Libraries loaded')

## Distance Metrics**Euclidean Distance (most common):**$d(x, y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}$**Manhattan Distance:**$d(x, y) = \sum_{i=1}^{n}|x_i - y_i|$**Minkowski Distance (generalization):**$d(x, y) = (\sum_{i=1}^{n}|x_i - y_i|^p)^{1/p}$

In [None]:
def euclidean_distance(x1, x2):    return np.sqrt(np.sum((x1 - x2)**2))def manhattan_distance(x1, x2):    return np.sum(np.abs(x1 - x2))# Examplepoint1 = np.array([1, 2])point2 = np.array([4, 6])print(f'Euclidean: {euclidean_distance(point1, point2):.2f}')print(f'Manhattan: {manhattan_distance(point1, point2):.2f}')

## Choosing K- **K=1:** Sensitive to noise, overfits- **K=large:** Underfits, boundaries too smooth- **Rule of thumb:** K = √n (n = number of samples)- **Best practice:** Cross-validation**K must be odd** for binary classification (avoids ties)

## KNN Algorithm1. Load training data (memorize all points)2. For new point x:   - Compute distance to all training points   - Find K nearest neighbors   - Vote: Most common class wins3. Return prediction

In [None]:
class KNN:    def __init__(self, k=3):        self.k = k        self.X_train = None        self.y_train = None    def fit(self, X, y):        self.X_train = X        self.y_train = y    def predict(self, X):        predictions = [self._predict(x) for x in X]        return np.array(predictions)    def _predict(self, x):        # Compute distances        distances = [euclidean_distance(x, x_train) for x_train in self.X_train]        # Get K nearest        k_indices = np.argsort(distances)[:self.k]        k_nearest_labels = [self.y_train[i] for i in k_indices]        # Vote        most_common = Counter(k_nearest_labels).most_common(1)        return most_common[0][0]print('✅ KNN implementation complete!')

In [None]:
# Test on irisiris = load_iris()X, y = iris.data, iris.targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)knn = KNN(k=3)knn.fit(X_train, y_train)y_pred = knn.predict(X_test)print(f'Accuracy: {accuracy_score(y_test, y_pred):.3f}')print('\n✅ KNN from scratch works!')

## Curse of Dimensionality**Problem:** In high dimensions, all points are far apart!- As dimensions increase, distance metrics become meaningless- "Nearest" neighbors aren't actually near- Performance degrades**Solution:** Feature selection, dimensionality reduction (PCA)

## Conclusion**Pros:**- ✅ Simple, intuitive- ✅ No training time- ✅ No assumptions about data- ✅ Works for multi-class**Cons:**- ❌ Slow predictions (must compute all distances)- ❌ Sensitive to scale (normalize features!)- ❌ Curse of dimensionality- ❌ Memory intensive**Next:** Lesson 5B - Optimized KNN with scikit-learn!