# 🌸 Iris Flower Classification using K-Nearest Neighbors (KNN)

## 📊 Project Overview

This notebook demonstrates **K-Nearest Neighbors (KNN)** algorithm for multi-class classification on the famous Iris dataset.

### 🎯 Learning Objectives:
- Understand KNN algorithm for classification
- Explore different distance metrics
- Find optimal K value
- Implement complete ML pipeline
- Evaluate multi-class classification
- Visualize decision boundaries

---

## 🤔 What is K-Nearest Neighbors?

**KNN** is a simple, **instance-based** learning algorithm that classifies a data point based on the majority class of its K nearest neighbors.

### Key Characteristics:

| Aspect | Description |
|--------|-------------|
| **Type** | Non-parametric, instance-based |
| **Learning** | Lazy learning (no training phase) |
| **Prediction** | Compare with all training samples |
| **Decision** | Majority vote among K neighbors |
| **Distance** | Euclidean, Manhattan, Minkowski, etc. |

### How It Works:

```
1. Choose K (number of neighbors)
2. Calculate distance from new point to all training points
3. Select K nearest neighbors
4. Take majority vote
5. Assign the most common class
```

### Visual Example:

```
Classify "?" with K=5:

    Setosa (●)      Versicolor (■)      Virginica (▲)

        ●                  ■                  ▲
      ●   ●       ?      ■   ■            ▲   ▲
        ●                  ■                  ▲

5 Nearest Neighbors: ● ● ● ■ ■
Votes: Setosa=3, Versicolor=2
→ Prediction: Setosa
```

## 📚 Import Libraries

In [None]:
# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap

# Machine Learning - Dataset
from sklearn.datasets import load_iris

# Machine Learning - Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler

# Machine Learning - Models
from sklearn.neighbors import KNeighborsClassifier

# Machine Learning - Evaluation
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, ConfusionMatrixDisplay
)

# Statistical analysis
from scipy import stats

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

# Set random seed for reproducibility
np.random.seed(42)

print("✅ Libraries imported successfully!")

## 📁 Load Dataset

**Dataset**: Iris Flower Dataset (built into scikit-learn)

### Dataset Information:
- **Source**: UCI Machine Learning Repository (1936, Ronald Fisher)
- **Samples**: 150 (50 per class)
- **Features**: 4 numerical features
- **Classes**: 3 species (Setosa, Versicolor, Virginica)
- **Type**: Multi-class classification

### Features:
1. **sepal_length**: Sepal length in cm
2. **sepal_width**: Sepal width in cm
3. **petal_length**: Petal length in cm
4. **petal_width**: Petal width in cm

### Target Classes:
- **0**: Setosa
- **1**: Versicolor
- **2**: Virginica

In [None]:
# Load the Iris dataset from scikit-learn
iris = load_iris()

# Extract features and target
X = iris.data
y = iris.target

# Get feature and target names
feature_names = iris.feature_names
target_names = iris.target_names

# Create a DataFrame for easier viewing
df = pd.DataFrame(X, columns=feature_names)
df['species'] = pd.Categorical.from_codes(y, target_names)
df['species_code'] = y

print(f"Dataset shape: {df.shape}")
print(f"Number of samples: {df.shape[0]}")
print(f"Number of features: {len(feature_names)}")
print(f"\nFeature names: {feature_names}")
print(f"Target names: {list(target_names)}")
print("\n" + "="*70)
print("First 5 rows:")
df.head()

## 🔍 Exploratory Data Analysis (EDA)

### Step 1: Basic Information

In [None]:
# Dataset info
print("📊 Dataset Information:")
print("="*70)
df.info()

In [None]:
# Statistical summary
print("📈 Statistical Summary:")
print("="*70)
df.describe()

In [None]:
# Check for missing values
print("🔍 Missing Values:")
print("="*70)
missing = df.isnull().sum()
print(missing if missing.sum() > 0 else "No missing values found! ✅")

In [None]:
# Check for duplicates
duplicates = df.duplicated().sum()
print(f"🔍 Duplicate rows: {duplicates}")
if duplicates > 0:
    print(f"Removing {duplicates} duplicate rows...")
    df = df.drop_duplicates()
    print(f"✅ New shape: {df.shape}")

### Step 2: Class Distribution Analysis

In [None]:
# Analyze target variable distribution
print("🎯 Class Distribution:")
print("="*70)

class_counts = df['species'].value_counts()
print(f"\nClass Counts:")
print(class_counts)
print(f"\nPercentage:")
print(df['species'].value_counts(normalize=True) * 100)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
sns.countplot(data=df, x='species', ax=axes[0], palette='Set2', order=target_names)
axes[0].set_title('Species Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Species', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].tick_params(axis='x', rotation=45)

# Pie chart
colors = ['#90EE90', '#FFB6C1', '#87CEEB']
axes[1].pie(class_counts, labels=target_names, autopct='%1.1f%%', 
            colors=colors, startangle=90)
axes[1].set_title('Species Proportion', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n✅ Dataset is perfectly balanced!")

### Step 3: Feature Distribution Analysis

In [None]:
# Distribution of all features
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

colors_hist = ['skyblue', 'lightcoral', 'lightgreen', 'gold']

for idx, col in enumerate(feature_names):
    axes[idx].hist(df[col], bins=20, edgecolor='black', alpha=0.7, color=colors_hist[idx])
    axes[idx].set_title(f'Distribution of {col}', fontweight='bold', fontsize=12)
    axes[idx].set_xlabel(col, fontsize=10)
    axes[idx].set_ylabel('Frequency', fontsize=10)
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Box plots for outlier detection
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, col in enumerate(feature_names):
    sns.boxplot(y=df[col], ax=axes[idx], color=colors_hist[idx])
    axes[idx].set_title(f'Box Plot of {col}', fontweight='bold', fontsize=12)
    axes[idx].set_ylabel(col, fontsize=10)
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

### Step 4: Feature Relationships - Pairwise Plots

In [None]:
# Pairwise scatter plots colored by species
print("📊 Creating pairwise feature relationships...")
sns.pairplot(df, hue='species', markers=['o', 's', 'D'], 
             palette='Set2', height=2.5, diag_kind='kde')
plt.suptitle('Pairwise Feature Relationships by Species', y=1.02, fontsize=16, fontweight='bold')
plt.show()

print("\n💡 Observations:")
print("- Setosa (green) is clearly separated from others")
print("- Versicolor and Virginica have some overlap")
print("- Petal measurements are more discriminative than sepal")

### Step 5: Correlation Analysis

In [None]:
# Correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = df[feature_names].corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, linewidths=1, square=True, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Matrix', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

print("\n🔍 Correlation Insights:")
print("High positive correlations:")
high_corr = correlation_matrix.unstack().sort_values(ascending=False)
high_corr = high_corr[high_corr < 1.0]  # Exclude self-correlation
print(high_corr[high_corr > 0.8].head())

### Step 6: Feature Analysis by Species

In [None]:
# Violin plots - Feature comparison by species
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()

for idx, feature in enumerate(feature_names):
    sns.violinplot(data=df, x='species', y=feature, ax=axes[idx], palette='Set2')
    axes[idx].set_title(f'{feature} by Species', fontweight='bold', fontsize=12)
    axes[idx].set_xlabel('Species', fontsize=10)
    axes[idx].set_ylabel(feature, fontsize=10)
    axes[idx].grid(alpha=0.3, axis='y')
    axes[idx].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Statistical summary by species
print("📊 Mean Values by Species:")
print("="*70)
print(df.groupby('species')[feature_names].mean())

print("\n📊 Standard Deviation by Species:")
print("="*70)
print(df.groupby('species')[feature_names].std())

## 🔧 Data Preprocessing

### Step 1: Prepare Features and Target

In [None]:
# Separate features and target
X = df[feature_names].values
y = df['species_code'].values

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature names: {feature_names}")
print(f"Target classes: {target_names}")

### Step 2: Train-Test Split

In [None]:
# Split data into training and testing sets (70-30 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print("✅ Data split completed!")
print("="*70)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print(f"\nTraining target distribution:")
unique, counts = np.unique(y_train, return_counts=True)
for cls, cnt in zip(unique, counts):
    print(f"  {target_names[cls]}: {cnt}")
print(f"\nTesting target distribution:")
unique, counts = np.unique(y_test, return_counts=True)
for cls, cnt in zip(unique, counts):
    print(f"  {target_names[cls]}: {cnt}")

### Step 3: Feature Scaling

**⚠️ CRITICAL for KNN**: Features must be scaled because KNN uses distance metrics!

Without scaling, features with larger ranges will dominate the distance calculation.

In [None]:
# Feature Scaling using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for better readability
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=feature_names)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=feature_names)

print("✅ Feature scaling completed!")
print("="*70)
print("\nOriginal data (first 5 rows):")
print(pd.DataFrame(X_train[:5], columns=feature_names))
print("\nScaled data (first 5 rows):")
print(X_train_scaled_df.head())

print("\n📊 Scaling Statistics:")
print(f"Mean of scaled features: {X_train_scaled.mean(axis=0).round(10)}")
print(f"Std of scaled features: {X_train_scaled.std(axis=0).round(2)}")

## 🤖 Model Training

### Model 1: Basic KNN (K=5, Euclidean Distance)

In [None]:
# Train basic KNN model
print("🤖 Training K-Nearest Neighbors Model...")
print("="*70)

# Initialize KNN with K=5
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform', metric='euclidean')

# Train the model
knn.fit(X_train_scaled, y_train)

# Make predictions
y_train_pred = knn.predict(X_train_scaled)
y_test_pred = knn.predict(X_test_scaled)

# Get prediction probabilities
y_train_pred_proba = knn.predict_proba(X_train_scaled)
y_test_pred_proba = knn.predict_proba(X_test_scaled)

print("✅ Model training completed!")
print(f"\nModel parameters:")
print(f"  K (n_neighbors): {knn.n_neighbors}")
print(f"  Weights: {knn.weights}")
print(f"  Distance metric: {knn.metric}")

## 📊 Model Evaluation

### Step 1: Accuracy Score

In [None]:
# Calculate accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print("📊 ACCURACY SCORES")
print("="*70)
print(f"Training Accuracy: {train_accuracy:.4f} ({train_accuracy*100:.2f}%)")
print(f"Testing Accuracy:  {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")
print("="*70)

if train_accuracy - test_accuracy > 0.05:
    print("⚠️ Possible overfitting detected!")
else:
    print("✅ Model generalizes well!")

### Step 2: Classification Report

In [None]:
# Detailed classification report
print("\n📋 CLASSIFICATION REPORT - Testing Set:")
print("="*70)
print(classification_report(y_test, y_test_pred, target_names=target_names))

# Per-class metrics
precision = precision_score(y_test, y_test_pred, average=None)
recall = recall_score(y_test, y_test_pred, average=None)
f1 = f1_score(y_test, y_test_pred, average=None)

metrics_df = pd.DataFrame({
    'Species': target_names,
    'Precision': precision,
    'Recall': recall,
    'F1-Score': f1
})

print("\n📊 Per-Class Metrics Summary:")
print(metrics_df.to_string(index=False))

### Step 3: Confusion Matrix

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_test_pred)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True,
            xticklabels=target_names, yticklabels=target_names)
plt.title('Confusion Matrix - K-Nearest Neighbors', fontsize=14, fontweight='bold')
plt.ylabel('Actual Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.tight_layout()
plt.show()

print("\n📊 Confusion Matrix Breakdown:")
print("="*70)
for i, species in enumerate(target_names):
    print(f"{species}:")
    print(f"  Correctly classified: {cm[i, i]}")
    print(f"  Misclassified: {cm[i].sum() - cm[i, i]}")
    if cm[i].sum() > 0:
        print(f"  Class accuracy: {cm[i, i] / cm[i].sum() * 100:.2f}%")
    print()

### Step 4: Prediction Examples

In [None]:
# Show some prediction examples
print("🔍 Sample Predictions:")
print("="*70)

# Select 5 random test samples
sample_indices = np.random.choice(len(X_test), 5, replace=False)

for idx in sample_indices:
    actual = target_names[y_test[idx]]
    predicted = target_names[y_test_pred[idx]]
    probabilities = y_test_pred_proba[idx]
    
    print(f"\nSample {idx + 1}:")
    print(f"  Features: {X_test[idx]}")
    print(f"  Actual: {actual}")
    print(f"  Predicted: {predicted}")
    print(f"  Probabilities:")
    for i, species in enumerate(target_names):
        print(f"    {species}: {probabilities[i]:.4f} ({probabilities[i]*100:.2f}%)")
    print(f"  Correct: {'✅' if actual == predicted else '❌'}")

## 🎛️ Finding Optimal K Value

Let's test different K values to find the optimal one.

In [None]:
# Test different K values
print("🔍 Finding Optimal K Value...")
print("="*70)

k_range = range(1, 31)
train_scores = []
test_scores = []

for k in k_range:
    knn_temp = KNeighborsClassifier(n_neighbors=k)
    knn_temp.fit(X_train_scaled, y_train)
    
    train_scores.append(knn_temp.score(X_train_scaled, y_train))
    test_scores.append(knn_temp.score(X_test_scaled, y_test))

# Find optimal K
optimal_k = k_range[np.argmax(test_scores)]
best_score = max(test_scores)

print(f"\n✅ Optimal K Value: {optimal_k}")
print(f"Best Test Accuracy: {best_score:.4f} ({best_score*100:.2f}%)")

# Plot K vs Accuracy
plt.figure(figsize=(12, 6))
plt.plot(k_range, train_scores, marker='o', linestyle='-', linewidth=2, 
         markersize=6, label='Training Accuracy', color='blue')
plt.plot(k_range, test_scores, marker='s', linestyle='-', linewidth=2, 
         markersize=6, label='Testing Accuracy', color='red')
plt.axvline(x=optimal_k, color='green', linestyle='--', linewidth=2, 
            label=f'Optimal K={optimal_k}')
plt.xlabel('K Value (Number of Neighbors)', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('K Value vs Accuracy', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.xticks(range(1, 31, 2))
plt.tight_layout()
plt.show()

print("\n💡 Observations:")
print(f"- K=1 shows overfitting (training accuracy = {train_scores[0]:.4f})")
print(f"- Optimal K={optimal_k} balances bias and variance")
print(f"- Very large K values lead to underfitting")

## 🔄 Model Variants - Different Distance Metrics

### Variant 1: Manhattan Distance (L1)

In [None]:
# KNN with Manhattan distance
print("🤖 Training KNN with Manhattan Distance...")
print("="*70)

knn_manhattan = KNeighborsClassifier(n_neighbors=optimal_k, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)

y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

print(f"\n📊 Manhattan Distance Results:")
print(f"Accuracy: {accuracy_manhattan:.4f} ({accuracy_manhattan*100:.2f}%)")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_manhattan, target_names=target_names))

### Variant 2: Weighted KNN (Distance-Based Weights)

In [None]:
# KNN with distance-based weights
print("🤖 Training KNN with Distance Weights...")
print("="*70)

knn_weighted = KNeighborsClassifier(n_neighbors=optimal_k, weights='distance')
knn_weighted.fit(X_train_scaled, y_train)

y_pred_weighted = knn_weighted.predict(X_test_scaled)
accuracy_weighted = accuracy_score(y_test, y_pred_weighted)

print(f"\n📊 Weighted KNN Results:")
print(f"Accuracy: {accuracy_weighted:.4f} ({accuracy_weighted*100:.2f}%)")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_weighted, target_names=target_names))

### Variant 3: Minkowski Distance (p=3)

In [None]:
# KNN with Minkowski distance (p=3)
print("🤖 Training KNN with Minkowski Distance (p=3)...")
print("="*70)

knn_minkowski = KNeighborsClassifier(n_neighbors=optimal_k, metric='minkowski', p=3)
knn_minkowski.fit(X_train_scaled, y_train)

y_pred_minkowski = knn_minkowski.predict(X_test_scaled)
accuracy_minkowski = accuracy_score(y_test, y_pred_minkowski)

print(f"\n📊 Minkowski Distance Results:")
print(f"Accuracy: {accuracy_minkowski:.4f} ({accuracy_minkowski*100:.2f}%)")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_minkowski, target_names=target_names))

## 📊 Model Comparison

In [None]:
# Compare all model variants
comparison_df = pd.DataFrame({
    'Model': [
        f'KNN (K={knn.n_neighbors}, Euclidean)',
        f'KNN (K={optimal_k}, Euclidean)',
        f'KNN (K={optimal_k}, Manhattan)',
        f'KNN (K={optimal_k}, Weighted)',
        f'KNN (K={optimal_k}, Minkowski p=3)'
    ],
    'Accuracy': [
        test_accuracy,
        best_score,
        accuracy_manhattan,
        accuracy_weighted,
        accuracy_minkowski
    ],
    'Precision (macro avg)': [
        precision_score(y_test, y_test_pred, average='macro'),
        precision_score(y_test, knn.predict(X_test_scaled), average='macro'),
        precision_score(y_test, y_pred_manhattan, average='macro'),
        precision_score(y_test, y_pred_weighted, average='macro'),
        precision_score(y_test, y_pred_minkowski, average='macro')
    ],
    'Recall (macro avg)': [
        recall_score(y_test, y_test_pred, average='macro'),
        recall_score(y_test, knn.predict(X_test_scaled), average='macro'),
        recall_score(y_test, y_pred_manhattan, average='macro'),
        recall_score(y_test, y_pred_weighted, average='macro'),
        recall_score(y_test, y_pred_minkowski, average='macro')
    ],
    'F1-Score (macro avg)': [
        f1_score(y_test, y_test_pred, average='macro'),
        f1_score(y_test, knn.predict(X_test_scaled), average='macro'),
        f1_score(y_test, y_pred_manhattan, average='macro'),
        f1_score(y_test, y_pred_weighted, average='macro'),
        f1_score(y_test, y_pred_minkowski, average='macro')
    ]
})

print("\n📊 MODEL COMPARISON - All Variants")
print("="*100)
print(comparison_df.to_string(index=False))
print("="*100)

# Find best model
best_model_idx = comparison_df['Accuracy'].idxmax()
print(f"\n🏆 Best Model: {comparison_df.iloc[best_model_idx]['Model']}")
print(f"   Accuracy: {comparison_df.iloc[best_model_idx]['Accuracy']:.4f}")

In [None]:
# Visual comparison
fig, ax = plt.subplots(figsize=(14, 8))

x = np.arange(len(comparison_df))
width = 0.2

metrics = ['Accuracy', 'Precision (macro avg)', 'Recall (macro avg)', 'F1-Score (macro avg)']
colors_bar = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']

for i, metric in enumerate(metrics):
    ax.bar(x + i*width, comparison_df[metric], width, label=metric, color=colors_bar[i], alpha=0.8)

ax.set_xlabel('Model Variant', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('KNN Model Variants Comparison', fontsize=14, fontweight='bold')
ax.set_xticks(x + width * 1.5)
ax.set_xticklabels(comparison_df['Model'], rotation=45, ha='right', fontsize=9)
ax.legend(fontsize=10)
ax.grid(alpha=0.3, axis='y')
ax.set_ylim([0.9, 1.05])

plt.tight_layout()
plt.show()

## 🎓 Cross-Validation

In [None]:
# Perform 5-fold cross-validation
print("🔄 Performing 5-Fold Cross-Validation...")
print("="*70)

knn_best = KNeighborsClassifier(n_neighbors=optimal_k)
cv_scores = cross_val_score(knn_best, X_train_scaled, y_train, cv=5, scoring='accuracy')

print(f"\n📊 Cross-Validation Results:")
print(f"Fold scores: {cv_scores}")
print(f"Mean accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
print(f"Min accuracy: {cv_scores.min():.4f}")
print(f"Max accuracy: {cv_scores.max():.4f}")

# Visualization
plt.figure(figsize=(10, 6))
plt.plot(range(1, 6), cv_scores, marker='o', linestyle='-', linewidth=2, markersize=10, color='#3498db')
plt.axhline(y=cv_scores.mean(), color='r', linestyle='--', linewidth=2, 
            label=f'Mean: {cv_scores.mean():.4f}')
plt.fill_between(range(1, 6), cv_scores.mean() - cv_scores.std(), 
                 cv_scores.mean() + cv_scores.std(), alpha=0.2, color='red')
plt.xlabel('Fold', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Cross-Validation Accuracy Scores', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.xticks(range(1, 6))
plt.tight_layout()
plt.show()

## 🎛️ Hyperparameter Tuning with GridSearchCV

In [None]:
# Grid Search for best hyperparameters
print("🔍 Performing Grid Search for Hyperparameter Tuning...")
print("="*70)

param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski']
}

grid_search = GridSearchCV(
    KNeighborsClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train_scaled, y_train)

print(f"\n✅ Grid Search Completed!")
print("="*70)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# Evaluate best model on test set
best_model = grid_search.best_estimator_
y_test_pred_best = best_model.predict(X_test_scaled)

best_test_accuracy = accuracy_score(y_test, y_test_pred_best)
best_precision = precision_score(y_test, y_test_pred_best, average='macro')
best_recall = recall_score(y_test, y_test_pred_best, average='macro')
best_f1 = f1_score(y_test, y_test_pred_best, average='macro')

print(f"\n📊 Best Model Performance on Test Set:")
print(f"Accuracy:  {best_test_accuracy:.4f} ({best_test_accuracy*100:.2f}%)")
print(f"Precision: {best_precision:.4f}")
print(f"Recall:    {best_recall:.4f}")
print(f"F1-Score:  {best_f1:.4f}")

## 📊 Visualizing Decision Boundaries

Let's visualize how KNN makes decisions using 2D feature space.

In [None]:
# Decision boundary visualization using 2 features (petal length and petal width)
print("🎨 Creating Decision Boundary Visualization...")
print("Using petal length and petal width (most discriminative features)")

# Select two features (indices 2 and 3 are petal length and petal width)
X_2d = X[:, [2, 3]]
X_train_2d, X_test_2d, y_train_2d, y_test_2d = train_test_split(
    X_2d, y, test_size=0.3, random_state=42, stratify=y
)

# Scale
scaler_2d = StandardScaler()
X_train_2d_scaled = scaler_2d.fit_transform(X_train_2d)
X_test_2d_scaled = scaler_2d.transform(X_test_2d)

# Train KNN on 2D data
knn_2d = KNeighborsClassifier(n_neighbors=optimal_k)
knn_2d.fit(X_train_2d_scaled, y_train_2d)

# Create mesh
h = 0.02
x_min, x_max = X_train_2d_scaled[:, 0].min() - 1, X_train_2d_scaled[:, 0].max() + 1
y_min, y_max = X_train_2d_scaled[:, 1].min() - 1, X_train_2d_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Predict on mesh
Z = knn_2d.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot
plt.figure(figsize=(12, 8))
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ['#FF0000', '#00FF00', '#0000FF']

plt.contourf(xx, yy, Z, alpha=0.4, cmap=cmap_light)

# Plot training points
for i, color in enumerate(cmap_bold):
    idx = np.where(y_train_2d == i)
    plt.scatter(X_train_2d_scaled[idx, 0], X_train_2d_scaled[idx, 1],
                c=color, label=target_names[i], edgecolor='black', s=100, alpha=0.7)

plt.xlabel('Petal Length (scaled)', fontsize=12)
plt.ylabel('Petal Width (scaled)', fontsize=12)
plt.title(f'KNN Decision Boundary (K={optimal_k})', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Accuracy on 2D features
accuracy_2d = knn_2d.score(X_test_2d_scaled, y_test_2d)
print(f"\n✅ Accuracy using only 2 features: {accuracy_2d:.4f} ({accuracy_2d*100:.2f}%)")
print("💡 Even with just 2 features, KNN achieves high accuracy!")

## 💾 Model Persistence (Optional)

Save the best model for future use.

In [None]:
# Save the best model and scaler
import pickle

# Save best model from GridSearch
with open('knn_best_model.pkl', 'wb') as f:
    pickle.dump(best_model, f)

# Save scaler
with open('knn_scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

print("✅ Model and scaler saved successfully!")
print("Files: knn_best_model.pkl, knn_scaler.pkl")

# Example: Load and use model
print("\n📖 Example - How to load and use the saved model:")
print("""
# Load model
with open('knn_best_model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

# Load scaler
with open('knn_scaler.pkl', 'rb') as f:
    loaded_scaler = pickle.load(f)

# Make predictions
new_data = [[5.1, 3.5, 1.4, 0.2]]  # Example iris flower
new_data_scaled = loaded_scaler.transform(new_data)
prediction = loaded_model.predict(new_data_scaled)
print(f"Predicted species: {target_names[prediction[0]]}")
""")

## 📝 Summary and Key Takeaways

### 🎯 Model Performance:
- KNN achieved excellent accuracy on the Iris dataset
- Optimal K value was determined through cross-validation
- Different distance metrics showed similar performance
- Weighted KNN can improve predictions by giving more importance to closer neighbors

### 💡 Key Learnings:

1. **KNN is simple but powerful**
   - No training phase (lazy learning)
   - Works well for small-medium datasets
   - Easy to understand and implement

2. **Feature scaling is CRITICAL for KNN**
   - Without scaling, features with larger ranges dominate
   - StandardScaler transforms all features to similar scales
   - ALWAYS scale features before applying KNN

3. **Choosing the right K is important**
   - K=1: Overfitting, sensitive to noise
   - K=large: Underfitting, loses local patterns
   - Use cross-validation to find optimal K
   - For Iris, optimal K was typically between 3-7

4. **Distance metrics matter**
   - Euclidean: Most common, works well generally
   - Manhattan: Better for high-dimensional data
   - Weighted: Closer neighbors get more votes

5. **KNN has trade-offs**
   - ✅ Pros: Simple, no assumptions, works for non-linear problems
   - ❌ Cons: Slow prediction, memory-intensive, curse of dimensionality

### 🔍 Iris Dataset Insights:

1. **Class Separability**
   - Setosa is completely separable from others
   - Versicolor and Virginica have slight overlap
   - Overall, well-structured data for classification

2. **Feature Importance**
   - Petal length and petal width are most discriminative
   - Sepal measurements are less discriminative
   - Even using just 2 features achieves high accuracy

3. **Perfect for Learning**
   - Clean data (no missing values)
   - Balanced classes (50 samples each)
   - Small size (150 samples)
   - Real-world botanical measurements

---

## 🚀 Next Steps:

1. **Try KNN on other datasets**
   - Wine quality dataset
   - Breast cancer dataset
   - Handwritten digits (MNIST)

2. **Compare with other algorithms**
   - Logistic Regression
   - Support Vector Machines (SVM)
   - Decision Trees
   - Random Forest
   - Neural Networks

3. **Explore advanced techniques**
   - KD-trees for faster prediction
   - Ball-trees for high-dimensional data
   - Feature selection
   - Dimensionality reduction (PCA)

4. **Deploy your model**
   - Create a web app with Streamlit or Flask
   - Build a REST API
   - Deploy to cloud (Heroku, AWS, GCP)

---

## 📚 Additional Resources:

- **Scikit-Learn KNN Documentation**: https://scikit-learn.org/stable/modules/neighbors.html
- **StatQuest KNN Video**: https://www.youtube.com/watch?v=HVXime0nQeI
- **Understanding Distance Metrics**: https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa
- **Curse of Dimensionality**: https://towardsdatascience.com/the-curse-of-dimensionality-50dc6e49aa1e

---

**✅ Project Complete! You've successfully implemented K-Nearest Neighbors for Iris Classification!**

**Happy Learning! 🌸🚀**