# Week 6 Lab: K-Nearest Neighbors for Customer Segmentation

## Business Scenario

You've been hired as a lead data scientist at **TrendMart**, a rapidly growing e-commerce company with over 100,000 active customers. The marketing team is struggling with several key challenges:

1. **Generic Marketing**: Current campaigns treat all customers the same, leading to low engagement rates
2. **Resource Waste**: Marketing budget is spread thin across ineffective broad campaigns
3. **Customer Churn**: Unable to identify at-risk customers before they leave
4. **Product Recommendations**: Poor personalization leads to low conversion rates

The marketing director wants to implement **targeted customer segmentation** to:
- **Identify customer types** based on purchasing behavior and demographics
- **Personalize marketing campaigns** for each customer segment
- **Optimize marketing spend** by focusing on high-value customer groups
- **Improve customer retention** through targeted interventions

Your task is to build a K-Nearest Neighbors classification system that can:
- Classify customers into meaningful segments
- Handle mixed data types (numeric and categorical)
- Provide interpretable results for marketing strategy
- Scale to handle new customers in real-time

## Learning Objectives
By completing this lab, you will:
- Understand the K-Nearest Neighbors algorithm and its applications
- Learn different distance metrics (Euclidean, Manhattan, Minkowski)
- Master the critical importance of feature scaling in KNN
- Optimize the k parameter through validation techniques
- Visualize decision boundaries and algorithm behavior
- Handle multi-class classification problems
- Compare KNN performance across different scenarios
- Create actionable business insights from model results

## Part 1: Setup and Data Generation

First, let's import necessary libraries and generate synthetic customer data that reflects real e-commerce patterns.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, validation_curve
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.metrics import (
    classification_report, confusion_matrix, accuracy_score, 
    precision_score, recall_score, f1_score
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from scipy.spatial.distance import euclidean, manhattan, minkowski
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure visualization settings
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("Set2")
plt.rcParams['figure.figsize'] = (10, 6)

print("Setup complete!")
print(f"Available distance metrics: Euclidean, Manhattan, Minkowski")
print(f"KNN is an instance-based, lazy learning algorithm")
print(f"Key consideration: Feature scaling is CRITICAL for KNN!")

### Generate Synthetic Customer Data

We'll create realistic customer data with features that naturally form clusters:
- **Demographics**: Age, income, location
- **Behavior**: Purchase frequency, average order value, browsing time
- **Preferences**: Category preferences, brand loyalty, price sensitivity
- **Engagement**: Email opens, social media activity, review participation

We'll create 4 distinct customer segments:
1. **Budget Shoppers**: Price-sensitive, low spending
2. **Premium Buyers**: High income, luxury preferences
3. **Frequent Shoppers**: Regular purchases, moderate spending
4. **Occasional Buyers**: Infrequent but targeted purchases

In [None]:
def generate_customer_data(n_samples=2000):
    """
    Generate synthetic customer data with distinct segments.
    """
    np.random.seed(42)
    
    # Create customer segments with different characteristics
    segment_names = ['Budget_Shoppers', 'Premium_Buyers', 'Frequent_Shoppers', 'Occasional_Buyers']
    n_per_segment = n_samples // 4
    
    all_customers = []
    
    for i, segment in enumerate(segment_names):
        if segment == 'Budget_Shoppers':
            # Budget-conscious customers
            age = np.random.normal(35, 8, n_per_segment)
            income = np.random.normal(35000, 8000, n_per_segment)
            avg_order_value = np.random.normal(25, 8, n_per_segment)
            purchase_frequency = np.random.normal(2, 1, n_per_segment)  # purchases per month
            browsing_time = np.random.normal(15, 5, n_per_segment)  # minutes per session
            price_sensitivity = np.random.normal(8, 1, n_per_segment)  # scale 1-10
            brand_loyalty = np.random.normal(4, 1.5, n_per_segment)  # scale 1-10
            email_engagement = np.random.normal(3, 1, n_per_segment)  # opens per month
            social_activity = np.random.normal(2, 1, n_per_segment)  # interactions per month
            
        elif segment == 'Premium_Buyers':
            # High-income, luxury-focused customers
            age = np.random.normal(45, 10, n_per_segment)
            income = np.random.normal(85000, 15000, n_per_segment)
            avg_order_value = np.random.normal(150, 40, n_per_segment)
            purchase_frequency = np.random.normal(4, 1.5, n_per_segment)
            browsing_time = np.random.normal(25, 8, n_per_segment)
            price_sensitivity = np.random.normal(3, 1, n_per_segment)
            brand_loyalty = np.random.normal(8, 1, n_per_segment)
            email_engagement = np.random.normal(8, 2, n_per_segment)
            social_activity = np.random.normal(5, 2, n_per_segment)
            
        elif segment == 'Frequent_Shoppers':
            # Regular, moderate-spending customers
            age = np.random.normal(30, 8, n_per_segment)
            income = np.random.normal(55000, 12000, n_per_segment)
            avg_order_value = np.random.normal(75, 20, n_per_segment)
            purchase_frequency = np.random.normal(8, 2, n_per_segment)
            browsing_time = np.random.normal(30, 10, n_per_segment)
            price_sensitivity = np.random.normal(6, 1.5, n_per_segment)
            brand_loyalty = np.random.normal(6, 1.5, n_per_segment)
            email_engagement = np.random.normal(12, 3, n_per_segment)
            social_activity = np.random.normal(8, 3, n_per_segment)
            
        else:  # Occasional_Buyers
            # Infrequent but targeted purchases
            age = np.random.normal(40, 12, n_per_segment)
            income = np.random.normal(60000, 15000, n_per_segment)
            avg_order_value = np.random.normal(120, 30, n_per_segment)
            purchase_frequency = np.random.normal(1, 0.5, n_per_segment)
            browsing_time = np.random.normal(45, 15, n_per_segment)
            price_sensitivity = np.random.normal(5, 2, n_per_segment)
            brand_loyalty = np.random.normal(7, 2, n_per_segment)
            email_engagement = np.random.normal(6, 2, n_per_segment)
            social_activity = np.random.normal(3, 2, n_per_segment)
        
        # Clip values to realistic ranges
        age = np.clip(age, 18, 70)
        income = np.clip(income, 20000, 200000)
        avg_order_value = np.clip(avg_order_value, 10, 500)
        purchase_frequency = np.clip(purchase_frequency, 0.5, 15)
        browsing_time = np.clip(browsing_time, 5, 120)
        price_sensitivity = np.clip(price_sensitivity, 1, 10)
        brand_loyalty = np.clip(brand_loyalty, 1, 10)
        email_engagement = np.clip(email_engagement, 0, 20)
        social_activity = np.clip(social_activity, 0, 15)
        
        # Create categorical features
        if segment == 'Budget_Shoppers':
            preferred_category = np.random.choice(['Essentials', 'Home', 'Books'], n_per_segment, p=[0.5, 0.3, 0.2])
            location_type = np.random.choice(['Urban', 'Suburban', 'Rural'], n_per_segment, p=[0.3, 0.5, 0.2])
        elif segment == 'Premium_Buyers':
            preferred_category = np.random.choice(['Electronics', 'Fashion', 'Luxury'], n_per_segment, p=[0.3, 0.4, 0.3])
            location_type = np.random.choice(['Urban', 'Suburban', 'Rural'], n_per_segment, p=[0.6, 0.3, 0.1])
        elif segment == 'Frequent_Shoppers':
            preferred_category = np.random.choice(['Fashion', 'Electronics', 'Health'], n_per_segment, p=[0.4, 0.3, 0.3])
            location_type = np.random.choice(['Urban', 'Suburban', 'Rural'], n_per_segment, p=[0.5, 0.4, 0.1])
        else:  # Occasional_Buyers
            preferred_category = np.random.choice(['Electronics', 'Luxury', 'Sports'], n_per_segment, p=[0.4, 0.3, 0.3])
            location_type = np.random.choice(['Urban', 'Suburban', 'Rural'], n_per_segment, p=[0.4, 0.5, 0.1])
        
        # Create segment dataframe
        segment_data = pd.DataFrame({
            'age': age,
            'annual_income': income,
            'avg_order_value': avg_order_value,
            'purchase_frequency': purchase_frequency,
            'browsing_time': browsing_time,
            'price_sensitivity': price_sensitivity,
            'brand_loyalty': brand_loyalty,
            'email_engagement': email_engagement,
            'social_activity': social_activity,
            'preferred_category': preferred_category,
            'location_type': location_type,
            'customer_segment': segment
        })
        
        all_customers.append(segment_data)
    
    # Combine all segments
    df = pd.concat(all_customers, ignore_index=True)
    
    # Shuffle the data
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)
    
    return df

# Generate the dataset
df = generate_customer_data(2000)
print(f"Dataset shape: {df.shape}")
print(f"\nCustomer segment distribution:")
print(df['customer_segment'].value_counts())
print(f"\nFirst few rows:")
df.head()

## Part 2: Exploratory Data Analysis

Before building our KNN model, let's understand the data structure and segment characteristics.

### Exercise 2.1: Basic Data Exploration
**Task**: Explore the dataset structure and segment distributions.

In [None]:
# TODO: Display basic dataset information
print("Dataset Information:")
print("="*50)
# YOUR CODE HERE: Display dataset info and summary statistics
______

print("\nMissing Values Check:")
# YOUR CODE HERE: Check for missing values
______

print("\nNumeric Feature Statistics by Segment:")
numeric_features = ['age', 'annual_income', 'avg_order_value', 'purchase_frequency', 
                   'browsing_time', 'price_sensitivity', 'brand_loyalty', 
                   'email_engagement', 'social_activity']

# Display mean values by segment for key features
segment_stats = df.groupby('customer_segment')[numeric_features[:4]].mean().round(1)
print(segment_stats)

### Exercise 2.2: Visualize Segment Characteristics
**Task**: Create visualizations to understand how segments differ across key features.

In [None]:
# Create comprehensive segment analysis visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

# Key features to visualize
key_features = ['annual_income', 'avg_order_value', 'purchase_frequency', 
                'browsing_time', 'price_sensitivity', 'brand_loyalty']

for i, feature in enumerate(key_features):
    # Create box plots for each feature by segment
    df.boxplot(column=feature, by='customer_segment', ax=axes[i])
    axes[i].set_title(f'{feature} by Customer Segment')
    axes[i].set_xlabel('Customer Segment')
    axes[i].set_ylabel(feature)
    # Rotate x-axis labels for better readability
    axes[i].tick_params(axis='x', rotation=45)

plt.suptitle('')  # Remove automatic title
plt.tight_layout()
plt.show()

# TODO: Create a scatter plot matrix for key features
print("\nFeature Relationships (Scatter Plot Matrix):")
# Create scatter plot for income vs order value, colored by segment
plt.figure(figsize=(12, 8))
for segment in df['customer_segment'].unique():
    segment_data = df[df['customer_segment'] == segment]
    plt.scatter(segment_data['annual_income'], segment_data['avg_order_value'], 
               label=segment, alpha=0.6, s=50)

plt.xlabel('Annual Income ($)')
plt.ylabel('Average Order Value ($)')
plt.title('Customer Segments: Income vs Order Value')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# TODO: Analyze categorical feature distributions
print("\nCategorical Feature Analysis:")
# YOUR CODE HERE: Create crosstab for preferred_category vs customer_segment
category_analysis = ______
print("Preferred Category by Segment:")
print(category_analysis)

### Exercise 2.3: Feature Scaling Analysis
**Task**: Examine why feature scaling is critical for KNN by comparing feature ranges.

In [None]:
# Analyze the importance of feature scaling for KNN
print("FEATURE SCALING ANALYSIS - Why it's CRITICAL for KNN")
print("="*60)

# Display feature ranges
feature_ranges = df[numeric_features].agg(['min', 'max', 'mean', 'std']).round(2)
feature_ranges.loc['range'] = feature_ranges.loc['max'] - feature_ranges.loc['min']

print("Raw Feature Statistics:")
print(feature_ranges)

# TODO: Calculate distance between two sample customers without scaling
customer1 = df.iloc[0][numeric_features].values
customer2 = df.iloc[1][numeric_features].values

print(f"\nDistance Calculation Example (Unscaled):")
print(f"Customer 1: {customer1}")
print(f"Customer 2: {customer2}")
print(f"Feature differences: {np.abs(customer1 - customer2)}")

# Calculate Euclidean distance
unscaled_distance = euclidean(customer1, customer2)
print(f"Unscaled Euclidean distance: {unscaled_distance:.2f}")

# Show which features dominate
squared_diffs = (customer1 - customer2) ** 2
print(f"\nSquared differences by feature:")
for i, feature in enumerate(numeric_features):
    contribution = squared_diffs[i] / sum(squared_diffs) * 100
    print(f"  {feature:20s}: {squared_diffs[i]:8.1f} ({contribution:5.1f}% of total)")

print(f"\n⚠️  PROBLEM: Features with larger scales (like annual_income) dominate the distance!")
print(f"   This means KNN will primarily classify based on income, ignoring other features.")
print(f"   SOLUTION: Feature scaling makes all features contribute equally to distances.")

## Part 3: Data Preprocessing

Proper preprocessing is crucial for KNN success.

### Exercise 3.1: Handle Categorical Variables
**Task**: Encode categorical variables for KNN.

In [None]:
# Prepare data for KNN
df_processed = df.copy()

# TODO: One-hot encode categorical variables
categorical_features = ['preferred_category', 'location_type']

print("Encoding categorical variables...")
for feature in categorical_features:
    print(f"  {feature}: {df[feature].nunique()} unique values")
    # YOUR CODE HERE: Create dummy variables for categorical features
    encoded_features = ______
    df_processed = pd.concat([df_processed, encoded_features], axis=1)

# Drop original categorical columns
df_processed = df_processed.drop(categorical_features, axis=1)

print(f"\nOriginal features: {df.shape[1]}")
print(f"After encoding: {df_processed.shape[1]}")
print(f"New feature columns:")
new_columns = set(df_processed.columns) - set(df.columns)
for col in sorted(new_columns):
    print(f"  - {col}")

### Exercise 3.2: Create Train/Test Split
**Task**: Split data while maintaining segment proportions.

In [None]:
# Separate features and target
X = df_processed.drop('customer_segment', axis=1)
y = df_processed['customer_segment']

# TODO: Create stratified train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=______, random_state=42, stratify=______  # YOUR CODE HERE: 20% test, stratify by target
)

print(f"Training set: {X_train.shape[0]:,} samples")
print(f"Test set: {X_test.shape[0]:,} samples")
print(f"Features: {X_train.shape[1]}")

# Verify stratification worked
print("\nSegment distribution (should be similar):")
print("Training:")
print(y_train.value_counts(normalize=True).sort_index())
print("\nTest:")
print(y_test.value_counts(normalize=True).sort_index())

### Exercise 3.3: Feature Scaling Comparison
**Task**: Compare different scaling methods and their impact on KNN.

In [None]:
# TODO: Compare different scaling methods
scalers = {
    'StandardScaler': StandardScaler(),
    'MinMaxScaler': MinMaxScaler()
}

scaled_data = {}

for scaler_name, scaler in scalers.items():
    # YOUR CODE HERE: Fit scaler on training data and transform both sets
    X_train_scaled = ______
    X_test_scaled = ______
    
    scaled_data[scaler_name] = {
        'X_train': X_train_scaled,
        'X_test': X_test_scaled,
        'scaler': scaler
    }

# Compare scaling effects
print("SCALING COMPARISON:")
print("="*50)

# Show effect on first few features
sample_features = ['age', 'annual_income', 'avg_order_value']
sample_indices = [list(X.columns).index(feat) for feat in sample_features]

print(f"Original data (first 3 samples, selected features):")
print(X_train.iloc[:3][sample_features])

for scaler_name, data in scaled_data.items():
    print(f"\n{scaler_name} scaled data:")
    sample_scaled = data['X_train'][:3, sample_indices]
    sample_df = pd.DataFrame(sample_scaled, columns=sample_features)
    print(sample_df)

# TODO: Calculate distance comparison
print(f"\nDistance comparison between first two customers:")
customer1_orig = X_train.iloc[0].values
customer2_orig = X_train.iloc[1].values
orig_distance = euclidean(customer1_orig, customer2_orig)
print(f"  Unscaled: {orig_distance:.2f}")

for scaler_name, data in scaled_data.items():
    customer1_scaled = data['X_train'][0]
    customer2_scaled = data['X_train'][1]
    scaled_distance = euclidean(customer1_scaled, customer2_scaled)
    print(f"  {scaler_name}: {scaled_distance:.2f}")

## Part 4: Understanding Distance Metrics

KNN uses distance metrics to find nearest neighbors. Let's explore different metrics.

### Exercise 4.1: Distance Metrics Comparison
**Task**: Compare Euclidean, Manhattan, and Minkowski distances.

In [None]:
# Use StandardScaler data for distance metric comparison
X_train_scaled = scaled_data['StandardScaler']['X_train']
X_test_scaled = scaled_data['StandardScaler']['X_test']

# TODO: Calculate different distances between sample customers
customer1 = X_train_scaled[0]
customer2 = X_train_scaled[1]
customer3 = X_train_scaled[2]

print("DISTANCE METRICS COMPARISON")
print("="*50)
print(f"Comparing distances between Customer 1 and others...\n")

customers = [('Customer 2', customer2), ('Customer 3', customer3)]

for name, customer in customers:
    print(f"Customer 1 vs {name}:")
    
    # Calculate different distance metrics
    euclidean_dist = euclidean(customer1, customer)
    manhattan_dist = ______  # YOUR CODE HERE: Calculate Manhattan distance
    minkowski_dist_p3 = minkowski(customer1, customer, p=3)
    
    print(f"  Euclidean (L2):   {euclidean_dist:.4f}")
    print(f"  Manhattan (L1):   {manhattan_dist:.4f}")
    print(f"  Minkowski (p=3):  {minkowski_dist_p3:.4f}")
    print()

# Visualize distance metric effects
print("Distance Metric Characteristics:")
print("-" * 30)
print("• Euclidean (L2): Standard straight-line distance")
print("  - Sensitive to outliers")
print("  - Works well with continuous features")
print("\n• Manhattan (L1): Sum of absolute differences")
print("  - More robust to outliers")
print("  - Good for high-dimensional data")
print("\n• Minkowski (Lp): Generalization of both")
print("  - p=1: Manhattan, p=2: Euclidean")
print("  - Higher p values reduce outlier sensitivity")

# TODO: Test KNN with different distance metrics
print("\nKNN Performance with Different Distance Metrics:")
print("="*50)

distance_metrics = {
    'euclidean': 'Euclidean',
    'manhattan': 'Manhattan',
    'minkowski': 'Minkowski (p=3)'
}

for metric, name in distance_metrics.items():
    if metric == 'minkowski':
        knn = KNeighborsClassifier(n_neighbors=5, metric=metric, p=3)
    else:
        knn = ______  # YOUR CODE HERE: Create KNN with appropriate metric
    
    # Fit and evaluate
    knn.fit(X_train_scaled, y_train)
    accuracy = knn.score(X_test_scaled, y_test)
    print(f"{name:20s}: {accuracy:.4f}")

## Part 5: K Parameter Optimization

The choice of k (number of neighbors) is crucial for KNN performance.

### Exercise 5.1: K Value Analysis
**Task**: Find the optimal k value using validation curves.

In [None]:
# TODO: Test different k values
k_values = list(range(1, 31, 2))  # Test odd values from 1 to 29
train_accuracies = []
test_accuracies = []

print("K VALUE OPTIMIZATION")
print("="*40)
print(f"Testing k values: {k_values[:5]}...{k_values[-5:]}")

for k in k_values:
    # YOUR CODE HERE: Create and train KNN with different k values
    knn = KNeighborsClassifier(n_neighbors=k, metric='euclidean')
    knn.fit(______) # Fit on scaled training data
    
    # Calculate accuracies
    train_acc = knn.score(X_train_scaled, y_train)
    test_acc = ______  # YOUR CODE HERE: Calculate test accuracy
    
    train_accuracies.append(train_acc)
    test_accuracies.append(test_acc)

# Find optimal k
optimal_k_idx = np.argmax(test_accuracies)
optimal_k = k_values[optimal_k_idx]
best_test_acc = test_accuracies[optimal_k_idx]

print(f"\nOptimal k: {optimal_k}")
print(f"Best test accuracy: {best_test_acc:.4f}")

# Plot k vs accuracy
plt.figure(figsize=(12, 6))
plt.plot(k_values, train_accuracies, 'o-', label='Training Accuracy', linewidth=2)
plt.plot(k_values, test_accuracies, 'o-', label='Test Accuracy', linewidth=2)
plt.axvline(optimal_k, color='red', linestyle='--', alpha=0.7, label=f'Optimal k = {optimal_k}')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Accuracy')
plt.title('KNN Performance vs K Value')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Analyze overfitting vs underfitting
print("\nBias-Variance Analysis:")
print(f"k=1:  Train={train_accuracies[0]:.4f}, Test={test_accuracies[0]:.4f} (High Variance/Overfitting)")
print(f"k={optimal_k}:  Train={train_accuracies[optimal_k_idx]:.4f}, Test={test_accuracies[optimal_k_idx]:.4f} (Optimal)")
print(f"k={k_values[-1]}: Train={train_accuracies[-1]:.4f}, Test={test_accuracies[-1]:.4f} (High Bias/Underfitting)")

### Exercise 5.2: Cross-Validation for Robust K Selection
**Task**: Use cross-validation to select k more robustly.

In [None]:
# TODO: Use cross-validation for k selection
from sklearn.model_selection import StratifiedKFold

cv_k_values = [1, 3, 5, 7, 9, 11, 15, 19, 25]
cv_scores_mean = []
cv_scores_std = []

print("CROSS-VALIDATION K SELECTION")
print("="*50)

# Use stratified k-fold to maintain class balance
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for k in cv_k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    
    # YOUR CODE HERE: Perform cross-validation
    cv_scores = cross_val_score(knn, X_train_scaled, y_train, cv=cv, scoring='accuracy')
    
    mean_score = cv_scores.mean()
    std_score = cv_scores.std()
    
    cv_scores_mean.append(mean_score)
    cv_scores_std.append(std_score)
    
    print(f"k={k:2d}: {mean_score:.4f} (+/- {std_score:.4f})")

# Find best k from cross-validation
best_cv_k_idx = np.argmax(cv_scores_mean)
best_cv_k = cv_k_values[best_cv_k_idx]
best_cv_score = cv_scores_mean[best_cv_k_idx]

print(f"\nBest k from cross-validation: {best_cv_k}")
print(f"Cross-validation accuracy: {best_cv_score:.4f}")

# Plot cross-validation results
plt.figure(figsize=(10, 6))
plt.errorbar(cv_k_values, cv_scores_mean, yerr=cv_scores_std, 
             marker='o', linewidth=2, capsize=5)
plt.axvline(best_cv_k, color='red', linestyle='--', alpha=0.7, 
           label=f'Best k = {best_cv_k}')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Cross-Validation Accuracy')
plt.title('Cross-Validation K Selection')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Part 6: Model Training and Evaluation

Now let's train our final KNN model with optimal parameters.

### Exercise 6.1: Train Final KNN Model
**Task**: Train KNN with optimal k and evaluate performance.

In [None]:
# TODO: Train final KNN model with optimal parameters
final_knn = KNeighborsClassifier(
    n_neighbors=best_cv_k,
    metric='euclidean',
    weights='uniform'  # Can also try 'distance' for distance-weighted voting
)

# YOUR CODE HERE: Fit the model
______

# Make predictions
y_pred = final_knn.predict(X_test_scaled)
y_pred_proba = final_knn.predict_proba(X_test_scaled)

# Calculate comprehensive metrics
print("FINAL KNN MODEL PERFORMANCE")
print("="*50)
print(f"Model Configuration:")
print(f"  k = {best_cv_k}")
print(f"  Distance metric: Euclidean")
print(f"  Feature scaling: StandardScaler")
print(f"  Features: {X_train_scaled.shape[1]}")

print(f"\nOverall Performance:")
accuracy = accuracy_score(y_test, y_pred)
print(f"  Accuracy: {accuracy:.4f}")

# Per-class performance
print(f"\nDetailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=sorted(df['customer_segment'].unique())))

# TODO: Create confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=sorted(df['customer_segment'].unique()),
            yticklabels=sorted(df['customer_segment'].unique()))
plt.title('Customer Segment Classification - Confusion Matrix')
plt.xlabel('Predicted Segment')
plt.ylabel('Actual Segment')
plt.tight_layout()
plt.show()

# Calculate per-segment accuracy
print("\nPer-Segment Accuracy:")
for i, segment in enumerate(sorted(df['customer_segment'].unique())):
    segment_accuracy = cm[i, i] / cm[i, :].sum()
    print(f"  {segment:20s}: {segment_accuracy:.4f}")

### Exercise 6.2: Compare with Different Configurations
**Task**: Compare uniform vs distance-weighted voting.

In [None]:
# TODO: Compare different KNN configurations
configurations = {
    'Uniform Weights': {'weights': 'uniform'},
    'Distance Weights': {'weights': 'distance'},
    'Manhattan Distance': {'weights': 'uniform', 'metric': 'manhattan'},
    'Large K (k=15)': {'weights': 'uniform', 'n_neighbors': 15}
}

print("KNN CONFIGURATION COMPARISON")
print("="*50)

results = {}

for config_name, params in configurations.items():
    # Set default parameters
    knn_params = {
        'n_neighbors': best_cv_k,
        'metric': 'euclidean',
        'weights': 'uniform'
    }
    # Update with specific configuration
    knn_params.update(params)
    
    # Train model
    knn_config = KNeighborsClassifier(**knn_params)
    knn_config.fit(______) # YOUR CODE HERE: Fit on scaled training data
    
    # Evaluate
    train_acc = knn_config.score(X_train_scaled, y_train)
    test_acc = ______  # YOUR CODE HERE: Calculate test accuracy
    
    results[config_name] = {'train': train_acc, 'test': test_acc}
    print(f"{config_name:20s}: Train={train_acc:.4f}, Test={test_acc:.4f}")

# Visualize comparison
config_names = list(results.keys())
train_scores = [results[name]['train'] for name in config_names]
test_scores = [results[name]['test'] for name in config_names]

plt.figure(figsize=(12, 6))
x = range(len(config_names))
width = 0.35

plt.bar([i - width/2 for i in x], train_scores, width, label='Training', alpha=0.7)
plt.bar([i + width/2 for i in x], test_scores, width, label='Test', alpha=0.7)

plt.xlabel('Configuration')
plt.ylabel('Accuracy')
plt.title('KNN Configuration Comparison')
plt.xticks(x, config_names, rotation=45, ha='right')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Part 7: Model Interpretation and Business Insights

### Exercise 7.1: Analyze Misclassified Customers
**Task**: Understand which customers are being misclassified and why.

In [None]:
# TODO: Analyze misclassified samples
# Get test set with predictions
test_results = X_test.copy()
test_results['actual_segment'] = y_test
test_results['predicted_segment'] = y_pred
test_results['correct'] = (y_test == y_pred)

# Find misclassified customers
misclassified = test_results[~test_results['correct']]
print(f"MISCLASSIFICATION ANALYSIS")
print("="*50)
print(f"Total test samples: {len(test_results)}")
print(f"Misclassified: {len(misclassified)} ({len(misclassified)/len(test_results):.1%})")

# Analyze misclassification patterns
print(f"\nMisclassification Patterns:")
misclass_patterns = misclassified.groupby(['actual_segment', 'predicted_segment']).size()
print(misclass_patterns)

# TODO: Find borderline cases
print(f"\nBorderline Cases Analysis:")
# Get prediction probabilities for misclassified samples
misclassified_indices = test_results[~test_results['correct']].index
test_indices = test_results.index
misclass_prob_indices = [list(test_indices).index(idx) for idx in misclassified_indices]

# Show examples of borderline classifications
print(f"Examples of uncertain predictions (low confidence):")
for i in range(min(5, len(misclass_prob_indices))):
    prob_idx = misclass_prob_indices[i]
    actual = y_test.iloc[prob_idx]
    predicted = y_pred[prob_idx]
    max_prob = np.max(y_pred_proba[prob_idx])
    
    print(f"  Customer {i+1}: {actual} → {predicted} (confidence: {max_prob:.3f})")

# Visualize prediction confidence
confidence_scores = np.max(y_pred_proba, axis=1)
plt.figure(figsize=(10, 6))
plt.hist(confidence_scores, bins=20, alpha=0.7, edgecolor='black')
plt.axvline(confidence_scores.mean(), color='red', linestyle='--', 
           label=f'Mean confidence: {confidence_scores.mean():.3f}')
plt.xlabel('Prediction Confidence')
plt.ylabel('Number of Customers')
plt.title('Distribution of Prediction Confidence Scores')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"\nConfidence Statistics:")
print(f"  Mean confidence: {confidence_scores.mean():.3f}")
print(f"  Min confidence: {confidence_scores.min():.3f}")
print(f"  Customers with confidence < 0.5: {(confidence_scores < 0.5).sum()}")

### Exercise 7.2: Feature Importance Analysis
**Task**: Understand which features are most important for classification.

In [None]:
# TODO: Analyze feature importance by examining feature differences between segments
print("FEATURE IMPORTANCE ANALYSIS")
print("="*50)

# Calculate feature statistics by segment
feature_analysis = []
segments = sorted(df['customer_segment'].unique())

for feature in numeric_features:
    # Calculate variance between segments (F-statistic like measure)
    segment_means = []
    for segment in segments:
        segment_data = df[df['customer_segment'] == segment][feature]
        segment_means.append(segment_data.mean())
    
    # Calculate between-group variance
    overall_mean = df[feature].mean()
    between_var = np.var(segment_means)
    within_var = np.mean([df[df['customer_segment'] == seg][feature].var() for seg in segments])
    
    # F-ratio as importance measure
    f_ratio = between_var / within_var if within_var > 0 else 0
    
    feature_analysis.append({
        'feature': feature,
        'f_ratio': f_ratio,
        'between_var': between_var,
        'within_var': within_var
    })

# Sort by importance
feature_df = pd.DataFrame(feature_analysis).sort_values('f_ratio', ascending=False)

print("Feature Importance (F-ratio - higher is more discriminative):")
print("-" * 60)
for idx, row in feature_df.iterrows():
    print(f"{row['feature']:20s}: {row['f_ratio']:8.2f}")

# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(range(len(feature_df)), feature_df['f_ratio'], alpha=0.7)
plt.yticks(range(len(feature_df)), feature_df['feature'])
plt.xlabel('F-ratio (Discriminative Power)')
plt.title('Feature Importance for Customer Segmentation')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# TODO: Show segment characteristics for top features
print(f"\nTop 3 Most Discriminative Features:")
top_features = feature_df.head(3)['feature'].tolist()
segment_profiles = df.groupby('customer_segment')[top_features].mean().round(1)
print(segment_profiles)

## Part 8: Business Applications and Recommendations

### Exercise 8.1: Customer Segment Profiling
**Task**: Create detailed profiles for each customer segment.

In [None]:
# TODO: Create comprehensive segment profiles
print("CUSTOMER SEGMENT PROFILES")
print("="*60)

# Calculate comprehensive statistics by segment
segment_profiles = {}

for segment in segments:
    segment_data = df[df['customer_segment'] == segment]
    
    profile = {
        'size': len(segment_data),
        'percentage': len(segment_data) / len(df) * 100,
        'avg_age': segment_data['age'].mean(),
        'avg_income': segment_data['annual_income'].mean(),
        'avg_order_value': segment_data['avg_order_value'].mean(),
        'purchase_frequency': segment_data['purchase_frequency'].mean(),
        'browsing_time': segment_data['browsing_time'].mean(),
        'price_sensitivity': segment_data['price_sensitivity'].mean(),
        'brand_loyalty': segment_data['brand_loyalty'].mean(),
        'email_engagement': segment_data['email_engagement'].mean(),
        'top_category': segment_data['preferred_category'].mode().iloc[0],
        'top_location': segment_data['location_type'].mode().iloc[0]
    }
    
    segment_profiles[segment] = profile

# Display detailed profiles
for segment, profile in segment_profiles.items():
    print(f"\n{segment.upper().replace('_', ' ')}")
    print("-" * 40)
    print(f"Size: {profile['size']:,} customers ({profile['percentage']:.1f}%)")
    print(f"Demographics: {profile['avg_age']:.0f} years old, ${profile['avg_income']:,.0f} income")
    print(f"Shopping: ${profile['avg_order_value']:.0f} avg order, {profile['purchase_frequency']:.1f} purchases/month")
    print(f"Behavior: {profile['browsing_time']:.0f}min browsing, prefers {profile['top_category']}")
    print(f"Engagement: {profile['email_engagement']:.1f} email opens/month")
    print(f"Preferences: Price sensitivity {profile['price_sensitivity']:.1f}/10, Brand loyalty {profile['brand_loyalty']:.1f}/10")
    print(f"Location: Primarily {profile['top_location']}")

# TODO: Calculate business value by segment
print(f"\n\nBUSINESS VALUE ANALYSIS")
print("="*50)

for segment, profile in segment_profiles.items():
    # Calculate annual value per customer
    monthly_value = profile['avg_order_value'] * profile['purchase_frequency']
    annual_value = monthly_value * 12
    total_segment_value = annual_value * profile['size']
    
    print(f"\n{segment.replace('_', ' ').title()}:")
    print(f"  Annual value per customer: ${annual_value:,.0f}")
    print(f"  Total segment value: ${total_segment_value:,.0f}")
    print(f"  Marketing receptivity: {profile['email_engagement']/20*100:.0f}%")

### Exercise 8.2: Marketing Strategy Recommendations
**Task**: Generate actionable marketing recommendations based on segment analysis.

In [None]:
# Generate comprehensive business recommendations
print("\n" + "="*80)
print("TRENDMART CUSTOMER SEGMENTATION - MARKETING STRATEGY RECOMMENDATIONS")
print("="*80)

print(f"\n📊 MODEL PERFORMANCE SUMMARY")
print("-" * 50)
print(f"• Classification Accuracy: {accuracy:.1%}")
print(f"• Model Type: K-Nearest Neighbors (k={best_cv_k})")
print(f"• Features Used: {X_train.shape[1]} (scaled and encoded)")
print(f"• Ready for real-time customer classification")

print(f"\n🎯 SEGMENT-SPECIFIC MARKETING STRATEGIES")
print("-" * 50)

# Strategy recommendations by segment
strategies = {
    'Budget_Shoppers': {
        'focus': 'Price-driven campaigns',
        'tactics': ['Discount promotions', 'Free shipping offers', 'Bundle deals', 'Clearance alerts'],
        'channels': ['Email newsletters', 'SMS promotions'],
        'kpis': ['Conversion rate', 'Cart abandonment reduction']
    },
    'Premium_Buyers': {
        'focus': 'Quality and exclusivity',
        'tactics': ['Premium product showcases', 'Early access to new arrivals', 'Luxury brand partnerships', 'VIP customer service'],
        'channels': ['Personalized emails', 'Social media ads', 'Influencer partnerships'],
        'kpis': ['Average order value', 'Customer lifetime value']
    },
    'Frequent_Shoppers': {
        'focus': 'Engagement and loyalty',
        'tactics': ['Loyalty point programs', 'Personalized recommendations', 'Regular engagement campaigns', 'Cross-selling'],
        'channels': ['App notifications', 'Personalized emails', 'Social media'],
        'kpis': ['Purchase frequency', 'Customer retention rate']
    },
    'Occasional_Buyers': {
        'focus': 'Targeted activation',
        'tactics': ['Need-based targeting', 'Seasonal campaigns', 'Wishlist reminders', 'Time-limited offers'],
        'channels': ['Retargeting ads', 'Quarterly email campaigns'],
        'kpis': ['Purchase activation rate', 'Time between purchases']
    }
}

for segment, strategy in strategies.items():
    segment_name = segment.replace('_', ' ').title()
    profile = segment_profiles[segment]
    
    print(f"\n🔹 {segment_name}")
    print(f"   Size: {profile['size']:,} customers ({profile['percentage']:.1f}% of base)")
    print(f"   Focus: {strategy['focus']}")
    print(f"   Key Tactics: {', '.join(strategy['tactics'][:3])}")
    print(f"   Channels: {', '.join(strategy['channels'])}")
    print(f"   Success Metrics: {', '.join(strategy['kpis'])}")

print(f"\n💡 IMPLEMENTATION ROADMAP")
print("-" * 50)
print(f"1. PHASE 1 - MODEL DEPLOYMENT (Week 1-2)")
print(f"   • Deploy KNN model to production environment")
print(f"   • Set up real-time customer classification pipeline")
print(f"   • Create segment dashboards for marketing team")
print(f"")
print(f"2. PHASE 2 - CAMPAIGN SETUP (Week 3-4)")
print(f"   • Design segment-specific email templates")
print(f"   • Configure automated campaign triggers")
print(f"   • Set up A/B testing framework")
print(f"")
print(f"3. PHASE 3 - LAUNCH & OPTIMIZE (Week 5+)")
print(f"   • Launch targeted campaigns for each segment")
print(f"   • Monitor performance and adjust tactics")
print(f"   • Continuously retrain model with new customer data")

print(f"\n📈 EXPECTED BUSINESS IMPACT")
print("-" * 50)
total_customers = sum(profile['size'] for profile in segment_profiles.values())
total_annual_value = sum(
    profile['size'] * profile['avg_order_value'] * profile['purchase_frequency'] * 12
    for profile in segment_profiles.values()
)

print(f"• Customer Base: {total_customers:,} customers")
print(f"• Total Annual GMV: ${total_annual_value:,.0f}")
print(f"• Expected Improvements with Targeted Marketing:")
print(f"  - Email engagement: +25-40% through personalization")
print(f"  - Conversion rates: +15-30% through segment-specific offers")
print(f"  - Customer retention: +10-20% through loyalty programs")
print(f"  - Marketing ROI: +50-100% through better targeting")

print(f"\n⚠️  MONITORING & MAINTENANCE")
print("-" * 50)
print(f"• Model Retraining: Monthly with new customer data")
print(f"• Performance Monitoring: Track classification accuracy weekly")
print(f"• Segment Drift Detection: Monitor changing customer behaviors")
print(f"• Business Metrics: Weekly campaign performance reviews")
print(f"• Feature Updates: Quarterly addition of new behavioral data")

print(f"\n" + "="*80)

## Part 9: Model Deployment Considerations

### Exercise 9.1: Create Production Pipeline
**Task**: Design a production-ready pipeline for real-time customer classification.

In [None]:
# TODO: Create a production pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Define preprocessing steps
numeric_features = ['age', 'annual_income', 'avg_order_value', 'purchase_frequency',
                   'browsing_time', 'price_sensitivity', 'brand_loyalty', 
                   'email_engagement', 'social_activity']

categorical_features = ['preferred_category', 'location_type']

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', 'passthrough', categorical_features)  # Will be handled separately
    ]
)

# Create full pipeline
production_pipeline = Pipeline([
    ('preprocessor', StandardScaler()),  # Simplified for demo
    ('classifier', KNeighborsClassifier(n_neighbors=best_cv_k, metric='euclidean'))
])

# Train on the processed data (already encoded)
production_pipeline.fit(X_train, y_train)

# Test pipeline performance
pipeline_score = production_pipeline.score(X_test, y_test)
print(f"PRODUCTION PIPELINE")
print("="*50)
print(f"Pipeline Accuracy: {pipeline_score:.4f}")
print(f"Ready for deployment: {'✅' if pipeline_score > 0.8 else '⚠️'}")

# Demonstrate real-time prediction
print(f"\nReal-time Prediction Example:")
sample_customer = X_test.iloc[0:1]  # Take first test customer
predicted_segment = production_pipeline.predict(sample_customer)[0]
prediction_proba = production_pipeline.predict_proba(sample_customer)[0]
confidence = max(prediction_proba)

print(f"Customer Features: {dict(sample_customer.iloc[0])[:3]}...")  # Show first 3 features
print(f"Predicted Segment: {predicted_segment}")
print(f"Confidence: {confidence:.3f}")
print(f"Actual Segment: {y_test.iloc[0]}")
print(f"Prediction: {'✅ Correct' if predicted_segment == y_test.iloc[0] else '❌ Incorrect'}")

# TODO: Show deployment architecture
print(f"\nPRODUCTION ARCHITECTURE RECOMMENDATIONS")
print("-" * 50)
print(f"1. Model Serving:")
print(f"   • REST API with Flask/FastAPI")
print(f"   • Response time: <100ms for real-time classification")
print(f"   • Input: Customer features (JSON)")
print(f"   • Output: Segment + confidence score")
print(f"")
print(f"2. Data Pipeline:")
print(f"   • Feature engineering from raw customer data")
print(f"   • Real-time feature computation")
print(f"   • Feature validation and error handling")
print(f"")
print(f"3. Monitoring:")
print(f"   • Model performance metrics")
print(f"   • Data drift detection")
print(f"   • Prediction confidence monitoring")
print(f"   • Business impact tracking")

## Conclusion

Congratulations! You've successfully completed a comprehensive K-Nearest Neighbors analysis for customer segmentation.

### Key Takeaways:

1. **KNN Fundamentals**: Learned how KNN classifies based on nearest neighbor voting
2. **Feature Scaling Critical**: Demonstrated why scaling is essential for distance-based algorithms
3. **Distance Metrics**: Compared Euclidean, Manhattan, and Minkowski distances
4. **K Optimization**: Used validation curves and cross-validation for optimal k selection
5. **Multi-class Classification**: Successfully classified customers into 4 distinct segments
6. **Business Application**: Translated technical results into actionable marketing strategies
7. **Production Considerations**: Designed deployment architecture for real-time classification

### Skills Practiced:
- K-Nearest Neighbors algorithm implementation and optimization
- Feature scaling and preprocessing for distance-based methods
- Distance metric comparison and selection
- Hyperparameter tuning with cross-validation
- Multi-class classification evaluation
- Business insight generation from model results
- Production pipeline design for real-time deployment

### Business Impact:
The KNN customer segmentation model enables TrendMart to:
- **Personalize Marketing**: Target campaigns to specific customer types
- **Optimize Spend**: Focus marketing budget on high-value segments
- **Improve Retention**: Identify at-risk customers and intervene appropriately
- **Scale Operations**: Automatically classify new customers in real-time

### Next Steps:
In the next lab, we'll explore Decision Trees for product recommendation systems, learning about interpretable models that can provide clear business rules and handle mixed data types naturally.