# k-NN with Distance Metrics - Customer Segmentation Lab:

## Business Scenario

You work as a data scientist for RetailIQ, an e-commerce analytics company that helps online retailers better understand their customers. One of your clients, an online fashion retailer, wants to segment their customer base to create more targeted marketing campaigns.

Your task is to develop a customer segmentation model using k-Nearest Neighbors with various distance metrics. The client has provided data on customer purchasing behavior, demographics, and engagement metrics.

They've already identified five customer segments in their previous marketing research:
- Segment 0: Occasional Shoppers (low frequency, low value)
- Segment 1: Loyal Regular Shoppers (high frequency, moderate value)
- Segment 2: High-Value Enthusiasts (high frequency, high value)
- Segment 3: Big Spenders (low frequency, high value)
- Segment 4: New Customers (recent first purchase)

The goal is to build a model that can accurately classify new customers into these segments based on their behavior and attributes, so that marketing strategies can be personalized for each segment.

## The Process

By the end of this lab, you will have:
1. Analyzed the dataset to understand the characteristics of customer features
2. Preprocessed the data appropriately for distance calculations
3. Implemented k-NN with different distance metrics
4. Evaluated and compared the performance of each metric
5. Tuned and optimized the best-performing model
6. Evaluated the performance of the final model and best distance metric


## Step 0: Setup - Import Libraries and Load Data

First, let's import all the necessary libraries and load our dataset.

In [None]:
# CodeGrade step0
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Set random seed for reproducibility
np.random.seed(42)

### Loading and exploring the dataset

The dataset contains customer information and a target variable 'segment' indicating their assigned segment (0-4).

In [None]:
# CodeGrade step0
# Load the dataset
customer_data = pd.read_csv('retail_customer_data.csv')

In [None]:
# Run this cell without changes
# Display basic information about the dataset
customer_data.info()

In [None]:
# Run this cell without changes
# Check the first few rows of the dataset
customer_data.head()

In [None]:
# Run this cell without changes
# Check the distribution of the target variable
customer_data['segment'].value_counts(normalize=True)

In [None]:
# Run this cell without changes
# Check basic statistics of the dataset
customer_data.describe()

## Part 1: Analyzing Dataset Characteristics

Before selecting a distance metric, we need to understand the characteristics of our data. Let's look at feature distributions and correlations.

### Feature Distributions
Analyzing feature distributions will help us understand if we need to standardize our data before applying distance metrics.

In [None]:
# CodeGrade step1
# Create histograms of all features to observe their distributions
# Select all numeric columns except the target
numeric_columns = None

# Plot histograms
plt.figure(figsize=(15, 10))
customer_data[numeric_columns].hist(bins=20, figsize=(15, 10))
plt.tight_layout()
plt.show();

# Create a correlation matrix to identify feature relationships - use the full dataframe including segment
correlation_matrix = None

# Plot the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Feature Correlation Matrix')
plt.show();

### Data Characteristics Analysis

Based on the histograms and correlation matrix above, consider how these questions help you:

1. Do you observe any features with significantly different scales? What impact would this have on distance calculations?

2. Are there correlations between features? Which distance metric might be more appropriate for correlated features?

3. Based on your analysis, which preprocessing steps would you recommend before applying k-NN with distance metrics?

## Part 2: Data Preprocessing

Now, let's prepare our data for k-NN modeling, applying the preprocessing steps you identified as necessary.

In [None]:
# CodeGrade step2
# Prepare features (X) and target (y)
X = None
y = None

# Splitting the data into training and testing sets (75-25 split and random_state of 42)
X_train, X_test, y_train, y_test = None

# Standardize the features
scaler = StandardScaler()
X_train_scaled = None
X_test_scaled = None

In [None]:
# Run this cell without changes to display results
# Display a comparison of original vs. scaled data for the first sample
print("Original first sample:")
print(X_train.iloc[0].values[:5], "...")
print("\nScaled first sample:")
print(X_train_scaled[0][:5], "...")

## Part 3: Implementing k-NN with Different Distance Metrics

Now, let's implement k-NN with various distance metrics and evaluate their performance using cross-validation accuracy.

In [None]:
# CodeGrade step3
# Distance metrics to test
metrics = ['euclidean', 'manhattan', 'chebyshev']
k_value = 5

# Dictionary to store results
results = {}

# Loop through list of metrics
None:
    # Create and evaluate model with different metrics and k=5
    knn = None
    # Get cross val scores for model
    cv_scores = None
    # Store the mean of cv scores as value and metric name as key in results dictionary
    None
    
best_metric = max(results, key=results.get)

In [None]:
# Run this cell without changes
# Find the best metric
print(results)
print(f"\nBest metric: {best_metric} with accuracy: {results[best_metric]:.4f}")

## Part 4: Implementing Mahalanobis Distance

Mahalanobis distance is particularly useful for datasets with correlated features. Let's implement it separately and compare its performance to the other metrics we tested.

In [None]:
# CodeGrade step4
# Calculate covariance matrix from the training data
cov = None

# Implement k-NN with Mahalanobis distance and k=5
knn_mahalanobis = None

# Evaluate performance via cross validation
cv_scores_mahalanobis = None
results['mahalanobis'] = cv_scores_mahalanobis.mean()

In [None]:
# Run this cell without changes
print(f"Average CV accuracy with Mahalanobis: {cv_scores_mahalanobis.mean():.4f}")

# Update best metric if necessary
if results['mahalanobis'] > results[best_metric]:
    best_metric = 'mahalanobis'
    print(f"New best metric: {best_metric} with accuracy: {results[best_metric]:.4f}")

## Part 5: Hyperparameter Tuning

Now, let's optimize our model by finding the best k value and weighting scheme for the top-performing distance metric.

Use the following information for your grid search:
- 'n_neighbors': [1, 3, 5, 7, 9]
- 'weights': ['uniform', 'distance']

In [None]:
# CodeGrade step5
# Define parameter grid
param_grid = None

# Create base model with best metric
base_model = None

# Initialize and run grid search
grid_search = None 
grid_search.fit(X_train_scaled, y_train)

# Get best parameters and accuracy
best_params = None
best_cv_accuracy = None

In [None]:
# Run this cell without changes
# Visualization of accuracy for different k values
# This helps us understand the relationship between k and model performance
print(f"Best parameters: {best_params}")
print(f"Best cross-validation accuracy: {best_cv_accuracy:.4f}")

k_range = range(1, 31)
k_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k, metric='euclidean')
    scores = cross_val_score(knn, X_train_scaled, y_train, cv=5)
    k_scores.append(scores.mean())

plt.figure(figsize=(10, 6))
plt.plot(k_range, k_scores)
plt.xlabel('Value of k')
plt.ylabel('Cross-Validated Accuracy')
plt.title(f'Accuracy for Different Values of k using {best_metric}')
plt.grid(True)
plt.show()

## Part 6: Final Model Evaluation

Let's build our final model with the optimized parameters and evaluate it on the test set.

In [None]:
# CodeGrade step6
# Build final model with best parameters
final_model = None

# Make predictions on test set
y_pred = None

# Calculate accuracy on test set
test_accuracy = None

# Create confusion matrix
cm = None

In [None]:
# Run this cell without changes
print(f"Test set accuracy: {test_accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()