# Customer Segmentation & Purchase Prediction
## Module 4 Assignment - Masai School

### Learning Path for Beginners üéì

This notebook will guide you through your first data science project! We'll learn:
1. **Data Exploration** - Understanding your data
2. **Preprocessing** - Cleaning and preparing data
3. **Clustering** - Grouping similar customers
4. **Prediction** - Building ML models
5. **Optimization** - Making models better

**Total Points: 100**
- Part 1: Data Exploration & Preprocessing (20 points)
- Part 2: Customer Segmentation (25 points)
- Part 3: Predictive Modeling (35 points)
- Part 4: Model Optimization (20 points)

## Step 0: Import Libraries

First, let's import all the tools we need. Think of these as your toolbox!

In [None]:
# Data manipulation libraries
import pandas as pd  # For working with tables (DataFrames)
import numpy as np   # For mathematical operations

# Visualization libraries
import matplotlib.pyplot as plt  # For creating plots
import seaborn as sns           # For beautiful statistical plots

# Machine Learning libraries
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, confusion_matrix, classification_report,
                             silhouette_score)

# Ignore warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("‚úÖ All libraries imported successfully!")

---
# Part 1: Data Exploration & Preprocessing (20 points)

## üéØ Learning Goals:
- Load and understand your dataset
- Find and fix missing values
- Detect and handle outliers
- Create visualizations to understand data patterns

### 1.1 Load the Dataset

In [None]:
# Load the data
df = pd.read_csv('customer_data.csv')

print("Dataset loaded successfully!\n")
print(f"üìä Dataset Shape: {df.shape[0]} rows √ó {df.shape[1]} columns")
print("\n" + "="*50)
print("First 5 rows of the dataset:")
print("="*50)
df.head()

### 1.2 Understanding the Data Structure

**What does `.info()` tell us?**
- Number of rows and columns
- Data types of each column (numbers, text, etc.)
- How many non-null (non-empty) values each column has
- Memory usage

In [None]:
print("üìã Dataset Information:")
print("="*50)
df.info()

### 1.3 Statistical Summary

**What does `.describe()` show?**
- `count`: How many values
- `mean`: Average value
- `std`: Standard deviation (how spread out the data is)
- `min/max`: Smallest and largest values
- `25%, 50%, 75%`: Quartiles (data distribution)

In [None]:
print("üìà Statistical Summary of Numerical Features:")
print("="*50)
df.describe().round(2)

### 1.4 Check for Missing Values

**Why are missing values important?**
- Machine learning models can't work with missing data
- Missing data can indicate patterns (e.g., customers not providing info)
- We need to decide: fill them, remove them, or use them as information

In [None]:
# Count missing values
missing_values = df.isnull().sum()
missing_percent = (df.isnull().sum() / len(df)) * 100

# Create a summary DataFrame
missing_df = pd.DataFrame({
    'Column': missing_values.index,
    'Missing Values': missing_values.values,
    'Percentage': missing_percent.values
})

# Only show columns with missing values
missing_df = missing_df[missing_df['Missing Values'] > 0].sort_values('Missing Values', ascending=False)

print("üîç Missing Values Analysis:")
print("="*50)
if len(missing_df) > 0:
    print(missing_df.to_string(index=False))
    print(f"\n‚ö†Ô∏è Total columns with missing values: {len(missing_df)}")
else:
    print("‚úÖ No missing values found!")

### 1.5 Visualize Missing Values

A heatmap helps us see patterns in missing data visually.

In [None]:
# Create a missing value heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=True, cmap='viridis', yticklabels=False)
plt.title('Missing Values Heatmap\n(Yellow = Missing, Purple = Present)', fontsize=14, fontweight='bold')
plt.xlabel('Columns')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

### 1.6 Handle Missing Values

**Strategy:**
- **Numerical columns**: Fill with median (middle value - less affected by outliers)
- **Categorical columns**: Fill with mode (most frequent value)

**Why median for numbers?**
- Mean can be skewed by extreme values
- Median represents the "typical" value better

In [None]:
# Separate numerical and categorical columns
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

# Remove customer_id and target variable from processing
if 'customer_id' in numerical_cols:
    numerical_cols.remove('customer_id')
if 'high_value_customer' in numerical_cols:
    numerical_cols.remove('high_value_customer')
if 'customer_id' in categorical_cols:
    categorical_cols.remove('customer_id')

print(f"üìä Numerical columns: {len(numerical_cols)}")
print(numerical_cols)
print(f"\nüìù Categorical columns: {len(categorical_cols)}")
print(categorical_cols)

# Fill missing values
print("\nüîß Filling missing values...")

# Numerical: Fill with median
for col in numerical_cols:
    if df[col].isnull().sum() > 0:
        median_value = df[col].median()
        df[col].fillna(median_value, inplace=True)
        print(f"  ‚úì {col}: Filled {df[col].isnull().sum()} values with median = {median_value:.2f}")

# Categorical: Fill with mode
for col in categorical_cols:
    if df[col].isnull().sum() > 0:
        mode_value = df[col].mode()[0]
        df[col].fillna(mode_value, inplace=True)
        print(f"  ‚úì {col}: Filled {df[col].isnull().sum()} values with mode = {mode_value}")

print("\n‚úÖ All missing values handled!")
print(f"Remaining missing values: {df.isnull().sum().sum()}")

### 1.7 Detect Outliers

**What are outliers?**
- Values that are very different from most other values
- Example: If most customers spend ‚Çπ100-500, but one spends ‚Çπ50,000

**Why detect them?**
- They can mess up our models
- But sometimes they're important (VIP customers!)

**Method: IQR (Interquartile Range)**
- Q1 = 25th percentile
- Q3 = 75th percentile
- IQR = Q3 - Q1
- Outliers: Values < Q1 - 1.5√óIQR OR > Q3 + 1.5√óIQR

In [None]:
def detect_outliers_iqr(data, column):
    """
    Detect outliers using IQR method
    Returns: lower bound, upper bound, outlier indices
    """
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    
    return lower_bound, upper_bound, outliers.index

# Check outliers for key numerical columns
print("üîç Outlier Detection (IQR Method):")
print("="*70)

outlier_summary = []
for col in ['total_spend', 'num_transactions', 'avg_transaction_value', 'days_since_last_purchase']:
    lower, upper, outlier_idx = detect_outliers_iqr(df, col)
    outlier_count = len(outlier_idx)
    outlier_percent = (outlier_count / len(df)) * 100
    
    outlier_summary.append({
        'Column': col,
        'Outliers': outlier_count,
        'Percentage': f"{outlier_percent:.2f}%",
        'Lower Bound': f"{lower:.2f}",
        'Upper Bound': f"{upper:.2f}"
    })

outlier_df = pd.DataFrame(outlier_summary)
print(outlier_df.to_string(index=False))

### 1.8 Visualize Outliers with Box Plots

**How to read a box plot:**
- Box: Middle 50% of data (Q1 to Q3)
- Line in box: Median
- Whiskers: Extend to min/max within 1.5√óIQR
- Dots beyond whiskers: Outliers

In [None]:
# Create box plots for key numerical features
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Box Plots: Outlier Detection', fontsize=16, fontweight='bold')

cols_to_plot = ['total_spend', 'num_transactions', 'avg_transaction_value', 'days_since_last_purchase']
colors = ['skyblue', 'lightcoral', 'lightgreen', 'plum']

for idx, (col, color) in enumerate(zip(cols_to_plot, colors)):
    row = idx // 2
    col_pos = idx % 2
    
    axes[row, col_pos].boxplot(df[col].dropna(), vert=True, patch_artist=True,
                                boxprops=dict(facecolor=color),
                                medianprops=dict(color='red', linewidth=2))
    axes[row, col_pos].set_title(col.replace('_', ' ').title(), fontweight='bold')
    axes[row, col_pos].set_ylabel('Value')
    axes[row, col_pos].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 1.9 Distribution Plots

**Why visualize distributions?**
- See if data is normally distributed (bell curve)
- Identify skewness (data leaning left or right)
- Spot multiple peaks (might indicate different customer groups!)

In [None]:
# Distribution plots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Distribution of Key Features', fontsize=16, fontweight='bold')

for idx, col in enumerate(cols_to_plot):
    row = idx // 2
    col_pos = idx % 2
    
    axes[row, col_pos].hist(df[col].dropna(), bins=50, color=colors[idx], alpha=0.7, edgecolor='black')
    axes[row, col_pos].set_title(col.replace('_', ' ').title(), fontweight='bold')
    axes[row, col_pos].set_xlabel('Value')
    axes[row, col_pos].set_ylabel('Frequency')
    axes[row, col_pos].axvline(df[col].median(), color='red', linestyle='--', linewidth=2, label='Median')
    axes[row, col_pos].axvline(df[col].mean(), color='green', linestyle='--', linewidth=2, label='Mean')
    axes[row, col_pos].legend()
    axes[row, col_pos].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 1.10 Correlation Analysis

**What is correlation?**
- Measures how two variables move together
- Range: -1 to +1
  - +1: Perfect positive (both increase together)
  - 0: No relationship
  - -1: Perfect negative (one increases, other decreases)

**Why is it important?**
- Find which features are related to your target (high_value_customer)
- Detect multicollinearity (features too similar)

In [None]:
# Select numerical columns for correlation
numerical_features = df[numerical_cols + ['high_value_customer']]

# Calculate correlation matrix
correlation_matrix = numerical_features.corr()

# Create heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix\n(Closer to 1 or -1 = Stronger Relationship)', 
          fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Show correlation with target variable
print("\nüéØ Correlation with Target Variable (high_value_customer):")
print("="*50)
target_corr = correlation_matrix['high_value_customer'].sort_values(ascending=False)
print(target_corr.to_string())

---
# Part 2: Customer Segmentation using Clustering (25 points)

## üéØ Learning Goals:
- Understand what clustering is
- Use K-Means algorithm to group customers
- Find the optimal number of clusters
- Visualize and interpret customer segments

### 2.1 What is Clustering? ü§î

**Clustering = Grouping similar things together**

Imagine you have 1000 customers. Some spend a lot, some visit often, some buy many products. 
Clustering helps us group similar customers together automatically!

**K-Means Algorithm:**
1. Choose K (number of groups)
2. Place K random points as "centers"
3. Assign each customer to nearest center
4. Move centers to average of their group
5. Repeat steps 3-4 until groups stabilize

**Why use it?**
- Marketing: Target different customer groups differently
- Business: Understand customer behavior patterns
- Strategy: Personalize offers for each segment

### 2.2 Prepare Data for Clustering

In [None]:
# Select features for clustering
# We'll use: spending, transactions, recency, and product variety
clustering_features = [
    'total_spend',
    'num_transactions', 
    'avg_transaction_value',
    'days_since_last_purchase',
    'num_visits',
    'product_categories_purchased',
    'discount_used'
]

# Create clustering dataset
X_cluster = df[clustering_features].copy()

print("üìä Features selected for clustering:")
print("="*50)
for i, feature in enumerate(clustering_features, 1):
    print(f"{i}. {feature}")

print(f"\nShape: {X_cluster.shape}")
print("\nFirst few rows:")
X_cluster.head()

### 2.3 Feature Scaling

**Why scale features?**

Imagine measuring distance between customers using:
- Total spend: ‚Çπ100 to ‚Çπ10,000 (huge range!)
- Number of transactions: 1 to 30 (small range)

Without scaling, total_spend would dominate the clustering!

**StandardScaler:**
- Transforms each feature to have mean=0, std=1
- Formula: (value - mean) / std
- Now all features are on the same scale

In [None]:
# Initialize scaler
scaler = StandardScaler()

# Fit and transform
X_scaled = scaler.fit_transform(X_cluster)

print("‚úÖ Features scaled successfully!")
print("\nBefore scaling (first customer):")
print(X_cluster.iloc[0].to_dict())
print("\nAfter scaling (first customer):")
print(dict(zip(clustering_features, X_scaled[0])))
print("\nüìä Scaled data statistics:")
print(f"Mean: {X_scaled.mean(axis=0).round(10)}")
print(f"Std Dev: {X_scaled.std(axis=0).round(2)}")

### 2.4 Find Optimal Number of Clusters

**The Challenge:** How many groups should we create?

**Method 1: Elbow Method**
- Try different K values (2, 3, 4, 5...)
- Calculate "inertia" (how tight the clusters are)
- Plot K vs Inertia
- Look for the "elbow" - where adding more clusters doesn't help much

**Method 2: Silhouette Score**
- Measures how well-separated clusters are
- Range: -1 to +1
- Higher = better separated clusters

In [None]:
# Test different numbers of clusters
K_range = range(2, 11)
inertias = []
silhouette_scores = []

print("üîç Testing different number of clusters...")
print("="*50)

for k in K_range:
    # Create and fit K-Means
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    
    # Calculate metrics
    inertias.append(kmeans.inertia_)
    sil_score = silhouette_score(X_scaled, kmeans.labels_)
    silhouette_scores.append(sil_score)
    
    print(f"K={k}: Inertia={kmeans.inertia_:.2f}, Silhouette={sil_score:.3f}")

# Find best K based on silhouette score
best_k = K_range[silhouette_scores.index(max(silhouette_scores))]
print(f"\n‚ú® Best K based on Silhouette Score: {best_k}")

### 2.5 Visualize Elbow and Silhouette Plots

In [None]:
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Elbow Plot
axes[0].plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Clusters (K)', fontsize=12)
axes[0].set_ylabel('Inertia', fontsize=12)
axes[0].set_title('Elbow Method\n(Look for the "elbow" point)', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Silhouette Plot
axes[1].plot(K_range, silhouette_scores, 'ro-', linewidth=2, markersize=8)
axes[1].axvline(best_k, color='green', linestyle='--', linewidth=2, label=f'Best K={best_k}')
axes[1].set_xlabel('Number of Clusters (K)', fontsize=12)
axes[1].set_ylabel('Silhouette Score', fontsize=12)
axes[1].set_title('Silhouette Score\n(Higher is better)', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 2.6 Apply K-Means Clustering

In [None]:
# Use the best K (or you can manually choose based on business needs)
final_k = best_k  # You can change this if needed

print(f"üéØ Applying K-Means with K={final_k} clusters...")

# Create and fit final K-Means model
kmeans_final = KMeans(n_clusters=final_k, random_state=42, n_init=10)
cluster_labels = kmeans_final.fit_predict(X_scaled)

# Add cluster labels to original dataframe
df['Cluster'] = cluster_labels

print("\n‚úÖ Clustering complete!")
print("\nüìä Cluster Distribution:")
print("="*50)
cluster_counts = df['Cluster'].value_counts().sort_index()
for cluster, count in cluster_counts.items():
    percentage = (count / len(df)) * 100
    print(f"Cluster {cluster}: {count} customers ({percentage:.1f}%)")

### 2.7 Analyze Cluster Characteristics

Now let's understand what makes each cluster unique!

In [None]:
# Calculate mean values for each cluster
cluster_profile = df.groupby('Cluster')[clustering_features].mean()

print("üìä Cluster Profiles (Average Values):")
print("="*70)
print(cluster_profile.round(2))

# Create a more readable comparison
print("\nüéØ Cluster Characteristics:")
print("="*70)
for cluster in range(final_k):
    print(f"\n{'='*70}")
    print(f"CLUSTER {cluster}:")
    print(f"{'='*70}")
    cluster_data = df[df['Cluster'] == cluster]
    print(f"üìà Size: {len(cluster_data)} customers ({len(cluster_data)/len(df)*100:.1f}%)")
    print(f"üí∞ Avg Spend: ‚Çπ{cluster_data['total_spend'].mean():.2f}")
    print(f"üõí Avg Transactions: {cluster_data['num_transactions'].mean():.1f}")
    print(f"üìÖ Avg Days Since Purchase: {cluster_data['days_since_last_purchase'].mean():.1f}")
    print(f"üè™ Avg Visits: {cluster_data['num_visits'].mean():.1f}")
    print(f"üì¶ Avg Product Categories: {cluster_data['product_categories_purchased'].mean():.1f}")
    
    # Determine cluster type
    if cluster_data['total_spend'].mean() > df['total_spend'].median() and \
       cluster_data['days_since_last_purchase'].mean() < df['days_since_last_purchase'].median():
        cluster_type = "üåü High-Value Active Customers"
    elif cluster_data['total_spend'].mean() > df['total_spend'].median():
        cluster_type = "üíé High Spenders"
    elif cluster_data['days_since_last_purchase'].mean() > df['days_since_last_purchase'].median():
        cluster_type = "üò¥ Inactive/At-Risk Customers"
    else:
        cluster_type = "üë• Regular Customers"
    
    print(f"\nüè∑Ô∏è Cluster Type: {cluster_type}")

### 2.8 Visualize Clusters

In [None]:
# Create scatter plots for cluster visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Customer Segmentation Visualization', fontsize=16, fontweight='bold')

# Plot 1: Total Spend vs Num Transactions
for cluster in range(final_k):
    cluster_data = df[df['Cluster'] == cluster]
    axes[0, 0].scatter(cluster_data['total_spend'], 
                       cluster_data['num_transactions'],
                       label=f'Cluster {cluster}', 
                       alpha=0.6, s=50)
axes[0, 0].set_xlabel('Total Spend', fontsize=11)
axes[0, 0].set_ylabel('Number of Transactions', fontsize=11)
axes[0, 0].set_title('Spending vs Transaction Frequency', fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Days Since Last Purchase vs Num Visits
for cluster in range(final_k):
    cluster_data = df[df['Cluster'] == cluster]
    axes[0, 1].scatter(cluster_data['days_since_last_purchase'], 
                       cluster_data['num_visits'],
                       label=f'Cluster {cluster}', 
                       alpha=0.6, s=50)
axes[0, 1].set_xlabel('Days Since Last Purchase', fontsize=11)
axes[0, 1].set_ylabel('Number of Visits', fontsize=11)
axes[0, 1].set_title('Recency vs Visit Frequency', fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Average Transaction Value vs Product Categories
for cluster in range(final_k):
    cluster_data = df[df['Cluster'] == cluster]
    axes[1, 0].scatter(cluster_data['avg_transaction_value'], 
                       cluster_data['product_categories_purchased'],
                       label=f'Cluster {cluster}', 
                       alpha=0.6, s=50)
axes[1, 0].set_xlabel('Average Transaction Value', fontsize=11)
axes[1, 0].set_ylabel('Product Categories Purchased', fontsize=11)
axes[1, 0].set_title('Transaction Value vs Product Diversity', fontweight='bold')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Cluster Size
cluster_counts.plot(kind='bar', ax=axes[1, 1], color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A'][:final_k])
axes[1, 1].set_xlabel('Cluster', fontsize=11)
axes[1, 1].set_ylabel('Number of Customers', fontsize=11)
axes[1, 1].set_title('Cluster Distribution', fontweight='bold')
axes[1, 1].tick_params(axis='x', rotation=0)
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

---
# Part 3: Predictive Modeling (35 points)

## üéØ Learning Goals:
- Build classification models to predict high-value customers
- Compare different algorithms
- Evaluate model performance
- Understand key metrics

### 3.1 What is Classification? ü§î

**Classification = Predicting categories**

In our case:
- **Input**: Customer features (age, spending, etc.)
- **Output**: Is this a high-value customer? (0 = No, 1 = Yes)

**Why predict high-value customers?**
- Target marketing campaigns
- Allocate resources efficiently
- Personalize offers
- Prevent churn of valuable customers

**Algorithms we'll try:**
1. **Logistic Regression**: Simple, interpretable
2. **Decision Tree**: Easy to understand, visual
3. **Random Forest**: Powerful, combines many trees

### 3.2 Prepare Data for Modeling

In [None]:
# Select features for prediction
feature_columns = [
    'age', 'total_spend', 'num_transactions', 'avg_transaction_value',
    'days_since_last_purchase', 'num_visits', 'product_categories_purchased',
    'discount_used', 'Cluster'
]

# Encode categorical variables
df_model = df.copy()

# Encode gender
le_gender = LabelEncoder()
df_model['gender_encoded'] = le_gender.fit_transform(df_model['gender'].fillna('Unknown'))

# Encode city_tier
le_city = LabelEncoder()
df_model['city_tier_encoded'] = le_city.fit_transform(df_model['city_tier'])

# Encode membership_type
le_membership = LabelEncoder()
df_model['membership_encoded'] = le_membership.fit_transform(df_model['membership_type'])

# Add encoded features to feature list
feature_columns.extend(['gender_encoded', 'city_tier_encoded', 'membership_encoded'])

# Create feature matrix (X) and target vector (y)
X = df_model[feature_columns]
y = df_model['high_value_customer']

print("‚úÖ Data prepared for modeling!")
print(f"\nüìä Feature Matrix Shape: {X.shape}")
print(f"üéØ Target Variable Shape: {y.shape}")
print(f"\nüìà Target Distribution:")
print(y.value_counts())
print(f"\nPercentage of high-value customers: {(y.sum()/len(y)*100):.1f}%")

### 3.3 Train-Test Split

**Why split data?**
- **Training Set (80%)**: Model learns from this
- **Test Set (20%)**: Model evaluated on this (never seen before!)

**Why not use all data for training?**
- We need to know how well the model works on NEW data
- Models can "memorize" training data (overfitting)
- Test set simulates real-world unseen data

In [None]:
# Split the data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,  # 20% for testing
    random_state=42,  # For reproducibility
    stratify=y  # Keep same proportion of 0s and 1s in both sets
)

print("‚úÖ Data split complete!")
print("\nüìä Training Set:")
print(f"  Features: {X_train.shape}")
print(f"  Target: {y_train.shape}")
print(f"  High-value %: {(y_train.sum()/len(y_train)*100):.1f}%")

print("\nüìä Test Set:")
print(f"  Features: {X_test.shape}")
print(f"  Target: {y_test.shape}")
print(f"  High-value %: {(y_test.sum()/len(y_test)*100):.1f}%")

### 3.4 Scale Features

Just like in clustering, we need to scale features for consistency.

In [None]:
# Initialize scaler
scaler_model = StandardScaler()

# Fit on training data only! (Important: avoid data leakage)
X_train_scaled = scaler_model.fit_transform(X_train)
X_test_scaled = scaler_model.transform(X_test)

print("‚úÖ Features scaled successfully!")
print("\n‚ö†Ô∏è Important: We fit the scaler ONLY on training data!")
print("   This prevents 'data leakage' from test set.")

### 3.5 Build and Evaluate Models

**Evaluation Metrics Explained:**

1. **Accuracy**: (Correct Predictions) / (Total Predictions)
   - Simple but can be misleading with imbalanced data

2. **Precision**: Of all predicted high-value, how many actually are?
   - Important when false positives are costly

3. **Recall**: Of all actual high-value customers, how many did we catch?
   - Important when false negatives are costly

4. **F1-Score**: Harmonic mean of Precision and Recall
   - Good overall metric

**Confusion Matrix:**
```
                 Predicted
              No       Yes
Actual  No    TN       FP
        Yes   FN       TP
```
- TN (True Negative): Correctly predicted NOT high-value
- TP (True Positive): Correctly predicted high-value
- FN (False Negative): Missed a high-value customer
- FP (False Positive): Wrongly predicted as high-value

#### Model 1: Logistic Regression

In [None]:
print("ü§ñ Training Logistic Regression...")
print("="*70)

# Create and train model
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_lr = lr_model.predict(X_test_scaled)
y_pred_lr_proba = lr_model.predict_proba(X_test_scaled)[:, 1]

# Calculate metrics
lr_accuracy = accuracy_score(y_test, y_pred_lr)
lr_precision = precision_score(y_test, y_pred_lr)
lr_recall = recall_score(y_test, y_pred_lr)
lr_f1 = f1_score(y_test, y_pred_lr)

print("\nüìä Logistic Regression Results:")
print(f"  Accuracy:  {lr_accuracy:.4f} ({lr_accuracy*100:.2f}%)")
print(f"  Precision: {lr_precision:.4f}")
print(f"  Recall:    {lr_recall:.4f}")
print(f"  F1-Score:  {lr_f1:.4f}")

print("\nüìà Classification Report:")
print(classification_report(y_test, y_pred_lr, target_names=['Not High-Value', 'High-Value']))

# Confusion Matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Not High-Value', 'High-Value'],
            yticklabels=['Not High-Value', 'High-Value'])
plt.title('Confusion Matrix - Logistic Regression', fontsize=14, fontweight='bold')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()

#### Model 2: Decision Tree

In [None]:
print("üå≥ Training Decision Tree...")
print("="*70)

# Create and train model
dt_model = DecisionTreeClassifier(random_state=42, max_depth=10, min_samples_split=20)
dt_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_dt = dt_model.predict(X_test_scaled)
y_pred_dt_proba = dt_model.predict_proba(X_test_scaled)[:, 1]

# Calculate metrics
dt_accuracy = accuracy_score(y_test, y_pred_dt)
dt_precision = precision_score(y_test, y_pred_dt)
dt_recall = recall_score(y_test, y_pred_dt)
dt_f1 = f1_score(y_test, y_pred_dt)

print("\nüìä Decision Tree Results:")
print(f"  Accuracy:  {dt_accuracy:.4f} ({dt_accuracy*100:.2f}%)")
print(f"  Precision: {dt_precision:.4f}")
print(f"  Recall:    {dt_recall:.4f}")
print(f"  F1-Score:  {dt_f1:.4f}")

print("\nüìà Classification Report:")
print(classification_report(y_test, y_pred_dt, target_names=['Not High-Value', 'High-Value']))

# Confusion Matrix
cm_dt = confusion_matrix(y_test, y_pred_dt)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_dt, annot=True, fmt='d', cmap='Greens',
            xticklabels=['Not High-Value', 'High-Value'],
            yticklabels=['Not High-Value', 'High-Value'])
plt.title('Confusion Matrix - Decision Tree', fontsize=14, fontweight='bold')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()

#### Model 3: Random Forest

In [None]:
print("üå≤ Training Random Forest...")
print("="*70)

# Create and train model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10, min_samples_split=20)
rf_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test_scaled)
y_pred_rf_proba = rf_model.predict_proba(X_test_scaled)[:, 1]

# Calculate metrics
rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_precision = precision_score(y_test, y_pred_rf)
rf_recall = recall_score(y_test, y_pred_rf)
rf_f1 = f1_score(y_test, y_pred_rf)

print("\nüìä Random Forest Results:")
print(f"  Accuracy:  {rf_accuracy:.4f} ({rf_accuracy*100:.2f}%)")
print(f"  Precision: {rf_precision:.4f}")
print(f"  Recall:    {rf_recall:.4f}")
print(f"  F1-Score:  {rf_f1:.4f}")

print("\nüìà Classification Report:")
print(classification_report(y_test, y_pred_rf, target_names=['Not High-Value', 'High-Value']))

# Confusion Matrix
cm_rf = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Oranges',
            xticklabels=['Not High-Value', 'High-Value'],
            yticklabels=['Not High-Value', 'High-Value'])
plt.title('Confusion Matrix - Random Forest', fontsize=14, fontweight='bold')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()

### 3.6 Compare All Models

In [None]:
# Create comparison dataframe
model_comparison = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest'],
    'Accuracy': [lr_accuracy, dt_accuracy, rf_accuracy],
    'Precision': [lr_precision, dt_precision, rf_precision],
    'Recall': [lr_recall, dt_recall, rf_recall],
    'F1-Score': [lr_f1, dt_f1, rf_f1]
})

print("\nüìä Model Comparison:")
print("="*80)
print(model_comparison.to_string(index=False))

# Find best model
best_model_idx = model_comparison['F1-Score'].idxmax()
best_model_name = model_comparison.loc[best_model_idx, 'Model']
best_f1 = model_comparison.loc[best_model_idx, 'F1-Score']

print(f"\nüèÜ Best Model: {best_model_name} (F1-Score: {best_f1:.4f})")

# Visualize comparison
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(model_comparison))
width = 0.2

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']

for i, (metric, color) in enumerate(zip(metrics, colors)):
    ax.bar(x + i*width, model_comparison[metric], width, label=metric, color=color, alpha=0.8)

ax.set_xlabel('Model', fontsize=12, fontweight='bold')
ax.set_ylabel('Score', fontsize=12, fontweight='bold')
ax.set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
ax.set_xticks(x + width * 1.5)
ax.set_xticklabels(model_comparison['Model'])
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
ax.set_ylim([0, 1.1])

plt.tight_layout()
plt.show()

### 3.7 Feature Importance

**Which features are most important for predictions?**

Understanding this helps us:
- Focus on collecting important data
- Understand what drives high-value customers
- Make business decisions

In [None]:
# Get feature importance from Random Forest
feature_importance = pd.DataFrame({
    'Feature': feature_columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("üîç Feature Importance (Random Forest):")
print("="*50)
print(feature_importance.to_string(index=False))

# Visualize
plt.figure(figsize=(12, 8))
plt.barh(feature_importance['Feature'][:10], feature_importance['Importance'][:10], color='steelblue')
plt.xlabel('Importance Score', fontsize=12, fontweight='bold')
plt.ylabel('Feature', fontsize=12, fontweight='bold')
plt.title('Top 10 Most Important Features', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

---
# Part 4: Model Optimization & Business Insights (20 points)

## üéØ Learning Goals:
- Fine-tune model parameters
- Use cross-validation for robust evaluation
- Extract business insights
- Create actionable recommendations

### 4.1 Hyperparameter Tuning with GridSearchCV

**What are hyperparameters?**
- Settings that control how the model learns
- Examples: tree depth, number of trees, learning rate

**GridSearchCV:**
- Try different combinations of hyperparameters
- Use cross-validation to find the best combination
- Automatically selects the best model

In [None]:
print("üîß Hyperparameter Tuning for Random Forest...")
print("="*70)
print("This may take a few minutes...\n")

# Define parameter grid to search
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create GridSearchCV object
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,  # 5-fold cross-validation
    scoring='f1',
    n_jobs=-1,  # Use all CPU cores
    verbose=1
)

# Fit grid search
grid_search.fit(X_train_scaled, y_train)

print("\n‚úÖ Grid Search Complete!")
print("\nüèÜ Best Parameters:")
print("="*50)
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")

print(f"\nüìä Best Cross-Validation F1-Score: {grid_search.best_score_:.4f}")

# Use best model
best_rf_model = grid_search.best_estimator_

### 4.2 Evaluate Optimized Model

In [None]:
# Make predictions with optimized model
y_pred_optimized = best_rf_model.predict(X_test_scaled)

# Calculate metrics
optimized_accuracy = accuracy_score(y_test, y_pred_optimized)
optimized_precision = precision_score(y_test, y_pred_optimized)
optimized_recall = recall_score(y_test, y_pred_optimized)
optimized_f1 = f1_score(y_test, y_pred_optimized)

print("üìä Optimized Random Forest Results:")
print("="*50)
print(f"  Accuracy:  {optimized_accuracy:.4f} ({optimized_accuracy*100:.2f}%)")
print(f"  Precision: {optimized_precision:.4f}")
print(f"  Recall:    {optimized_recall:.4f}")
print(f"  F1-Score:  {optimized_f1:.4f}")

# Compare with original
print("\nüìà Improvement:")
print("="*50)
print(f"  Accuracy:  {(optimized_accuracy - rf_accuracy)*100:+.2f}%")
print(f"  Precision: {(optimized_precision - rf_precision)*100:+.2f}%")
print(f"  Recall:    {(optimized_recall - rf_recall)*100:+.2f}%")
print(f"  F1-Score:  {(optimized_f1 - rf_f1)*100:+.2f}%")

### 4.3 Cross-Validation Analysis

**What is Cross-Validation?**
- Split data into K parts (folds)
- Train on K-1 parts, test on 1 part
- Repeat K times with different test part
- Average the results

**Why use it?**
- More reliable performance estimate
- Uses all data for both training and testing
- Reduces risk of lucky/unlucky split

In [None]:
print("üîÑ Performing 5-Fold Cross-Validation...")
print("="*70)

# Perform cross-validation
cv_scores = cross_val_score(best_rf_model, X_train_scaled, y_train, cv=5, scoring='f1')

print("\nüìä Cross-Validation Results:")
print("="*50)
print(f"  Fold 1: {cv_scores[0]:.4f}")
print(f"  Fold 2: {cv_scores[1]:.4f}")
print(f"  Fold 3: {cv_scores[2]:.4f}")
print(f"  Fold 4: {cv_scores[3]:.4f}")
print(f"  Fold 5: {cv_scores[4]:.4f}")
print("\n" + "="*50)
print(f"  Mean F1-Score: {cv_scores.mean():.4f}")
print(f"  Std Deviation: {cv_scores.std():.4f}")
print(f"  95% Confidence Interval: [{cv_scores.mean() - 2*cv_scores.std():.4f}, {cv_scores.mean() + 2*cv_scores.std():.4f}]")

# Visualize
plt.figure(figsize=(10, 6))
plt.plot(range(1, 6), cv_scores, 'bo-', linewidth=2, markersize=10)
plt.axhline(cv_scores.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {cv_scores.mean():.4f}')
plt.fill_between(range(1, 6), 
                 cv_scores.mean() - cv_scores.std(), 
                 cv_scores.mean() + cv_scores.std(), 
                 alpha=0.2, color='red')
plt.xlabel('Fold Number', fontsize=12, fontweight='bold')
plt.ylabel('F1-Score', fontsize=12, fontweight='bold')
plt.title('5-Fold Cross-Validation Results', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### 4.4 Business Insights & Recommendations

Now let's translate our technical findings into business value!

In [None]:
print("\n" + "="*80)
print(" "*20 + "BUSINESS INSIGHTS & RECOMMENDATIONS")
print("="*80)

print("\nüéØ 1. CUSTOMER SEGMENTATION INSIGHTS:")
print("-" * 80)
for cluster in range(final_k):
    cluster_data = df[df['Cluster'] == cluster]
    print(f"\n   Segment {cluster}: {len(cluster_data)} customers ({len(cluster_data)/len(df)*100:.1f}%)")
    print(f"   üí∞ Average Spending: ‚Çπ{cluster_data['total_spend'].mean():.2f}")
    print(f"   üìÖ Recency: {cluster_data['days_since_last_purchase'].mean():.1f} days")
    print(f"   üéØ High-Value %: {(cluster_data['high_value_customer'].sum()/len(cluster_data)*100):.1f}%")

print("\n\nüí° 2. KEY PREDICTIVE FACTORS:")
print("-" * 80)
top_features = feature_importance.head(5)
for idx, row in top_features.iterrows():
    print(f"   ‚Ä¢ {row['Feature']}: {row['Importance']:.4f}")

print("\n\nüìä 3. MODEL PERFORMANCE:")
print("-" * 80)
print(f"   ‚Ä¢ Our model can identify high-value customers with {optimized_accuracy*100:.1f}% accuracy")
print(f"   ‚Ä¢ Precision: {optimized_precision*100:.1f}% (When we predict high-value, we're right {optimized_precision*100:.1f}% of the time)")
print(f"   ‚Ä¢ Recall: {optimized_recall*100:.1f}% (We catch {optimized_recall*100:.1f}% of all high-value customers)")

print("\n\nüöÄ 4. ACTIONABLE RECOMMENDATIONS:")
print("-" * 80)
print("\n   A. For Marketing Team:")
print("      ‚Ä¢ Target high-spend, high-frequency customers with premium offers")
print("      ‚Ä¢ Re-engage customers who haven't purchased in 40+ days")
print("      ‚Ä¢ Create campaigns for customers with high product category diversity")

print("\n   B. For Sales Team:")
print(f"      ‚Ä¢ Focus on the top {feature_importance.iloc[0]['Feature']} metric")
print("      ‚Ä¢ Prioritize customers with high transaction values")
print("      ‚Ä¢ Nurture customers with potential (medium spend, high frequency)")

print("\n   C. For Customer Success:")
print("      ‚Ä¢ Monitor days since last purchase for at-risk customers")
print("      ‚Ä¢ Incentivize frequent visits to increase engagement")
print("      ‚Ä¢ Personalize experiences based on cluster characteristics")

print("\n   D. For Product Team:")
print("      ‚Ä¢ Encourage cross-category purchases")
print("      ‚Ä¢ Optimize discount strategies based on customer value")
print("      ‚Ä¢ Design loyalty programs for high-frequency buyers")

print("\n\nüí∞ 5. EXPECTED BUSINESS IMPACT:")
print("-" * 80)
# Calculate potential revenue impact
high_value_avg = df[df['high_value_customer'] == 1]['total_spend'].mean()
regular_avg = df[df['high_value_customer'] == 0]['total_spend'].mean()
value_diff = high_value_avg - regular_avg

print(f"   ‚Ä¢ High-value customers spend ‚Çπ{value_diff:.2f} more on average")
print(f"   ‚Ä¢ By identifying and nurturing potential high-value customers:")
print(f"     - Converting just 10% more customers to high-value")
print(f"     - Could increase revenue by ‚Çπ{value_diff * len(df) * 0.10:.2f}")

print("\n" + "="*80)
print(" "*25 + "END OF ANALYSIS")
print("="*80)

## üéì Summary: What You've Learned

Congratulations! You've completed your first end-to-end data science project! Here's what you've mastered:

### Part 1: Data Exploration & Preprocessing
- ‚úÖ Load and explore datasets
- ‚úÖ Handle missing values strategically
- ‚úÖ Detect and understand outliers
- ‚úÖ Visualize data distributions
- ‚úÖ Analyze correlations

### Part 2: Customer Segmentation
- ‚úÖ Understand clustering concepts
- ‚úÖ Apply K-Means algorithm
- ‚úÖ Find optimal number of clusters
- ‚úÖ Interpret customer segments
- ‚úÖ Extract business insights

### Part 3: Predictive Modeling
- ‚úÖ Build classification models
- ‚úÖ Compare multiple algorithms
- ‚úÖ Evaluate with proper metrics
- ‚úÖ Understand confusion matrices
- ‚úÖ Identify important features

### Part 4: Optimization & Insights
- ‚úÖ Tune hyperparameters
- ‚úÖ Use cross-validation
- ‚úÖ Generate business recommendations
- ‚úÖ Calculate business impact

## üöÄ Next Steps

1. **Experiment**: Try different feature combinations
2. **Improve**: Test other algorithms (XGBoost, Neural Networks)
3. **Deploy**: Think about how to use this model in production
4. **Learn More**: Explore deep learning, NLP, computer vision

## üìö Resources for Further Learning

- **Scikit-learn Documentation**: https://scikit-learn.org/
- **Kaggle**: Practice with real datasets
- **Coursera/Udemy**: Structured courses
- **Towards Data Science**: Articles and tutorials

---

**Good luck with your data science journey! üåü**