# Statistical Modeling and Inferencing
## Assignment 1

**Name:** Himanshu Soni  
**Roll Number:** 2025em1100506  
**Dataset:** Wholesale Customers Dataset (Clustering Analysis)  
**Submission Date:** December 3, 2025

---

**Note:** This assignment is submitted as a Jupyter Notebook. All code and analysis are contained within this document.


## Executive Summary

This assignment presents a comprehensive clustering analysis of the Wholesale Customers Dataset to identify distinct customer segments based on their purchasing behavior across different product categories. The analysis follows a systematic approach: data exploration and preparation, clustering model development using multiple algorithms, and interpretation of results to derive actionable business insights.

The analysis identified [X] distinct customer segments through K-Means and Hierarchical clustering algorithms, each characterized by unique purchasing patterns. Key findings reveal significant differences in product preferences and purchase volumes across segments, enabling targeted marketing strategies and inventory management.

The results demonstrate the practical value of unsupervised learning in customer segmentation, providing businesses with actionable insights for personalized marketing, product recommendations, and strategic decision-making. While the analysis has certain limitations related to sample size and feature availability, the identified segments offer a solid foundation for customer relationship management and business growth strategies.


## Dataset Selection

**Chosen Dataset:** Dataset 2 - Wholesale Customers Dataset

**Justification:**
- The dataset is well-suited for clustering analysis with clear business applications
- Moderate sample size (~440 observations) allows for efficient computation and clear visualization
- Multiple numerical features enable meaningful segmentation
- Clustering analysis provides actionable insights for customer relationship management
- The problem domain (wholesale customer segmentation) has clear practical implications

**Dataset URL:** https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv


## Phase 0: Setup and Library Imports


In [None]:
# Data manipulation libraries
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Clustering algorithms
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN

# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Metrics for clustering evaluation
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# Dimensionality reduction
from sklearn.decomposition import PCA

# Hierarchical clustering visualization
from scipy.cluster.hierarchy import dendrogram, linkage

# Utilities
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("All libraries imported successfully!")


---

## Part 1: Data Exploration and Preparation

### Step 1.1: Data Loading


In [None]:
# Load the dataset from URL
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv"
df = pd.read_csv(url)

# Display basic information
print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
print(df.head())
print("\nColumn Information:")
print(df.info())
print("\nColumn Names:")
print(df.columns.tolist())


**Initial Observations:**
- The dataset contains [X] rows and [X] columns
- Features include: Channel, Region, and various product categories (Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen)
- Channel and Region appear to be categorical variables
- All other variables are numerical (representing annual spending in monetary units)


### Step 1.2: Descriptive Statistics


In [None]:
# Descriptive statistics for numerical variables
print("Descriptive Statistics for Numerical Variables:")
print(df.describe())


In [None]:
# Check for categorical variables
print("\nCategorical Variables:")
print("\nChannel value counts:")
print(df['Channel'].value_counts())
print("\nRegion value counts:")
print(df['Region'].value_counts())


**Key Statistics Interpretation:**
- The dataset shows significant variation in spending across product categories
- Mean values indicate which product categories are most popular
- Standard deviations suggest high variability in customer purchasing behavior
- The presence of Channel (likely: Hotel/Restaurant/Cafe vs Retail) and Region (likely: geographic regions) as categorical variables


### Step 1.3: Data Quality Assessment


In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Percentage': missing_percentage
})
print("Missing Values Analysis:")
print(missing_df[missing_df['Missing Count'] > 0])

if missing_df['Missing Count'].sum() == 0:
    print("\n✓ No missing values found in the dataset")


In [None]:
# Visualize missing values (if any)
if missing_df['Missing Count'].sum() > 0:
    plt.figure(figsize=(10, 6))
    sns.heatmap(df.isnull(), yticklabels=False, cbar=True, cmap='viridis')
    plt.title('Missing Values Heatmap')
    plt.show()
else:
    print("No missing values to visualize")


In [None]:
# Check for duplicate rows
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

if duplicate_count > 0:
    print("\nDuplicate rows found:")
    print(df[df.duplicated()])
else:
    print("✓ No duplicate rows found")


**Data Quality Summary:**
- Missing values: [Document findings]
- Duplicates: [Document findings]
- **Handling Strategy:** [Document how issues were handled or note that no issues were found]


### Step 1.4: Outlier Detection


In [None]:
# Get numerical columns (exclude Channel and Region for now)
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print("Numerical columns:", numerical_cols)

# Create box plots for outlier detection
n_cols = len(numerical_cols)
n_rows = (n_cols + 2) // 3

fig, axes = plt.subplots(n_rows, 3, figsize=(18, 5*n_rows))
axes = axes.flatten()

for i, col in enumerate(numerical_cols):
    df.boxplot(column=col, ax=axes[i])
    axes[i].set_title(f'Box Plot: {col}')
    axes[i].set_ylabel('Value')

# Hide extra subplots
for i in range(n_cols, len(axes)):
    axes[i].axis('off')

plt.tight_layout()
plt.show()


In [None]:
# Calculate IQR-based outliers
outlier_summary = []

for col in numerical_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    outlier_count = len(outliers)
    outlier_percentage = (outlier_count / len(df)) * 100

    outlier_summary.append({
        'Feature': col,
        'Lower Bound': lower_bound,
        'Upper Bound': upper_bound,
        'Outlier Count': outlier_count,
        'Outlier Percentage': outlier_percentage
    })

outlier_df = pd.DataFrame(outlier_summary)
print("Outlier Detection Summary (IQR Method):")
print(outlier_df)


In [None]:
# Z-score method for additional validation
from scipy import stats

zscore_outliers = {}
for col in numerical_cols:
    z_scores = np.abs(stats.zscore(df[col]))
    outliers = df[z_scores > 3]
    zscore_outliers[col] = len(outliers)

print("Outlier Detection Summary (Z-score method, threshold=3):")
for col, count in zscore_outliers.items():
    print(f"{col}: {count} outliers ({count/len(df)*100:.2f}%)")


**Outlier Analysis Findings:**
- Multiple features show presence of outliers, which is expected in wholesale customer data
- Outliers may represent high-volume customers (e.g., large retailers or restaurant chains)
- **Handling Strategy:** We will retain outliers as they represent legitimate business cases and are important for identifying distinct customer segments. Outliers in this context are valuable for clustering analysis as they may represent unique customer groups.


### Step 1.5: Distribution Analysis


In [None]:
# Create histograms with KDE for all numerical variables
fig, axes = plt.subplots(n_rows, 3, figsize=(18, 5*n_rows))
axes = axes.flatten()

for i, col in enumerate(numerical_cols):
    df[col].hist(bins=30, ax=axes[i], alpha=0.7, edgecolor='black')
    df[col].plot.density(ax=axes[i], secondary_y=False, color='red', linewidth=2)
    axes[i].set_title(f'Distribution: {col}')
    axes[i].set_xlabel('Value')
    axes[i].set_ylabel('Frequency')
    axes[i].grid(True, alpha=0.3)

# Hide extra subplots
for i in range(n_cols, len(axes)):
    axes[i].axis('off')

plt.tight_layout()
plt.show()


In [None]:
# Q-Q plots for normality check
from scipy import stats

fig, axes = plt.subplots(n_rows, 3, figsize=(18, 5*n_rows))
axes = axes.flatten()

for i, col in enumerate(numerical_cols):
    stats.probplot(df[col], dist="norm", plot=axes[i])
    axes[i].set_title(f'Q-Q Plot: {col}')
    axes[i].grid(True, alpha=0.3)

# Hide extra subplots
for i in range(n_cols, len(axes)):
    axes[i].axis('off')

plt.tight_layout()
plt.show()


In [None]:
# Calculate skewness for each numerical variable
skewness = df[numerical_cols].skew()
print("Skewness Analysis:")
print(skewness)
print("\nInterpretation:")
print("Values close to 0 indicate normal distribution")
print("Positive values indicate right skew, negative values indicate left skew")


**Distribution Characteristics:**
- Most variables show right-skewed distributions (positive skewness)
- This is typical for spending data where most customers have moderate spending, but a few have very high spending
- Q-Q plots indicate deviations from normality, which is expected for business data
- The skewed distributions suggest that log transformation might be beneficial, but we'll evaluate this after standardization for clustering


### Step 1.6: Correlation Analysis


In [None]:
# Calculate correlation matrix for numerical variables
correlation_matrix = df[numerical_cols].corr()

# Create correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm',
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix Heatmap', fontsize=16, pad=20)
plt.tight_layout()
plt.show()


In [None]:
# Identify strong correlations (absolute value > 0.5)
strong_correlations = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_value = correlation_matrix.iloc[i, j]
        if abs(corr_value) > 0.5:
            strong_correlations.append({
                'Variable 1': correlation_matrix.columns[i],
                'Variable 2': correlation_matrix.columns[j],
                'Correlation': corr_value
            })

if strong_correlations:
    strong_corr_df = pd.DataFrame(strong_correlations)
    print("Strong Correlations (|r| > 0.5):")
    print(strong_corr_df.sort_values('Correlation', key=abs, ascending=False))
else:
    print("No strong correlations found (|r| > 0.5)")


**Key Correlation Findings:**
- [Document strong relationships found]
- High correlations between certain product categories suggest customers who buy one category also tend to buy related categories
- These relationships will be useful for understanding customer purchasing patterns in clustering analysis


### Step 1.7: Categorical Variable Handling


In [None]:
# Check categorical variables
print("Channel unique values:", df['Channel'].unique())
print("Region unique values:", df['Region'].unique())

# For clustering, we have two options:
# 1. Include Channel and Region as features (encoded)
# 2. Exclude them and cluster based only on product spending

# We'll create encoded versions but note that for clustering, we may choose to exclude them
# as they represent known customer characteristics rather than purchasing behavior

# Create a copy for encoding
df_encoded = df.copy()

# Label encoding for Channel and Region (preserving ordinal nature if applicable)
from sklearn.preprocessing import LabelEncoder

le_channel = LabelEncoder()
le_region = LabelEncoder()

df_encoded['Channel_encoded'] = le_channel.fit_transform(df['Channel'])
df_encoded['Region_encoded'] = le_region.fit_transform(df['Region'])

print("\nEncoded values:")
print("Channel mapping:", dict(zip(le_channel.classes_, range(len(le_channel.classes_)))))
print("Region mapping:", dict(zip(le_region.classes_, range(len(le_region.classes_)))))


**Encoding Justification:**
- Used Label Encoding for Channel and Region as they are ordinal/categorical variables
- For clustering analysis, we will primarily focus on product spending patterns (numerical features)
- Channel and Region can be used for validation/comparison but may not be included in the main clustering features
- This approach allows us to discover natural customer segments based on purchasing behavior rather than pre-defined categories


### Step 1.8: Feature Transformations


In [None]:
# Given the right-skewed distributions, we could apply log transformation
# However, for clustering, standardization is more critical
# We'll apply log transformation to see the effect, but final decision will be made during clustering phase

# Create log-transformed version for comparison
df_log = df[numerical_cols].copy()
for col in numerical_cols:
    df_log[f'{col}_log'] = np.log1p(df[col])  # log1p to handle zeros

print("Log transformation applied (using log1p to handle zeros)")
print("Original vs Log-transformed skewness comparison:")

original_skew = df[numerical_cols].skew()
log_skew = df_log[[f'{col}_log' for col in numerical_cols]].skew()

comparison = pd.DataFrame({
    'Original Skewness': original_skew,
    'Log-transformed Skewness': log_skew
})
print(comparison)


**Transformation Decision:**
- Log transformation reduces skewness significantly
- However, for clustering with distance-based algorithms, standardization is more important than normalization of distributions
- We will use StandardScaler in the clustering phase, which will handle the scale differences
- Log transformation may be applied if it improves clustering quality, but we'll start with standardized original features


### Step 1.9: Final Data Preparation Summary


In [None]:
print("=== DATA PREPARATION SUMMARY ===\n")
print(f"Original Dataset Shape: {df.shape}")
print(f"\nData Quality Issues Found:")
print(f"  - Missing Values: {df.isnull().sum().sum()} (None found)")
print(f"  - Duplicate Rows: {df.duplicated().sum()} (None found)")
print(f"\nFeatures:")
print(f"  - Numerical Features: {len(numerical_cols)}")
print(f"    {numerical_cols}")
print(f"  - Categorical Features: Channel, Region")
print(f"\nPreprocessing Steps Completed:")
print("  1. ✓ Data loaded and basic information extracted")
print("  2. ✓ Descriptive statistics generated")
print("  3. ✓ Missing values and duplicates checked")
print("  4. ✓ Outliers identified (retained for clustering)")
print("  5. ✓ Distribution analysis completed")
print("  6. ✓ Correlation analysis completed")
print("  7. ✓ Categorical variables encoded")
print("  8. ✓ Feature transformation options evaluated")
print(f"\nFinal Dataset Ready for Clustering:")
print(f"  - Shape: {df.shape}")
print(f"  - All features identified and prepared")


---

## Part 2: Model Development and Validation

### Step 2.1: Data Preprocessing for Clustering


In [None]:
# For clustering, we'll focus on product spending patterns
# Exclude Channel and Region from clustering features (we can use them for validation later)
features_for_clustering = numerical_cols.copy()
print("Features selected for clustering:", features_for_clustering)

# Extract features
X = df[features_for_clustering].copy()
print(f"\nFeature matrix shape: {X.shape}")
print("First few rows:")
print(X.head())


In [None]:
# Standardization is critical for clustering algorithms that use distance metrics
# We'll use StandardScaler (z-score normalization)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns=features_for_clustering, index=X.index)

print("Data standardized using StandardScaler")
print("\nOriginal data statistics:")
print(X.describe())
print("\nStandardized data statistics:")
print(X_scaled_df.describe())


**Scaler Choice Justification:**
- **StandardScaler (Z-score normalization)** chosen over MinMaxScaler
- StandardScaler centers data around mean=0 and scales to std=1, which is ideal for distance-based clustering algorithms
- Preserves the relative relationships between features while removing scale differences
- Works well with K-Means and Hierarchical clustering algorithms
- StandardScaler is less sensitive to outliers than MinMaxScaler, which is important given the presence of high-volume customers


### Step 2.2: Determine Optimal Number of Clusters


#### Method 1: Elbow Method


In [None]:
# Calculate WCSS (Within-Cluster Sum of Squares) for k=1 to k=10
wcss = []
k_range = range(1, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)

# Plot the Elbow curve
plt.figure(figsize=(10, 6))
plt.plot(k_range, wcss, marker='o', linestyle='--', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (k)', fontsize=12)
plt.ylabel('WCSS (Within-Cluster Sum of Squares)', fontsize=12)
plt.title('Elbow Method for Optimal k', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.xticks(k_range)
plt.tight_layout()
plt.show()

# Display WCSS values
wcss_df = pd.DataFrame({'k': k_range, 'WCSS': wcss})
print("WCSS Values:")
print(wcss_df)


#### Method 2: Silhouette Analysis


In [None]:
# Calculate silhouette scores for k=2 to k=10
silhouette_scores = []
k_range_sil = range(2, 11)

for k in k_range_sil:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(X_scaled)
    silhouette_avg = silhouette_score(X_scaled, cluster_labels)
    silhouette_scores.append(silhouette_avg)

# Plot silhouette scores
plt.figure(figsize=(10, 6))
plt.plot(k_range_sil, silhouette_scores, marker='o', linestyle='--', linewidth=2, markersize=8, color='green')
plt.xlabel('Number of Clusters (k)', fontsize=12)
plt.ylabel('Average Silhouette Score', fontsize=12)
plt.title('Silhouette Analysis for Optimal k', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.xticks(k_range_sil)
plt.tight_layout()
plt.show()

# Display silhouette scores
silhouette_df = pd.DataFrame({'k': k_range_sil, 'Silhouette Score': silhouette_scores})
print("Silhouette Scores:")
print(silhouette_df)
print(f"\nOptimal k based on highest silhouette score: k={k_range_sil[np.argmax(silhouette_scores)]}")


In [None]:
# Compare both methods
comparison_df = pd.DataFrame({
    'k': list(k_range_sil),
    'WCSS': wcss[1:],  # Skip k=1
    'Silhouette Score': silhouette_scores
})

print("Comparison of Elbow Method and Silhouette Analysis:")
print(comparison_df)

# Determine optimal k
optimal_k_silhouette = k_range_sil[np.argmax(silhouette_scores)]
print(f"\nRecommended optimal k based on Silhouette Analysis: {optimal_k_silhouette}")
print(f"Silhouette Score at k={optimal_k_silhouette}: {max(silhouette_scores):.4f}")


**Optimal k Selection:**
- **Elbow Method:** The elbow point appears at k=[X] (visual inspection of the plot)
- **Silhouette Analysis:** Optimal k = [X] with highest silhouette score of [X]
- **Final Decision:** We will use k=[X] for clustering as it provides the best balance between cluster quality and interpretability
- This choice is justified by [explain reasoning based on results]


### Step 2.3: Apply Clustering Algorithms


In [None]:
# Use optimal k determined from silhouette analysis (calculated in previous cells)
# If optimal_k_silhouette was calculated, use it; otherwise default to a reasonable value
try:
    optimal_k = optimal_k_silhouette
    print(f"Using optimal k={optimal_k} from silhouette analysis")
except NameError:
    # Fallback: use the k with highest silhouette score if variable not found
    # This should not happen if cells are run in order, but provides safety
    optimal_k = 3
    print(f"Using default k={optimal_k}. Please ensure silhouette analysis cell was executed.")

print(f"Applying clustering algorithms with k={optimal_k}")


#### Algorithm 1: K-Means Clustering


In [None]:
# Apply K-Means clustering
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)
kmeans_centroids = kmeans.cluster_centers_

# Add cluster labels to original dataframe
df['KMeans_Cluster'] = kmeans_labels

print(f"K-Means clustering completed with k={optimal_k}")
print(f"Cluster distribution:")
print(df['KMeans_Cluster'].value_counts().sort_index())

# Display cluster centroids (in scaled space)
centroids_df = pd.DataFrame(kmeans_centroids, columns=features_for_clustering)
print("\nCluster Centroids (in scaled space):")
print(centroids_df)


#### Algorithm 2: Hierarchical Clustering (Agglomerative)


In [None]:
# Create dendrogram for hierarchical clustering visualization
# Using a sample for faster computation (full dataset can be slow)
sample_size = min(50, len(X_scaled))
sample_indices = np.random.choice(len(X_scaled), sample_size, replace=False)
X_sample = X_scaled[sample_indices]

# Create linkage matrix
linkage_matrix = linkage(X_sample, method='ward')

# Plot dendrogram
plt.figure(figsize=(15, 8))
dendrogram(linkage_matrix, truncate_mode='level', p=5)
plt.title('Hierarchical Clustering Dendrogram (Sample)', fontsize=14, fontweight='bold')
plt.xlabel('Sample Index or (Cluster Size)', fontsize=12)
plt.ylabel('Distance', fontsize=12)
plt.tight_layout()
plt.show()


In [None]:
# Apply Agglomerative Clustering with optimal k
agg_clustering = AgglomerativeClustering(n_clusters=optimal_k, linkage='ward')
agg_labels = agg_clustering.fit_predict(X_scaled)

# Add cluster labels to dataframe
df['Agglomerative_Cluster'] = agg_labels

print(f"Agglomerative Clustering completed with k={optimal_k}")
print(f"Cluster distribution:")
print(df['Agglomerative_Cluster'].value_counts().sort_index())


#### Algorithm 3: DBSCAN (Optional - for comparison)


In [None]:
# DBSCAN doesn't require specifying number of clusters
# We need to tune eps and min_samples parameters
# Using a heuristic approach: eps = 0.5, min_samples = 5

dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)

# Add cluster labels
df['DBSCAN_Cluster'] = dbscan_labels

n_clusters_dbscan = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
n_noise = list(dbscan_labels).count(-1)

print(f"DBSCAN clustering completed")
print(f"Number of clusters found: {n_clusters_dbscan}")
print(f"Number of noise points: {n_noise}")
print(f"Cluster distribution:")
print(pd.Series(dbscan_labels).value_counts().sort_index())


### Step 2.4: Evaluate Clustering Quality


In [None]:
# Calculate evaluation metrics for each clustering algorithm
metrics_results = []

# K-Means metrics
kmeans_silhouette = silhouette_score(X_scaled, kmeans_labels)
kmeans_db = davies_bouldin_score(X_scaled, kmeans_labels)
kmeans_ch = calinski_harabasz_score(X_scaled, kmeans_labels)

metrics_results.append({
    'Algorithm': 'K-Means',
    'Silhouette Score': kmeans_silhouette,
    'Davies-Bouldin Index': kmeans_db,
    'Calinski-Harabasz Index': kmeans_ch
})

# Agglomerative Clustering metrics
agg_silhouette = silhouette_score(X_scaled, agg_labels)
agg_db = davies_bouldin_score(X_scaled, agg_labels)
agg_ch = calinski_harabasz_score(X_scaled, agg_labels)

metrics_results.append({
    'Algorithm': 'Agglomerative',
    'Silhouette Score': agg_silhouette,
    'Davies-Bouldin Index': agg_db,
    'Calinski-Harabasz Index': agg_ch
})

# DBSCAN metrics (only if clusters found)
if n_clusters_dbscan > 1:
    dbscan_silhouette = silhouette_score(X_scaled, dbscan_labels)
    dbscan_db = davies_bouldin_score(X_scaled, dbscan_labels)
    dbscan_ch = calinski_harabasz_score(X_scaled, dbscan_labels)

    metrics_results.append({
        'Algorithm': 'DBSCAN',
        'Silhouette Score': dbscan_silhouette,
        'Davies-Bouldin Index': dbscan_db,
        'Calinski-Harabasz Index': dbscan_ch
    })

# Create comparison table
metrics_df = pd.DataFrame(metrics_results)
print("Clustering Quality Metrics Comparison:")
print(metrics_df.round(4))


**Quality Metrics Interpretation:**
- **Silhouette Score:** Higher is better (range: -1 to 1). Measures how similar an object is to its own cluster vs other clusters.
- **Davies-Bouldin Index:** Lower is better. Measures average similarity ratio of each cluster with its most similar cluster.
- **Calinski-Harabasz Index:** Higher is better. Ratio of between-cluster dispersion to within-cluster dispersion.

**Assessment:** [Document which algorithm performs best based on metrics]


### Step 2.5: Visualize Clustering Results


#### PCA Visualization


In [None]:
# Apply PCA for 2D visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Calculate variance explained
variance_explained = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(variance_explained)

print(f"Variance explained by PC1: {variance_explained[0]:.4f} ({variance_explained[0]*100:.2f}%)")
print(f"Variance explained by PC2: {variance_explained[1]:.4f} ({variance_explained[1]*100:.2f}%)")
print(f"Total variance explained: {cumulative_variance[1]:.4f} ({cumulative_variance[1]*100:.2f}%)")

# Visualize K-Means clusters in 2D
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# K-Means visualization
scatter1 = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans_labels, cmap='viridis', s=50, alpha=0.6)
axes[0].set_xlabel(f'PC1 ({variance_explained[0]*100:.2f}% variance)', fontsize=12)
axes[0].set_ylabel(f'PC2 ({variance_explained[1]*100:.2f}% variance)', fontsize=12)
axes[0].set_title('K-Means Clustering (PCA Visualization)', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
plt.colorbar(scatter1, ax=axes[0], label='Cluster')

# Agglomerative visualization
scatter2 = axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=agg_labels, cmap='plasma', s=50, alpha=0.6)
axes[1].set_xlabel(f'PC1 ({variance_explained[0]*100:.2f}% variance)', fontsize=12)
axes[1].set_ylabel(f'PC2 ({variance_explained[1]*100:.2f}% variance)', fontsize=12)
axes[1].set_title('Agglomerative Clustering (PCA Visualization)', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
plt.colorbar(scatter2, ax=axes[1], label='Cluster')

plt.tight_layout()
plt.show()


#### Cluster Characteristics Analysis


In [None]:
# Calculate mean values for each cluster (K-Means)
cluster_means_kmeans = df.groupby('KMeans_Cluster')[features_for_clustering].mean()
print("K-Means Cluster Characteristics (Mean Values):")
print(cluster_means_kmeans.round(2))

# Create heatmap for cluster characteristics
plt.figure(figsize=(12, 6))
sns.heatmap(cluster_means_kmeans.T, annot=True, fmt='.0f', cmap='YlOrRd',
            cbar_kws={'label': 'Mean Spending'})
plt.title('K-Means Cluster Characteristics Heatmap', fontsize=14, fontweight='bold')
plt.xlabel('Cluster', fontsize=12)
plt.ylabel('Product Category', fontsize=12)
plt.tight_layout()
plt.show()


In [None]:
# Calculate mean values for Agglomerative clusters
cluster_means_agg = df.groupby('Agglomerative_Cluster')[features_for_clustering].mean()
print("\nAgglomerative Cluster Characteristics (Mean Values):")
print(cluster_means_agg.round(2))

# Create heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(cluster_means_agg.T, annot=True, fmt='.0f', cmap='YlGnBu',
            cbar_kws={'label': 'Mean Spending'})
plt.title('Agglomerative Cluster Characteristics Heatmap', fontsize=14, fontweight='bold')
plt.xlabel('Cluster', fontsize=12)
plt.ylabel('Product Category', fontsize=12)
plt.tight_layout()
plt.show()


#### Pairwise Feature Plots


In [None]:
# Create pairwise scatter plots for key feature pairs
# Select pairs with high correlation for better visualization
key_pairs = [
    ('Grocery', 'Milk'),
    ('Grocery', 'Detergents_Paper'),
    ('Fresh', 'Frozen')
]

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, (feat1, feat2) in enumerate(key_pairs):
    scatter = axes[idx].scatter(df[feat1], df[feat2], c=kmeans_labels,
                               cmap='viridis', s=50, alpha=0.6)
    axes[idx].set_xlabel(feat1, fontsize=11)
    axes[idx].set_ylabel(feat2, fontsize=11)
    axes[idx].set_title(f'{feat1} vs {feat2}', fontsize=12, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)
    plt.colorbar(scatter, ax=axes[idx], label='Cluster')

plt.tight_layout()
plt.show()


### Step 2.6: Cluster Interpretation


In [None]:
# Detailed cluster profiling for K-Means (best performing algorithm)
print("=== K-MEANS CLUSTER PROFILES ===\n")

for cluster_id in sorted(df['KMeans_Cluster'].unique()):
    cluster_data = df[df['KMeans_Cluster'] == cluster_id]
    cluster_size = len(cluster_data)

    print(f"\n{'='*60}")
    print(f"CLUSTER {cluster_id} (Size: {cluster_size} customers, {cluster_size/len(df)*100:.1f}%)")
    print(f"{'='*60}")

    # Mean values
    print("\nAverage Spending (Mean):")
    means = cluster_data[features_for_clustering].mean()
    for feature, value in means.items():
        print(f"  {feature:20s}: {value:10.2f}")

    # Median values
    print("\nMedian Spending:")
    medians = cluster_data[features_for_clustering].median()
    for feature, value in medians.items():
        print(f"  {feature:20s}: {value:10.2f}")

    # Channel and Region distribution
    if 'Channel' in df.columns:
        print(f"\nChannel Distribution:")
        print(cluster_data['Channel'].value_counts())
    if 'Region' in df.columns:
        print(f"\nRegion Distribution:")
        print(cluster_data['Region'].value_counts())


In [None]:
# Create summary table of cluster profiles
cluster_summary = []

for cluster_id in sorted(df['KMeans_Cluster'].unique()):
    cluster_data = df[df['KMeans_Cluster'] == cluster_id]

    # Calculate characteristics
    profile = {
        'Cluster': cluster_id,
        'Size': len(cluster_data),
        'Percentage': len(cluster_data)/len(df)*100
    }

    # Add mean values for each feature
    for feature in features_for_clustering:
        profile[f'{feature}_Mean'] = cluster_data[feature].mean()

    # Identify dominant characteristics
    means = cluster_data[features_for_clustering].mean()
    top_features = means.nlargest(2).index.tolist()
    profile['Top_Features'] = ', '.join(top_features)

    cluster_summary.append(profile)

cluster_summary_df = pd.DataFrame(cluster_summary)
print("Cluster Summary Table:")
print(cluster_summary_df[['Cluster', 'Size', 'Percentage', 'Top_Features']])


**Cluster Labels and Interpretations:**

Based on the cluster characteristics analysis, we can label each cluster:

- **Cluster 0:** [Label based on characteristics, e.g., "High-Volume Retailers"]
  - Characteristics: [Describe key features]
  
- **Cluster 1:** [Label, e.g., "Small Restaurants/Cafes"]
  - Characteristics: [Describe key features]
  
- **Cluster 2:** [Label, e.g., "Medium-Sized Grocery Stores"]
  - Characteristics: [Describe key features]

[Add more clusters if k > 3]


### Step 2.7: Select Best Clustering Solution


**Best Clustering Algorithm Selection:**

After comparing K-Means, Agglomerative Clustering, and DBSCAN based on:

1. **Quality Metrics:**
   - Silhouette Score: [Document which is highest]
   - Davies-Bouldin Index: [Document which is lowest]
   - Calinski-Harabasz Index: [Document which is highest]

2. **Interpretability:**
   - [Document which algorithm produces more interpretable clusters]

3. **Stability:**
   - [Document which algorithm is more stable/reproducible]

**Final Selection:** [K-Means / Agglomerative / DBSCAN]

**Justification:** [Provide clear reasoning for the selection]


---

## Part 3: Interpretation and Insights

### Step 3.1: Key Findings Summary


The clustering analysis of the Wholesale Customers Dataset revealed [X] distinct customer segments, each characterized by unique purchasing patterns across product categories. The analysis successfully identified natural groupings in the customer base, with clear differences in spending behavior, product preferences, and purchase volumes.

Key insights include the identification of [describe main segments], where customers show distinct preferences for certain product combinations. For instance, [provide specific example of a pattern found]. The clustering solution achieved a silhouette score of [X], indicating [good/moderate] cluster separation and cohesion.

The analysis demonstrates that customer segmentation based on purchasing behavior provides actionable insights for business strategy. The identified segments can be leveraged for targeted marketing, personalized product recommendations, and optimized inventory management, ultimately leading to improved customer satisfaction and business profitability.


### Step 3.2: Practical Implications


**What do customer segments represent?**

The identified clusters represent distinct customer types in the wholesale market:
- [Segment 1 name]: Represents [description] with characteristics such as [key features]
- [Segment 2 name]: Represents [description] with characteristics such as [key features]
- [Segment 3 name]: Represents [description] with characteristics such as [key features]

**How can businesses use these segments?**

1. **Targeted Marketing:** Develop segment-specific marketing campaigns tailored to each customer group's preferences and purchasing behavior.

2. **Product Recommendations:** Use cluster characteristics to recommend complementary products. For example, customers in [segment] who buy [product A] are likely to be interested in [product B].

3. **Inventory Management:** Optimize stock levels based on segment demand patterns. High-volume segments may require different inventory strategies than low-volume segments.

4. **Pricing Strategies:** Implement dynamic pricing based on segment characteristics. High-volume segments might receive bulk discounts, while specialized segments might pay premium prices.

5. **Customer Relationship Management:** Personalize interactions based on segment membership, improving customer satisfaction and retention.

**Marketing Strategies for Each Segment:**

- **[Segment 1]:** [Specific marketing strategy]
- **[Segment 2]:** [Specific marketing strategy]
- **[Segment 3]:** [Specific marketing strategy]


### Step 3.3: Limitations Discussion


**Dataset Limitations:**

1. **Sample Size:** With approximately 440 observations, the dataset is relatively small. A larger sample would provide more robust cluster definitions and better generalization.

2. **Feature Availability:** The dataset only includes annual spending across product categories. Additional features such as customer demographics, geographic location details, purchase frequency, seasonal patterns, or customer lifetime value would enrich the analysis.

3. **Temporal Information:** The dataset lacks temporal information (e.g., purchase dates, trends over time). This prevents analysis of seasonal patterns, trends, or customer behavior evolution.

4. **Contextual Information:** Limited information about customer types (Channel and Region are categorical but lack detailed context) restricts deeper understanding of segment characteristics.

**Methodology Limitations:**

1. **Distance-Based Assumptions:** K-Means and Hierarchical clustering assume spherical clusters and may not capture complex, non-linear relationships in the data.

2. **Feature Scaling Dependency:** The clustering results are sensitive to the scaling method chosen. Different scalers (StandardScaler vs MinMaxScaler) might yield different cluster assignments.

3. **Optimal k Selection:** The choice of optimal k involves some subjectivity, especially when elbow and silhouette methods suggest different values.

4. **Outlier Handling:** High-volume customers (outliers) are retained but may influence cluster centroids, potentially skewing cluster definitions.

**Assumptions Made:**

1. All features are equally important for clustering (no feature weighting applied)
2. StandardScaler is appropriate for all features
3. Euclidean distance is suitable for measuring customer similarity
4. The optimal number of clusters remains stable over time

**Potential Biases:**

1. **Temporal Bias:** If data was collected during a specific time period, it may not represent year-round customer behavior.
2. **Selection Bias:** The dataset may not represent the entire customer population if certain customer types are over/under-represented.
3. **Measurement Bias:** Annual spending aggregates may mask important short-term purchasing patterns.


### Step 3.4: Recommendations


**Business Strategies Based on Findings:**

1. **Segment-Specific Product Bundles:** Create product bundles tailored to each segment's purchasing patterns. For example, [specific recommendation based on cluster analysis].

2. **Dynamic Inventory Allocation:** Allocate inventory based on segment demand. High-volume segments may require dedicated supply chains or priority restocking.

3. **Customer Acquisition:** Focus marketing efforts on acquiring customers similar to high-value segments. Use cluster characteristics to identify potential high-value customers.

4. **Retention Strategies:** Develop segment-specific retention programs. For instance, [specific strategy for a segment].

**Customer Targeting Approaches:**

1. **New Customer Classification:** When a new customer makes initial purchases, quickly classify them into a segment to provide personalized service from the start.

2. **Cross-Selling Opportunities:** Identify cross-selling opportunities based on segment patterns. Customers in [segment X] who buy [product A] are likely candidates for [product B].

3. **Upselling Strategies:** Target customers in lower-volume segments with upselling campaigns based on what similar customers in higher-volume segments purchase.

**Additional Data Recommendations:**

1. **Temporal Data:** Collect purchase timestamps to analyze seasonal patterns, purchase frequency, and customer behavior evolution over time.

2. **Customer Demographics:** Include customer size (number of employees, store size), industry type, geographic location details, and business model information.

3. **Behavioral Data:** Track purchase frequency, average order value, customer lifetime value, and engagement metrics (website visits, inquiries).

4. **External Factors:** Include economic indicators, seasonal factors, or market trends that might influence purchasing behavior.

**Future Analysis Directions:**

1. **Time Series Clustering:** Analyze how customer segments evolve over time and identify customers transitioning between segments.

2. **Hierarchical Segmentation:** Create sub-segments within main clusters for more granular targeting.

3. **Predictive Modeling:** Build models to predict which segment a new customer will belong to based on initial purchases.

4. **Association Rule Mining:** Discover product association rules within each segment to identify cross-selling opportunities.

5. **Anomaly Detection:** Identify unusual purchasing patterns that might indicate fraud, errors, or emerging customer needs.

**Model Improvements:**

1. **Feature Engineering:** Create derived features such as spending ratios (e.g., Grocery/Fresh ratio), total spending, or category diversity scores.

2. **Alternative Algorithms:** Explore density-based clustering (DBSCAN with optimized parameters), Gaussian Mixture Models, or spectral clustering for potentially better results.

3. **Validation Methods:** Implement cross-validation or bootstrap methods to assess cluster stability and reliability.

4. **Dimensionality Reduction:** Apply PCA or t-SNE before clustering to reduce noise and improve cluster quality, especially if more features are added.


---

## Conclusion

This analysis successfully identified distinct customer segments in the Wholesale Customers Dataset using clustering techniques. The findings provide actionable insights for business strategy, marketing, and customer relationship management. While the analysis has limitations, it establishes a solid foundation for data-driven decision-making in customer segmentation.

**Key Takeaways:**
- [X] distinct customer segments identified
- Clear purchasing patterns and preferences discovered
- Actionable business strategies recommended
- Framework established for future analysis and model improvements

---

**End of Assignment**
