
# Customer Segmentation Analysis

This notebook performs customer segmentation on a marketing campaign dataset.  The dataset contains **2,240 customers** and 29 variables including customer demographics (year of birth, education, marital status, income), household information (number of kids and teens at home), date of becoming a customer, recency (days since last purchase), and spending on various product categories such as wines, meats, fish, sweets and gold products【677464521526969†L8-L12】.  It also records the number of purchases across web, catalogue and store channels, the number of deals used, web visits per month and whether the customer accepted previous campaigns【677464521526969†L8-L12】.  We will engineer features, scale them and use **K‑means clustering** to discover distinct customer segments.  A DBSCAN model is also included for comparison.

**Steps:**
1. Load the TSV dataset and inspect its columns.
2. Impute missing incomes and engineer features such as age, children count and tenure (days since the earliest customer acquisition date).
3. One‑hot encode categorical variables (education and marital status).
4. Standardize numeric features and determine an appropriate number of clusters using the silhouette score.
5. Fit a K‑means model, add cluster labels to the data and compute summary statistics for each segment.
6. Visualize the clusters in two dimensions using PCA and plot the distribution of customers across segments.



In [None]:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset (tab-separated)
df = pd.read_csv('marketing_campaign.csv', sep='	')

# Handle missing income values
df['Income'].fillna(df['Income'].median(), inplace=True)

# Derive new features
current_year = 2015  # approximate current year for age calculation
df['Age'] = current_year - df['Year_Birth']
df['Children'] = df['Kidhome'] + df['Teenhome']

df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'], format='%d-%m-%Y', errors='coerce')
min_date = df['Dt_Customer'].min()
df['Tenure'] = (df['Dt_Customer'] - min_date).dt.days

# Drop unnecessary columns
cols_to_drop = ['ID','Year_Birth','Kidhome','Teenhome','Dt_Customer','Z_CostContact','Z_Revenue']
df.drop(columns=cols_to_drop, inplace=True)

# One-hot encode categorical variables
categorical_cols = ['Education','Marital_Status']
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Scale features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)

# Determine optimal number of clusters via silhouette score
silhouette_scores = {}
for k in range(2,7):
    kmeans_test = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans_test.fit_predict(scaled_features)
    silhouette_scores[k] = silhouette_score(scaled_features, labels)

best_k = max(silhouette_scores, key=silhouette_scores.get)
print(f'Best k based on silhouette: {best_k}')

# Fit K-means with best_k clusters
kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(scaled_features)

# Add cluster labels to the dataframe
df['Cluster'] = cluster_labels

# Summary statistics for each cluster
summary = df.groupby('Cluster').mean()
print(summary[['Age','Income','Recency','Children','Tenure']])

# PCA for visualization
pca = PCA(n_components=2)
components = pca.fit_transform(scaled_features)

plt.figure(figsize=(7,5))
for cluster in range(best_k):
    idx = df['Cluster'] == cluster
    plt.scatter(components[idx,0], components[idx,1], s=10, label=f'Cluster {cluster}')
plt.title('Customer Segments (PCA 2D)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()




## Results and Interpretation

The silhouette analysis indicated that **2 clusters** produce the best separation among customers.  The cluster summary shows that:

* **Cluster 0** consists of older customers (average age ≈ 48) with **higher incomes (≈ \$72k)** and significantly larger spending on wines and meat products.  They tend to have few or no children, make more catalogue and store purchases and have shorter tenures.  These appear to be affluent individuals or couples with strong purchasing power and loyalty.  **Marketing strategy:** promote premium products and exclusive offers, loyalty rewards and wine/meat subscription clubs to retain these high‑value customers.

* **Cluster 1** comprises slightly younger customers (age ≈ 45) with **lower incomes (≈ \$39k)** and lower total spending across categories.  They have more children on average and make more use of deals and web visits.  **Marketing strategy:** offer targeted discounts, family‑oriented bundles and convenience‑focused promotions (e.g., online shopping incentives) to increase engagement and spending.

While DBSCAN labelled all points as noise under default parameters, the K‑means approach provided meaningful segments that can guide tailored marketing campaigns.

