# Customer Segmentation - Unsupervised Machine Learning Project

## Project Description

The goal of this project is to analyze customer data from a retail store in order to better understand customer behaviors, needs, and purchasing patterns. By applying unsupervised machine learning techniques, we aim to group similar customers into meaningful segments that can support business decision-making and targeted marketing strategies.

We will work with a dataset containing demographic information, purchasing habits, and marketing activity of the customers.

## Tasks Overview

The following unsupervised learning methods will be applied:

- **Data Preparation**  
  Data cleaning, handling missing values, encoding categorical variables, outlier detection, feature scaling.

- **Clustering Approaches**  
  - **K-Means Clustering**
  - **Hierarchical Agglomerative Clustering**
  - **DBSCAN (Density-Based Spatial Clustering of Applications with Noise)**

- **Feature Engineering**  
  Creating new customer features to improve clustering quality.

- **Dimensionality Reduction**  
  Applying PCA (Principal Component Analysis) to reduce feature space.

- **Cluster Evaluation**  
  Using multiple evaluation metrics:
  - Silhouette Score
  - Davies-Bouldin Index
  - Calinski-Harabasz Index

- **Visualization & Interpretation**  
  Analyzing and interpreting the discovered customer segments to draw business conclusions.

---

This project is part of the final assignment for the course:  
**Advanced Machine Learning with Scikit-learn**

Instructor: Andrzej Bobyk  
Year: 2025



In [None]:
# Necessary imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from datetime import datetime

In [None]:
# Loading the dataset
df = pd.read_csv("marketing_campaign.csv")

# Basic information about the data
print("Data information:")
df.info()
print("\nDescriptive statistics:")
print(df.describe(include='all'))
print("\nMissing values:")
print(df.isnull().sum())

# Handling missing values by filling with median
median_income = df['Income'].median()
df['Income'].fillna(median_income, inplace=True)

# Transforming CustomerFrom into CustomerSinceDays (format: DD-MM-YYYY)
df['CustomerFrom'] = pd.to_datetime(df['CustomerFrom'], format='%d-%m-%Y')
df['CustomerSinceDays'] = (datetime.today() - df['CustomerFrom']).dt.days

# Dropping the original date column
df.drop('CustomerFrom', axis=1, inplace=True)

# Encoding categorical variables
categorical_cols = ['Education', 'MaritalStatus']
le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

# Initial outlier visualization
plt.figure(figsize=(12, 8))
sns.boxplot(data=df.select_dtypes(include=['int64', 'float64']))
plt.xticks(rotation=90)
plt.title("Boxplot - searching for outliers")
plt.show()

# Scaling numeric data for further clustering
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df[numeric_cols]), columns=numeric_cols)

# Preview of the scaled data
print("\nSample of scaled data:")
print(df_scaled.head())
