# üìò Mall Customers ‚Äî DBSCAN Clustering

### üéØ Objective of This Project ‚Äî DBSCAN Clustering on Mall Customers Dataset
##### The objective of this project is to perform density-based clustering on the Mall Customers dataset using DBSCAN.
##### Our goal is to:
##### 1- Discover natural customer segments based on their Annual Income and Spending Score.
##### 2- Automatically detect outliers, such as customers whose behavior is very different from others (e.g., unusually high or low spending).
##### 3- Compare DBSCAN‚Äôs behavior to K-Means and Hierarchical Clustering, showing how density-based clustering:
#####  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Finds clusters of any shape
#####  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Does not require predefining the number of clusters
#####  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Naturally identifies noise points
##### 4- Produce a 2D visual clustering map for easy interpretation.

##### This project will help us understand when DBSCAN is the better choice for segmentation, especially in datasets with irregular cluster shapes or noisy points.

### Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
import seaborn as sns

### Load Dataset

In [None]:
df = pd.read_csv("./data/Mall_Customers.csv")
df.head()

### Basic Inspection

In [None]:
df.info()
df.describe()

### Select Features for Clustering
##### DBSCAN works best on meaningful numerical features.
##### We will use:
##### - Annual Income (k$)
##### - Spending Score (1‚Äì100)

In [None]:
X = df[["Annual Income (k$)", "Spending Score (1-100)"]]

### Scale Features
##### DBSCAN is distance-based ‚Üí scaling is required.

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### Try DBSCAN with Initial Parameters
##### Start with simple values:
##### - eps = 0.3
##### - min_samples = 5

In [None]:
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

# Count clusters
np.unique(labels)

### Visualize Initial Result

In [None]:
plt.figure(figsize=(8,5))
sns.scatterplot(
    x=X["Annual Income (k$)"], 
    y=X["Spending Score (1-100)"],
    hue=labels,
    palette="tab10"
)
plt.title("Initial DBSCAN Clusters")
plt.show()

### Evaluate Silhouette Score (Ignoring Noise -1)

In [None]:
mask = labels != -1  # ignore noise
if len(np.unique(labels[mask])) > 1:
    score = silhouette_score(X_scaled[mask], labels[mask])
    print("Silhouette Score:", score)
else:
    print("Not enough clusters for silhouette evaluation.")


### Tuning eps (Neighborhood Size)
##### We test different eps values to find better structure.

In [None]:
eps_values = np.arange(0.1, 1.0, 0.1)
results = []

for eps in eps_values:
    model = DBSCAN(eps=eps, min_samples=5)
    labels = model.fit_predict(X_scaled)
    mask = labels != -1
    
    if len(np.unique(labels[mask])) > 1:
        score = silhouette_score(X_scaled[mask], labels[mask])
    else:
        score = -1  # invalid cluster
    
    results.append((eps, score))

results

### Visualize eps vs Silhouette

In [None]:
eps_list = [r[0] for r in results]
score_list = [r[1] for r in results]

plt.figure(figsize=(8,5))
plt.plot(eps_list, score_list, marker="o")
plt.xlabel("eps value")
plt.ylabel("Silhouette Score")
plt.title("DBSCAN eps tuning")
plt.show()

### Choose Best eps

##### Select the highest silhouette score.

In [None]:
best_eps = eps_list[np.argmax(score_list)]
best_eps

### Final DBSCAN Model

In [None]:
dbscan_final = DBSCAN(eps=best_eps, min_samples=5)
labels_final = dbscan_final.fit_predict(X_scaled)
np.unique(labels_final)

### Final Cluster Visualization

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(
    x=df["Annual Income (k$)"],
    y=df["Spending Score (1-100)"],
    hue=labels_final,
    palette="tab20",
    s=70
)
plt.title(f"DBSCAN Final Clusters (eps={best_eps})")
plt.show()

### Show Number of Clusters & Noise

In [None]:
unique, counts = np.unique(labels_final, return_counts=True)
cluster_summary = dict(zip(unique, counts))
cluster_summary

### üîç What This Means
##### 1. DBSCAN identified 7 real clusters (0 to 6)
##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Some clusters are large (Cluster 1 ‚Üí 78 customers)
##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Some clusters are very small (Clusters 5 and 6 ‚Üí 4‚Äì5 customers)
##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Small clusters are normal because DBSCAN detects dense pockets in the data.
##### 2. 77 customers are labeled as -1 (noise)
##### These customers:
##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Do not belong to any dense region
##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Are isolated customers with unusual income/spending patterns
##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- DBSCAN correctly marks them as outliers
##### Outliers are extremely important in marketing segmentation:
##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- They may represent VIP clients
##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Or irregular spending behaviors
##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Or customers who do not fit standard profiles

### üìä Is This Good?

##### Yes ‚Äî DBSCAN is doing what it's supposed to:

#### üëç Strengths:
##### - Finds arbitrary-shaped clusters
##### - Identifies outliers (very useful in marketing data)
##### - Does not require specifying number of clusters

#### üëé Weak Points:
##### - Some clusters are very small ‚Üí may require adjusting parameters
##### - Many outliers might mean:
##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- eps is too small
##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- or data has natural noise