# Customer Segmentation

## Introduction to Customer Segmentation

Customer segmentation is a powerful technique in data science that enables businesses to categorize their customers into distinct groups based on shared characteristics. This approach is pivotal in understanding customer behavior, optimizing marketing strategies, and enhancing customer service. In this assignment, we will delve into the practical application of customer segmentation using machine learning algorithms.

- **Significance of Customer Segmentation**:
  - **Targeted Marketing**: Tailoring marketing campaigns to specific customer groups based on their purchasing behavior and preferences.
  - **Product Customization**: Developing products and services that cater to the specific needs and desires of different customer segments.
  - **Improved Customer Experience**: Delivering personalized experiences to customers, increasing satisfaction and loyalty.

The example code provided serves as a starting point for this exploration. It demonstrates the application of K-Means clustering, a popular technique in machine learning for grouping data. This algorithm partitions customers into clusters based on features like transaction amount, account balance, and transaction frequency.

- **Key Techniques and Concepts**:
  - **K-Means Clustering**: Understand and apply K-Means to segment customers.
  - **Data Standardization**: Learn the importance of scaling features for effective clustering.
  - **Cluster Visualization**: Gain skills in visualizing the clusters to extract meaningful insights.

We will expand upon this initial code by experimenting with different numbers of clusters, applying additional clustering techniques like Hierarchical Clustering and DBSCAN, and performing a thorough analysis of the clusters to understand their business implications. 


In [None]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Generate sample data
np.random.seed(42)
data = pd.DataFrame({
    'TransactionAmount': np.random.uniform(10, 1000, 100),
    'AccountBalance': np.random.uniform(500, 5000, 100),
    'TransactionFrequency': np.random.poisson(5, 100)
})

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Choose the number of clusters (you may want to experiment with this)
num_clusters = 3

# Apply KMeans clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
data['Cluster'] = kmeans.fit_predict(scaled_data)

# Visualize the clusters (for two features)
plt.scatter(data['TransactionAmount'], data['AccountBalance'], c=data['Cluster'], cmap='viridis')
plt.title('Customer Segmentation')
plt.xlabel('Transaction Amount')
plt.ylabel('Account Balance')
plt.show()

# Display the cluster centers (in the standardized feature space)
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_centers_df = pd.DataFrame(cluster_centers, columns=data.columns[:-1])
print(cluster_centers_df)

In [None]:
# Add additional diagrams
from scipy.cluster.hierarchy import dendrogram, linkage

# Load the dataset
file_path = 'Mall_Customers.csv'  # Replace with the actual file path
data2 = pd.read_csv(file_path)

# Select relevant features for clustering
selected_features = data2[['Annual Income (k$)', 'Spending Score (1-100)']]

# Standardize the data
scaler = StandardScaler()
scaled_data2 = scaler.fit_transform(selected_features)

# Apply KMeans clustering
num_clusters = 5  # This can be adjusted based on experimentation
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
data2['Cluster'] = kmeans.fit_predict(scaled_data2)

# Prepare for Hierarchical Clustering
Z = linkage(scaled_data2, 'ward')

In [None]:
# Visualize the clusters (for two features)
plt.scatter(data2['Annual Income (k$)'], data2['Spending Score (1-100)'], c=data2['Cluster'], cmap='viridis')
plt.title('Customer Segmentation')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.show()

# Display the cluster centers (in the original feature space)
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_centers_df = pd.DataFrame(cluster_centers, columns=selected_features.columns)
print(cluster_centers_df)

The k-means algorithm has done a good job of identifying the clusters.  The centroids are well dispersed.  I excluded age from the analysis, because it caused there to be an extra cluster in the center, overlapping cluster 0.

**Data Exploration and Analysis**

First, we look at the first five records.

In [None]:
print(data2.head())

Next, we examine the distribution of the variables. Annual Income is skewed right, but spending score is roughly normally distributed.  The scales actually by coincidence are pretty similar, but it makes sense to scale the data anyway for a K-Means clustering.

In [None]:
# Generate histograms for the features
data2[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']].hist(figsize=(10, 8))
plt.tight_layout()
plt.show()

### Introduction to Cluster Evaluation Techniques

In the realm of unsupervised machine learning, determining the optimal number of clusters is a pivotal decision that can significantly impact the outcomes of your model. Cluster evaluation techniques are essential tools that provide guidance in this decision-making process. Two of the most widely recognized methods for evaluating clustering results are the Elbow Method and the Silhouette Score.

#### Elbow Method
- **Explanation**: The Elbow Method is a heuristic used in determining the number of clusters in a data set. The approach involves plotting the explained variance as a function of the number of clusters, and picking the point where the increase in variance explained by adding another cluster is not significant anymore. This point is known as the 'elbow', where the graph bends.
- **Interpretation**: In the Elbow Method, one should look for a change in the gradient of the line plot; a sharp change like an elbow suggests the optimal number of clusters. The idea is that adding more clusters beyond this number does not provide much better modeling of the data.

#### Silhouette Score
- **Explanation**: The Silhouette Score is a metric used to calculate the goodness of a clustering technique. It measures the distance between points within a cluster and the distance to points in the next nearest cluster. The score ranges from -1 to +1, where a high value indicates that the points are well clustered.
- **Interpretation**: A Silhouette Score close to +1 indicates that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster.

Both methods provide different lenses through which to view the clustering results and can be used in conjunction to make a more informed decision. The Elbow Method gives us an insight into the variance within each cluster, whereas the Silhouette Score provides a measure of how similar an object is to its own cluster compared to others. The optimal number of clusters is often the one that balances between the two measures, subject to the specific context and use case of the analysis.


In [None]:
from sklearn.cluster import KMeans

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(scaled_data2)
    wcss.append(kmeans.inertia_)

plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()


The elbow chart shows that the optimum number of clusters is 6, as this is where the sum of squared distances between each point and the centroid of the cluster it belongs to starts to reach diminishing returns.

In [None]:
from sklearn.metrics import silhouette_score

silhouette_scores = []
for i in range(2, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(scaled_data2)
    score = silhouette_score(scaled_data2, kmeans.labels_)
    silhouette_scores.append(score)

plt.plot(range(2, 11), silhouette_scores)
plt.title('Silhouette Score for each number of clusters')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.show()


The Silhouette Score likewise points to an optimal number of clusters as 5, so our initial assumption about the number of clusters was correct.

**Advanced Implementation of Hierarchical Clustering**

In [None]:
# Plot the dendrogram with limited branches
plt.figure(figsize=(12, 6))
dendrogram(
    Z,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=10,  # number of clusters to show
)
plt.title('Hierarchical Clustering Dendrogram (Truncated)')
plt.xlabel('Cluster size')
plt.ylabel('Distance')
plt.show()

Looking at the dendrogram, we can get most of the datapoints into clusters by drawing a line at distance 6. This identifies 5 clusters, which we saw before was the optimal number. I limited the number of clusters to 10 to avoid overfitting the data.

**Comprehensive Cluster Analysis**

To go beyond the silhouette score, Gemini recommended two additional metrics as follows:

1.   Calinski-Harabasz Index (Variance Ratio Criterion):


*   Measures the ratio of between-cluster dispersion to within-cluster dispersion.
*   A higher Calinski-Harabasz score generally indicates better-defined clusters.
*   It's calculated as the sum of squared distances between clusters divided by the sum of squared distances within clusters.

2.   Davies-Bouldin Index:

*  Measures the average similarity ratio of each cluster with the cluster that is most similar to it.
*  Similarity is a measure based on distances between centroids and the spread of points within clusters.
*  A lower Davies-Bouldin index indicates better clustering, with clusters that are more separated and less spread out.

In [None]:
from sklearn.metrics import calinski_harabasz_score, davies_bouldin_score

calinski_score = calinski_harabasz_score(scaled_data2, kmeans.labels_)
davies_bouldin = davies_bouldin_score(scaled_data2, kmeans.labels_)

print(f"Calinski-Harabasz Index: {calinski_score}")
print(f"Davies-Bouldin Index: {davies_bouldin}")

As one would want, the Calinski-Harabasz Index is relatively high, and the Davies-Bouldin Index is relatively low, indicating that the clusters are widely dispersed and that the datapoints within each cluster are fairly similar to each other.

**Final Analysis**

The cluster analysis reveals that dataspace is divided into four quadrants, each containing a different combination of high and low annual income and high and low spending.

*  For example, cluster 1 has high spending and high income. You might call these your *whales* - they are the customers that have both the ability and willingness to spend.  
*  Cluster 2, on the other hand, has low income but high spend.  You might call these the *fans*, because they spend a higher percentage of their income on your product.  
*  Cluster 3 has high income but low spending: these are your *targets,* because they have the capacity to be spending more than they currently are.  
*  Cluster 4 has both low income and low spend, so they are *strugglers.*  
*  Finally, cluster 0 is all those who are in the middle, having moderate income and moderate spending.  They might be *secondary targets*, because their spending ability is limited, but they could be doing more than they are.

Through this analysis, we can identify the customers we should be pitching offers to in an effort to get them to spend more with us, and avoid wasting time and money on customers that have little ability to spend more.

- **DBSCAN Clustering**:
  - Implement DBSCAN and compare its segmentation with K-Means and Hierarchical clustering.
  - Analyze the clusters formed by DBSCAN for any unique characteristics.

- **Principal Component Analysis (PCA)**:
  - Apply PCA to the data and visualize the results.
  - Discuss how dimensionality reduction impacts the clustering results and its potential use in simplifying complex datasets.


In [None]:
from sklearn.cluster import DBSCAN

epsilon = 0.5  # I tried different values until I got a low number of noise points.
min_samples = 10 # Tried different values, this seemed like a good balance.

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=epsilon, min_samples=min_samples)
data2['DBSCAN_Cluster'] = dbscan.fit_predict(scaled_data2)

# The cluster labels in DBSCAN include -1 for noise points
# You can explore the number of clusters found and the number of noise points
n_clusters_ = len(set(data2['DBSCAN_Cluster'])) - (1 if -1 in data2['DBSCAN_Cluster'] else 0)
n_noise_ = list(data2['DBSCAN_Cluster']).count(-1)

print(f"Estimated number of clusters: {n_clusters_}")
print(f"Estimated number of noise points: {n_noise_}")

# Visualize the DBSCAN clusters
plt.scatter(data2['Annual Income (k$)'], data2['Spending Score (1-100)'], c=data2['DBSCAN_Cluster'], cmap='viridis')
plt.title('DBSCAN Customer Segmentation')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.show()

The clustering with DBSCAN is similar to that with k-means, although DBSCAN classifies as noise some outlying points that k-means identifies as being in the clusters.  Adjusting epsilon up or down tends to reduce the number of clusters, and I like the five that we established earlier.

In [None]:
from sklearn.decomposition import PCA

# Apply PCA to reduce the data to 2 principal components
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_data2)

# Create a DataFrame with the principal components
pca_df = pd.DataFrame(data=principal_components, columns=['Principal Component 1', 'Principal Component 2'])

# Add the cluster labels to the PCA DataFrame (using the KMeans clusters for visualization)
pca_df['Cluster'] = data2['Cluster'] # You can use DBSCAN clusters here as well

# Visualize the data in the PCA space
plt.figure(figsize=(8, 6))
plt.scatter(pca_df['Principal Component 1'], pca_df['Principal Component 2'], c=pca_df['Cluster'], cmap='viridis')
plt.title('Customer Segmentation (PCA Reduced)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

# You can also examine the explained variance ratio of each principal component
explained_variance_ratio = pca.explained_variance_ratio_
print(f"Explained variance ratio of principal components: {explained_variance_ratio}")

Although we only had two features to begin with, using PCA shows that they are about equally important.

The analysis remains the same as above.