# Mall Customers Segmentation Analysis

This notebook performs customer segmentation using K-Means clustering on the provided `Mall_Customers.csv` dataset. The analysis includes data loading, exploratory data analysis, preprocessing, clustering, and visualization of the results.

## 1. Load Dataset

First, we import the necessary libraries and load the dataset into a pandas DataFrame.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.cluster import KMeans

# Load the dataset
df = pd.read_csv('Mall_Customers.csv')
df.head()

## 2. Explore Dataset

Let's explore the dataset to understand its structure and summary statistics.

In [None]:
# Display basic info and statistics
df.info()
df.describe()

df['Gender'].value_counts()

## 3. Data Preprocessing

We will check for missing values, encode categorical variables, and scale the numerical features if necessary.

In [None]:
# Check for missing values
print(df.isnull().sum())

# Encode 'Gender' column
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])

# Select features for clustering
X = df[['Gender', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)']]

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

## 4. Perform Analysis

We will use the K-Means clustering algorithm to segment the customers. First, we will determine the optimal number of clusters using the Elbow Method, then fit the K-Means model and assign cluster labels.

In [None]:
# Find the optimal number of clusters using the Elbow Method
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(range(1, 11), wcss, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

# Fit KMeans with optimal clusters (let's assume 5 based on the elbow plot)
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
clusters = kmeans.fit_predict(X_scaled)
df['Cluster'] = clusters
df.head()

## 5. Visualize Results

Let's visualize the clusters to better understand the customer segments. We'll use scatter plots for key feature pairs and a count plot for cluster sizes.

In [None]:
# Visualize clusters by Annual Income and Spending Score
plt.figure(figsize=(8, 6))
sns.scatterplot(
    x='Annual Income (k$)', y='Spending Score (1-100)',
    hue='Cluster', palette='Set1', data=df, s=60, alpha=0.8
)
plt.title('Customer Segments by Income and Spending Score')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend(title='Cluster')
plt.show()

# Visualize cluster sizes
plt.figure(figsize=(6, 4))
sns.countplot(x='Cluster', data=df, palette='Set1')
plt.title('Number of Customers in Each Cluster')
plt.xlabel('Cluster')
plt.ylabel('Count')
plt.show()

# Optional: Cluster centers (inverse transform to original scale)
centers = scaler.inverse_transform(kmeans.cluster_centers_)
centers_df = pd.DataFrame(centers, columns=X.columns)
centers_df