<a href="https://colab.research.google.com/github/karthikasai1828/PRODIGY_ML_02/blob/main/Mall_customer_segmentation_with_Kmeans_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Clustering involves identifying similarities within unlabeled datasets and subsequently partitioning them into distinct clusters. These clusters lack predefined labels and necessitate recognition based on domain knowledge acquired from the dataset's real-world context post-clustering. If confident about the identified labels, the problem can transition to supervised learning for new data points, leveraging the insights gained.

**K-Means Clustering:**

The fundamental concept of K-Means Clustering is straightforward. Each data point is assigned to a cluster, with exclusivity ensuring each point belongs to only one cluster.

**Steps of K-Means:**

**Choosing K (Number of Clusters):**
1. Select an initial value for K (usually K=2).
2. Measure the Sum of Square Distances (SSD).
3. Fit a new Kmeans model with K+1 and measure SSD again.
4. Repeat this process, tracking SSD across various K values until observing diminishing returns, indicating that adding extra clusters doesn't significantly enhance cluster separation (Elbow method).

**K-Means Procedure:**

1. Randomly select K distinct data points as cluster centroids.
2. Assign each remaining point to the nearest cluster centroid.
3. Compute the mean value of each point vector to determine the center of each cluster.
4. Reassign each point to the nearest cluster center.
5. Iterate steps 3 and 4 until no further reassignments occur.

**Choosing K Value:**

Evaluation of goodness of fit involves tracking the reduction in SSD for different K values.
Theoretically, as K increases (up to the number of data points), SSD tends towards zero.
In practice, observe the rate of decline in SSD across various K values to discern the optimal number of clusters, avoiding excessive additions that don't significantly improve cluster clarity.

In [15]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

In [16]:
df= pd.read_csv("/content/drive/MyDrive/prodigy_data set/Mall_Customers.csv")
df.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


**Data Overview:**

This project focuses on Mall Customer Segmentation, also referred to as market basket analysis, utilizing unsupervised machine learning techniques, specifically the KMeans Clustering Algorithm, Gradient Boost Classifier Algorithm, and Naive Bayes Classifier Algorithm, in its simplest form.

**Content:**

The dataset contains essential customer information gathered through membership cards at a supermarket mall. Attributes include Customer ID, age, gender, annual income, and spending score. The spending score is assigned based on predefined parameters, such as customer behavior and purchasing data.

**Problem Statement:**

As the owner of the mall, the objective is to comprehend customer segments, particularly those easily targeted [Target Customers], enabling informed marketing strategies tailored to the identified segments.

**Inspiration:**

**Through this case study, the following questions will be addressed:**

How to perform customer segmentation utilizing machine learning algorithms, specifically KMeans Clustering, in Python, employing a straightforward approach.
Identification of target customers to initiate marketing strategies effectively, focusing on ease of engagement.
Real-world implications of marketing strategies and their efficacy.
By addressing these aspects, a comprehensive understanding of customer segmentation and its practical applications in marketing strategies will be attained.









In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   CustomerID              200 non-null    int64 
 1   Gender                  200 non-null    object
 2   Age                     200 non-null    int64 
 3   Annual Income (k$)      200 non-null    int64 
 4   Spending Score (1-100)  200 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 7.9+ KB


In [18]:
df.columns

Index(['CustomerID', 'Gender', 'Age', 'Annual Income (k$)',
       'Spending Score (1-100)'],
      dtype='object')

# **Data Cleaning**

In [19]:
df.corr()

ValueError: could not convert string to float: 'Male'

In [20]:
import seaborn as sns
sns.heatmap(df.corr(), annot=True)

ValueError: could not convert string to float: 'Male'

In [None]:
df.isnull().sum()

Here, There is no missing value

creating Dummy Variables

In [None]:
df.select_dtypes(include='object')

In [None]:
df_num= df.select_dtypes(exclude='object')
df_obj= df.select_dtypes(include='object')

In [None]:
df_num.info()

In [None]:
df_obj.info()

In [None]:
# Converting objects to number by one-hot encoding(drop_first=True:removes multi-collinearity)
df_obj= pd.get_dummies(df_obj, drop_first=True)

In [None]:
Final_df= pd.concat([df_num, df_obj], axis=1)
Final_df.head()

In [None]:
Final_df= Final_df.drop('CustomerID', axis=1)
Final_df

# **Exploratory Data Analysis**
EDA is very important in Unsupervised learning because it helps to have more domain knowledge.

In [None]:
plt.figure(figsize=(6,6))
sns.histplot(data=Final_df, x='Age')

**The age group ranging from 30 to 35 exhibits the highest shopping activity at this mall.**

In [None]:
plt.figure(figsize=(6,6))
sns.countplot(x='Gender_Male', data=Final_df)

**After encoding dummy variables, where 0 represents female and 1 represents male, it indicates that females have exhibited higher shopping activity within this mall.**

In [None]:
plt.figure(figsize=(6, 6))
sns.histplot(data=Final_df, x='Annual Income (k$)')

**The majority of customers exhibit an annual income below $80,000.**

In [None]:
plt.figure(figsize=(6, 6))
sns.histplot(data=Final_df, x='Spending Score (1-100)')

**The majority of customers exhibit a spending score hovering around 50.**

In [None]:
plt.figure(figsize=(20, 15))
sns.countplot(data=Final_df, x='Spending Score (1-100)', hue='Gender_Male' )
plt.legend(loc=(1.1, 0.5))

**Women in their 40s exhibit a keen interest in shopping.**


In [None]:
plt.figure(figsize=(20,8))
sns.barplot(x='Annual Income (k$)',y='Spending Score (1-100)',data=Final_df)

**The relationship between Annual Income and Spending Score is evident. Surprisingly, individuals with the highest income tend to spend the same amount or even less compared to those with average incomes.**

In [None]:
plt.figure(figsize=(20,8))
sns.barplot(x='Age',y='Annual Income (k$)',data=Final_df)

**The age group of 40-year-olds exhibits the highest median annual income.**

# **Scaling the features**

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_X= scaler.fit_transform(Final_df)
scaled_X

# **K Means Clustering Algorithm**

In [None]:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
cluster_labels= model.fit_predict(scaled_X)
cluster_labels

In [None]:
fig = plt.figure(figsize = (10,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(scaled_X[cluster_labels == 0,0],scaled_X[cluster_labels == 0,1],scaled_X[cluster_labels == 0,2],s = 40 , color = 'red', label = "cluster 1")
ax.scatter(scaled_X[ cluster_labels== 1,0],scaled_X[ cluster_labels== 1,1],scaled_X[ cluster_labels== 1,2],s = 40 , color = 'blue', label = "cluster 2")
ax.scatter(scaled_X[ cluster_labels== 2,0],scaled_X[ cluster_labels== 2,1],scaled_X[ cluster_labels== 2,2], s = 40 , color = 'green', label = "cluster 3")
ax.set_xlabel('Age of a customer-->')
ax.set_ylabel('Anual Income-->')
ax.set_zlabel('Spending Score-->')
ax.legend()
plt.show()

## **Choosing K values**

In [None]:
wcss=[]
from sklearn.cluster import KMeans
for i in range(1,11):
  kmeans = KMeans(n_clusters = i, init="k-means++", n_init = 10, max_iter=300)
  kmeans.fit(scaled_X)
  wcss.append(kmeans.inertia_)

plt.plot(range(1,11), wcss,'*--')

**It seems k=5 is a good choice because we see a significantly drop in the curve**

In [None]:
new_model = KMeans(n_clusters=5)
y_pred1= new_model.fit_predict(scaled_X)

In [None]:
print("shape of y_pred1 is:", y_pred1.shape)
y_pred1

In [None]:
Final_df['Cluster']=y_pred1
Final_df

In [None]:
X = Final_df.iloc[:,[1,2,3]].values
y= new_model.fit_predict(X)


In [None]:
X[:5]

In [None]:
y

In [None]:
fig = plt.figure(figsize = (10,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[y == 0,0],X[y == 0,1],X[y == 0,2], s = 40 , color = 'red', label = "cluster 1")
ax.scatter(X[y == 1,0],X[y == 1,1],X[y == 1,2], s = 40 , color = 'blue', label = "cluster 2")
ax.scatter(X[y == 2,0],X[y == 2,1],X[y == 2,2], s = 40 , color = 'green', label = "cluster 3")
ax.scatter(X[y == 3,0],X[y == 3,1],X[y == 3,2], s = 40 , color = 'yellow', label = "cluster 4")
ax.scatter(X[y == 4,0],X[y == 4,1],X[y == 4,2], s = 40 , color = 'purple', label = "cluster 5")
ax.set_xlabel('Age of a customer-->')
ax.set_ylabel('Anual Income-->')
ax.set_zlabel('Spending Score-->')
ax.legend()
plt.show()


Upon analysis, selecting K=5 appears optimal for clustering.

Cluster 2 comprises individuals under 40 years with notably high annual incomes, correlating with their elevated spending scores. Consequently, incentivizing this demographic with enhanced offers is prudent to sustain their engagement.

Both Cluster 2 and Cluster 4 represent prime candidates for targeted promotional offers, aimed at fostering their patronage at the mall.

In [None]:
Final_df["Target"]= y

In [None]:
clustered_df = Final_df
clustered_df

In [None]:
X = clustered_df.iloc[:,0:4]
Y = clustered_df.iloc[:,-1]
X.head()

In the initial line, "X = Clustered_df.iloc[:, 0:4]," a targeted extraction operation is performed on the DataFrame Clustered_df. This operation encompasses all rows and columns indexed from 0 to 4 (exclusive), aiming to encapsulate the pertinent features essential for subsequent employment within your clustering algorithm.

Subsequently, the following line, "y = Clustered_df.iloc[:, -1]," executes a refined data extraction from Clustered_df. It comprehensively captures all rows and specifically targets the last column (-1), thereby effectively isolating the pertinent labels or target variables crucial for the data analysis process.

# **Neive Bayes Classifier**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=10)
X_train.head()

In [None]:
sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)

In [None]:
from sklearn.naive_bayes import GaussianNB
NB_Model = GaussianNB()
NB_Model.fit(X_train, y_train)
y_pred1 = NB_Model.predict(X_test)


In [None]:
prediction = pd.DataFrame({'Original Value': y_test, 'Predicted Value': y_pred1})

# Print the DataFrame
display(prediction)

In [None]:
from sklearn.metrics import (accuracy_score, confusion_matrix, ConfusionMatrixDisplay, f1_score,)

accuray = accuracy_score(y_pred1, y_test)
f1 = f1_score(y_pred1, y_test, average="weighted")

print("NB_Model_Accuracy is:", accuray)
print("NB_Model_F1 Score is:", f1)

In [None]:
labels = [0,1,2,3,4]
cm = confusion_matrix(y_test, y_pred1, labels=labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot();

# **Gradient Boosting Classifier**

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

In [None]:
gbc_Model = GradientBoostingClassifier(n_estimators=50,random_state=5)

In [None]:
gbc_Model.fit(X_train,y_train)
y_pred2=gbc_Model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred2)
print("gbc_Model_accuracy is:", accuracy)

In [None]:
y_pred2=gbc_Model.predict(X_test)
y_pred2

In [None]:
prediction2 = pd.DataFrame({'Original Value': y_test, 'Predicted Value': y_pred2})

# Print the DataFrame
display(prediction2)