<a href="https://colab.research.google.com/github/pandharkardeep/ML_Mini_Project/blob/main/ML_MiniProj_Task2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Pandharkar  60009220220

## Importing the required Libraries and the Dataset

In [None]:
import pandas as pd
import plotly.express as px
from sklearn.preprocessing import StandardScaler

# K Means
from sklearn.cluster import KMeans

# Metrics
from sklearn.metrics import silhouette_score

In [None]:
df = pd.read_csv('Mall_Customers.csv')
df.head()

Unnamed: 0,CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


### Removing Redundant Columns
These columns will not be used in building our ML Model. Thus we would remove these columns

In [None]:
df.drop(columns=["CustomerID"], inplace=True)
df.head()

Unnamed: 0,Genre,Age,Annual Income (k$),Spending Score (1-100)
0,Male,19,15,39
1,Male,21,15,81
2,Female,20,16,6
3,Female,23,16,77
4,Female,31,17,40


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Genre                   200 non-null    object
 1   Age                     200 non-null    int64 
 2   Annual Income (k$)      200 non-null    int64 
 3   Spending Score (1-100)  200 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 6.4+ KB


In [None]:
df.rename(columns = {'Genre':'Gender'}, inplace = True)

## Plotting necessary charts and Data

In [None]:
pie_plot = px.pie(df, names="Gender", hole=0.4, title="Gender Distribution")
pie_plot.show()

Upon examining the gender distribution within our dataset, it's evident that there exists a slight bias, with approximately 56% of the samples representing females and 44% representing males. This discrepancy introduces a minor class imbalance, diverging from the more typical distribution seen in many datasets.

However, it's crucial to note that this imbalance is relatively small, amounting to only a 6% deviation from an even split between genders. As such, while the gender distribution may deviate slightly from the norm, the impact on our data analysis is expected to be minimal, given the modest magnitude of the imbalance. Therefore, we can proceed with our analysis with confidence, acknowledging this observation while recognizing its limited effect on our results.

In [None]:
age_hist = px.histogram(df, x="Age", title="Age Histogram")
age_hist.show()

age_box = px.box(df, x="Age", color="Gender", title="Age Box Plot")
age_box.show()

Upon scrutinizing our dataset through Histogram and box plots, we observe a broad age distribution spanning from a minimum of 18 to a maximum of 70 years, reflecting a diverse representation across various age groups. Notably, the concentration of ages, as depicted by the interquartile range (IQR) from the first quartile (Q1) to the third quartile (Q3), predominantly lies between 27 and 51, indicating a focus on adult demographics within our data. Specifically, the median age, which serves as a measure of central tendency, stands at 37 for males, indicating a predominant adult presence within this gender category. Conversely, for females, the median slightly shifts downwards to 35, reflecting a subtle divergence in the age distribution between genders.

Furthermore, a comparative analysis reveals distinctions in the range of ages between males and females. While the maximum age for both genders reaches 70, females exhibit a narrower range, with a maximum age of 68, suggesting a slight compression in the age spectrum for females compared to males. Additionally, the histogram visualization portrays a symmetrical distribution, resembling a normal distribution curve, with the peak density of individuals falling within the age range of 30 to 34.

This observation underscores the prevalence of individuals within this age bracket within our dataset, aligning with broader demographic trends. Overall, the analysis provides valuable insights into the age distribution within our dataset, shedding light on the predominant age groups and their representation across genders

In [None]:
annual_income_hist = px.histogram(df, x="Annual Income (k$)", title="Annual Income Histogram", labels={"AnnualIncome": "Annual Income"})
annual_income_hist.show()

annual_income_box = px.box(df, x="Annual Income (k$)", color="Gender", title="Annual Income Box Plot", labels={"AnnualIncome": "Annual Income"})
annual_income_box.show()

Upon examining the histogram depicting annual income distribution, a notable observation is the clustering of individuals towards the lower end of the income spectrum, indicating a prevalence of modest income levels. The prominent peak around 72-79K, representing the average annual income, aligns with expectations, reflecting the typical income range for a significant portion of the population. Additionally, a tail extending to higher income levels signifies the presence of individuals earning considerably higher annual incomes, albeit with a lower count, mirroring real-world income disparities where only a select few attain exceptionally high earnings, possibly due to unique skills or circumstances.

Turning to the box plot, we observe a leftward skew in the distribution of annual income, with the interquartile range (IQR) not centered within the overall range, indicative of an asymmetrical distribution. This skew is evident for both genders, though more pronounced among females, whose IQR exhibits a wider spread compared to males. This disparity in spread suggests potentially greater income variation among females, a trend that may be influenced by the previously noted gender imbalance within the dataset. The unequal distribution of genders could contribute to variations in income distribution, underscoring the need to consider gender dynamics when analyzing income disparities and their implications within our dataset.

In [None]:
spending_score_hist = px.histogram(df, x="Spending Score (1-100)", title="Spending Score Histogram", labels={"SpendingScore": "Spending Score"})
spending_score_hist.show()

spending_score_box = px.box(df, x="Spending Score (1-100)", color="Gender", title="Spending Score Box Plot", labels={"SpendingScore": "Spending Score"})
spending_score_box.show()

Upon examining the distribution of spending scores, which range from 0 to 100, a distinctive probability distribution emerges, characterized by a unique shape that resembles a bell curve, albeit with abrupt drops towards both extremes. While reminiscent of a normal distribution, this shape reflects a notable concentration of individuals within the mid-range of scores, approximately between 30 and 70. This concentration signifies a substantial proportion of individuals with moderate spending scores, while also encompassing notable numbers of both low and high scorers.

Analyzing the gender distribution of spending scores via the box plot yields intriguing insights. Surprisingly, the interquartile range (IQR) for males exhibits a broader spread compared to females, indicating greater variability among male spending scores. Conversely, females tend to cluster towards the upper end of the spending score spectrum, suggesting that, on average, females exhibit higher spending scores than males. Notably, despite this difference in spread, both genders share the same median spending score of 50, indicating a balanced central tendency.

This observation highlights gender-based disparities in spending behavior, with females demonstrating a propensity for higher spending compared to their male counterparts, despite a shared median spending score.

In [None]:
corr = round(df.drop(columns=["Gender"]).corr(method="spearman"), 2)
corr = px.imshow(corr, text_auto=True, title="Correlation Matrix")
corr.show()

Examining the correlation matrix allows us to discern the relationships between individual features and uncover any notable associations among them.

When evaluating the relationship between age and annual income, a modest correlation coefficient of 0.02 suggests a negligible or nearly nonexistent relationship between the two variables. This finding implies that advancing age does not necessarily correspond to an increase in annual income, indicating a nuanced interplay between age and financial standing.

Conversely, age exhibits a relatively strong negative correlation with spending score, signifying that as individuals age, their propensity for spending tends to diminish. This observation aligns with real-world scenarios where older individuals may exhibit more conservative spending habits compared to younger counterparts.

Shifting focus to annual income, consistent with previous observations, we find a lack of significant correlation with both age and spending score. This absence of strong correlations suggests that annual income operates somewhat independently of age and spending behavior within our dataset.

These findings underscore the complex and multifaceted nature of the relationships between demographic and financial variables, highlighting the need for comprehensive analysis to elucidate underlying patterns and dynamics.

In [None]:
px.scatter(df, x="Age", y="Spending Score (1-100)", trendline="ols", title="Age vs Spending Score")

We saw a negative coefficient (-0.34) in the correlation matrix between spending score and Age. This is, also evident when we look at its Scatter Plot.



In [None]:
px.scatter_3d(df, x="Age", y="Spending Score (1-100)", z="Annual Income (k$)", title="Age vs Spending Score", color="Gender")

In the 3D scatter plot depicting annual income, spending score, and age, a distinctive pattern emerges, revealing the presence of five distinct clusters within the data. Notably, the largest and most prominent cluster occupies the central region of the plot. What sets these clusters apart is their distribution across a wide range of values, indicating their presence across various age, annual income, and spending score ranges.

Unlike a singular dominant cluster, these clusters are dispersed throughout the plot, encompassing diverse combinations of age, annual income, and spending behavior. This observation underscores the complexity and heterogeneity of the dataset, suggesting nuanced relationships and potential segmentation within the underlying population.

If we focus on the gender, feature, then it doesn't seem to affect the creation of clusters.

## Applying Preprocessing on the Data

Scaling the Data

In [None]:
scaler = StandardScaler()

# Apply Scaler
scaled_data = scaler.fit_transform(df[["Age", "Annual Income (k$)", "Spending Score (1-100)"]])

Label Encoding the Data

In [None]:
df['Gender'] = df['Gender'].map({'Male':1,'Female':0})

In [None]:
df.head()

Unnamed: 0,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,19,15,39
1,1,21,15,81
2,0,20,16,6
3,0,23,16,77
4,0,31,17,40


## Applying `KMeans` Model on the Data

In [None]:
km = KMeans(
    n_clusters=5,
    n_init=25,
    random_state=42
)

# Fit the algorithm
km.fit(scaled_data)

# Evaluate
wcss = km.inertia_
print(f"Intra-Class Sum of Squared errors: {wcss}")

Intra-Class Sum of Squared errors: 168.24758017556837


In [None]:
px.scatter_3d(df, x="Age", y="Spending Score (1-100)", z="Annual Income (k$)", title="K Means Clusters (5)", color=km.labels_)

## Evaluation -  Best `n` clusters

### Elbow Method

In this, we will try to change number of clusters and try to find WCSS - Within Cluster Sum of Squares for each cluster

In [None]:
n_clusters = list(range(2, 15))

# Scores
wcsss = []

# Loop over all the clusters
for n in n_clusters:
    km_ = KMeans(n, n_init=25, random_state=42)
    km_.fit(scaled_data)
    wcsss.append(km_.inertia_)

# The Elbow method
line_chart = px.line(x=n_clusters, y=wcsss, labels={"y":"Intertia(WCSS)", "x":"No. Clusters"}, title="The Elbow Method")
line_chart.show()

Based on the elbow method analysis, there is some ambiguity between 4 and 6 clusters, but considering 4 as the elbow point yields a clearer inflection point in the inertia plot. While inertia is lower for 6 clusters, it lacks the distinct elbow characteristic. Therefore, proceeding with 4 clusters appears optimal for our analysis.

### Silhouette Method

In [None]:
scores = []

# Loop over all the clusters
for n in n_clusters:
    km_ = KMeans(n, n_init=25, random_state=42)
    km_.fit(scaled_data)
    scores.append(silhouette_score(scaled_data, km_.labels_))

# The Silhouette Method
line_chart = px.line(x=n_clusters, y=scores, labels={"y":"Silhouette Score", "x":"No. Clusters"}, title="The Silhouette Method")
line_chart.show()

### Looking at this graph, we have the highest silhouette score for the 6 number of clusters. That's as per this, the best number of clusters has to be 6.

## Fitting Model on Best Number of Clusters(6)

In [72]:
km_ = KMeans(6, n_init=25, random_state=42,init = 'k-means++')
km_.fit(scaled_data)
px.scatter_3d(df, x="Age", y="Spending Score (1-100)", z="Annual Income (k$)", title="K Means Clusters (6)", color=km_.labels_)

### Trying to Predict using the Model

In [74]:
print(km_.predict((pd.DataFrame([[40,75,87]]))))

[4]


### Getting the PKL file of Model
This will be used for deploying the Model on Flask

In [None]:
import joblib
joblib.dump(km_, "spend_model.pkl")

['spend_model.pkl']

## Adding Clusters to our Data

In [None]:
# Add Clusters to our data
df["Clusters"] = km_.labels_

# Conver to one hot
df = pd.get_dummies(df, columns=["Gender"])
df.head()

Unnamed: 0,Age,Annual Income (k$),Spending Score (1-100),Clusters,Gender_0,Gender_1
0,19,15,39,0,False,True
1,21,15,81,0,False,True
2,20,16,6,5,True,False
3,23,16,77,0,True,False
4,31,17,40,5,True,False


In [None]:
df.agg({
    "Age":"median",
    "Annual Income (k$)":"median",
    "Spending Score (1-100)":"median",
    "Gender_0":"mean",
    "Gender_1":"mean"
})

Age                       36.00
Annual Income (k$)        61.50
Spending Score (1-100)    50.00
Gender_0                   0.56
Gender_1                   0.44
dtype: float64

In [None]:
df.groupby("Clusters").agg({
    "Age":"median",
    "Annual Income (k$)":"median",
    "Spending Score (1-100)":"median",
    "Gender_0":"mean",
    "Gender_1":"mean"
})

Unnamed: 0_level_0,Age,Annual Income (k$),Spending Score (1-100),Gender_0,Gender_1
Clusters,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,23.0,24.0,77.0,0.565217,0.434783
1,54.0,54.0,49.0,0.577778,0.422222
2,43.0,86.0,16.0,0.424242,0.575758
3,26.0,60.0,50.0,0.641026,0.358974
4,32.0,79.0,83.0,0.538462,0.461538
5,46.0,25.0,15.0,0.619048,0.380952


Upon delving into the data segmented by multiple clusters, intriguing insights emerge. For instance, in cluster 0, characterized by a mean age of 25, we observe a predominantly youthful demographic, reflective of recent adults. Notably, this cluster exhibits a median annual income of 24, indicating a relatively modest financial standing. However, despite this, the median spending score reaches a notable 77, which is the second highest among all clusters. Interestingly, the cluster with the highest spending score is the 4th cluster, suggesting distinct spending behaviors within different demographic segments. These findings shed light on the diverse profiles present within our dataset and offer valuable insights into the spending patterns of various demographic groups