# INTRODUCTION

Customer segmentation is dividing up a customer base into smaller groups that have commonalities. This can be helpful for a lot of things, like targeting marketing efforts or enhancing the efficacy of a product or service. Data points can be grouped together using the clustering technique according to how similar they are.

In order for the marketing team to properly design their marketing strategy, the project's objective is to give them information on their target clients. Customer segmentation and clustering will be carried out utilizing Python to accomplish this.

Data on each customer's income, age, and mall shopping score will be collected in the first step. The analysis's starting point will be this data, which will be stored in a pandas DataFrame.

Next, the clustering features will be chosen, which in this case are the mall shopping score, age, and income. To ensure that the various attributes are on the same scale, the data may also need to be pre-processed through scaling or normalization.

Once the data is ready, a clustering algorithm will be used to divide the clients into segments according to how similar they are. To determine which clustering method generates the most insightful segments, a variety of clustering algorithms may be used.

In order to use Python for customer segmentation and clustering, a number of libraries must be imported. Data analysis and manipulation are performed using the pandas library, and data visualization is performed using seaborn and matplotlib.pyplot. Data points are grouped according to how similar they are using the sklearn.cluster.KMeans method from the scikit-learn library. The script's warnings are ignored using the warnings module and the warnings.filterwarnings function to avoid having their output cluttered.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv("/kaggle/input/mall-customer-segmentation/mall-customer-segmentation.csv")

In [None]:
df.head()

# UNIVARIATE ANALYSIS

Univariate analysis is a statistical technique that is used to analyze a single variable. It involves studying the distribution of the data and identifying patterns and trends.

In [None]:
df.describe()

In [None]:
sns.distplot(df['Annual Income (k$)']);

A histogram is produced by the distplot function, which also fits a probability density function (PDF) to the data. The distribution of the data is depicted in the resulting figure, together with its form and any potential patterns or trends.

In [None]:
df.columns

In [None]:
columns = ['Age', 'Annual Income (k$)','Spending Score (1-100)']
for i in columns:
    plt.figure()
    sns.distplot(df[i])

In [None]:
columns = ['Age', 'Annual Income (k$)','Spending Score (1-100)']
for i in columns:
    plt.figure()
    sns.distplot(df[i])

This code displays distribution plots for three variables: age, annual income (k$), and spending score (1-100) using the seaborn library. It performs this by running the sns.distplot() function for each variable while iterating through the variables in a for loop. The generated plots display each variable's distribution, including its form and any patterns or trends that might exist. These charts can be used to comprehend each variable's features and spot any outliers or irregularities in the data.

In [None]:
sns.kdeplot(df['Annual Income (k$)'],shade=True,hue=df['Gender']);

The resulting plot displays the customers' annual income distribution, including its form and any patterns or trends that might exist. Additionally, it demonstrates how the economic disparity between men and women. This graph can be used to comprehend the properties of the income data and spot any outliers or irregularities.

In [None]:
columns = ['Age', 'Annual Income (k$)','Spending Score (1-100)']
for i in columns:
    plt.figure()
    sns.kdeplot(df[i],shade=True,hue=df['Gender'])

The generated plots display each variable's distribution, including its form and any patterns or trends that might exist. These charts can be used to comprehend each variable's features and spot any outliers or irregularities in the data.

In [None]:
columns = ['Age', 'Annual Income (k$)','Spending Score (1-100)']
for i in columns:
    plt.figure()
    sns.boxplot(data=df,x='Gender',y=df[i])

A box plot, which is a graphic depiction of the distribution of a dataset, is produced using the sns.boxplot() function. The box plot displays the data's median and quartiles as well as any potential outliers or anomalies.

In [None]:
df['Gender'].value_counts(normalize=True)

# BIVARIATE ANALYSIS

It is beneficial to understand the relationships between the variables and spot any trends or patterns. Bivariate analysis can be carried out using methods including scatter plots, correlation coefficients, and regression analysis. It is comparable to multivariate analysis, which considers several variables at once, and is frequently employed after univariate analysis to look into potential links in the data.

In [None]:
sns.scatterplot(data=df, x='Annual Income (k$)',y='Spending Score (1-100)' )

In [None]:
sns.pairplot(df,hue='Gender')

The plots are colored by the 'Gender' column. The resulting plots show the relationships between the variables, including any patterns or trends that may be present. These plots can be used to identify any correlations or trends between the variables and to understand how the variables differ between males and females.

In [None]:
df.groupby(['Gender'])['Age', 'Annual Income (k$)',
       'Spending Score (1-100)'].mean()

This groups the DataFrame by the 'Gender' column and computes the mean of three variables for each group: 'Age,' 'Annual Income (k$),' and 'Spending Score (1-100).

In [None]:
df.corr()

The pandas function df.corr() computes the pairwise correlations between all columns in a DataFrame and produces a DataFrame with the correlation coefficient for each pair of columns. The correlation coefficient, which ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), assesses the degree and direction of a linear relationship between two variables (perfect positive correlation). This function can be used to find and understand any relationships between variables in a DataFrame.

In [None]:
sns.heatmap(df.corr(),annot=True,cmap='coolwarm')

The seaborn library is used in this code to generate a heat map displaying the relationships between the variables in the df DataFrame. The correlations are calculated with the df.corr() function, and the results are plotted with the annot parameter. The color map 'coolwarm' was used to color the plot.

# Clustering - Univariate, Bivariate, Multivariate

The goal of clustering is to identify patterns and relationships within the data that may not be evident by looking at each data point individually. First thing to do is to initiate the algorithm.

In [None]:
clustering1 = KMeans(n_clusters=3)

This generates an instance of the KMeans class, which is a common unsupervised clustering algorithm. The n clusters option has been set to 3, indicating that the algorithm should divide the data into three clusters. This code's clustering1 object can be used to fit the KMeans algorithm to a dataset and predict cluster assignments for additional data points.

In [None]:
clustering1.fit(df[['Annual Income (k$)']])

In [None]:
clustering1.labels_

In [None]:
df['Income Cluster'] = clustering1.labels_
df.head()

In [None]:
df['Income Cluster'].value_counts()

In [None]:
clustering1.inertia_

.inertia_ that stores the sum of squared distances between data points and the centroids of respective clusters. It can be used to assess the performance of the KMeans algorithm on a particular dataset and to compare the method's performance for different n clusters values. A lower inertia_ number usually suggests a better fit of the data to the clusters.

In [None]:
intertia_scores=[]
for i in range(1,11):
    kmeans=KMeans(n_clusters=i)
    kmeans.fit(df[['Annual Income (k$)']])
    intertia_scores.append(kmeans.inertia_)

On each iteration of the loop, the KMeans algorithm is fit to the data with the current value of n_clusters and the value of the inertia_ attribute is appended to the intertia_scores list. The intertia_scores list will contain the values of inertia_ for n_clusters ranging from 1 to 10. This code is used to evaluate the performance of the KMeans algorithm for different values of n_clusters and to identify the optimal number of clusters.

In [None]:
intertia_scores

In [None]:
plt.plot(range(1,11),intertia_scores)

This creates a line plot of the values in the intertia scores list using the plot() method from the matplotlib.pyplot package. The plot's x-values are n cluster values ranging from 1 to 10, while the y-values are inertia_ values. This graphic can be used to visually represent the relationship between n clusters and inertia_ and to determine the appropriate number of clusters.

In [None]:
df.groupby('Income Cluster')['Age', 'Annual Income (k$)',
       'Spending Score (1-100)'].mean()

#Bivariate Clustering

In [None]:
clustering2 = KMeans(n_clusters=5)
clustering2.fit(df[['Annual Income (k$)','Spending Score (1-100)']])
df['Spending and Income Cluster'] =clustering2.labels_
df.head()

In [None]:
intertia_scores2=[]
for i in range(1,11):
    kmeans2=KMeans(n_clusters=i)
    kmeans2.fit(df[['Annual Income (k$)','Spending Score (1-100)']])
    intertia_scores2.append(kmeans2.inertia_)
plt.plot(range(1,11),intertia_scores2)

On each iteration of the loop, the KMeans algorithm is fit to the df[['Annual Income (k$)','Spending Score (1-100)']] DataFrame with the current value of n_clusters and the value of the inertia_ attribute is appended to the intertia_scores2 list. After the loop has completed, a line plot is created with the values of n_clusters on the x-axis and the values of inertia_ on the y-axis.

In [None]:
centers =pd.DataFrame(clustering2.cluster_centers_)
centers.columns = ['x','y']

This generates a DataFrame in which the coordinates of the centroids of the clusters found by the KMeans method are saved. The centers DataFrame, for example, can be used to see the centroids' locations on a scatter plot.

In [None]:
plt.figure(figsize=(10,8))
plt.scatter(x=centers['x'],y=centers['y'],s=100,c='black',marker='*')
sns.scatterplot(data=df, x ='Annual Income (k$)',y='Spending Score (1-100)',hue='Spending and Income Cluster',palette='tab10')
plt.savefig('clustering_bivariate.png')

This plots the data in the df DataFrame as a scatter plot, using the 'Annual Income (k$)' column on the x-axis and the 'Spending Score (1-100)' column on the y-axis. The scatter plot points are colored based on the values in the 'Spending and Income Cluster' column. The scatter plot also shows the positions of the cluster centroids as black stars.

In [None]:
pd.crosstab(df['Spending and Income Cluster'],df['Gender'],normalize='index')

In [None]:
df.groupby('Spending and Income Cluster')['Age', 'Annual Income (k$)',
       'Spending Score (1-100)'].mean()

In [None]:
#mulivariate clustering 
from sklearn.preprocessing 
import StandardScaler

In [None]:
scale = StandardScaler()

The StandardScaler class is a preprocessing tool that can be used to standardize the features of a dataset by scaling them to have zero mean and unit variance.

In [None]:
df.head()

In [None]:
dff = pd.get_dummies(df,drop_first=True)
dff.head()

This transforms the df DataFrame's category variables to dummy/indicator variables and stores the resulting data in a new DataFrame called dff. The dummy/indicator variables are created using the pd.get dummies() method, and the drop first argument is set to True to remove the first level of each categorical variable. The first five rows of the dff DataFrame are then displayed using the head() method.

In [None]:
dff = dff[['Age', 'Annual Income (k$)', 'Spending Score (1-100)','Gender_Male']]
dff.head()

In [None]:
dff = scale.fit_transform(dff)

In [None]:
dff = pd.DataFrame(scale.fit_transform(dff))
dff.head()

In [None]:
intertia_scores3=[]
for i in range(1,11):
    kmeans3=KMeans(n_clusters=i)
    kmeans3.fit(dff)
    intertia_scores3.append(kmeans3.inertia_)
plt.plot(range(1,11),intertia_scores3)

This plot can be used to identify the optimal number of clusters by finding the "elbow" of the plot, which is the point at which the value of inertia_ starts to decrease more slowly.

In [None]:
df

In [None]:
df.to_csv('Clustering-output.csv')

# SUMMARY

The purpose of this project was to identify the most important customer groups for a marketing team to target using data analysis and machine learning approaches. The project's data comprised of information about mall clients, such as their age, gender, annual income, and spending score.

The data was imported into a Pandas DataFrame to begin the analysis, and several visualizations were created to explore the correlations between the variables. Univariate plots were used to depict the distribution of each individual variable, and bivariate plots were used to visualize the interactions between pairs of variables. The data was also summarized using statistical measures such as the mean and standard deviation.

The data was then clustered to identify groups of customers with similar characteristics. Customers were divided into clusters using the KMeans algorithm based on their annual income and expenditure score. The best number of clusters was established by locating the "elbow" in a plot of the inertia_ values for various n clusters values.

Finally, the clustering findings were visualized and analyzed in order to better understand the characteristics of the various clusters and to determine the most important consumer groups for the marketing team to target.