# Customer Segmentation Project.
# Market Basket Analysis.

# Case Study

It has always been important for businesses to understand customer behaviours in order to ensure that products or services are tailored towards maximum profitability. For this case study, we will refer to a dataset with customer shopping data on customer’s gender, city, customer‘s annual income, credit score, and spending score found here. This data was obtained on several cities in India as will be seen in the dataset. Data visualization will be done (in Python) to make comparisons between the different features of the dataset.  

#  1. Defining the Goal of Customer Segmentation

Customer Segmentation is the process of division of customer base into several groups of individuals that share a similarity in different ways that are relevant to marketing. Behaviour leads to Customer Segmentation, why Using clustering, companies, Malls, supermarkets and Restaurants can identify segments of customers to target the potential user base. We will divide customers into groups according to common characteristics like gender, city, customer‘s annual income, credit score, and spending score. Through this we get deeper understanding of the customer preferences as well as the requirements for discovering valuable segments that would help us gain maximum profit for the company.
secondly we strategize the marketing techniques more efficiently and reduce the risk of investment.

#  2. Get the Data

We need to understand the data set in detail. We develop a brief understanding of the data set of which we will be working with. For example how many features are there in the data set , how many unique labels, How are they distributed or how are the labels distributed.

In [1]:
#We import the libraries for performing basic mathmatical operations and tabular Dataset that we intend to use in developing our model project.
import pandas as pd
import numpy as np
from pandas import plotting

#  Load the Data set.

In [2]:
#We now load the data set.

In [3]:
Data=pd.read_csv("C:\\Users\\nongaya\\Desktop\\Jenga-Project\\Mall_Customers.csv",header="infer")# we now read the data.

In [4]:
Data

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40
...,...,...,...,...,...
195,196,Female,35,120,79
196,197,Female,45,126,28
197,198,Male,32,126,74
198,199,Male,32,137,18


In [5]:
Data.head()# we explore the headers on the datasets to understand the features.

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [6]:
Data.tail()# we explore the tail on the datasets to understand it. Shows the bottom Data in the set

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
195,196,Female,35,120,79
196,197,Female,45,126,28
197,198,Male,32,126,74
198,199,Male,32,137,18
199,200,Male,30,137,83


In [7]:
Data.tail().T# We explore the data sets to check on the headers conformity and understand it.

Unnamed: 0,195,196,197,198,199
CustomerID,196,197,198,199,200
Gender,Female,Female,Male,Male,Male
Age,35,45,32,32,30
Annual Income (k$),120,126,126,137,137
Spending Score (1-100),79,28,74,18,83


In [8]:
Data.sample(10)

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
129,130,Male,38,71,75
7,8,Female,23,18,94
139,140,Female,35,74,72
99,100,Male,20,61,49
35,36,Female,21,33,81
184,185,Female,41,99,39
112,113,Female,38,64,42
107,108,Male,54,63,46
21,22,Male,25,24,73
190,191,Female,34,103,23


In [9]:
Data.T# we visualize the data set in tabular form

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,190,191,192,193,194,195,196,197,198,199
CustomerID,1,2,3,4,5,6,7,8,9,10,...,191,192,193,194,195,196,197,198,199,200
Gender,Male,Male,Female,Female,Female,Female,Female,Female,Male,Female,...,Female,Female,Male,Female,Female,Female,Female,Male,Male,Male
Age,19,21,20,23,31,22,35,23,64,30,...,34,32,33,38,47,35,45,32,32,30
Annual Income (k$),15,15,16,16,17,17,18,18,19,19,...,103,103,113,113,120,120,126,126,137,137
Spending Score (1-100),39,81,6,77,40,76,6,94,3,72,...,23,69,8,91,16,79,28,74,18,83


# 3. Clean the Data.

In [10]:
Data.isnull().sum()#Check if there are any missing values.

CustomerID                0
Gender                    0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64

In [11]:
# This shows our data set does not have any missing values which means its clean.

In [12]:
len(Data)#Shows how much data the Dataset contains:

200

In [13]:
Data.info() #This displays all columns and their data types,

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   CustomerID              200 non-null    int64 
 1   Gender                  200 non-null    object
 2   Age                     200 non-null    int64 
 3   Annual Income (k$)      200 non-null    int64 
 4   Spending Score (1-100)  200 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 7.9+ KB


In [14]:
Data.describe()# This shows you some basic descriptive statistics for all numeric columns in the data set.which includes the count,mean,standard deviation,min and max

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100)
count,200.0,200.0,200.0,200.0
mean,100.5,38.85,60.56,50.2
std,57.879185,13.969007,26.264721,25.823522
min,1.0,18.0,15.0,1.0
25%,50.75,28.75,41.5,34.75
50%,100.5,36.0,61.5,50.0
75%,150.25,49.0,78.0,73.0
max,200.0,70.0,137.0,99.0


# 4. Enrich the data set to obtain reports.

In [15]:
#We now check for the data types in the data set.
Data.dtypes

CustomerID                 int64
Gender                    object
Age                        int64
Annual Income (k$)         int64
Spending Score (1-100)     int64
dtype: object

In [17]:
# We now visualize the columns in the data set.
Data.columns

Index(['CustomerID', 'Gender', 'Age', 'Annual Income (k$)',
       'Spending Score (1-100)'],
      dtype='object')

In [18]:
# We now rename the columns in the data set.

Data.rename(columns={'Annual Income (k$)':'AnnualIncome','Spending Score (1-100)':'SpendingScore'},inplace=True)

In [19]:
# We use the concept of for loop in our data set  to understatand for data's columns
for i,col in enumerate(Data.columns):
    print((i+1),'. columns is :',col)

1 . columns is : CustomerID
2 . columns is : Gender
3 . columns is : Age
4 . columns is : AnnualIncome
5 . columns is : SpendingScore


In [20]:
# Lets Perform the row and columns count in our data set.
Data.shape

(200, 5)

In [21]:
#Check for null values count  in our data set.
Data.isnull().sum()

CustomerID       0
Gender           0
Age              0
AnnualIncome     0
SpendingScore    0
dtype: int64

In [22]:
#Lets check for every feature control  null value in this data # False mean our data is clean.
print(list(Data.isnull().any()))

[False, False, False, False, False]


In [23]:
#lWe check for data control null values in our data sets.
Data.isnull().values.any()

False

In [24]:
#checking for data correlation in our data set.
Data.corr()

Unnamed: 0,CustomerID,Age,AnnualIncome,SpendingScore
CustomerID,1.0,-0.026763,0.977548,0.013835
Age,-0.026763,1.0,-0.012398,-0.327227
AnnualIncome,0.977548,-0.012398,1.0,0.009903
SpendingScore,0.013835,-0.327227,0.009903,1.0


In [25]:
Data.iloc[:,1:].corr()# we check for data correlation in our features as headers.

Unnamed: 0,Age,AnnualIncome,SpendingScore
Age,1.0,-0.012398,-0.327227
AnnualIncome,-0.012398,1.0,0.009903
SpendingScore,-0.327227,0.009903,1.0


# 5.Find Insights and Visualize Data set.

We intend to drop the customer id from our data set since the customer id does not have the insights or does not draw any correlationship or unique feature with the rest of features in the data set. Hence we end up having Age,AnnualIncome,Spending Score and Gender in our data set.
Therefore we will use this to visualize the data set.