# **Project Name**    -



##### **Project Type**    - Unsupervised (Clustering)
##### **Contribution**    - Individual
##### **Contributor **   - Prashant Pratap Singh


# **Project Summary -**

In this project we have online retail dataset having information about the customer purchases. Our task here is to extract as such information as we can about our customers to make our business strategies. For this purpose we will divide our dataset into certain number of groups(clusters) in which each group will have customers having similar characteristics. These characteristics can be gender, age, monetary value of purchases, frequency of purchases, recency of purchase etc. Through clustering we can make our marketing strategies more targeted and efficient resulting in better business prospects. Each group(cluster) will have specific marketing strategies based on their characteristics. Our project will thus help improve the business of the company.   

# **Problem Statement**


Our task here is to extract as much information as we can about our customers through the online retail dataset. For this purpose we have to divide our dataset into different clusters(groups) based on certain characteristics
These groups will then be used to make targeted marketing startegies

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np # for numerical calculations
import pandas as pd # for data analysis
import matplotlib.pyplot as plt # for data visualization
import seaborn as sns # for data visualization
import datetime as dt #for date manipulation
from numpy import math #for mathematical calculations
from sklearn.preprocessing import StandardScaler #for scaling the data
from sklearn.metrics import silhouette_score #for clustering evaluation
from sklearn.cluster import KMeans #for clustering

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv("Online Retail.csv")

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

Our dataset has 541909 rows and 8 columns





### Dataset Information

In [None]:
# Dataset Info
df.info()

Description and CustomerId have null values



#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

Our dataset have 5268 duplicate entries

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(f"Total number of null values in the dataset is {df.isnull().sum().sum()}\n")
df_null = df.isnull().sum()
df_null

There are 136534 null values with customerID having the most null values(135080)

In [None]:
# Visualizing the missing values
plt.figure(figsize = (10,10))
sns.barplot(x = df_null.index, y = df_null.values)
plt.xlabel("Columns")
plt.ylabel("Number of Missing Values")
plt.show()

### What did you know about your dataset?

Our dataset have -

1. 541909 rows , 8 columns
2. 5268 duplicate values
3. 136534 missing values

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

1. InvoiceNo - Invoice Number
2. StockCode - Stock Name Code
3. Description - Description of Product
4. Quantity - Quantity Purchased
5. InvoiceDate - Date of Purchase
6. UnitPrice - Price of one unit
7. CustomerID - Unique Id of Customer
8. Country - Country of Customer

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

We can check the unique values for each variable here. InvoiceNo has the most unique values followed by InvoiceDate

## 3. ***Data Wrangling***

### Data Wrangling Code

**1. Handling Missing Values**

In [None]:
#removing all rows with null values in certain columns

df.dropna(subset = ['InvoiceNo','Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'], inplace = True )

df.shape #checking number of rows after removal

Now there are only 406829 rows after removing all the null value rows

In [None]:
#checking number of null values in each variable
df.isnull().sum()

Now there are no variables with null values

**2. Removing Duplicate Entries**

In [None]:
# finding number of duplicate entries
df.duplicated().sum()

We have to now remove the duplicate entries

In [None]:
#removing duplicates
df.drop_duplicates(inplace = True)
df.duplicated().sum()

Now there are no duplicates left in the dataset

**3. Removing Outliers**

We will remove outliers in Quantity and Unit Price variables through IQR method.
IQR is the difference between the value at 75 percentile to the value at 25 percentile. We found these percentile values through describe method above

In [None]:
#removing outliers in quantity variable
percentile_75 = 10
percentile_25 = 1

IQR = percentile_75 - percentile_25

UR = percentile_75 + 1.5*IQR
LR = percentile_25 - 1.5*IQR

df = df[(df["Quantity"]<=UR) & (df["Quantity"]>=LR) & (df["Quantity"]>0)]
df.shape


We calculated the upper range(UR) and lower range(LR) and removed all the values outside this range

In [None]:
#removing outliers in Unit Price variable
percentile_75 = 4.13
percentile_25 = 1.25

IQR = percentile_75 - percentile_25

UR = percentile_75 + 1.5*IQR
LR = percentile_25 - 1.5*IQR

df["UnitPrice"] = df["UnitPrice"].astype(float)

df = df[(df["UnitPrice"]<=UR) & (df["UnitPrice"]>=LR) & (df["UnitPrice"]>0)]
df.shape


We calculated the upper range(UR) and lower range(LR) and removed all the values outside this range

In [None]:
#checking for any anomaly in country
df["Country"].value_counts()

We did'nt found any outlier in other columns

**4. Creating New Columns**

In [None]:
#creating a new column total price
df["Total Price"] = df['Quantity']*df['UnitPrice']
df.head()

In [None]:
df.describe()

In [None]:
#removing outliers in Total Price
percentile_75 = 16.6
percentile_25 = 3.75

IQR = percentile_75 - percentile_25

UR = percentile_75 + 1.5*IQR
LR = percentile_25 - 1.5*IQR

df = df[(df["Total Price"]<=UR) & (df["Total Price"]>=LR) & (df["Total Price"]>0)]
df.shape


Successfully removed outliers in Total Price

In [None]:
#extracting year,month,day from invoice date

df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])
df["InvoiceYear"] = df["InvoiceDate"].dt.year
df["InvoiceMonth"] = df["InvoiceDate"].dt.month_name()
df["InvoiceDay"] = df["InvoiceDate"].dt.day_name()
df.head()

### What all manipulations have you done and insights you found?

Manipulations done -

1. We have removed all rows having null values in 'InvoiceNo','Quantity', 'InvoiceDate, 'UnitPrice', 'CustomerID', 'Country'. We have removed a total of 135,086 rows

2. We have removed 5225 duplicate entries

3. We used interquartile range(IQR) method to remove 85511 rows having outliers in the columns quantity and unit price

4. We have made a new column TotalPrice to find the total monetary value of each purchase. We made this by multiplying quantity and unit price

5. Removed outliers in Total Price Variable using IQR method

6. We made new columns InvoiceYear, InvoiceMonth, InvoiceDay to find the year, month and day of purchase

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#plotting histogram of quantity
sns.histplot(data=df , x=df["Quantity"])
plt.show()

##### 1. Why did you pick the specific chart?

Histogram is used to count the frequency of a continuous variable Quantity

##### 2. What is/are the insight(s) found from the chart?

Customers mostly buy products in the number from 1 to 12

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By knowing the quantity preferences of customers our clients can optimize their inventory thus increasing profits

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#plotting histogram of unitprice
sns.histplot(data=df , x=df["UnitPrice"])
plt.show()

##### 1. Why did you pick the specific chart?

Histogram is used to plot the frequency of a single continuous variable UnitPrice

##### 2. What is/are the insight(s) found from the chart?

Customers mostly buy products having unit price in the range of $ 0-6

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Our clients can optimize their inventory by preferring products in the most popular price range thus increasing their profit

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#plotting histogram of TotalPrice
sns.histplot(data=df , x=df["Total Price"])
plt.show()

##### 1. Why did you pick the specific chart?

Histogram is used to plot the frequency of a single continuous variable Total Price

##### 2. What is/are the insight(s) found from the chart?

Mostly customers purchase in the price range of $ 0-20

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By knowing the price sensitivity of customers our clients can optimize their marketing strategies and inventories thus increasing their business prospects

#### Chart - 4

In [None]:
# Chart - 4 visualization code
#plotting countplot of top 10 buying countries
countries = df["Country"].value_counts() # series having count of countries in descending order
top_10_countries = countries[:10,] # series with top 10 buying countries
plt.figure(figsize = (14,8)) # adjusting figure size
sns.barplot(x = top_10_countries.index , y = top_10_countries.values )
plt.show()

##### 1. Why did you pick the specific chart?

Barplot is used to plot a categorical variable country against its counts in the dataset

##### 2. What is/are the insight(s) found from the chart?

United Kingdom far exceeds other countries in terms of purchases

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Since most of the purchases are coming from the home country so the marketing strategies should be more focused on it rather than other countries

#### Chart - 5

In [None]:
# Chart - 5 visualization code
#plotting countplot of InvoiceYear
sns.countplot(x = df.InvoiceYear)
plt.show()

##### 1. Why did you pick the specific chart?

Countplot is chosen to plot the counts of categorical variable InvoiceYear

##### 2. What is/are the insight(s) found from the chart?

Business has increased exponentially from 2010 to 2011

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business and marketing strategies of our client are working fine looking at the growth of the company. They should carry forward their strategies.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#plotting countplot of InvoiceMonth
plt.figure(figsize=(12,6))
sns.countplot(x = df.InvoiceMonth)
plt.show()


##### 1. Why did you pick the specific chart?

Countplot is chosen to plot the counts of categorical variable InvoiceMonth

##### 2. What is/are the insight(s) found from the chart?

Last 4 months sees the maximum sales

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Inventories should be increased and marketing strategies directed towards the end of the year to increase profits

#### Chart - 7

In [None]:
# Chart - 7 visualization code
#plotting country against average total price
data = df.groupby("Country")["Total Price"].mean() #grouping country and total price
data_sorted = data.sort_values(ascending=False).head(10) # sorting the prices in ascending order and finding top 10 countries
plt.figure(figsize=(15,6))
sns.barplot(x = data_sorted.index , y = data_sorted.values)
plt.ylabel("Mean Total Price")
plt.show()

##### 1. Why did you pick the specific chart?

Barplot is chosen to plot a categorical variable country against a continuous variable Mean Total Price

##### 2. What is/are the insight(s) found from the chart?

Barring Czech Republic all other top buying countries have same mean total price.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Like the previous recommendation of focusing on the home country this time also we got the insight that foreign buying is not up to the mark.
It is again recommended to focus on the marketing strategies and inventories in the home country

#### Chart - 8

In [None]:
# Chart - 8 visualization code
#plotting InvoiceYear against average total price
data = df.groupby("InvoiceYear")["Total Price"].mean()
data_sorted = data.sort_values(ascending=False).head(10)
sns.barplot(x = data_sorted.index , y = data_sorted.values)
plt.ylabel("Mean Total Price")
plt.show()

##### 1. Why did you pick the specific chart?

Barplot is chosen to plot a categorical variable InvoiceYear against a continuous variable Mean Total Price

##### 2. What is/are the insight(s) found from the chart?

Mean Total price is almost the same for the years 2010 and 2011

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Despite exponential increase in total sales in 2011 as compared to 2010 average sales is same for the years 2011 and 2010. It is advised to our clients to recaliberate their marketing strategies so that on an average the customers buy more.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
#plotting InvoiceMonth against mean total price
data = df.groupby("InvoiceMonth")["Total Price"].mean()
data_sorted = data.sort_values(ascending=False).head(10)
plt.figure(figsize = (10,6))
sns.barplot(x = data_sorted.index , y = data_sorted.values)
plt.ylabel("Mean Total Price")
plt.show()

##### 1. Why did you pick the specific chart?

Barplot is chosen as we have to plot a categorical variable against a continuous variable

##### 2. What is/are the insight(s) found from the chart?

Despite the last 4 months where most of the buying take place average prices still remain almost the same for every month  

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Like the previous insight our client must change their strategies so as to increase the average buying by customers.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
#plotting InvoiceDay against total price
data = df.groupby("InvoiceDay")["Total Price"].sum()
data_sorted = data.sort_values(ascending=False).head(10)
plt.figure(figsize = (10,6))
sns.barplot(x = data_sorted.index , y = data_sorted.values)
plt.ylabel("Total Price")
plt.show()

##### 1. Why did you pick the specific chart?

Barplot is chosen to plot a categorical variable (InvoiceDay) against a continuous variable (Total Price)

##### 2. What is/are the insight(s) found from the chart?

Weekdays see more buying from customers as compared to weekends

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Unexpectedly weekdays see more buying than weekends. Weekends should be the most productive time for business. If this issue is sorted than business prospects can be greatly increased

#### Chart - 11

In [None]:
# Chart - 11 visualization code
#plotting unitprice with Quantity
sns.lineplot(x = df.Quantity, y = df.UnitPrice)
plt.show()

##### 1. Why did you pick the specific chart?

Linechart is chosen as both are continuous variables

##### 2. What is/are the insight(s) found from the chart?

Unit price of product decreases as the quantity of the product increases

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There should be a check on the drastic decrease in unit price of products as their quantity increases. It should be checked whether any discount offered on buying more products is impacting the business prospects of the company.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Plotting top 10 customers against total price
data = df.groupby("CustomerID")["Total Price"].sum()
data_sorted = data.sort_values(ascending=False).head(10)
plt.figure(figsize = (10,6))
sns.barplot(x = data_sorted.index , y = data_sorted.values)
plt.ylabel("Total Price")
plt.show()

##### 1. Why did you pick the specific chart?

Barplot is chosen to plot a categorical variable CustomerId against a continuous variable total price

##### 2. What is/are the insight(s) found from the chart?

Our high value customers are spending significant amount on purchases with some customers having spent as high as $ 30000.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These high value customers have a high potential of spending. Our client will greatly benefited if there are special marketing startegies for them

#### Chart - 13

In [None]:
# Chart - 13 visualization code
#plotting InvoiceMonth against total prices for different years
plt.figure(figsize = (12,5))
sns.barplot(x = df["InvoiceMonth"] , y = df["Total Price"]  , hue = df.InvoiceYear , estimator = np.sum)
plt.show()

##### 1. Why did you pick the specific chart?

This chart is chosen to plot a categorical variable (InvoiceMonth) against continuous variable (Total Price)

##### 2. What is/are the insight(s) found from the chart?

We have plotted sales of different months across different years. Barring december there are no other months where sales are even comparable. 2011 saw exponential increase in sales

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Client should continue the policies they have implemented after 2010 as that has resulted in multiple times increase in sales

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
df_new = df[["Quantity" , "UnitPrice" , "Total Price"]] #creating a new dataframe with only continuous variables
sns.heatmap(df_new.corr(), cmap="YlGnBu", annot=True)
plt.show()

##### 1. Why did you pick the specific chart?

Heatmap is used to plot the relationship between different continuous variables

##### 2. What is/are the insight(s) found from the chart?

There is positive proportionality between quantity and total price and between unit price and total price. And there is negative proportionality between quantity and unit price

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
plt.figure(figsize = (10,15))
sns.pairplot(df_new)
plt.show()

##### 1. Why did you pick the specific chart?

Pair plot is chosen to plot graphs between different continuous variables

##### 2. What is/are the insight(s) found from the chart?

Quantity decreases as unit price increases
Total Price increases as unit price increases
Total Price increases as quantity increases

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

We will perform hypothesis testing on three variables Quantity, Unit Price and Total Price. The different hypothesis we will perform is discussed below

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Claim - Most of the product quantity bought by the individual customers are less than or equal to 12

Null Hypothesis(Ho): p = 0.5

Alternate Hypothesis(Ha): p < 0.5

Here we take significance level = 0.05
If our p value comes below significance level then we reject the claim

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
n = 1000 # number of samples taken

df_sample = df.sample(n=1000) #taking sample of the dataset

df_sample_quantity_less_than_equal_to_12 = df_sample[df_sample["Quantity"]<=12] #finding entries with quantity less than 12 in df_sample

#finding probability of entries with quantity less than 12 in df_sample
probability_quantity_less_than_equal_to_12 = len(df_sample_quantity_less_than_equal_to_12)/n

probability_quantity_less_than_equal_to_12

In [None]:
#finding test statistic z
p_sample  = probability_quantity_less_than_equal_to_12
p_claim = 0.5
q = 1- p_claim

z = (p_sample - p_claim)/np.sqrt(p_claim*q/n)
z

We see that for values of z = 3.50 and higher, we use 0.9999 for the cumulative area to the left of the test statistic. p value is 0.9999
p value is greater than 0.05 so our claim is true

##### Which statistical test have you done to obtain P-Value?

Since our test statistic z is greater than 3.50 so we have taken our p value as 0.9999 which is standard for such z values

##### Why did you choose the specific statistical test?

We chose the specific statistical test as we have probability of sample and probability we claimed. We just have to find the test statistic z according to the formula above and find the p value corresponding to it. After measuring whether the p value is greater or less than the significance level (0.05) we accept or reject the null hypothesis

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Claim - Most of the products have unit price less than 5

Null Hypothesis(Ho): p = 0.5

Alternate Hypothesis(Ha): p < 0.5

Here we take significance level = 0.05

If our p value comes below significance level then we reject the claim

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
n = 1000 # number of samples taken

df_sample = df.sample(n=1000) #taking sample of the dataset

df_sample_unitprice_less_than_5 = df_sample[df_sample["UnitPrice"]<5] #finding entries with unit price less than 5 in df_sample

#finding probability of entries with unit price less than 5 in df_sample
probability_unitprice_less_than_5 = len(df_sample_unitprice_less_than_5)/n

probability_unitprice_less_than_5

In [None]:
#finding test statistic z
p_sample  = probability_unitprice_less_than_5
p_claim = 0.5
q = 1- p_claim

z = (p_sample - p_claim)/np.sqrt(p_claim*q/n)
z

We see that for values of z = 3.50 and higher, we use 0.9999 for the cumulative area to the left of the test statistic. p value is 0.9999. p value is greater than 0.05 so our claim is true

##### Which statistical test have you done to obtain P-Value?

Since our test statistic z is greater than 3.50 so we have taken our p value as 0.9999 which is standard for such z values

##### Why did you choose the specific statistical test?

We chose the specific statistical test as we have probability of sample and probability we claimed. We just have to find the test statistic z according to the formula above and find the p value corresponding to it. After measuring whether the p value is greater or less than the significance level (0.05) we accept or reject the null hypothesis

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Claim - Most of the sales have total price less than 30

Null Hypothesis(Ho): p = 0.5

Alternate Hypothesis(Ha): p < 0.5

Here we take significance level = 0.05 If our p value comes below significance level then we reject the claim

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
n = 1000 # number of samples taken

df_sample = df.sample(n=1000) #taking sample of the dataset

df_sample_totalprice_less_than_30 = df_sample[df_sample["Total Price"]<30] #finding entries with total price less than 30 in df_sample

#finding probability of entries with total price less than 30 in df_sample
probability_totalprice_less_than_30 = len(df_sample_totalprice_less_than_30)/n

probability_totalprice_less_than_30

In [None]:
#finding test statistic z
p_sample  = probability_totalprice_less_than_30
p_claim = 0.5
q = 1- p_claim

z = (p_sample - p_claim)/np.sqrt(p_claim*q/n)
z

We see that for values of z = 3.50 and higher, we use 0.9999 for the cumulative area to the left of the test statistic. p value is 0.9999. p value is greater than 0.05 so our claim is true

##### Which statistical test have you done to obtain P-Value?

Since our test statistic z is greater than 3.50 so we have taken our p value as 0.9999 which is standard for such z values

##### Why did you choose the specific statistical test?

We chose the specific statistical test as we have probability of sample and probability we claimed. We just have to find the test statistic z according to the formula above and find the p value corresponding to it. After measuring whether the p value is greater or less than the significance level (0.05) we accept or reject the null hypothesis

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

Already handled missing values in data wrangling section

#### What all missing value imputation techniques have you used and why did you use those techniques?

Removed all rows having missing values in columns 'InvoiceNo','Quantity', 'InvoiceDate','UnitPrice', 'CustomerID', 'Country'. Used the dropna function for it. This techinque is used because these variables are important for our analysis

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

Already handled outliers in data wrangling section

##### What all outlier treatment techniques have you used and why did you use those techniques?

We removed outliers in Quantity,Unit Price and Total Price variables through IQR method. IQR is the difference between the value at 75 percentile to the value at 25 percentile. We found the upper range and lower range for our variables through the below formula

Upper Range = 75 percentile value + 1.5IQR

Lower Range = 25 percentile value - 1.5IQR

We removed any value outside the upper and lower range

This method is used as it is a reliable method to remove outliers in continuous variables

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

There is no need for categorical encoding in our project

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

In our analysis and ML implementation we have not found any need for textual preprocessing

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

We have created new features "Total Features" , "InvoiceYear" , "InvoiceMonth" , "InvoiceDay" for better understanding of customer behaviour

In [None]:
# Manipulate Features to minimize feature correlation and create new features


   #### 2. Feature Selection

We will create new dataframes rfm_df and qp_df where we will create new features like "Recency", "Frequency","Monetary","Quantity"(Average), "Total Price"(Average). These features will be used in our model

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

We will take log of our variables before deploying them in the algorithm. This is the only transformation we will do. This transformation is done so as to balance the ranges of different variables and make them as close as possible to normal distribution

In [None]:
# Transform Your data


### 6. Data Scaling

In [None]:
# Scaling your data

We will use standard scaler to scale our data

##### Which method have you used to scale you data and why?

We will use standard scaler as it reduces the mean of a feature to 0 and the variance to 1 which significantly reduces the range of the feature and make it comparable to the other features where same scaling is deployed. This makes the model performance more accurate

### 7. Dimesionality Reduction

We have not used any dimensionality reduction technique

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In clustering we don't need to split the data into training and test data

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

We don't need to handle imbalanced data

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

Our first model is RFM (Recency, Frequency, Monetary Value) model

Recency, frequency, monetary value is a marketing analysis tool used to identify a company's or an organization's best customers by measuring and analyzing spending habits.

The RFM model is based on three quantitative factors:

Recency: How recently a customer has made a purchase

Frequency: How often a customer makes a purchase

Monetary Value: How much money a customer spends on purchases

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

# we will take the latest date as 2011-12-10 as the last invoice date was 2011-11-09
latest_date = dt.datetime(2011,12,10)

# calculating days since last purchase , frequency of purchases and total price spent by each customer
rfm_df = df.groupby("CustomerID").agg({"InvoiceDate" : lambda x : (latest_date - x.max()).days ,
                                       "InvoiceNo" : lambda x : len(x) , "Total Price" : lambda x : x.sum()})

# renaming columns
rfm_df.rename(columns = {"InvoiceDate" : "Recency" , "InvoiceNo" : "Frequency" , "Total Price" : "Monetary"}, inplace = True)

# changing datatype of recency
rfm_df["Recency"] = rfm_df["Recency"].astype(int)
rfm_df.reset_index().head()

In [None]:
# descriptive stats of rfm_df
rfm_df.describe()

In [None]:
# plotting distribution of rfm values
fig,axes = plt.subplots(2,2 , figsize = (20,13))
sns.distplot(rfm_df["Recency"] , ax = axes[0,0])
sns.distplot(rfm_df["Frequency"] , ax = axes[0,1])
sns.distplot(rfm_df["Monetary"] , ax = axes[1,0])

axes[0,0].set_title("Recency Distribution")
axes[0,1].set_title("Frequency Distribution")
axes[1,0].set_title("Monetary Distribution")
plt.show()


In [None]:
# applying log transformation in recency, frequency, monetary variables
rfm_df["Recency_log"] = np.log(rfm_df["Recency"])
rfm_df["Frequency_log"] = np.log(rfm_df["Frequency"])
rfm_df["Monetary_log"] = np.log(rfm_df["Monetary"])
rfm_df

We applied log transformation in order to balance the ranges of the three variables and make the distribution of variables close to normal distribution.
We will now visualize the log transformation of variables

In [None]:
# removing non positive values from rfm_df before visualizing the log transformation

rfm_df = rfm_df[(rfm_df["Recency"]>0) & (rfm_df["Frequency"]>0) & (rfm_df["Monetary"]>0)]
rfm_df.head()

In [None]:
fig,axes = plt.subplots(2,2 , figsize = (20,13))
sns.distplot(rfm_df["Recency_log"] , ax = axes[0,0])
sns.distplot(rfm_df["Frequency_log"] , ax = axes[0,1])
sns.distplot(rfm_df["Monetary_log"] , ax = axes[1,0])

axes[0,0].set_title("Recency_log Distribution")
axes[0,1].set_title("Frequency_log Distribution")
axes[1,0].set_title("Monetary_log Distribution")
plt.show()

Log transformation have resulted in almost normal distribution as we can see from the graphs

In [None]:
# scaling the data

features = rfm_df[["Recency_log","Frequency_log","Monetary_log"]].values
scaler = StandardScaler()
X = scaler.fit_transform(features)

****IMPLEMENTING K MEANS CLUSTERING****

In [None]:
#elbow method to find out the best number of clusters

inertia = []

for k in range(1,10):
  kmeans = KMeans(n_clusters = k , random_state = 0).fit(X)
  inertia.append(kmeans.inertia_)

#plotting inertia with number of clusters
plt.figure(figsize=(8, 6))
sns.lineplot(x=range(1, 10),y=inertia, marker='o')
plt.title("Elbow Method")
plt.xlabel("Number of clusters")
plt.ylabel("Inertia")
plt.xticks(range(1, 10))
plt.show()

From the elbow method we got the optimal number of clusters as 3

We will now run the model with number of clusters as 3

In [None]:
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_km = kmeans.predict(X)

# Plot the clusters
plt.figure(figsize=(12, 8))
plt.title('Customer Segmentation based on Recency and Frequency')
plt.scatter(X[:,0], X[:,1], c=y_km, s=50, cmap='Set2', label='Clusters')

# Plot and annotate the centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:,0], centers[:,1], c='red', s=210, alpha=0.6, marker='o')
for i, center in enumerate(centers):
    plt.annotate(f'Cluster {i}', (center[0], center[1]), textcoords="offset points", xytext=(0,10), ha='center')

plt.xlabel('Recency')
plt.ylabel('Frequency')
plt.legend()
plt.show()

We can observe 3 clusters from when k means model is applied on the graph plotting recency and frequency

**Splitting the RFM values into 4 quantiles**

In [None]:
quantiles = rfm_df.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()

In [None]:
#function to create RFM segments

def RecencyScore(r):
  if r <= quantiles["Recency"][0.25]:
    return 1
  elif r <= quantiles["Recency"][0.50]:
    return 2
  elif r <= quantiles["Recency"][0.75]:
    return 3
  else:
    return 4



def FreqScore(r):
  if r <= quantiles["Frequency"][0.25]:
    return 4
  elif r <= quantiles["Frequency"][0.50]:
    return 3
  elif r <= quantiles["Frequency"][0.75]:
    return 2
  else:
    return 1


def MontScore(r):
  if r <= quantiles["Monetary"][0.25]:
    return 4
  elif r <= quantiles["Monetary"][0.50]:
    return 3
  elif r <= quantiles["Monetary"][0.75]:
    return 2
  else:
    return 1

In [None]:
# calculate RFM segment values for each record
rfm_df['R'] = rfm_df['Recency'].apply(RecencyScore)
rfm_df['F'] = rfm_df['Frequency'].apply(FreqScore)
rfm_df['M'] = rfm_df['Monetary'].apply(MontScore)
rfm_df.reset_index().head()

**Calculating RFM score from RFM segementation**

In [None]:
rfm_df["RFMGroup"] = rfm_df["R"].map(str) + rfm_df["F"].map(str) + rfm_df["M"].map(str)

rfm_df["RFMScore"] = rfm_df["R"] + rfm_df["F"] + rfm_df["M"]

rfm_df.reset_index().head()

We got the RFM score for each datapoint. Now we will assign each datapoint to a cluster

In [None]:
# assigning datapoints to clusters
rfm_df["Cluster"] = kmeans.labels_
rfm_df.reset_index().head()

In [None]:
# calculate mean values each cluster
cluster_average = rfm_df.groupby('Cluster').mean()
cluster_average

**INTERPRETATION :**

CLUSTER 0 :

*   Recency - Moderate (103 days)
*   Frequency - Moderate (41)


*   Monetary - Moderate ( $ 480)
*   Interpretation - These are potential loyalists having moderate interaction with our client. They can be pursued to increase their purchases from the comapny and become permanent customers


CLUSTER 2 :

*   Recency - High (160 days)
*   Frequency - Low (8)


*   Monetary - Low ($ 106)
*   Interpretation - These are occasional/part time buyers who need special attention from our client. They usually don't make purchases from our company so they need different marketing startegies



CLUSTER 1 :

*   Recency - Low (17 days)
*   Frequency - High (179)


*   Monetary - High ($ 1874)
*   Interpretation - These customers are loyalist/permanent customers who regularly make purchases from our client. They should be retained as they form the base of the business of our client

NOTE - Cluster numbers change eveytime we run the code so the above cluster information should be seen for the position the clusters represent in the graph and not the numbers they have

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Here we have used the RFM (Recency, Frequency, Monetary) model for our clustering purposes.

1. Recency is the amount of days passed since the customer made the last purchase

2. Frequency is the number of times the customer made purchases

3. Monetary is the total amount spent by the customer in making purchases

First we have taken the log of these three variables and scaled them using standard scaler technique for better performance of our model.

Here we have plotted recency against frequency and found the different clusters through KMeans algorithm which makes use of expectation maximization technique to clusterize the dataset.

Elbow method predicted 3 as the most appropriate number of clusters for our task. Then we have calculated RFM scores for these clusters to deduce common information among each member of the 3 clusters

These 3 clusters of data representing customer information can be used by our client to optimize and channelize their marketing strategies thus increasing their profit

Now we will evaluate our model with silhouette score technique

In [None]:
# Visualizing evaluation Metric Score chart
#silhouetee score visualization

from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.metrics import silhouette_score

#calculating and visualizing silhouette scores for cluster numbers from 2-9
for n in range(2,10):
  kmeans = KMeans(n_clusters = n)
  y_pred = kmeans.fit_predict(X)

  score = silhouette_score(X, y_pred, metric = 'euclidean')

  print(f"Silhouette score for {n} clusters is {score}")

  visualizer = SilhouetteVisualizer(kmeans)
  visualizer.fit(X)
  visualizer.poof()

As the average silhouetee score of dataset increases and reaches close to 1 we can assume that our clustering is done right and vice versa for silhouetee score becoming 0 or negative.

In the above charts our highest average silhouetee score is coming for clusters number = 2 but the elbow method we used earlier suggested that the appropriate cluster number is 3 for our task. Our next highest average silhouette score is also coming for clusters number = 3 so we have taken 3 as the most appropriate number of clusters required

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

from sklearn.model_selection import KFold

# Number of clusters
num_clusters = 3

# Initialize K-Fold Cross-Validation
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Initialize a list to store silhouette scores
silhouette_scores = []

# Iterate over the folds
for train_index, val_index in kf.split(X):
    # Split the data into training and validation sets
    X_train, X_val = X[train_index], X[val_index]

    # Initialize and fit the KMeans model
    kmeans = KMeans(n_clusters=num_clusters, random_state=42)
    kmeans.fit(X_train)

    # Predict cluster labels for validation data
    val_labels = kmeans.predict(X_val)

    # Check if there is more than one unique cluster label
    if len(set(val_labels)) > 1:
        # Calculate silhouette score for the validation data
        score = silhouette_score(X_val, val_labels)
        silhouette_scores.append(score)

# Handle case where no valid silhouette scores were calculated
if silhouette_scores:
    avg_silhouette_score = sum(silhouette_scores) / len(silhouette_scores)
    print("Silhouette Scores:", silhouette_scores)
    print("Average Silhouette Score:", avg_silhouette_score)
else:
    print("No valid silhouette scores were calculated.")


##### Which hyperparameter optimization technique have you used and why?

We have used k fold stratification technique to find the silhouette score for 5 different training and test sets. Through it we got five different silhouette scores.

Averaging these 5 scores we got a more accurate silhouette score for our model than by training our data only one time  

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

There is only marginal difference between our silhouette score and the score we got from k fold stratification technique. So there is no improvement

Original Silhouette Score = 0.3196801458213756

K fold Silhouette Score = 0.319018518855131

### ML Model - 2

Here we have used QP (Quantity,Total Price) model to evaluate the customers using mean quantity of products they buy and mean amount they spend on each transaction. Through it we divide our customers based on the revenue they generate for the company and marketing strategies can be optimized using such divisions

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#creating a dataframe to map mean quantity and mean total price for each customer
qp_df = df.groupby("CustomerID").agg({"Quantity" : lambda x : x.mean(), "Total Price" : lambda x : x.mean()})
qp_df.reset_index().head()

In [None]:
#plotting density plots for quantity and total price
fig,axes = plt.subplots(1,2 , figsize = (12,5))
sns.distplot(qp_df["Quantity"] , ax = axes[0])
sns.distplot(qp_df["Total Price"] , ax = axes[1])


axes[0].set_title("Quantity Distribution")
axes[1].set_title("Total Price Distribution")
plt.show()


To make the above distributions as close to normal distribution as possible and make ranges of variables comparable we take log of them

In [None]:
# taking log of variables
qp_df["Quantity_log"] = np.log(qp_df["Quantity"])
qp_df["TotalPrice_log"] = np.log(qp_df["Total Price"])
qp_df

In [None]:
# plotting the logarithm of variables
fig,axes = plt.subplots(1,2 , figsize = (12,5))
sns.distplot(qp_df["Quantity_log"] , ax = axes[0])
sns.distplot(qp_df["TotalPrice_log"] , ax = axes[1])


axes[0].set_title("Quantity_log Distribution")
axes[1].set_title("TotalPrice_log Distribution")
plt.show()

In [None]:
# scaling the logarithm of variables using standard scaler
features = qp_df[["Quantity_log","TotalPrice_log"]].values
scaler = StandardScaler()

X = scaler.fit_transform(features)


In [None]:
# applying elbow method to find appropriate number of clusters
inertia = []

for k in range(1,10):
  kmeans = KMeans(n_clusters = k , random_state = 0).fit(X)
  inertia.append(kmeans.inertia_)

#plotting inertia with number of clusters
plt.figure(figsize=(8, 6))
sns.lineplot(x=range(1, 10),y=inertia, marker='o')
plt.title("Elbow Method")
plt.xlabel("Number of clusters")
plt.ylabel("Inertia")
plt.xticks(range(1, 10))
plt.show()

We got 3 as the most appropriate number for the clusters

**Applying KMeans Algorithm**

In [None]:
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_km = kmeans.predict(X)

# Plot the clusters
plt.figure(figsize=(12, 8))
plt.title('Customer Segmentation based on Quantity and Total Price')
plt.scatter(X[:,0], X[:,1], c=y_km, s=50, cmap='Set2', label='Clusters')

# Plot and annotate the centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:,0], centers[:,1], c='red', s=210, alpha=0.6, marker='o')
for i, center in enumerate(centers):
    plt.annotate(f'Cluster {i}', (center[0], center[1]), textcoords="offset points", xytext=(0,10), ha='center')

plt.xlabel('Quantity')
plt.ylabel('Total Price')
plt.legend()
plt.show()

We can clearly visualize the 3 clusters in above graph plotting quantity and total price

In [None]:
# dividing the dataset into 4 quantiles
quantiles = qp_df.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()

In [None]:
# function to assign rank according to quantity of products bought by customers
def QuantityScore(r):
  if r <= quantiles["Quantity"][0.25]:
    return 4
  elif r <= quantiles["Quantity"][0.50]:
    return 3
  elif r <= quantiles["Quantity"][0.75]:
    return 2
  else:
    return 1


# function to assign rank according to total price spent by customers
def TotalPriceScore(r):
  if r <= quantiles["Total Price"][0.25]:
    return 4
  elif r <= quantiles["Total Price"][0.50]:
    return 3
  elif r <= quantiles["Total Price"][0.75]:
    return 2
  else:
    return 1

In [None]:
# creating separate columns to assign ranks according to quantity and total price
qp_df['Q'] = qp_df['Quantity'].apply(QuantityScore)
qp_df['P'] = qp_df['Total Price'].apply(TotalPriceScore)

qp_df.reset_index().head()

In [None]:
# creating two columns
qp_df["QP_Group"] = qp_df["Q"].map(str) + qp_df["P"].map(str) # column to group different ranks of customers
qp_df["QP_Score"] = qp_df["Q"] + qp_df["P"] # column to find sum of ranks
qp_df.head()

In [None]:
# assigning clusters to the customers
qp_df["Cluster"] = kmeans.labels_
qp_df.reset_index().head()

In [None]:
# finding average of variables for different clusters
qp_df_averages = qp_df.groupby("Cluster").mean()
qp_df_averages

**Interpretation** -

Cluster 0:

Quantity: Moderate (approx 4)

Total Price: Moderate (approx $ 9)

Interpretation: These customers are the ones responsible for ensuring moderate amount of revenue for our client because they buy products in moderate quantity as well the prices of products they buy are moderate. These should be preserved and marketing strategies should focus on ensuring it as well as making them buy more as they have the potential of becoming high value customers



Cluster 1:

Quantity: Low (approx 2)

Total Price: Low (approx $ 4)

Interpretation: These customers are low revenue generating customers as they buy low quantity of products and the prices of products are also low. These customers can be casual buyers and marketing startegies should focus on retaining them and making them buy more as they are not spending much. Special discounts can help in this case


Cluster 2:

Quantity: High (approx 9)

Total Price: High (approx $ 17)

Interpretation: These customers are high value customers responsible for maximum profit for the company. They buy in more quantity and the prices of the product they buy are also high. They are loyal customers which must be retained at all cost. Marketing strategies should focus on esuring the same.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

We have used the QP (Quantity - Total Price) model where we grouped the customers based on the average quantity of products they buy and the average price they pay on each transaction.

First we grouped each customer's quantity of products and total price and calculated the mean of it. Then we took the log of the above variables and scaled them according to Standard Scaler.

After scaling we used the elbow method to find the optimal number of clusters for our model. This we have done by plotting quantity with total price

We got 3 as the optimal number for clusters. Then we used k means algorithm to plot the two variables and visualize the 3 clusters

Then we found the QP group and QP score according to the rankings we assigned based on the quantity and total price associated with each customer

Each cluster characteristics and the marketing impact it can have are described after that

In [None]:
# Visualizing evaluation Metric Score chart.0

#silhouetee score visualization

from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.metrics import silhouette_score

#calculating and visualizing silhouette scores for cluster numbers from 2-9
for n in range(2,10):
  kmeans = KMeans(n_clusters = n)
  y_pred = kmeans.fit_predict(X)

  score = silhouette_score(X, y_pred, metric = 'euclidean')

  print(f"Silhouette score for {n} clusters is {score}")

  visualizer = SilhouetteVisualizer(kmeans)
  visualizer.fit(X)
  visualizer.poof()

Based on the elbow method we used previously and silhouette score we calculated for different clusters we found the optimal number of clusters to be 3 only

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

from sklearn.model_selection import KFold

# Number of clusters
num_clusters = 3

# Initialize K-Fold Cross-Validation
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Initialize a list to store silhouette scores
silhouette_scores = []

# Iterate over the folds
for train_index, val_index in kf.split(X):
    # Split the data into training and validation sets
    X_train, X_val = X[train_index], X[val_index]

    # Initialize and fit the KMeans model
    kmeans = KMeans(n_clusters=num_clusters, random_state=42)
    kmeans.fit(X_train)

    # Predict cluster labels for validation data
    val_labels = kmeans.predict(X_val)

    # Check if there is more than one unique cluster label
    if len(set(val_labels)) > 1:
        # Calculate silhouette score for the validation data
        score = silhouette_score(X_val, val_labels)
        silhouette_scores.append(score)

# Handle case where no valid silhouette scores were calculated
if silhouette_scores:
    avg_silhouette_score = sum(silhouette_scores) / len(silhouette_scores)
    print("Silhouette Scores:", silhouette_scores)
    print("Average Silhouette Score:", avg_silhouette_score)
else:
    print("No valid silhouette scores were calculated.")


We got average silhouette score to be 0.5777479124660421

##### Which hyperparameter optimization technique have you used and why?

We have used k fold stratification technique to find the silhouette score for 5 different training and test sets. Through it we got five different silhouette scores.

Averaging these 5 scores we got a more accurate silhouette score for our model than by training our data only one time

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

There is only marginal difference between our silhouette score and the score we got from k fold stratification technique. So there is no improvement

Original Silhouette Score = 0.5774089325283756

K fold Silhouette Score = 0.5777479124660421

#### 3. Explain each evaluation metric's indication towards business and the business impact of the ML model used.

By finding the optimal number of clusters through Elbow and silhouetee score method we got the perfect grouping of our customers. This grouping can help our client know better about the customers and accordingly plan marketing strategies
This will go a long way in optimising business costs and increasing profits simultaneously.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Silhouetee score gave us the best evaluation mechanism to deduce the efficiency of the number of clusters we got through elbow method. This helped create perfect groups to optimise and channelize marketing strategies

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

RFM (Recency,Frequency,Monetary) model is the better model to optimize marketing strategies as it takes into account wholesome customer behaviour i.e. last purchase, total purchases and total amount spent. Through it the marketing costs are optimized in a substantial way as compared to the other model which takes into account just the average quantity and average price spent by each customer

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

We have used 2 models here to segment our customers. Our first one is based on customer's wholesome behaviour in terms of his buying habits i.e. last purchase, total purchases in terms of number, total purchases in terms of value. Our second model is based on customer's buying habits in terms of average quantity of products purchased, average price of each transaction. This model also gives us a way to segment customers. These 2 models will help in optimising marketing costs and increase profits

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project we have created 2 models to segment our customers. These models used different strategies so as to make our client aware about the customers behaviour by grouping them into different segments and in this way they can plan their marketing and business strategies. We hope this project is successful in completing the task provided.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***