<a href="https://colab.research.google.com/github/navinsinghdo/capstone/blob/main/Online_Retail_Customer_Segmentation_Capstone_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Online Retail Customer Segmentation



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

Summary:
I recieved a data frame with following features(columns):

InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation. StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product. Description: Product (item) name. Nominal. Quantity: The quantities of each product (item) per transaction. Numeric. InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated. UnitPrice: Unit price. Numeric, Product price per unit in sterling. CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer. Country: Country name. Nominal, the name of the country where each customer resides.

The cx of any retain company falls under different category and sub groups. Those groups can be based on country, time of purchase, types of purchase, total purchase and so on. Segementating them into multiple groups using ML models helps the company for better service, keeping good and old cx happy and targeted advertisment, etc.

I started my project with dropping null values, as there was no good way to replace them. I also stripped date, month, time, etc. in different column for better analysis. I also made a new feature of total cost a cx spend. Further, I analysed which countries have highest number of cx, months with high orders, products that gets most demand, etc. I also made graph to have a clear visulization.

I also noticed a number of cx cancelled the order, I dropped cx with negative total cost before modelling. The company is ideally suggested to look into why some cx cancelled orders.

I did the RFM modelling, thus analysed the cx on basis of Recency, Frequency, Monetary Value. In the last part I used different ML models to cluster data into different groups. According to models, following are the ideal number of clusters to be divided.

Model Name   -----                             Optimal Number of Clusters


K-Means      -----                                      3

K-means with silhouette_score      -----                2

K-Means with elbow method         -----                 4

Hierarchical clustering          -----                  2

Hierarchical clustering after cut-off     -----         3


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


In this project, your task is to identify major customer segments on a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

import datetime as dt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
from yellowbrick.cluster import KElbowVisualizer
from pylab import rcParams
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as sch
from prettytable import PrettyTable 

### Dataset Loading

In [None]:
# Load Dataset

from google.colab import drive
drive.mount('/content/drive')

df = pd.read_excel('/content/drive/MyDrive/Online Retail.xlsx')

### Dataset First View

In [None]:
# Dataset First Look

df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

df.shape

### Dataset Information

In [None]:
# Dataset Info

df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

df.isnull().sum()

In [None]:
# Visualizing the missing values

import missingno as msno
msno.bar(df)

### What did you know about your dataset?

The data set contains information of different online customer, including invoice number, stock code, quantity, etc. (details in section below).

Out dataframe have 5268 duplicates, and a number of NAN in 2 columns: discription and customer id(shown in graph above).

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

df.columns

In [None]:
# Dataset Describe

df.describe()

### Variables Description 

InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.

StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.

Description: Product (item) name. Nominal.

Quantity: The quantities of each product (item) per transaction. Numeric.

InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.

UnitPrice: Unit price. Numeric, Product price per unit in sterling.

CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.

Country: Country name. Nominal, the name of the country where each customer resides.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

#We have a invoice date column, checking data type.
df.dtypes

In [None]:
#We already have InvoiceDate as datetime object. I am stripping them into year, month, day, hours to have a better analysis of costomer. 
df['Invoice_Year'] = df['InvoiceDate'].dt.year
df['Invoice_Month'] = df['InvoiceDate'].dt.strftime('%B') 
df['Invoice_Day'] = df['InvoiceDate'].dt.strftime('%A') 
df['Invoice_Hour'] = df['InvoiceDate'].dt.hour

#Printing data frame
df.head()

In [None]:
#Adding a new column for total amount. Tital amount = Quantity*Unit price.
df['Total_Amount'] = df['Quantity']*df['UnitPrice']

#printing new data frame
df.head()

In [None]:
df.dropna(axis = 0 , inplace = True)
df.drop_duplicates(inplace=True)

### What all manipulations have you done and insights you found?

I have done 2 manipulation in data frame:

Added 4 additional column from datetime feature, that are: Invoice year, Month, day and hour.
I have also added one additional column for total amount spend by customer.
Both this new column(feature) will help us better analyze the customer.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

num_features= list(df.select_dtypes(['int64','float64']))

for col in num_features:
  fig = plt.figure(figsize = (10,6))
  ax = fig.gca()
  sns.distplot(df[col])
  feature = df[col]
  ax.axvline(feature.mean(), color= 'red')
  ax.axvline(feature.median(), color= 'Blue')
  ax.set_title(f'Histogram plot for {col}')

  plt.show()
     

##### 1. Why did you pick the specific chart?

Seaborn distplot is best way to visualize numerical data and read it.

#### Chart - 2

Top Countries

In [None]:
# Chart - 2 visualization code

top_countries = pd.DataFrame(df['Country'].value_counts().sort_values(ascending = False).reset_index())
top_countries['Country %']= top_countries['Country']*100/df['Country'].count()
top_countries.rename(columns = {'index':'Country','Country':'Total Counts'},inplace = True)
top_countries.head(5)

In [None]:
plt.figure(figsize= (30,10))
sns.barplot(x = 'Country' , y = 'Total Counts' , data = top_countries[:15] , palette=("YlOrBr")) #Setting it will 15 only as there are may country with very small counts. 
plt.xlabel('Countries' , size = 15)
plt.ylabel('Total counts' , size  = 15)
plt.title('Country vs counts of CX')

##### 1. Why did you pick the specific chart?

Bar plot is a good and easy way to visualize here as we want to compare demand by country.

##### 2. What is/are the insight(s) found from the chart?

UK have highest demand followed by germany and france.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The management is suggested to check how much we are spending in ads for UK and germany and compare roi, as demand is UK is much higher as compared to germany and france. The management should also check investment of countries like finland, cyprus, etc. as they have lowest customer. A healthy ratio of roi and customer acquisition cost have lead positive buisness growth.

#### Chart - 3

Monthly consumrion in top to bottom order

In [None]:
# Chart - 3 visualization code

top_months_df = pd.DataFrame(df['Invoice_Month'].value_counts().sort_values(ascending = False).reset_index())
top_months_df.rename(columns = {'index' : 'Invoice_Month' , 'Invoice_Month' : 'Monthly Frequency' } , inplace = True)

top_months_df

In [None]:
plt.figure(figsize = (30,10))

sns.barplot(x = 'Invoice_Month' , y = 'Monthly Frequency' , data = top_months_df , palette=("YlOrBr"))
plt.xlabel('Invoice Month' , size = 15)
plt.ylabel('Monthly Frequency' , size = 15)
plt.title('Monthly Frequency')

plt.show()

##### 1. Why did you pick the specific chart?

I here want to compare demand by month, bra plot is best option for that. It compares month sales side by side.

##### 2. What is/are the insight(s) found from the chart?

November have highest frequency of demand.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The management can consider investing more on ads and discounts to drive more sales in months like feb , april , etc. It will lead to positive buisness growth.

#### Chart - 4

Top Consumption by Days of Week

In [None]:
# Chart - 4 visualization code

top_days_df = pd.DataFrame( df['Invoice_Day'].value_counts().sort_values(ascending = False).reset_index())
top_days_df.rename(columns = {'index' : 'Invoice Day' , 'Invoice_Day' : 'Frequency'} , inplace = True)

top_days_df

In [None]:
plt.figure(figsize=(20,10))

sns.barplot(x = 'Invoice Day' , y = 'Frequency' , data = top_days_df ,  palette=("YlOrBr"))
plt.xlabel('Day' , size = 15)
plt.ylabel('Frequency' , size = 15)
plt.title('Frequency by Days')

plt.show()

##### 1. Why did you pick the specific chart?

For comparision, bar plot is the best option.

##### 2. What is/are the insight(s) found from the chart?

Thrusday have highest sales, friday and sunday have among lowest.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The management can invest more on discounts on sunday and friday.

#### Chart - 5

Consumption by Hour of the Day

In [None]:
# Chart - 5 visualization code

top_hrs_df = pd.DataFrame(df['Invoice_Hour'].value_counts().sort_values(ascending = False).reset_index())
top_hrs_df.rename(columns = {'index' : 'Invoice Hour' , 'Invoice_Hour': 'Frequency'}, inplace = True)

top_hrs_df

In [None]:
plt.figure(figsize=(20,10))

sns.barplot(x = 'Invoice Hour' , y = 'Frequency' , data = top_hrs_df ,  palette=("YlOrBr"))
plt.xlabel('Hour' , size = 15)
plt.ylabel('Frequency' , size = 15)
plt.title('Frequency by Hours')

plt.show()

##### 1. Why did you pick the specific chart?

To compare demand of customers by hrs, bar plot is best option.

##### 2. What is/are the insight(s) found from the chart?

12, 13 hrs have highest demand.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The management can explore of using more serve space in 12, 13 hrs, and lesser in off hrs like after 20 and before 9. If done, it can potentially lead to positive buisness growth.

#### Chart - 6

Most and the Least purchased products

In [None]:
# Chart - 6 visualization code

product_desc_df = pd.DataFrame(df['Description'].value_counts().sort_values(ascending = False).reset_index())
product_desc_df.rename(columns = {'index' : 'Description' , 'Description': 'Frequency'}, inplace = True)

product_desc_df

In [None]:
plt.figure(figsize=(30,10))

sns.barplot(x = 'Description' , y = 'Frequency' , data = product_desc_df[:10] ,  palette=("YlOrBr"))
plt.xlabel('Description' , size = 15)
plt.ylabel('Frequency' , size = 15)
plt.title('Frequency by Description')

plt.show()

##### 1. Why did you pick the specific chart?

Barplot is most suitable for comparision

##### 2. What is/are the insight(s) found from the chart?

White hanging heart light have highest demand.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The management can push low sale items via ad if required.

#### Chart - 7

Total Amount

In [None]:
# Chart - 7 visualization code

plt.figure(figsize = (15,10))
sns.distplot(df['Total_Amount'])

The graph is showing below 0 and -ve as well. Removing it

In [None]:
total_amount_df = df[df['Total_Amount']>0]

# Distribution of Total amounts
plt.figure(figsize = (20,10))
sns.distplot(np.log1p(total_amount_df['Total_Amount']))
plt.title('Distribution of Total Amount')

##### 1. Why did you pick the specific chart?

seaborn distplot is best to visualize distribution.

##### 2. What is/are the insight(s) found from the chart?

This is a right skewed distribution.

#### Chart - 8

Customers who cancels orders

Reason for so many -ve total amount was because a number of customers are cancelling the order as well, which is bad for buisness. I am writing a function to check where total amount is negative, and visualizing them.

In [None]:
# Chart - 8 visualization code

#Making a function to check if total amount is negative(means cancelled, thus loss for company)

def cancel_or_not(data):
  '''
  This function will check if the total amount is in negative which imply that the order was cancelled.
  '''
  if (data<0):
    return 'cancelled'
  
  else:
    return 'Not cancelled'

  return data

In [None]:
df_copy = df.copy() #making a copy as I do not want original data frame to change.
df_copy['Cancelling_insight'] = df['Total_Amount'].apply(cancel_or_not)
cancellation_df = pd.DataFrame(df_copy.groupby('Cancelling_insight' , sort= False).agg({'CustomerID': 'count'}))
cancellation_df

In [None]:
plt.figure(figsize = (15,10))

sns.barplot(x = cancellation_df.index , y = 'CustomerID' ,data = cancellation_df )

##### 1. Why did you pick the specific chart?

Bar plot is best for side by side comparison.

##### 2. What is/are the insight(s) found from the chart?

8905 customer have cancelled the order.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The management is sugessted to look into reasons for order cancellation. If the number increases, it might lead to negative buisness growth.

#### Chart - 9 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization codeplt.figure(figsize = (20,10))

plt.figure(figsize = (20,10))

sns.heatmap(df.corr() , annot = True , cmap=sns.diverging_palette(20, 220, n=200))
plt.title('Corelation Matrix')


##### 1. Why did you pick the specific chart?

Heatmaps are best plot to read and understand corelation. The more close 2 columns are, the closer to 1 it gets. -ve means they are inversely corelated.

##### 2. What is/are the insight(s) found from the chart?

Some features like total amount and quantity are highly corelated.

## ***5. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

The missing data can not with replaced with mean, median or mode, thus deciding to drop the rows ith missing values. As less than 50% of value is missing from column, thus not droppingf the column instead dropping the row.

*RFM Model (Recency, Frequency, Monetary Value)*



Recency, frequency, monetary value is a marketing analysis tool used to identify a company's or an organization's best customers by using certain measures. The RFM model is based on three quantitative factors.

Recency: How recently cx purchased.

Frequency: How often a customer makes a purchase.

Monetary Value: How much money a customer spends on.

Performing RFM Segmentation and RFM Analysis, Step by Step

The first step in building an RFM model is to assign Recency, Frequency and Monetary values to each customer.

The second step is to divide the customer list into tiered groups for each of the three dimensions (R, F and M)

In [None]:
#Removing all cancelled invoices as it is not useful in customer segmentation.  
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]

In [None]:
# creating column for only date
df['Invoice_Date'] = df['InvoiceDate'].dt.date
snapshot_date = max(df.InvoiceDate) + dt.timedelta(days=1)
# Creating dataframe to record RFM score
RFM_df = df.groupby(['CustomerID']).agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo': 'count',
    'Total_Amount': 'sum'}).reset_index()
# Renaming columns
RFM_df.rename(columns = {'InvoiceDate': 'Recency',
                            'InvoiceNo': 'Frequency',
                            'Total_Amount': 'MonetaryValue'}, inplace=True)

In [None]:
RFM_df.head(10)

In [None]:
plt.figure(figsize = (20 , 10))
sns.distplot(RFM_df.Recency)
plt.title('Recency Distribution Plot')

In [None]:
plt.figure(figsize = (20,10))
sns.distplot(RFM_df.Frequency)
plt.title('Frequency Distribution Plot')

In [None]:
plt.figure(figsize = (20 , 10))
sns.distplot(RFM_df.MonetaryValue)
plt.title('Monetary Distribution Plot')

In [None]:
plt.figure(figsize = (20,10))
sns.heatmap(RFM_df.corr(), annot = True , cmap=sns.diverging_palette(20, 220, n=200))
plt.title('Correlation among RFM')

In [None]:
#Splitting into 25, 50, 75 : 4 quantiles.
quantiles = RFM_df.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()
quantiles 

In [None]:
# Functions to create R,F and M segments
def RScoring(x,p,d):
  if x <= d[p][0.25]:
    return 1
  elif x <= d[p][0.5]:
    return 2
  elif x <= d[p][0.75]:
    return 3
  else:
    return 4
def FnMScoring(x,p,d):
  if x <= d[p][0.25]:
    return 4
  elif x <= d[p][0.5]:
    return 3
  elif x <= d[p][0.75]:
    return 2
  else:
    return 1

In [None]:
# Calculate and addd R, F and M segment value columns in the existing dataset to show R, F and M segment values
RFM_df['R'] = RFM_df['Recency'].apply(RScoring, args=('Recency',quantiles,))
RFM_df['F'] = RFM_df['Frequency'].apply(FnMScoring, args=('Frequency',quantiles,))
RFM_df['M'] = RFM_df['MonetaryValue'].apply(FnMScoring, args=('MonetaryValue',quantiles,))
RFM_df.head()
     

In [None]:
#Calculate and Add RFMScore value column showing total sum of RFMGroup values
RFM_df['RFMScore'] = RFM_df[['R', 'F', 'M']].sum(axis = 1)
RFM_df.head()

In [None]:
#Making a new column for RFM group
RFM_df['RFMGroup'] = RFM_df.R.map(str) + RFM_df.F.map(str) + RFM_df.M.map(str)
RFM_df.head()

**Interpretation**

RFMScore: It is the sum of R, F and M value that assigned by quantiles.

RFMGroup: This is the combination of R,F, and M values.

For example:

A customer ordered a product 300 days ago, arrived on Online retail platform just once and bought 10$ dollar worth product. According to RFM table he got R = 4(Recency is very low), F = 4(Frequency is very low because he arrived just once) and M = 4(Monetary value is very low). So, the RFMGroup value become 444 means he is the worst customer, we don't need to spend more time on him.

Those customer are the best customers who scored R = 1, F = 1 or 2 and M = 1 or 2 means in combination it will give RFMGroup 111,112,121 values like these are good indications

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

#Using IQR
rfm = ['Recency','Frequency','MonetaryValue']
for col in rfm:
  Q1 = RFM_df[col].quantile(0.05)
  Q3 = RFM_df[col].quantile(0.95)
  IQR = Q3 - Q1
  RFM_df = RFM_df [(RFM_df[col] >= Q1 - 1.5*IQR) & (RFM_df[col] <= Q3 + 1.5*IQR)]

In [None]:
#Updated RFM. 
RFM_df.head()

##### What all outlier treatment techniques have you used and why did you use those techniques?

I have used IQR method to handle outliers.

**Plots**:

In [None]:
#Box plot:

plt.rcParams['figure.figsize']=(20,10)
ax = RFM_df[["Recency","Frequency","MonetaryValue"]].plot(kind='box', title='Boxplot', showmeans=True)
plt.show()

In [None]:
#Checking RFM distribution:

for col in rfm:
  plt.figure(figsize = (15,10))
  sns.distplot(RFM_df[col])
  plt.title(f'{col} Distribution Plot')

370 and 797 cx are two group of cx who are good for buisness acc to RFM df.

All above distribution in RFM df are right skewed.

Making a df to see the cx with higher RFM scores.

In [None]:
# Top customers who frequent in all features
print(RFM_df[RFM_df['RFMScore'] == 3].sort_values('RFMScore', ascending = False).reset_index().head(10))
RFM_df[RFM_df['RFMScore'] == 3].shape

In [None]:
print(RFM_df[(RFM_df['RFMScore'] > 3) & (RFM_df['RFMScore'] <= 5)].sort_values('RFMGroup', ascending = False).reset_index().head(10))
RFM_df[(RFM_df['RFMScore'] > 3) &(RFM_df['RFMScore'] <= 5)].shape

### 3. Data Scaling

In [None]:
# Scaling your data

Log_rfm_Data = RFM_df[['Recency', 'Frequency', 'MonetaryValue']].apply(np.log1p, axis = 1)

In [None]:
rfm_features = ['Recency', 'Frequency', 'MonetaryValue']
final_rfm = Log_rfm_Data[rfm_features].values
sc = StandardScaler()
X = sc.fit_transform(final_rfm)

## ***7. ML Model Implementation***

### ML Model - 1

Implementing K-Means

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

In [None]:
# ML Model - 1 Implementation
model = KMeans(n_clusters=3,max_iter=1000, random_state=10)
# Fit the Algorithm
cluster_labels = model.fit_predict(X)
print(cluster_labels)

In [None]:
#Plot 

plt.figure(figsize = (20,10))
plt.scatter(X[:, 0], X[:, 1], c=cluster_labels, s=50 , cmap='viridis')

centers = model.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
plt.title('K-Means clustering with 3 clusters.',fontweight = 'bold')

In [None]:
model.cluster_centers_

In [None]:
#Assigning labels

RFM_df['Cluster_Id'] =cluster_labels[0:4256]
print(RFM_df.head(6))

In [None]:
#Group by Cluster ID. 

RFM_df.groupby('Cluster_Id').mean()

In [None]:
#Number of cx in each clusters:

RFM_df['Cluster_Id'].value_counts()

**Clusters details**

Cluster 0: The group of customers of whom Recency is very low, frequencies are very less and Company is generating only few bucks.

Cluster 1: This cluster give the insights that these are marked with less frequency but more frequent than cluster 0 and company is making more money than cluster 0.

Cluster 2: These are one who visits more, they are more frequent and they are helping to generate a lot of business

Obsevations:

Cluster 1 have the most cx, followed by cluster 0 and cluster 2. The company needs to push more and more cx from cluster 1 and 0 towards 2 and work on keeping cluster 1 more happy.

### ML Model - 2

Implementing k-Means Clustering with Silhouette

In [None]:
#Setting range:

cluster_range = [2,3,4,5,7,8,10]

In [None]:
for n_clusters in cluster_range:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(16, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state= 1)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print(f"For n_clusters = {n_clusters}, The average silhouette_score is :{silhouette_avg}.")

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samplesax1.set_title(f"The silhouette plot for the {n_clusters} clusters.",fontweight = 'bold')
    ax1.set_xlabel("The silhouette coefficient values",fontweight = 'bold')
    ax1.set_ylabel("Cluster label", fontweight = 'bold')

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Cluster label
    centers = clusterer.cluster_centers_
    # Drawing white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

   

    ax2.set_title("The visualization of the clustered data.",fontweight = 'bold')
    ax2.set_xlabel("Feature space for the 1st feature",fontweight = 'bold')
    ax2.set_ylabel("Feature space for the 2nd feature",fontweight = 'bold')

    plt.suptitle((f'Silhouette analysis for KMeans clustering on sample data with n_clusters = {n_clusters}'),
                 fontsize=14, fontweight='bold')

plt.show()

With n = 2 , silhoutte score is 0.39. There are less change of assigning cx in wrong clusters here as clusters are far away.

### ML Model - 3

Implementing K-Means with Elbow Method

In [None]:
model3 = KMeans()

In [None]:
#Finding best number of clusters
def elbow_method(X):
  ''' Displays elbow curves with different metrics '''
  
  metrics = ['distortion', 'calinski_harabasz', 'silhouette']
  
  for m in metrics:
    visualizer = KElbowVisualizer(model3, k = (2,10), metric = m)
    visualizer.fit(X)
    visualizer.poof()

In [None]:
#Plot. 

elbow_method(Log_rfm_Data)

In [None]:
# within cluster sum of squares:
wcss = []

for i in range(1,11):
    kmeans=KMeans(n_clusters=i, init='k-means++',random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

# plotting curve

plt.figure(figsize = (9,6))
plt.grid(True)
plt.plot(range(1,11),wcss)
plt.title('The Elbow curve',fontweight = 'bold')
plt.xlabel('Number of Clusters',fontweight = 'bold')
plt.ylabel('WCSS', fontweight = 'bold')
plt.show()
     

Best cluster number is 4(acc to elbow method). After that there is slight decrease in wcs

**ML Model - 4**

Implementing Hierarchical Clustering

In [None]:
model4 = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage = 'ward')
y_hc = model4.fit_predict(X)

In [None]:
plt.figure(figsize=(15,8))
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'magenta', label = 'Cluster_1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'darkblue', label = 'Cluster_2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Cluster_3')      

plt.title('CX Clusters', size = 20)
plt.xlabel('RFM', fontweight = 'bold')
plt.ylabel('Spending Score', size = 15, fontweight = 'bold')
plt.legend()
plt.show()

In [None]:
#Dendogram plot: for best cluster number:

rcParams['figure.figsize'] = 15, 10

# max_d = cut-off/ Threshold value
max_d = 50

dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram',fontweight = 'bold')
plt.xlabel('Customers',fontweight = 'bold')
plt.ylabel('Euclidean Distances',fontweight = 'bold')
#Cutting it on threshold value, 50.
plt.axhline(y = max_d,c = 'k')
plt.show()

By choosing max_d = 50, we are getting three intersection or say three clusters. By choosing max_d vaue, we will get diffferent cluster numbers.

In [None]:
final_table = PrettyTable(['Sr. No.',"Model_Name",'Data', "Optimal_Number_of_cluster"]) 
  
# Add rows 
final_table.add_row(['1','K-Means','RFM',"3"])
final_table.add_row(['2',"K-Means with silhouette_score ", "RFM", "2"]) 
final_table.add_row(['3',"K-Means with Elbow method  ", "RFM", "4"])
final_table.add_row(['4',"Hierarchical clustering  ", "RFM", "2"])
final_table.add_row(['5',"Hierarchical clustering after Cut-off ", "RFM", "3"])
print(final_table)

1. Which ML model did you choose from the above created models as your final prediction model and why?

I am using K mean cluster as our model, as it's dividing the customers in 3 clusters according to there website visiting nature. 2 cluster as suggested by hierarchical clustering is too low to divide the customer, thus 3 clusters will help us in making better buisness and advertisment stratergy.

Cluster 0 or New customers(considered): The group of customers of whom Recency is very low, frequencies are very less and Company is generating only few bucks.

Cluster 1 or casual customer: This cluster give the insights that these are marked with less frequency but more frequent than cluster 0 and company is making more money than cluster 0.

Cluster 2 or loyal customer: These are one who visits more, they are more frequent and they are helping to generate a lot of business

# **Conclusion**

Following are the conclusion made during EDA:

Top Five Countries: Uniter Kingdom, Germany, France, Ireland and Spain.

Month which give maximum business: November, October, December, September and May.

Maximum purchasing on different days: Thursday > Wednesday > Tuesday > Monday > Saturday > Friday.

Most of the customers usually purchase products in between 10:00 A.M to 3:00 P.M.

The company should make sure that the website server is up and running during that hrs, and also invest on cx support during that hrs.

WHITE HANGING HEART T-LIGHT HOLDER > REGENCY CAKESTAND 3 TIER> JUMBO BAG RED RETROSPOT were ordered with top 3 highest frequency.
The company is suggested to keep the inventory always stocked with these products.

GLASS AND BEADS BRACELET IVORY , CROCHET LILAC/RED BEAR KEYRING , PINK BAROQUE FLOCK CANDLE HOLDER were ordered in the lowest number.

A number of cx also cancelled the order. The company should ideally look into it.

Following is the number of clusters suggested by different ML algorithm:

K-Means = Optimal Clusters(3)

K-Means with Silhoutte = Optimal_Clusters: (2)

K-Means with Elbow Method = Optimal_Clusters: (4)

Hierarchical Clustering = Optimal_Clusters: (2)

Hierarchical Clustering with cut-off = Optimal_Cluster: (3)

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***