<a href="https://colab.research.google.com/github/naru289/Assignment-10-Paradigm-Of-ML/blob/main/K_Means_Clustering(Customer_Segmentation)_Ungraded.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Problem Statement: Customer segmentation using K-Means Clustering

## Dataset

The dataset chosen for this problem is the Online Retail dataset. It is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Here our main goal is to cluster all the customers according to their attributes and gain more information about various customer patterns.

The dataset contains 541909 records, and each record is made up of 8 fields.

To know more about the dataset : (https://archive.ics.uci.edu/ml/datasets/Online+Retail)

## Introduction

What is Customer Segmentaion?

As the name suggests, segregating the customers to certain groups based on purchases, frequency of purchases, type of products bought, into homogeneuous groups. In any business, it is important to analyse and retain the existing customers as well as explore and attract new cutomers.

To certain extend, it is found that customers retaining leads to more effort than exploring new customers. As existing customers are more likely to spend more on the products. Satisfying these customers will help to build large, and strong reliable customer base and also bolsters the repeated purchases of your products.

## Information

**Clustering** is the task of grouping together a set of objects so that the objects in the same cluster are more similar to each other than to objects in other clusters. Similarity is a measure that reflects the strength of the relationship between two data objects.

In the clustering calculation, K-Means is a very popular algorithm. In this analysis, this method is used to cluster the similar data items.

In Retail and E-Commerce (B2C), and more broadly in B2B, one of the key elements shaping the business strategy of a firm is understanding of customer behaviour. More specifically, understanding the customers based on different business metrics: how much they spend (revenue), how often they spend (frequency), are they new or existing customers, what are their favorite products, etc... Such understanding in turn helps direct marketing, sales, account management and product teams to support customers on a personalized level and improve the product offering.

Furthermore, segmenting customers into different categories based on similar/cyclical buying pattern over a period of 1 year helps the retail shops manage their inventory better, thereby lowering costs and raising revenues by placing the orders in sync with the buying cycles.

## Problem Statement

Perform customer segmentation for an Online Retail dataset to segment the customers based on purchases and frequency using K-means Clustering.




In [None]:
#@title Run the cell to download the dataset
from warnings import filterwarnings
filterwarnings('ignore')
!wget https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/OnlineRetail.csv




### Importing required packages

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

### Loading the data

In [None]:
data = pd.read_csv('OnlineRetail.csv', encoding = 'unicode_escape')
data.head()

**Data Attributes**

**InvoiceNo:** Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘C’, it indicates a cancellation.

**StockCode:** Product (item) code. As it is a wholesale retail store, it has unique identifier for each stock of the item.

**Description:** Product (item) name. Nominal.

**Quantity:** The quantities of each product (item) per transaction. Numeric.

**InvoiceDate:** Invoice Date and time. Numeric, the day and time when each transaction was done by the customer.

**UnitPrice:** Unit price. Numeric, Product price per unit in sterling.

**CustomerID:** Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.

**Country:** Country name. Nominal, the name of the country where each customer resides.

In [None]:
data.shape

In [None]:
# dataframe info
data.info()

In [None]:
# data description
data.describe()

### Data Pre-processing

Explore the dataset by performing the following operations:

* There is a lot of redundant data. check for duplicate entries from the data and remove from the data. 

* Most Invoices appear as normal transactions with positive quantity and prices, but there are some prefixed with "C" or "A" which denote different transaction types. Invoice starting with C represents cancelled order and A represents the Adjusted. Check the negative values in Quantity column for all cancelled orders

* Handle the null values by dropping or filling with appropriate mean


* Create a `DayOfWeek` column using `InvoiceDate`, using `pd.to_datetime()`

**Note:** We perform all the above operations using a function 

In [None]:
# original dataframe for backup
data_orig = data

In [None]:
# Check for the cancelled orders data
data[data.InvoiceNo.str[0] == 'C']

In [None]:
# Identify the cancelled orders
len(data[data.InvoiceNo.str[0] == 'C']), len(data[data.Quantity < 1 ])

In [None]:
# Check the null values
data.isna().sum()

In [None]:
# Function to perform all the pre-processing steps mentioned above
def pre_processing(df):

    # Drop the duplicates from the data
    df.drop_duplicates(inplace=True)

    # Remove the cancelled orders from the data
    df = df[~ (df.InvoiceNo.str[0] == 'C')]

    # Drop Null values
    df.dropna(inplace=True)

    # Converting 'InvoiceDate' valuess in to pandas datatime format 
    df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'].values)

    df['DayOfWeek'] = [i.dayofweek for i in df['InvoiceDate']]
    df['MonthName'] = [i.month_name() for i in df['InvoiceDate']]

    return df

In [None]:
# Call the above pre-processing function by passing the dataframe
data  = pre_processing(data_orig)

In [None]:
data.shape

In [None]:
data.head()

### Understanding new insights from the data (Exploratory Data Analysis)

1.  Are there any free items in the data? How many are there?

2.  Find the number of transactions per country and visualize using an appropriate plot

3.  What is the ratio of customers who are repeat purchasers vs single-time purchasers? Visualize the plot.

4. Find the top 10 customers who bought the most no.of items.

#### 1. Are there any free items in the data ? How many are there ?

In [None]:
data[data.UnitPrice == 0].count()

#### 2. Find the number of transactions per country and visualize

In [None]:
plt.figure(figsize=(10,7))
plt.barh(data.Country.value_counts().index, data.Country.value_counts())
plt.show()

#### 3. What is the ratio of customers who are repeat purchasers vs single-time purchasers? Visualize the plot

In [None]:
MostRepeat = data.groupby('CustomerID')['InvoiceNo'].nunique().sort_values(ascending=False)
rep = MostRepeat[MostRepeat > 1].values
nrep = MostRepeat[MostRepeat == 1].values
ser = pd.Series([len(rep)/ len(MostRepeat),len(nrep)/len(MostRepeat)], index=['Repeat Customers','One-time Customers'])
ser.plot(kind='pie', autopct='%.2f%%').set(ylabel='')
plt.suptitle('Top Repeat Customers', fontsize=15)
plt.show()

#### 4. Find the top 10 customers who bought the most no.of items.

In [None]:
data['CustomerID'] = data['CustomerID'].astype(int)

In [None]:
Top10Customers = data.groupby('CustomerID').agg({"Quantity":"sum"}).sort_values('Quantity', ascending=False).iloc[:10]
print(Top10Customers)

### Feature Engineering and Transformation 

#### Create new features to uncover better insights and drop the unwanted columns

* Create a new column which represents Total amount/Revenue per transaction spent by each customer. We ge this by using following formula:  **Quantity * UnitPrice**

* Customer IDs are seen to be repeated. Maintain unique customer IDs by grouping and summing up all possible observations per customer.



Now, we will check the `total_amount` spent by each customer

In [None]:
data['Total_Amount'] = data['Quantity'] * data['UnitPrice']

In [None]:
data.head()

Before going further, let us understand how are we going to analyse this dataset. There are many ways to analyse this dataset, but we will be seeing RFM analysis. This analysis has been adopted and put into practise since long time ago. It plays a vital role in marketing effort. The three main variables in this analysis:

**R (recency)** : It stores the number of days the customer has done his last purchase with respect to last date in the dataset. It is just to find the last a particular customer has purchaced from the store.

**F (frequency)**: It is the number of times each customer has made a purchase by counting unique innovice dates each customer was seen making a purchase.

**M(Monetary):** It is the total amount spent by each customer.

Let’s calculate RFM values.

The easiest is to calcualte the M-monetary value. We will be using the Total_Amount column that we have created before.

In [None]:
m = data.groupby('CustomerID')['Total_Amount'].sum()
m = pd.DataFrame(m).reset_index()
m.head()

Looking at the first few rows of the new dataframe, we can see that we calculated the monetary value for each customer!

Now, let’s calculate the number of times each customer purchased from the store. We will be using the CustomerID and InvoiceDate columns.



In [None]:
f = data.groupby('CustomerID')['InvoiceNo'].count()
f = f.reset_index()
f.columns = ['CustomerID','Frequency']
f.head()

We were able to calculate the total number of times each customer purchased from the store.

Finally, Let’s calculate R-receny value for each customer.


First, we need to find the when was the last purchase done in the data set.



In [None]:
last_day = max(data['InvoiceDate'])

To find out the last date of purchase of each customer



In [None]:
data['difference'] = last_day - data['InvoiceDate']
data.head()

Now we need just the number of days but not the time and days attached to it just the integer. So that, it is easier to groupby later on based on each customer.

So we can have a seperate function to give the integer number.



In [None]:
def get_days(x):
    y = str(x).split()[0]
    return int(y)
data['difference'] = data['difference'].apply(get_days)

In [None]:
data.head()

Now, we can groupby each customer by using CustomerId and difference column.

In [None]:
r = data.groupby('CustomerID')['difference'].min()
r = r.reset_index()
r.columns = ['CustomerID','Recency']
r.head()

Now we have created all three seperate dataframes for Recency (r), frequency (f), monetary (m). Let’s group these dataframes.

In [None]:
grouped_df = pd.merge(m, f, on = 'CustomerID',how = 'inner')
RFM_df = pd.merge(grouped_df, r, on ='CustomerID', how = 'inner')
RFM_df.columns = ['CustomerID','Monetary','Frequency','Recency']

Here we are doing inner join to group up 3 dataframes.

As K-means clustering access every data point to form a cluster, having outliers can affect in process of detecting clusters so first lets drop the outliers so that we can get better clusters later on.



Let’s look at the box plot of each column.



In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 8))
sns.boxplot(ax=axes[0], x=RFM_df["Monetary"]);
sns.boxplot(ax=axes[1], x=RFM_df["Frequency"]);
sns.boxplot(ax=axes[2], x=RFM_df["Recency"]);

As each variable has outliers lets drop them. Refer to the following [link](https://kanoki.org/2020/04/23/how-to-remove-outliers-in-python/) on how to remove outliers



In [None]:
outlier_vars = ['Monetary','Recency','Frequency']
for column in outlier_vars:
    
    lower_quartile = RFM_df[column].quantile(0.25)
    upper_quartile = RFM_df[column].quantile(0.75)
    iqr = upper_quartile - lower_quartile
    iqr_extended = iqr * 1.5
    min_border = lower_quartile - iqr_extended
    max_border = upper_quartile + iqr_extended
    
    outliers = RFM_df[(RFM_df[column] < min_border) | (RFM_df[column] > max_border)].index
    print(f"{len(outliers)} outliers detected in column {column}")
    
    RFM_df.drop(outliers, inplace = True)

### Feature Scaling

Now we need to standardise the data, as there are larger vlaues that can dominate from defining clusters.

As clustering algorithm is based on distance between the data points, we need to scale the data to follow a normal distribution of mean 0 and standard deviation of 1.

In [None]:
scaled_df = RFM_df[['Monetary','Frequency','Recency']]
scale_standardisation = StandardScaler()
rfm_df_scaled = scale_standardisation.fit_transform(scaled_df)
rfm_df_scaled = pd.DataFrame(rfm_df_scaled)
rfm_df_scaled.columns = ['monetary','frequency','recency']

In [None]:
rfm_df_scaled.head()

Now let's see how the k-means algorithm works.

#### Working of k-means algorithm

It starts by placing the centroids randomly (e.g., by picking k instances at random and using their locations as centroids). Then label the instances and update the centroids, then again label the instances and update the centroids, and so on until the centroids stop moving. The algorithm converges in a finite number of steps.

In order to update the label of instances, k-means computes the distance of each instance from every cluster and assigns the one which is closest to them. Also, each centroid  is updated to the mean of all instances assigned to that cluster as shown in the figure below.
<br><br>
<center>
<img src="https://cdn.iisc.talentsprint.com/CDS/Images/Kmeans_cluster_update.JPG" width=650/>
</center>

The algorithm halts creating and optimizing clusters when either:

* The centroids have stabilized — there is no change in their values because the clustering has been successful.
* The defined number of iterations has been achieved.

Till now we don't know how many clusters the dataset contains or it is hard to identify even with visualization.

#### Apply k-means algorithm to identify a specific number of clusters


* Fit the k-means model

* Extract and store the cluster centroids

Below are the parameters for k-means, which are helpful

**n_clusters** is no. of clusters specified

**k-means++** is a random initialization method for centroids to avoid random initialisation trap

**max_iter** is max no of iterations defined when k-means is running

**n_init** is no. of times k-means will run with different initial centroids

**Note:** Refer to the following [link](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) on  k-means from sklearn

#### Find the optimal number of clusters (K) by using the Elbow method.

Here is a technique called **"Elbow Method"** that helps to find the approipriate clusters for building the clustering model.  

Elbow method gives us an idea on what a good **k** number of clusters would be based on the **sum of squared distance (SSE) or within cluster sum of errors (wcss)** between data points and their assigned clusters’ centroids. We pick **k** at the spot where SSE starts to flatten out and forming an elbow. In scikit learn, we calculate this SSE using `KMeans.inertia_` method. The KMeans class runs the algorithm n_init times and keeps the model with the lowest inertia.

The inertia is not a good performance metric when trying to choose $k$ because it keeps getting lower as we increase $k$. Indeed, the more clusters there are, the closer each instance will be to its closest centroid, and therefore the lower the inertia will be. 

Let’s plot the inertia as a function of $k$

In [None]:
# Plot inertia by varying number of clusters
clusters = np.arange(1,10)
inertia = []
for c in clusters:
    kmeans = KMeans(n_clusters = c, random_state=1)
    kmeans.fit(rfm_df_scaled)
    inertia.append(kmeans.inertia_)
plt.plot(clusters, inertia, marker= '.')
plt.title('Inertia Plot')
plt.xlabel("$k$")
plt.ylabel("Inertia")
plt.show()

From the above plot we can see, the inertia drops very quickly as we increase $k$ up to 3, but then it decreases much more slowly as we keep increasing $k$. This curve has roughly the shape of an arm, and there is an **“elbow”** at $k = 3$. So, if we did not know better, $3$ would be a good choice.

This technique for choosing the best value for the number of clusters is rather coarse. A **more precise approach** but also more computationally expensive is to use the **silhouette score**, which is the mean silhouette coefficient over all the instances. 

An instance’s silhouette coefficient is given by $$Sil(x_1) = \frac{(b – a)}{max(a, b)}$$ where, 

$a$ is the mean distance to the other instances in the same cluster (i.e., the mean intra-cluster distance), and 

$b$ is the mean nearest-cluster distance (i.e., the mean distance to the instances of the next closest cluster, defined as the one that minimizes $b$, excluding the instance’s own cluster) as shown in the figure below. 

<center>
<img src="https://cdn.iisc.talentsprint.com/CDS/Images/Silhouette_coefficient.png" width= 500 px/>
</center>

The silhouette coefficient can vary between $–1$ and $+1$ as follows: 

* close to $+1$ means that the instance is well inside its own cluster and far from other clusters, 
* close to $0$ means that it is close to a cluster boundary, and  
* close to $–1$ means that the instance may have been assigned to the wrong cluster. 

To compute the silhouette score, we can use Scikit-Learn’s `silhouette_score()` function refer to the following [link](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html), giving it all the instances in the dataset and the labels they were assigned:


In [None]:
# Plot Silhouette score plot
clusters = np.arange(2,10)
sil_score = []
for c in clusters:
    kmeans = KMeans(n_clusters = c, random_state=1)
    kmeans.fit(rfm_df_scaled)
    sil_coeff = silhouette_score(rfm_df_scaled, kmeans.labels_)
    sil_score.append(sil_coeff)
    print(f"cluster = {c}\t-->{sil_coeff}")
plt.plot(clusters, sil_score, marker= '.')
plt.title('Silhouette score plot')
plt.xlabel("$k$")
plt.ylabel("Silhouette score")
plt.show()

As we can see, the above visualization is much richer than the previous one: although it shows that $k = 2$ is a good choice, it also underlines the fact that $k = 3$ is quite good as well and much better than $k = 4$ or $5$. This was not visible when comparing inertias. 


So, from the above plot we will choose the number of clusters or customer groups to be 3.

In [None]:
kmeans = KMeans(n_clusters = 3)
kmeans.fit_predict(rfm_df_scaled)

In [None]:
# To understand the behavior of the customers from each cluster print the respective centroid point values
cluster_centers = pd.DataFrame(data = kmeans.cluster_centers_, columns = [rfm_df_scaled.columns])
cluster_centers

### Analyze the clusters

- Visualize the clusters with different colors using the predicted cluster centers.

  **Hint:** [3D plot](https://matplotlib.org/stable/gallery/mplot3d/scatter3d.html
)

In [None]:
clusters = kmeans.labels_
RFM = rfm_df_scaled 
RFM['clusters'] = clusters
centroids = kmeans.cluster_centers_
fig = plt.figure(figsize=(20,12))
ax = fig.add_subplot(projection='3d')
ax.scatter(RFM["monetary"][RFM.clusters == 0], RFM["frequency"][RFM.clusters == 0], RFM["recency"][RFM.clusters == 0], c='blue', s=30)
ax.scatter(RFM["monetary"][RFM.clusters == 1], RFM["frequency"][RFM.clusters == 1], RFM["recency"][RFM.clusters == 1], c='red', s=30)
ax.scatter(RFM["monetary"][RFM.clusters == 2], RFM["frequency"][RFM.clusters == 2], RFM["recency"][RFM.clusters == 2], c='yellow', s=30)
ax.scatter(centroids[:, 0], centroids[:, 1], centroids[:, 2], s = 1100, c = 'k',  marker="*")

ax.view_init(30, 200)
plt.xlabel("Monetary")
plt.ylabel("Frequency")
ax.set_zlabel('Recency')
plt.show()

From the above plot, we can see that the centroids after applying the k-means in black star

In [None]:
# Created a new column in the RMF dataframe refering to the assigned clusters
RFM.head()

Let’s look at the analytics of RFM dataframe.

In [None]:
RFM_df['Clusters'] = kmeans.labels_

analysis = RFM_df.groupby('Clusters').agg({
    'Recency':['mean','max','min'],
    'Frequency':['mean','max','min'],
    'Monetary':['mean','max','min']})

In [None]:
print(analysis)

In [None]:
# Sample Analysis Below:

### First customer cluster: Frequent and minimum amount spent by the customer

### Second customers cluster: Recently visited the store with maximum frequency and spending the amount by customer

### Third Customers cluster: The last transaction was sometime back and minimum amount spent by the customer, 
#                             Average number of purchases by the customer