## K-Means Clustering

**Overview**<br>
<a href="https://archive.ics.uci.edu/ml/datasets/online+retail">Online retail is a transnational data set</a> which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

The steps are broadly:
1. Read and understand the data
2. Clean the data
3. Prepare the data for modelling
4. Modelling
5. Final analysis and reco

# 1. Read and visualise the data

In [21]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import datetime as dt

import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

In [22]:
# read the dataset
retail_df = pd.read_csv("Cricket.csv", sep=",", encoding="ISO-8859-1", header=0)
retail_df.head()

Unnamed: 0,Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR
0,SR Tendulkar (INDIA),1989-2012,463,452,41,18426,200*,44.83,21367,86.23
1,KC Sangakkara (Asia/ICC/SL),2000-2015,404,380,41,14234,169,41.98,18048,78.86
2,RT Ponting (AUS/ICC),1995-2012,375,365,39,13704,164,42.03,17046,80.39
3,ST Jayasuriya (Asia/SL),1989-2011,445,433,18,13430,189,32.36,14725,91.2
4,DPMD Jayawardene (Asia/SL),1998-2015,448,418,39,12650,144,33.37,16020,78.96


In [23]:
# basics of the df
retail_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79 entries, 0 to 78
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Player  79 non-null     object 
 1   Span    79 non-null     object 
 2   Mat     79 non-null     int64  
 3   Inns    79 non-null     int64  
 4   NO      79 non-null     int64  
 5   Runs    79 non-null     int64  
 6   HS      79 non-null     object 
 7   Ave     79 non-null     float64
 8   BF      79 non-null     int64  
 9   SR      79 non-null     float64
dtypes: float64(2), int64(5), object(3)
memory usage: 6.3+ KB


# 2. Clean the data

In [24]:
retail_df.isnull()

Unnamed: 0,Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False


In [25]:
retail_df.isnull().sum()

Player    0
Span      0
Mat       0
Inns      0
NO        0
Runs      0
HS        0
Ave       0
BF        0
SR        0
dtype: int64

In [26]:
# missing values
round(100*(retail_df.isnull().sum())/len(retail_df), 2)

Player    0.0
Span      0.0
Mat       0.0
Inns      0.0
NO        0.0
Runs      0.0
HS        0.0
Ave       0.0
BF        0.0
SR        0.0
dtype: float64

In [27]:
# drop all rows having missing values
retail_df = retail_df.dropna()
retail_df.shape

(79, 10)

In [28]:
retail_df.head()

Unnamed: 0,Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR
0,SR Tendulkar (INDIA),1989-2012,463,452,41,18426,200*,44.83,21367,86.23
1,KC Sangakkara (Asia/ICC/SL),2000-2015,404,380,41,14234,169,41.98,18048,78.86
2,RT Ponting (AUS/ICC),1995-2012,375,365,39,13704,164,42.03,17046,80.39
3,ST Jayasuriya (Asia/SL),1989-2011,445,433,18,13430,189,32.36,14725,91.2
4,DPMD Jayawardene (Asia/SL),1998-2015,448,418,39,12650,144,33.37,16020,78.96


# 3. Prepare the data for modelling

- R (Recency): Number of days since last purchase
- F (Frequency): Number of tracsactions
- M (Monetary): Total amount of transactions (revenue contributed)

# removing (statistical) outliers METHOD 1 INITIALLY BUT LATER WE HAVE OUTLIERS SO USE THE NEXT ONE
Q1 = grouped_df.amount.quantile(0.05)
Q3 = grouped_df.amount.quantile(0.95)
IQR = Q3 - Q1
grouped_df = grouped_df[(grouped_df.amount >= Q1 - 1.5*IQR) & (grouped_df.amount <= Q3 + 1.5*IQR)]

# outlier treatment for recency
Q1 = grouped_df.recency.quantile(0.05)
Q3 = grouped_df.recency.quantile(0.95)
IQR = Q3 - Q1
grouped_df = grouped_df[(grouped_df.recency >= Q1 - 1.5*IQR) & (grouped_df.recency <= Q3 + 1.5*IQR)]

# outlier treatment for frequency
Q1 = grouped_df.frequency.quantile(0.05)
Q3 = grouped_df.frequency.quantile(0.95)
IQR = Q3 - Q1
grouped_df = grouped_df[(grouped_df.frequency >= Q1 - 1.5*IQR) & (grouped_df.frequency <= Q3 + 1.5*IQR)]



In [29]:
# 2. rescaling
rfm_df = retail_df[['SR', 'Ave']]

# instantiate
scaler = StandardScaler()

# fit_transform
rfm_df_scaled = scaler.fit_transform(rfm_df)
rfm_df_scaled.shape

(79, 2)

In [30]:
rfm_df_scaled = pd.DataFrame(rfm_df_scaled)
rfm_df_scaled.columns = ['SR', 'Ave']
rfm_df_scaled.head()

Unnamed: 0,SR,Ave
0,0.703152,1.072294
1,-0.044139,0.587725
2,0.110997,0.596226
3,1.207091,-1.047909
4,-0.034,-0.876185


In [31]:
# rfm_df_scaled.shape()
# gives error tuple not callable

# 4. Modelling

In [32]:
# k-means with some arbitrary k and for each run the assignment of cluster wil change very importanat point######################################
kmeans = KMeans(n_clusters=4, max_iter=50,random_state=100)
kmeans.fit(rfm_df_scaled)



In [33]:
kmeans.labels_

array([2, 2, 2, 3, 1, 2, 2, 2, 1, 2, 2, 2, 3, 0, 1, 0, 1, 2, 2, 2, 2, 1,
       1, 2, 3, 0, 2, 3, 1, 2, 1, 1, 2, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 1,
       1, 1, 2, 1, 1, 2, 3, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 3, 2, 2, 0, 2,
       2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 2, 1])

In [34]:

# assign the label
retail_df['cluster_id'] = kmeans.labels_
retail_df.head()

Unnamed: 0,Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR,cluster_id
0,SR Tendulkar (INDIA),1989-2012,463,452,41,18426,200*,44.83,21367,86.23,2
1,KC Sangakkara (Asia/ICC/SL),2000-2015,404,380,41,14234,169,41.98,18048,78.86,2
2,RT Ponting (AUS/ICC),1995-2012,375,365,39,13704,164,42.03,17046,80.39,2
3,ST Jayasuriya (Asia/SL),1989-2011,445,433,18,13430,189,32.36,14725,91.2,3
4,DPMD Jayawardene (Asia/SL),1998-2015,448,418,39,12650,144,33.37,16020,78.96,1


In [35]:
import pandas as pd
pd.set_option("display.max_rows", None)   # show all rows
pd.set_option("display.max_columns", None)

# Increase display width so they don't wrap
pd.set_option("display.width", 1000)

print(retail_df.head(79))


                          Player       Span  Mat  Inns  NO   Runs    HS    Ave     BF      SR  cluster_id
0           SR Tendulkar (INDIA)  1989-2012  463   452  41  18426  200*  44.83  21367   86.23           2
1    KC Sangakkara (Asia/ICC/SL)  2000-2015  404   380  41  14234   169  41.98  18048   78.86           2
2           RT Ponting (AUS/ICC)  1995-2012  375   365  39  13704   164  42.03  17046   80.39           2
3        ST Jayasuriya (Asia/SL)  1989-2011  445   433  18  13430   189  32.36  14725   91.20           3
4     DPMD Jayawardene (Asia/SL)  1998-2015  448   418  39  12650   144  33.37  16020   78.96           1
5      Inzamam-ul-Haq (Asia/PAK)  1991-2007  378   350  53  11739  137*  39.52  15812   74.24           2
6         JH Kallis (Afr/ICC/SA)  1996-2014  328   314  53  11579   139  44.36  15885   72.89           2
7        SC Ganguly (Asia/INDIA)  1992-2007  311   300  23  11363   183  41.02  15416   73.70           2
8      R Dravid (Asia/ICC/INDIA)  1996-2011  3

In [16]:
# help(KMeans)