#  Unsupervised learning - K-Means Clustering:

Online retail is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

__Link to the dataset and the metadata:__ https://archive.ics.uci.edu/dataset/352/online+retail

We will use RFM Analysis:
- Recency
- Frequency
- Monetory

### Let's write down the steps that we will be following:

- Step 1: Read and visualize the data
- Step 2: Clean the data
- Step 3: Data prep for modelling
- Step 4: Modeliing
- Step 5: Final analysis and business recommendation
  

In [29]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn import metrics


In [31]:
df = pd.read_csv('Online+Retail.csv', encoding="ISO-8859-1")
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01-12-2010 08:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,01-12-2010 08:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01-12-2010 08:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01-12-2010 08:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,01-12-2010 08:26,3.39,17850.0,United Kingdom


In [33]:
df.shape

(541909, 8)

In [35]:
df.isnull().sum().any()

True

In [43]:
df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

In [51]:
round(100*df.isnull().sum()/len(df), 3)

InvoiceNo       0.000
StockCode       0.000
Description     0.268
Quantity        0.000
InvoiceDate     0.000
UnitPrice       0.000
CustomerID     24.927
Country         0.000
dtype: float64

We cannot drop the column CustomerID even though it has 25% of missing values because of we drop the CustomerID itself then we loose the information about the customers itself. We wont be able to do customer segmentatin if we dro the CustomerID.

So we have to drop the 25% of the rows instead of dropping the entire column,

In [55]:
# Drop all the rows having the missing values.

In [57]:
df = df.dropna()

In [59]:
df.isnull().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

In [61]:
df.shape

(406829, 8)