---

<center><h1>Online Retail Customer Segmentation RFM Analysis</h1></center>

---

## Process Flow of Project

1. Understanding Problem Statement
2. Getting System Ready
3. Understanding the Data-Data Eyeballing & Data Description
4. Data Cleaning & Preprocessing I
5. Exploratory Data Analysis (EDA)
6. Insights from Data Visualization
7. Feature Engineerig
8. Model Buidling & Evaluation
9. Selection of Best Model & Hyperparameter Tuninng
10. Generating Pickle file

## 1) Understanding Problem Statement
---

### Problem Statement:
Online retailers have a vast customer base with diverse shopping behaviors. To enhance marketing strategies and customer engagement, there is a need to segment customers into distinct groups based on their recency, frequency and monetary (RFM) characteristics. This project aims to segment customers effectively to provide tailored marketing campaigns and improve overall business performance.

### Approach:
Given that the data is already available, the approach involves preprocessing the existing online retail transaction data, calculating RFM metrics, applying **K-Means** and **Hierarchical clustering** for customer segmentation, interpreting and labeling clusters and using insights to optimize marketing strategies and enhance customer engagement.

### Objective:
To effectively segment online retail customers based on their RFM characteristics using clustering techniques (K-Means and Hierarchical) to improve targeted marketing, boost revenue, enhance customer retention, optimize inventory management and increase overall customer satisfaction.

### Benefits:
The benefits of this solution include:

- **Improved Customer Engagement:** Tailored marketing campaigns and personalized recommendations for each customer segment lead to higher engagement and conversion rates.

- **Increased Revenue:** Targeting high-value customer segments with the right offers can boost sales and revenue.

- **Customer Retention:** Identifying at-risk and churned customer segments allows for proactive retention efforts.

- **Resource Optimization:** Efficient allocation of resources in inventory management and marketing efforts based on customer preferences.

- **Enhanced Customer Satisfaction:** Providing customers with products and offers that match their preferences leads to higher satisfaction and loyalty.

- **Data-Driven Decision Making:** The project promotes data-driven decision-making, helping the company adapt to changing customer behaviors and market trends.

- **Competitive Advantage:** The ability to understand and cater to customer segments better can provide a competitive edge in the online retail industry.

## 2) Getting System Ready
---

### Import Required Packages
Importing Pandas, Numpy, Matplotlib, Seaborn, libraries for Clustering and Warnings Library

In [2]:
# import required libraries for dataframe and visualization

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt


# import required libraries for clustering
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

## 3) Understanding the Data-Data Eyeballing & Data Description
---

Online retail Dataset is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

[Dataset Source](https://archive.ics.uci.edu/dataset/352/online+retail)

#### The given dataset has two sheet Year 2009-2010 and 2010-2011

In [3]:
retail = pd.read_excel('online_retail_II.xlsx', sheet_name='Year 2010-2011')

In [4]:
retail.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [5]:
print('The size of Dataframe is: ', retail.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
retail.info()
print('-'*100)

The size of Dataframe is:  (541910, 8)
----------------------------------------------------------------------------------------------------
The Column Name, Record Count and Data Types are as follows: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541910 entries, 0 to 541909
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Invoice      541910 non-null  object        
 1   StockCode    541910 non-null  object        
 2   Description  540456 non-null  object        
 3   Quantity     541910 non-null  int64         
 4   InvoiceDate  541910 non-null  datetime64[ns]
 5   Price        541910 non-null  float64       
 6   Customer ID  406830 non-null  float64       
 7   Country      541910 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB
----------------------------------------------------------------------------------------------------


In [6]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in retail.columns if retail[feature].dtype != 'O']
categorical_features = [feature for feature in retail.columns if retail[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

We have 4 numerical features : ['Quantity', 'InvoiceDate', 'Price', 'Customer ID']

We have 4 categorical features : ['Invoice', 'StockCode', 'Description', 'Country']


In [7]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=retail.isnull().sum().sort_values(ascending=False)
percent=(retail.isnull().sum()/retail.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

Missing Value Presence in different columns of DataFrame are as follows : 
----------------------------------------------------------------------------------------------------


Unnamed: 0,Total,Percent
Customer ID,135080,24.926648
Description,1454,0.26831
Invoice,0,0.0
StockCode,0,0.0
Quantity,0,0.0
InvoiceDate,0,0.0
Price,0,0.0
Country,0,0.0


In [8]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
print('-'*100)
retail.describe()

Summary Statistics of numerical features for DataFrame are as follows:
----------------------------------------------------------------------------------------------------


Unnamed: 0,Quantity,InvoiceDate,Price,Customer ID
count,541910.0,541910,541910.0,406830.0
mean,9.552234,2011-07-04 13:35:22.342307584,4.611138,15287.68416
min,-80995.0,2010-12-01 08:26:00,-11062.06,12346.0
25%,1.0,2011-03-28 11:34:00,1.25,13953.0
50%,3.0,2011-07-19 17:17:00,2.08,15152.0
75%,10.0,2011-10-19 11:27:00,4.13,16791.0
max,80995.0,2011-12-09 12:50:00,38970.0,18287.0
std,218.080957,,96.759765,1713.603074


In [9]:
print('Summary Statistics of categorical features for DataFrame are as follows:')
print('-'*100)
retail.describe(include= 'object')

Summary Statistics of categorical features for DataFrame are as follows:
----------------------------------------------------------------------------------------------------


Unnamed: 0,Invoice,StockCode,Description,Country
count,541910,541910,540456,541910
unique,25900,4070,4223,38
top,573585,85123A,WHITE HANGING HEART T-LIGHT HOLDER,United Kingdom
freq,1114,2313,2369,495478


In [10]:
print('-'*125)
print('Checking records for mis-spell, typo-error etc.')
print('-'*125)

print("'Invoice' variable have {} unique category : \n{}\n".format(retail['Invoice'].nunique(), retail['Invoice'].unique()))
print('-'*125)

print("'StockCode' variable have {} unique category : \n{}\n".format(retail['StockCode'].nunique(), retail['StockCode'].unique()))
print('-'*125)

print("'Description' variable have {} unique category : \n{}\n".format(retail['Description'].nunique(), retail['Description'].unique()))
print('-'*125)

print("'Country' variable have {} unique category : \n{}\n".format(retail['Country'].nunique(), retail['Country'].unique()))
print('-'*125)

-----------------------------------------------------------------------------------------------------------------------------
Checking records for mis-spell, typo-error etc.
-----------------------------------------------------------------------------------------------------------------------------
'Invoice' variable have 25900 unique category : 
[536365 536366 536368 ... 581585 581586 581587]

-----------------------------------------------------------------------------------------------------------------------------
'StockCode' variable have 4070 unique category : 
['85123A' 71053 '84406B' ... '90214U' '47591b' 23843]

-----------------------------------------------------------------------------------------------------------------------------
'Description' variable have 4223 unique category : 
['WHITE HANGING HEART T-LIGHT HOLDER' 'WHITE METAL LANTERN'
 'CREAM CUPID HEARTS COAT HANGER' ... 'lost'
 'CREAM HANGING HEART T-LIGHT HOLDER' 'PAPER CRAFT , LITTLE BIRDIE']

------------------