<a href="https://colab.research.google.com/github/keerthana-narra/Online-Retail-Customer-Segmentation/blob/main/Capstone_Project_Online_Retail_Customer_Segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name - Online Retail Customer Segmentation**



##### **Project Type**    - Clustering
##### **Contribution**    - Individual


# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**

In the project, the task is to identify major customer segments on a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

**What is Customer Segmentation ??**

Customer segmentation is the process of categorizing a customer base into distinct groups based on shared characteristics or behaviors. This allows businesses to tailor marketing strategies and services to meet the unique needs of each segment. Effective segmentation enhances personalization and improves overall customer satisfaction.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline
from matplotlib import rcParams


from scipy.stats import *
import math

from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from xgboost import XGBClassifier
from xgboost import XGBRFClassifier
from sklearn.tree import export_graphviz

#!pip install shap==0.40.0
import shap
import graphviz
sns.set_style('darkgrid')

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
#Mount drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Load the dataset
df=pd.read_excel('/content/drive/MyDrive/Almabetter/Masters/Foundation/Module6/Capstone/Online Retail.xlsx')

### Peek into data 👀

In [None]:
#First 5 rows
df.head()

In [None]:
#Last 5 rows
df.tail()

In [None]:
# Dataset Rows & Columns
print(f'Shape of original dataframe:  {df.shape}')

## **2. Understanding Data & Initial Preprocessing**

### **Understanding Data**

In [None]:
#Variables in the dataset
print(f'Variables in the dataset : {list(df.columns)}')

**Variables Description**

* **InvoiceNo   :**  Invoice number. A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
* **StockCode   :**  Item code. A 5-digit integral number uniquely to each item.
* **Description :**  Item name.
* **Quantity    :**  The quantity of each product (item) per transaction.
* **InvoiceDate :**  The day and time when each transaction was generated.
* **UnitPrice   :**  Product price per unit.
* **CustomerID  :**  Customer ID. A 5-digit integral number unique to each customer.
* **Country     :**  Country name where each customer resides.

In [None]:
# Dataset Info
df.info()

In [None]:
# Dataset Describe
df.describe(include='all')

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(f'Count of duplicate rows : {len(df[df.duplicated()])}')

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(df.isnull(), cbar=False)

### **Initial Preprocessing**
This step before EDA helps us to understand data easily further
1. Remove duplicates
2. Drop rows if null values in identifiers (CustomerID here)
3. Data type conversion
4. Drop rows which are cancelled transactions(As not required)


In [None]:
# Modify dataframe by Droping duplicates based on all columns
df = df.drop_duplicates()
print('Shape of dataset after droping duplicates',df.shape)

In [None]:
# Drop rows with null values in customer ID
df = df.dropna(subset=['CustomerID'])
print('Shape of dataset after dropping rows with no customer ID',df.shape)

In [None]:
# Data type conversion
columns_to_convert = ['InvoiceNo', 'Description', 'StockCode', 'CustomerID', 'Country']
df[columns_to_convert] = df[columns_to_convert].astype(str)

In [None]:
# Drop cancelled transactions
print("Total count of cancelled items are", df[df['InvoiceNo'].str.contains('C')].shape[0])
df=df[~df['InvoiceNo'].str.contains('C')]
print('Shape of dataset after droping cancelled transactions',df.shape)

In [None]:
print("Total Customers:", df['CustomerID'].nunique())
print("Total unique Transactions:", df['InvoiceNo'].nunique())
print("Total distinct Items sold:", df['StockCode'].nunique())
print("Total Countries:", df['Country'].nunique())

## **3. Exploratory Data Analysis**

#### Chart-1 & Chart-2 Monthly Distribution of Transactions & Monthly Distribution of Revenue

In [None]:
# Monthly Distribution of Transactions Over Time
plt.figure(figsize=(15, 6))

plt.subplot(1, 2, 1)
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df.set_index('InvoiceDate')['InvoiceNo'].resample('M').nunique().plot(marker='o', color='blue')
plt.title('Monthly Distribution of Transactions')
plt.xlabel('Month')
plt.ylabel('Number of Transactions')

# Monthly Revenue Over Time
plt.subplot(1, 2, 2)
df['Revenue'] = df['Quantity'] * df['UnitPrice']
df = df[df['Revenue'] > 0]
df.set_index('InvoiceDate')['Revenue'].resample('M').sum().div(1000000).plot(marker='o', color='orange')  # Convert to Millions
plt.title('Monthly Revenue')
plt.xlabel('Month')
plt.ylabel('Revenue (in Millions)')

# Add 'M' extension to y-axis ticks
plt.gca().yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.1f}M'))

plt.tight_layout()
plt.show()

Chart -1 Distribution of Transactions Over Time

1. Why this chart? This line chart helps visualize the monthly distribution of transactions occuring over time.

2. Insights: Identify patterns or trends in transaction volume over time, detect seasonality, and assess the impact of time on business operations.

3. Business Impact: Understanding the temporal distribution of transactions can help optimize staffing, inventory management, and marketing strategies based on peak and off-peak periods.



Chart 2- Distribution of Monthly Revenue Over Time

1. Why this chart? A bar chart illustrates the monthly revenue over time.

2. Insights: Identify revenue trends, seasonality, and potential growth or decline periods.

3. Business Impact: Helps in financial planning, budgeting, and adjusting strategies based on revenue performance.

#### Chart -3 Distribution of Customers Across Countries

In [None]:
#Customer Distribution Across Countries
top_countries = df['Country'].value_counts().head(5).index

# Create a new column 'Country_Grouped' to represent the top 5 countries and 'Others'
df['Country_Grouped'] = df['Country'].apply(lambda x: x if x in top_countries else 'Others')

# Plot the count of customers in each country group
plt.figure(figsize=(8, 5))
sns.countplot(x='Country_Grouped', data=df, palette='viridis')
plt.title('Customer Distribution across countries')
plt.xlabel('Country')
plt.ylabel('Number of Customers')
plt.xticks(rotation=0)
plt.show()

# Drop the temporary 'Country_Grouped' column - not needed for further analysis
#df.drop('Country_Grouped', axis=1, inplace=True)

1. Why this chart? A countplot provides a visual representation of the distribution of customers across different countries.

2. Insights: Identify the countries with the highest customer concentration.

3. Business Impact: Tailor marketing strategies, promotions, or product offerings based on the countries with the highest customer base to maximize impact and revenue.

#### Chart - 4 Distribution of quantity of item sold in a transaction

In [None]:
#Quantity of Items Sold Distribution

plt.figure(figsize=(8, 5))
sns.histplot(np.log(df['Quantity']), bins=50, kde=True)
plt.title('Distribution of Quantity of Items Sold')
plt.xlabel('Quantity')
plt.ylabel('Frequency')
plt.show()

1. Why this chart? A histogram shows the distribution of quantities of items sold.

2. Insights: Understand the distribution of the quantity of items sold, identify outliers.

3. Business Impact: Helps in inventory management by identifying frequently sold quantities and outliers that may need special attention.

#### Chart 5 - Top Selling Products - Popular

In [None]:
plt.figure(figsize=(8, 5))
df_top = df.groupby('StockCode')['Quantity'].sum().reset_index().sort_values(by = 'Quantity', ascending = False )
top_products = df_top.head(10)
top_products.plot(kind='bar', color='coral')
plt.title('Top 10 Selling Products')
plt.xlabel('Stock Code')
plt.ylabel('Quantity Sold')
plt.show()

1. Why this chart? A horizontal bar chart displays the top-selling products.

2. Insights: Identify the best-performing products in terms of sales volume.

3. Business Impact: Helps in inventory management, highlighting popular products that might need special attention or promotions.

#### Chart 6 - Top Revenue generating products

In [None]:
plt.figure(figsize=(8,5))
df_rev = df.groupby('StockCode')['Revenue'].sum().reset_index().sort_values(by = 'Revenue', ascending = False )
top_products = df_rev.head(10)
top_products.plot(kind='bar', color='coral')
plt.title('Top 10 Revenue generating products')
plt.xlabel('Stock Code')
plt.ylabel('Revenue')
plt.show()

1. Why this chart? A horizontal bar chart displays the top-selling products.

2. Insights: Identify the best-performing products in terms of revenue generation

3. Business Impact: Helps in top revenue generation products

#### Chart 7 - Customer Purchase Frequency

In [None]:
plt.figure(figsize=(8, 5))
purchase_frequency = df.groupby('CustomerID')['InvoiceNo'].nunique()
sns.histplot(purchase_frequency, bins=50, kde=True)
plt.title('Customer Purchase Frequency')
plt.xlabel('Number of Purchases')
plt.ylabel('Number of Customers')
plt.show()

1. Why this chart? A histogram depicts the distribution of customer purchase frequencies.

2. Insights: Understand how often customers make purchases.

3. Business Impact: Target marketing efforts based on customer segments, such as frequent buyers or occasional shoppers.

#### Chart 8 - Distribution of transaction by Weekday

In [None]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Extract weekday from 'InvoiceDate' and format as day names
df['Weekday'] = df['InvoiceDate'].dt.strftime('%A')

# Create a bar chart for weekday vs. transaction count
plt.figure(figsize=(8, 5))
df.groupby('Weekday')['InvoiceNo'].nunique().sort_values(ascending=True).plot(kind='bar', color='orange')
plt.title('Transaction Count by Weekday')
plt.xlabel('Weekday')
plt.ylabel('Number of Transactions')
plt.xticks(rotation=0)  # Rotate x-axis labels for better readability
plt.show()

1. Why this chart?
To visualize the distribution of transaction counts across weekdays, providing insights into sales patterns and identifying peak days.

2. Insights:
The chart highlights varying transaction counts by weekday, offering insights into peak sales days and opportunities for targeted marketing. Saturday is a holiday ? No transactions on that day

3. Business Impact:
Enables strategic resource allocation, staffing, and marketing efforts, optimizing business operations based on observed transaction patterns.

#### Chart 9 - Top 15 Customers by Trasactions

In [None]:
plt.figure(figsize=(10, 6))
df['CustomerID'].value_counts().head(15).plot(kind='bar')
plt.title('Top 15 Customers by Transactions')
plt.xlabel('Customer ID')
plt.ylabel('Number of Transactions')
plt.show()

1. Why this chart? A bar chart displays the top 15 customers based on transaction count.

2. Insights: Identifies the most valuable customers in terms of transaction count.

3. Business Impact: Guides customer relationship management strategies and loyalty programs.

#### Chart 10 - Country vs. Invoice Date vs. Revenue Heatmap

In [None]:
heatmap_data = df.groupby(['Country_Grouped', df['InvoiceDate'].dt.to_period('M')])['UnitPrice'].sum().unstack().fillna(0)

plt.figure(figsize=(8, 5))
sns.heatmap(heatmap_data, cmap='viridis')
plt.title('Country vs. Invoice Date vs. Revenue')
plt.xlabel('Date')
plt.ylabel('Country')
plt.show()

1. Why this chart? A heatmap explores how revenue varies across different countries and dates.

2. Insights: Identifies patterns in revenue generation over time and in different regions.

3. Business Impact: Guides international marketing strategies and helps plan for revenue fluctuations.

### Overall Insights

1. Sep, Oct, Nov are where customers are very active. In Sep, Oct customers tendency to buy high priced items compared to in Nov
2. Top 5 countries from where customers are expected from UK, France, Germany, Spain, EIRE.
3. Customer Segmentation based on quantity of items bought is performable as the distribution is too wide.
4. Customer Frequency can be another segmentation
5. No transactions recorded on Saturday. Thursday seems popular to purchases.
6. Top 15 Customers, Top 10 products by transaction count, Top 10 products by revenue are seen.
7. UK purchased mostly in November, France purchased mostly in October, EIRE purchased mostly in March