# Introduction

Today, we'll learn how to perform customer segmentation using **Python** while leveraging the **RFM** framework.

## Dataset

[Online Retail Dataset](https://archive.ics.uci.edu/ml/datasets/Online+Retail)

This Online Retail data set contains all the transactions occurring for a UK-based and registered, non-store online retail between `01-Dec-2010` and `09-Dec-2011`.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

Attribute Information:

- `InvoiceNo`: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
- `StockCode`: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
- `Description`: Product (item) name. Nominal.
- `Quantity`: The quantities of each product (item) per transaction. Numeric.
- `InvoiceDate`: Invoice date and time. Numeric. The day and time when a transaction was generated.
- `UnitPrice`: Unit price. Numeric. Product price per unit in sterling (£).
- `CustomerID`: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
- `Country`: Country name. Nominal. The name of the country where a customer resides.

## RFM

**RFM** is commonly used to identify customers who are likely to buy more frequently, spend more, and/or are more likely to return to the company.

RFM stands for the three dimensions:
- `Recency` – How recently did the customer purchase?
- `Frequency` – How often do they purchase?
- `Monetary` – How much do they spend?

# Setup

Import all necessary packages. Normally, you would import these packages in the top of the notebook.

|Packages|Purpose|
|:-|:-|
|<ul><li>`pandas`</li><li>`numpy`</li></ul>|Data wrangling and analysis|
|<ul><li>`matplotlib`</li><li>`plotly.express`</li><li>`seaborn`</li></ul>|Data visualisation|

In [None]:
# !pip install --upgrade plotly
# !pip install plotly-express

In [None]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

# visualisation
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

# to make subplots
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# better matplotlib plots
%matplotlib inline
%config InlineBackend.figure_format="svg"

# Load data

Let's download the dataset from the link provided above and save it in a `data` folder. You can also save it in a different location if you prefer. 

This is where I store my data.

![image](../../images/tree_directory.png)

Once the dataset is available in your local machine, let's load it using the `pd.read_csv` function to load the data since the data is in **CSV** format

```python
data = pd.read_csv("../../data/OnlineRetail.csv", encoding="ISO-8859-1")
```

**Exercise**

Can you use `shape` to check the shape of the data and print out the number of **rows** and **columns** in the dataset?

In [None]:
# [TODO]


**Exercise**

Normally, you would like to **LOOK** at the dataset before doing any work. How do you display **5** random rows from the dataset?

In [None]:
# [TODO]


**Exercise**

Can you use `.info` to print out the data types of each column?

Which column(s) has missing data?

In [None]:
# [TODO]


**Exercise**

Let's use `describe()` to print out the summary statistics of the dataset.

In [None]:
# [TODO]


In [None]:
# [TODO]


# EDA

**Exercise**

Which country do the customers from the dataset come from?

**Hint**: Use `.groupby` to group the data by `Country` and `CustomerID` and then use `.nunique()` to count the number of customers per country.

In [None]:
# [TODO]


**Exercise**

Can you calculate the percentage of customers coming from the UK?

In [None]:
# [TODO]


Let's visualise the number of customers per country using a **bar plot**.

**Exercise**

What's the percentage of cancelled transactions out of total transactions?

**Hint**: Cancelled transactions are denoted by the `InvoiceNo` column starting with the letter `c`. Convert the `InvoiceNo` column to a `string` and use `.str.contains` (or `str.startswith`) to check if the `InvoiceNo` contains with the letter `c`.

In [None]:
# [TODO]


# Data Preprocessing

**Exercise**

What's the data type of the `InvoiceDate` column?

**Hint**: Use `.dtypes` to check the data types of all columns in the dataframe.

In [None]:
# [TODO]


Take 5 samples of the `InvoiceDate` column.

**Exercise**

How do we ensure that `InvoiceDate` is correctly of type `date`?

In addition, can you convert `CustomerID` from `float` to `int`. Missing values will be ignored and kept as is. We'll handle them later.

**Hint**: Use `pd.to_datetime` and `astype`

In [None]:
# [TODO]


**Exercise**

InvoiceNo starting with `c` indicates a cancellation. How do we remove these records from the dataset?

In [None]:
# [TODO]
print(f"Before: {data.shape[0]} rows and {data.shape[1]} columns.")

# INSERT YOUR CODES HERE

print(f"After : {data.shape[0]} rows and {data.shape[1]} columns.")

**Exercise**

How do you drop the records with missing values in the dataset?

**NOTE**: There are many ways to deal with missing values in a data science project. Dropping the records with missing values is the simplest method.

In [None]:
# [TODO]
print(f"Before: {data.shape[0]} rows and {data.shape[1]} columns.")

# INSERT YOUR CODES HERE

print(f"After : {data.shape[0]} rows and {data.shape[1]} columns.")

By removing cancelled transactions and missing `CustomerID`, there are no longer any transactions with negative `Quantity`.

**Exercise**

Verify that there are no longer any missing values in the dataset

In [None]:
# [TODO]


# RFM Analysis

## Recency

**Recency**: How recently did the customer purchase?

In order to answer this question, we need to have an anchor date. We'll take the last `InvoiceDate` as the anchor date.

**Hint**:
```python
LAST_INVOICE_DATE = data["InvoiceDate"].max()
```

**Recency** will be calculated as the number of days between the **anchor date** and the last `InvoiceDate` of each customer.
- Firstly, we need to find the last `InvoiceDate` of each customer.
- Secondly, we'll calculate the time difference between the **anchor date** and the last `InvoiceDate` of each customer. 

In [None]:
# find the last InvoiceDate for each customer


# calculate the time difference between the anchor and the last InvoiceDate of each customer


In [None]:
# have a look at the recency we just calculated


In [None]:
# Check the shape of recency data


Let's take a look at the `recency` distribution. We'll visualise the distribution using both a **histogram** and a **box plot**.

## Frequency

**Exercise**

**Frequency** will be calculated as the number of times the customer purchased the product.

**Hint**: It will be the count of unique `InvoiceNo` per `CustomerID`.

In [None]:
# [TODO]


In [None]:
# have a look at the frequency we just calculated


In [None]:
# Check the shape of frequency data


**Exercise**

You have learned how to visualise **recency** data above. Can you do the same for **frequency**?

In [None]:
# [TODO]


## Monetary

**Exercise**

**Monetary** is will be the total amount of money the customer spent. 

Can you calculate **monetary**?

**Hint**: 
- Use columns `Quantity` and `UnitPrice` to calculate the total amount of money spent by each customer.
- Use `groupby` to group the data by `CustomerID` and then use `sum` to calculate the total amount of money spent by each customer.

In [None]:
# [TODO]

# calculate monetary value of each purchase


# find the total amount spent for each customer


In [None]:
# Let's take a look at the monetary data we just calculated


In [None]:
# Check the shape of monetary data


**Exercise**

You have learned how to visualise **recency** data above. Can you do the same for **monetary**?

In [None]:
# [TODO]


## Combine RFM

**Exercise**

Can you combine **Recency**, **Frequency** and **Monetary** into 1 single dataframe?

**Hint**: Use `pd.concat`

In [None]:
# [TODO]


In [None]:
# Let's take a look at the combined RFM data


In [None]:
# Check the shape of RFM data


**Recency**, **Frequency** and **Monetary** all appear to contain a lot of outliers. 

Therefore, we'll handle the outliers by capping the values of **Recency**, **Frequency** and **Monetary** at the `10th` and `90th` percentile values respectively.

In [None]:
# treating outliers
FLOORING = 0.1
CAPPING = 0.9



In [None]:
# Let's take a look at the combined RFM data without outliers


**Exercise**

Since we keep both the original and fixed RFM data, can you use **box plot** to visualise the changes in the **Recency**, **Frequency** and **Monetary** columns?
The final graph should look like this:

![image](../../images/rfm_original_fixed.png)

In [None]:
# [TODO]


Uncomment the codes below to look at another way to visualise the same thing using `seaborn`

In [None]:
# fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 10), sharey=True)

# for r, key in enumerate(["Recency", "Frequency", "Monetary"]):
#     y_axis = axes[r, 0].axes.get_yaxis()
#     y_axis.set_visible(False)

#     axes[r, 0].set_title(f"Original: {key}")
#     sns.boxplot(ax=axes[r, 0], data=rfm_data[f"{key}_Original"], orient="h")

#     axes[r, 1].set_title(f"Fixed: {key}")
#     sns.boxplot(ax=axes[r, 1], data=rfm_data[key], orient="h", color="green")

# RFM Score

RFM scores are defined such that score of `5` and `1` are the `best` and the `worst` respectively.

In [None]:
# most recent purchase should receive the highest score


# the more purchases a customer has made, the higher the score
# Some customers will have the same number of purchases, so we'll rank the frequency score based on the first appearance


# the more money a customer has spent, the higher the score


In [None]:
# Combine the scores into 1 single label column for each customer


# Calculate the total RFM Score for each customer


In [None]:
# best customers


In [None]:
# worst customers


# Segmentation with RFM

Refer to [RFM Segmentation - Business Use](https://documentation.bloomreach.com/engagement/docs/rfm-segmentation) for more information.

|No|Customer Segment|Activity|Actionable Tip|
|:-:|:-|:-|:-|
|1|Champion|Bought recently, order often and spend the most.|Reward them. Can be early adopters of new products. Will promote your brand. Most likely to send referrals.|
|2|Loyal|Orders regularly. Responsive to promotions.|Upsell higher value products. Ask for reviews.|
|3|Potential Loyalist|Recent customers, and spent a good amount.|Offer membership / loyalty program. Keep them engaged. Offer personalised recommendations.|
|4|New Customers|Bought most recently.|Provide on-boarding support, give them early access, start building relationship.|
|5|Promising|Potential loyalist a few months ago. Spends frequently and a good amount. But the last purchase was several weeks ago.|Offer coupons. Bring them back to the platform and keep them engaged. Offer personalised recommendations.|
|6|Core|Standard customers with not too long-ago purchase.|Make limited time offers.|
|7|Needs attention|Core customers whose last purchase happened more than one month ago.|Make limited time offers. Offer personalised recommendations.|
|8|Cannot Lose Them|Made the largest orders, and often. But haven’t returned for a long time.|Win them back via renewals or newer products, don’t lose them to competition. Talk to them if necessary. Spend time on highest possible personalisation.|
|9|At Risk|Similar to “Can’t lose them but losing” but with smaller monetary and frequency value.|Provide helpful resources on the site. Send personalised emails.|
|10|Hibernating|Made their last purchase a long time ago but in the last 4 weeks either visited the site or opened an email.|Make subject lines of emails very personalised. Revive their interest by a specific discount on a specific product.|
|11|Lost|Made last purchase long time ago and didn’t engage at all in the last 4 weeks.|Revive interest with reach out campaign. Ignore otherwise.|

**Exercise**

Use the segmentation map below to perform customer segmentation.

```python
segmentation_map = {
    r'555|554|544|545|454|455|445': 'Champions',
    r'543|444|435|355|354|345|344|335': 'Loyal',
    r'553|551|552|541|542|533|532|531|452|451|442|441|431|453|433|432|423|353|352|351|342|341|333|323': 'Potential Loyalist',
    r'512|511|422|421|412|411|311': 'New Customers',
    r'525|524|523|522|521|515|514|513|425|424|413|414|415|315|314|313': 'Promising',
    r'535|534|443|434|343|334|325|324': 'Need Attention',
    r'331|321|312|221|213|231|241|251': 'About To Sleep',
    r'255|254|245|244|253|252|243|242|235|234|225|224|153|152|145|143|142|135|134|133|125|124': 'At Risk',
    r'155|154|144|214|215|115|114|113': 'Cannot Lose Them',
    r'332|322|233|232|223|222|132|123|122|212|211': 'Hibernating',
    r'111|112|121|131|141|151': 'Lost',
}
```

The segment should be stored in a new column `RFMSegment` in the dataframe.

```python
rfm_data['RFMSegment'] = ...
```

In [None]:
# [TODO]


**Exercise**

Let's calculate the `median`, `mean` and `std` of **Recency**, **Frequency**, **Monetary** for each segment in `RFMSegment` column.

In [None]:
# [TODO]


# Visualisation

**Exercise**

The distribution of Recency, Frequency, and Monetary need to follow the same direction since we assign 1 and 5 for the worst and the best customer respectively.

Can you verify this by plotting the distribution?

In [None]:
# [TODO]


Uncomment the codes below to look at another way to visualise the same thing using `seaborn`

In [None]:
# fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 10))

# for r, key in enumerate(["Recency", "Frequency", "Monetary"]):
#     sns.histplot(ax=axes[r, 0], data=rfm_data[key], bins=5, kde=True, legend=False)

#     y_axis_ax1 = axes[r, 1].axes.get_yaxis()
#     y_axis_ax1.set_visible(False)
#     axes[r, 1].set_xlabel(f"{key}")
#     sns.boxplot(ax=axes[r, 1], data=rfm_data[key], orient="h")

Let's see if Recency, Frequency and Monetary are correlated to each other using a **heatmap**.

Let's visualise the customers based on the RFM Segment on a 3D **scatter plot**.

Let's draw a **treemap** of the customers based on the RFM Segment

**Exercise**

**VERY HARD** 🤯

Plot the distribution of **Monetary** for each combination of **Recency** and **Frequency** score. 

**NOTE**: Ideally, we are expecting to see a diagonal line having the green bar increasing from 1 to 5 for **Monetary**.

**Hint**: Use loops to loop through different combination of values of R, F.

The final result should look like this:

![image](../../images/rfm_M_for_each_R_and_F.png)

In [None]:
# [TODO]


Uncomment the codes below to look at another way to visualise the same thing using `seaborn`

In [None]:
# fig, axes = plt.subplots(nrows=5, ncols=5,
#                          sharex=False, sharey=True,
#                          figsize=(10, 10))

# for r in range(1, 6):
#     for f in range(1, 6):
#         y = rfm_data[(rfm_data["RScore"] == r) & (rfm_data["FScore"] == f)]["MScore"].value_counts().sort_index()
#         x = y.index
#         ax = axes[r - 1, f - 1]
#         bars = ax.bar(x, y, color="grey")
#         if r == 5:
#             if f == 3:
#                 ax.set_xlabel(f"{f}\nF", va='top')
#             else:
#                 ax.set_xlabel(f"{f}\n", va='top')
#         if f == 1:
#             if r == 3:
#                 ax.set_ylabel(f"R\n{r}")
#             else:
#                 ax.set_ylabel(r)
#         ax.set_frame_on(False)
#         ax.tick_params(left=False, labelleft=False, bottom=False)
#         ax.set_xticks(x)
#         ax.set_xticklabels(x, fontsize=8)

#         for bar in bars:
#             value = bar.get_height()
#             if value == y.max():
#                 bar.set_color("green")
#             ax.text(bar.get_x() + bar.get_width() / 2,
#                     value,
#                     int(value),
#                     ha="center",
#                     va="bottom",
#                     color="k")
# fig.suptitle("M distribution for each F and R", fontsize=14)
# plt.tight_layout()
# plt.show()