# Introduction

This analysis presents a complete customer segmentation and predictive analytics workflow using transaction data

The goal is to convert raw purchase data into actionable insights that inform marketing strategies, improve customer retention, and support revenue forecasting.


## 1. Dataset Source and Description

The dataset used in this project was obtained from the **UC Irvine Machine Learning Repository**:

**Online Retail II Dataset**  
Donated: September 20, 2019  
Source: UC Irvine Machine Learning Repository  

üîó [https://archive.ics.uci.edu/dataset/502/online+retail+ii](https://archive.ics.uci.edu/dataset/502/online+retail+ii)

This dataset contains **two years of real online retail transactions** from a UK-based, non-store online retailer, covering the period **December 2009 to December 2011**. The company primarily sells unique, all-occasion gift products, with many customers being wholesalers.

The dataset is well-suited for **customer analytics and predictive modeling**, and supports tasks such as:

- Classification  
- Regression  
- Clustering  

It includes a mix of **transactional, temporal, and categorical features**, making it ideal for RFM analysis, customer segmentation, and CLV modeling.

#### Variable Description

| Fields/ Columns       | Description |
|-------------|------------|
| InvoiceNo    | Unique invoice number for each transaction. If it starts with "C", it indicates a cancellation. |
| StockCode    | Unique product identifier. |
| Description  | Product name. |
| Quantity     | Number of items purchased in a transaction. |
| InvoiceDate  | Date and time when the transaction occurred. |
| UnitPrice    | Price per unit in British Pounds (¬£). |
| CustomerID   | Unique customer identifier. |
| Country      | Customer‚Äôs country of residence. |


## 2. Imports

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import yaml
import os
import openpyxl

## 3. Load Config

In [28]:
with open("../config.yaml","r") as file:
    config = yaml.safe_load(file)

data_path = os.path.join("..", config["paths"]["raw_data"])

## 4. Data Loading & Exploration (EDA)

### Dataset Overview:

The dataset contains two yearly transaction sheets, representing consecutive retail periods.

 - Year 2009-2010
 - Year 2010-2011

We shall be merging it later if the fields match 

**Below are the content breakdown for the different Fields**

| Column      | Notes                                                           |
| ----------- | --------------------------------------------------------------- |
| Invoice     | Object ‚Äì includes normal, cancellation, and adjustment invoices |
| StockCode   | Mixed formats                                                   |
| Description | Some missing values                                             |
| Quantity    | Contains negative values                                        |
| InvoiceDate | Proper datetime                                                 |
| Price       | Contains negative values                                        |
| Customer ID | ~243k missing                                                   |
| Country     | Mostly UK                                                       |

> **Loading the Data**

In [29]:
dfs = pd.read_excel(data_path, sheet_name=None)  # None = load all sheets
print("Sheets loaded:", list(dfs.keys()))

df1 = dfs[list(dfs.keys())[0]]  # first tab
df2 = dfs[list(dfs.keys())[1]]  # second tab

print(df1.head(3))
print(df2.head(3))

Sheets loaded: ['Year 2009-2010', 'Year 2010-2011']
  Invoice StockCode                          Description  Quantity  \
0  489434     85048  15CM CHRISTMAS GLASS BALL 20 LIGHTS        12   
1  489434    79323P                   PINK CHERRY LIGHTS        12   
2  489434    79323W                  WHITE CHERRY LIGHTS        12   

          InvoiceDate  Price  Customer ID         Country  
0 2009-12-01 07:45:00   6.95      13085.0  United Kingdom  
1 2009-12-01 07:45:00   6.75      13085.0  United Kingdom  
2 2009-12-01 07:45:00   6.75      13085.0  United Kingdom  
  Invoice StockCode                         Description  Quantity  \
0  536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER         6   
1  536365     71053                 WHITE METAL LANTERN         6   
2  536365    84406B      CREAM CUPID HEARTS COAT HANGER         8   

          InvoiceDate  Price  Customer ID         Country  
0 2010-12-01 08:26:00   2.55      17850.0  United Kingdom  
1 2010-12-01 08:26:00   3.39  

> **Merging Datasets**
 - Both years share identical structure, so we safely merge them into a single transactional dataset.

In [30]:
# Checking if the columns are the same then merging
columns_match = df1.columns.equals(df2.columns)
print("Columns Match:", columns_match)

if columns_match:
    df = pd.concat([df1, df2], axis=0, ignore_index=True)
    print(df.head(3))
else:
    print("Columns do not match. Please check the data.")

Columns Match: True
  Invoice StockCode                          Description  Quantity  \
0  489434     85048  15CM CHRISTMAS GLASS BALL 20 LIGHTS        12   
1  489434    79323P                   PINK CHERRY LIGHTS        12   
2  489434    79323W                  WHITE CHERRY LIGHTS        12   

          InvoiceDate  Price  Customer ID         Country  
0 2009-12-01 07:45:00   6.95      13085.0  United Kingdom  
1 2009-12-01 07:45:00   6.75      13085.0  United Kingdom  
2 2009-12-01 07:45:00   6.75      13085.0  United Kingdom  


#### Data Exploration:

**a) Data Overview**

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067371 entries, 0 to 1067370
Data columns (total 8 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   Invoice      1067371 non-null  object        
 1   StockCode    1067371 non-null  object        
 2   Description  1062989 non-null  object        
 3   Quantity     1067371 non-null  int64         
 4   InvoiceDate  1067371 non-null  datetime64[ns]
 5   Price        1067371 non-null  float64       
 6   Customer ID  824364 non-null   float64       
 7   Country      1067371 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 65.1+ MB


**b) Numerical Summary**

- **Business Meaning**
    - Issue & Interpretation 
        - Negative Quantity : Returns / cancellations 
        - Negative Price    : Adjustments / accounting corrections
        - Extreme values    : Non-sales financial entries

In [32]:
df.describe() # descriptive statistics for numerical columns

Unnamed: 0,Quantity,InvoiceDate,Price,Customer ID
count,1067371.0,1067371,1067371.0,824364.0
mean,9.938898,2011-01-02 21:13:55.394028544,4.649388,15324.638504
min,-80995.0,2009-12-01 07:45:00,-53594.36,12346.0
25%,1.0,2010-07-09 09:46:00,1.25,13975.0
50%,3.0,2010-12-07 15:28:00,2.1,15255.0
75%,10.0,2011-07-22 10:23:00,4.15,16797.0
max,80995.0,2011-12-09 12:50:00,38970.0,18287.0
std,172.7058,,123.5531,1697.46445


**c). Categorical Summary**
 - 53,628 invoices
 - 5,305 products
- 43 countries
- UK dominates dataset (~92%)

In [33]:
df.describe(include='O') # descriptive statistics for categorical columns

Unnamed: 0,Invoice,StockCode,Description,Country
count,1067371,1067371,1062989,1067371
unique,53628,5305,5698,43
top,537434,85123A,WHITE HANGING HEART T-LIGHT HOLDER,United Kingdom
freq,1350,5829,5918,981330


**d). Missing Customer IDs**
 - This matters because Without Customer IDs we cannot assign RFM or CLV thus these records must be excluded from segmentation.

In [34]:
print("Null values in Customer ID column:", df["Customer ID"].isna().sum())
display(df[df["Customer ID"].isna()].head(5))

Null values in Customer ID column: 243007


Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
263,489464,21733,85123a mixed,-96,2009-12-01 10:52:00,0.0,,United Kingdom
283,489463,71477,short,-240,2009-12-01 10:52:00,0.0,,United Kingdom
284,489467,85123A,21733 mixed,-192,2009-12-01 10:53:00,0.0,,United Kingdom
470,489521,21646,,-50,2009-12-01 11:44:00,0.0,,United Kingdom
577,489525,85226C,BLUE PULL BACK RACING CAR,1,2009-12-01 11:49:00,0.55,,United Kingdom


**e). Negative Quantities:** 
 - This means these records are Returns, cancellations, or corrections.

In [35]:
print("Negative values in Quantity column:", df["Quantity"].lt(0).sum())
display(df[df['Quantity'] < 0].head(3))

Negative values in Quantity column: 22950


Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
178,C489449,22087,PAPER BUNTING WHITE LACE,-12,2009-12-01 10:33:00,2.95,16321.0,Australia
179,C489449,85206A,CREAM FELT EASTER EGG BASKET,-6,2009-12-01 10:33:00,1.65,16321.0,Australia
180,C489449,21895,POTTING SHED SOW 'N' GROW SET,-4,2009-12-01 10:33:00,4.25,16321.0,Australia


#### Invoice Structure Analysis: 
 - Goal: 
    - Checking to see if invoice has anything other than 6 digit number.
    - starts with C which is cancellation

- Observation: 
    - We have 19,500 invoices with more than 6 digits
    - C , and A are the starting letters on the invoices with more than 6 digits 

- Implication: 
     - The rows with lettersrepresent financial corrections, not purchases. i.e 
        - None : Normal sales
        - C    : Sales Cancellations (19,494)
        - A    : Sales Adjustments (6)

In [36]:
df["Invoice"] = df["Invoice"].astype("str") # converting Invoice column to string

# looking at invoices with more than 6 digits
print("Invoices with more than 6 digits:", df["Invoice"].str.len().gt(6).sum())
print("Snapshot of Invoices with more than 6 digits:")
display(df[df["Invoice"].str.len() > 6].head(3))

Invoices with more than 6 digits: 19500
Snapshot of Invoices with more than 6 digits:


Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
178,C489449,22087,PAPER BUNTING WHITE LACE,-12,2009-12-01 10:33:00,2.95,16321.0,Australia
179,C489449,85206A,CREAM FELT EASTER EGG BASKET,-6,2009-12-01 10:33:00,1.65,16321.0,Australia
180,C489449,21895,POTTING SHED SOW 'N' GROW SET,-4,2009-12-01 10:33:00,4.25,16321.0,Australia


In [37]:
# 2. checking if all invoices start with letter C
df["Invoice"].str.replace("[0-9]", "", regex=True).unique() # removing digits

array(['', 'C', 'A'], dtype=object)

In [38]:
# Count of Records that are cancellations and adjustments
print("Count of Records that are cancellations and adjustments:", df[df["Invoice"].str.startswith(("C", "A"))].shape[0])
print("Count of Cancellation Records:", df[df["Invoice"].str.startswith("C")].shape[0])
print("Count of Adjustment Records:", df[df["Invoice"].str.startswith("A")].shape[0])

Count of Records that are cancellations and adjustments: 19500
Count of Cancellation Records: 19494
Count of Adjustment Records: 6


#### StockCode Pattern Analysis

**StockCode Interpretation Table**

| Code             | Description            | Action  |
| ---------------- | ---------------------- | ------- |
| DCGS*            | Gift sets / bundles    | Exclude |
| D                | Discount               | Exclude |
| DOT              | Postage                | Exclude |
| M / m            | Manual entry           | Exclude |
| C2 / C3          | Carriage               | Exclude |
| BANK CHARGES / B | Bank fees              | Exclude |
| S                | Samples                | Exclude |
| TEST*            | Testing                | Exclude |
| gift_*           | Gift cards             | Exclude |
| PADS             | Padding product        | Include |
| SP1002           | Special product        | Exclude |
| AMAZONFEE        | Amazon fees            | Exclude |
| ADJUST*          | Accounting adjustments | Exclude |
| CRUK             | Charity donation       | Exclude |


**a). All StockCodes**

In [39]:
# Looking at stock codes besides 5 digits and 5 digits with letter at the end
df["StockCode"] = df["StockCode"].astype("str")

stock_codes = df[(df["StockCode"].str.match("^\\d{5}$") == False) & 
                   (df["StockCode"].str.match("^\\d{5}[a-zA-Z]+$") == False)
                   ]["StockCode"].unique()

stock_codes

array(['POST', 'D', 'DCGS0058', 'DCGS0068', 'DOT', 'M', 'DCGS0004',
       'DCGS0076', 'C2', 'BANK CHARGES', 'DCGS0003', 'TEST001',
       'gift_0001_80', 'DCGS0072', 'gift_0001_20', 'DCGS0044', 'TEST002',
       'gift_0001_10', 'gift_0001_50', 'DCGS0066N', 'gift_0001_30',
       'PADS', 'ADJUST', 'gift_0001_40', 'gift_0001_60', 'gift_0001_70',
       'gift_0001_90', 'DCGSSGIRL', 'DCGS0006', 'DCGS0016', 'DCGS0027',
       'DCGS0036', 'DCGS0039', 'DCGS0060', 'DCGS0056', 'DCGS0059', 'GIFT',
       'DCGSLBOY', 'm', 'DCGS0053', 'DCGS0062', 'DCGS0037', 'DCGSSBOY',
       'DCGSLGIRL', 'S', 'DCGS0069', 'DCGS0070', 'DCGS0075', 'B',
       'DCGS0041', 'ADJUST2', '47503J ', 'C3', 'SP1002', 'AMAZONFEE',
       'DCGS0055', 'DCGS0074', 'DCGS0057', 'DCGS0073', 'DCGS0071',
       'DCGS0066P', 'DCGS0067', 'CRUK'], dtype=object)

## 5. Data Cleaning for RFM & CLV

Tasks: 
- Remove rows without Customer ID
    - Since customer id shall be needed for Segmentation
- Keep only real purchase invoices
    - Doing away with the cancellations and adjustment records
- Remove negative quantity & price
    - Since such records are Returns, cancellations, or corrections.
- Remove non-product stock codes
    - This shall be based on the stockcode pattern above
- Create & clean fields
    - Revenue (Total Amount)
    - InvoiceDate
    - Customer ID

Goal: 

| Metric     | Impact                                    |
| ---------- | ----------------------------------------- |
| Recency    | Uses only real purchase dates             |
| Frequency  | Counts only true transactions             |
| Monetary   | Uses valid purchase revenue               |
| CLV        | Model learns from clean spending behavior |
| Clustering | Segments reflect true customers           |


In [40]:
df_clean = df.copy()

`1. Removing rows without Customer ID`

In [41]:
df_clean = df_clean[~df_clean["Customer ID"].isna()] # removing null values

`2. Keeping only real purchase invoices`

In [42]:
df_clean = df_clean[df_clean["Invoice"].str.len() <= 6]

`3. Removing negative quantity & price`

In [43]:
df_clean = df_clean[
    (df_clean["Quantity"] > 0) &
    (df_clean["Price"] > 0)
]

`4. Removing non-product stock codes`

| Code             | Meaning            |
| ---------------- | ------------------ |
| POST / DOT       | Postage / delivery |
| BANK CHARGES     | Financial service  |
| ADJUST / ADJUST2 | Accounting         |
| AMAZONFEE        | Marketplace fees   |
| TEST / GIFT      | Non-sales items    |


In [44]:
exclude_codes = [
    'DOT','D','M','m','BANK CHARGES','B','S',
    'TEST001','TEST002','ADJUST','ADJUST2',
    'AMAZONFEE','SP1002','C2','C3'
]

df_clean = df_clean[~df_clean["StockCode"].isin(exclude_codes)]

`5. Create & clean fields` 
- Revenue (Total Amount)
- InvoiceDate
- Customer ID

In [45]:
df_clean["TotalAmount"] = df_clean["Quantity"] * df_clean["Price"]
df_clean["InvoiceDate"] = pd.to_datetime(df_clean["InvoiceDate"])
df_clean["Customer ID"] = df_clean["Customer ID"].astype(int)


**Final Check (Data Overview)**

In [46]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 804487 entries, 0 to 1067370
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Invoice      804487 non-null  object        
 1   StockCode    804487 non-null  object        
 2   Description  804487 non-null  object        
 3   Quantity     804487 non-null  int64         
 4   InvoiceDate  804487 non-null  datetime64[ns]
 5   Price        804487 non-null  float64       
 6   Customer ID  804487 non-null  int64         
 7   Country      804487 non-null  object        
 8   TotalAmount  804487 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(2), object(4)
memory usage: 61.4+ MB


In [47]:
df_clean.describe()

Unnamed: 0,Quantity,InvoiceDate,Price,Customer ID,TotalAmount
count,804487.0,804487,804487.0,804487.0,804487.0
mean,13.29515,2011-01-02 11:29:29.632797184,2.991972,15332.183465,21.828834
min,1.0,2009-12-01 07:45:00,0.001,12346.0,0.001
25%,2.0,2010-07-07 12:21:00,1.25,13981.0,4.95
50%,5.0,2010-12-03 15:19:00,1.95,15271.0,11.85
75%,12.0,2011-07-28 14:03:00,3.75,16805.0,19.5
max,80995.0,2011-12-09 12:50:00,8142.75,18287.0,168469.6
std,143.703917,,10.266217,1696.827478,222.527815


In [48]:
df_clean.describe(include='O') # descriptive statistics for categorical columns

Unnamed: 0,Invoice,StockCode,Description,Country
count,804487,804487,804487,804487
unique,36705,4621,5272,41
top,576339,85123A,WHITE HANGING HEART T-LIGHT HOLDER,United Kingdom
freq,541,5188,5181,724492


## RFM Feature Engineering

### 1. Calculating R,F,M (Metrics) per customer

RFM is a method to analyze customers based on their purchasing behavior. helping businesses to understand who their best customers are in terms of: 
- R ‚Äì Recency: How recently did the customer make a purchase?
    - Here we shall be calculating the most recent purchase date for each customer.
        - Compare it to a ‚Äúreference date‚Äù such today. i.e the smaller the days gap implies more recent thus maps to better.
- F ‚Äì Frequency: How often does the customer make purchases?
    - This is a straight forward count of how many purchases each customer made.(Volume) 
- M ‚Äì Monetary: How much money does the customer spend?
    - Sum the total amount spent by the customer (Value)

i.e Its like a customer score based on recentness (days), loyalty (freqency) and value (amount spent on platform or service)

In [60]:
# using day after the last invoice as snapshot date or reference date 
# minus one day because we want to include the last invoice otherwise it would be excluded (0 days difference)
snapshot_date = df_clean["InvoiceDate"].max() + pd.Timedelta(days=1)

rfm = df_clean.groupby("Customer ID").agg(
    TotalAmount=("TotalAmount", "sum"),
    LastInvoiceDate=("InvoiceDate", "max"),
    InvoiceCount=("Invoice", "nunique")
).reset_index()

rfm['snapshot_date'] = snapshot_date
rfm["Recency"] = (rfm["snapshot_date"] - rfm["LastInvoiceDate"]).dt.days
rfm["Frequency"] = rfm["InvoiceCount"]
rfm["Monetary"] = rfm["TotalAmount"]

display(rfm.head())

Unnamed: 0,Customer ID,TotalAmount,LastInvoiceDate,InvoiceCount,snapshot_date,Recency,Frequency,Monetary
0,12346,77352.96,2011-01-18 10:01:00,3,2011-12-10 12:50:00,326,3,77352.96
1,12347,5633.32,2011-12-07 15:52:00,8,2011-12-10 12:50:00,2,8,5633.32
2,12348,2019.4,2011-09-25 13:13:00,5,2011-12-10 12:50:00,75,5,2019.4
3,12349,4428.69,2011-11-21 09:51:00,4,2011-12-10 12:50:00,19,4,4428.69
4,12350,334.4,2011-02-02 16:01:00,1,2011-12-10 12:50:00,310,1,334.4


In [59]:
print("RFM DF with Customer ID, Recency, Frequency, Monetary")
rfm = rfm[["Customer ID", "Recency", "Frequency", "Monetary"]]
display(rfm.head())

RFM DF with Customer ID, Recency, Frequency, Monetary


Unnamed: 0,Customer ID,Recency,Frequency,Monetary
0,12346,326,3,77352.96
1,12347,2,8,5633.32
2,12348,75,5,2019.4
3,12349,19,4,4428.69
4,12350,310,1,334.4


**RFM statistics**

In [62]:
print("RFM Statistics:")
rfm.describe().round(2)
    
# # Save RFM data
# rfm.to_csv(RFM_DATA_FILE, index=False)
# print(f"RFM data saved to: {RFM_DATA_FILE}")

RFM Statistics:


Unnamed: 0,Customer ID,TotalAmount,LastInvoiceDate,InvoiceCount,snapshot_date,Recency,Frequency,Monetary
count,5853.0,5853.0,5853,5853.0,5853,5853.0,5853.0,5853.0
mean,15319.25,3000.34,2011-05-23 18:30:42.942081024,6.27,2011-12-10 12:50:00,200.24,6.27,3000.34
min,12346.0,2.95,2009-12-01 09:55:00,1.0,2011-12-10 12:50:00,1.0,1.0,2.95
25%,13837.0,348.59,2010-11-25 15:59:00,1.0,2011-12-10 12:50:00,25.0,1.0,348.59
50%,15320.0,896.66,2011-09-05 16:22:00,3.0,2011-12-10 12:50:00,95.0,3.0,896.66
75%,16802.0,2303.17,2011-11-14 12:55:00,7.0,2011-12-10 12:50:00,379.0,7.0,2303.17
max,18287.0,608821.65,2011-12-09 12:50:00,373.0,2011-12-10 12:50:00,739.0,373.0,608821.65
std,1715.14,14635.24,,12.79,,208.58,12.79,14635.24


**RFM Distributions**

In [None]:
# Create a figure with 4 subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Recency Distribution
axes[0, 0].hist(rfm['Recency'], bins=30, color=COLORS[0], edgecolor='black', alpha=0.7)
axes[0, 0].set_title('How Recently Did Customers Purchase?', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Days Since Last Purchase')
axes[0, 0].set_ylabel('Number of Customers')
axes[0, 0].grid(True, alpha=0.3)

# Add average line
avg_recency = rfm['Recency'].mean()
axes[0, 0].axvline(avg_recency, color='red', linestyle='--', linewidth=2)
axes[0, 0].text(avg_recency*1.05, axes[0, 0].get_ylim()[1]*0.9, 
                f'Average: {avg_recency:.0f} days', color='red')

# Plot 2: Frequency Distribution
axes[0, 1].hist(rfm['Frequency'], bins=30, color=COLORS[1], edgecolor='black', alpha=0.7)
axes[0, 1].set_title('How Often Do Customers Purchase?', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Number of Purchases')
axes[0, 1].set_ylabel('Number of Customers')
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Monetary Distribution
axes[1, 0].hist(rfm['Monetary'], bins=30, color=COLORS[2], edgecolor='black', alpha=0.7)
axes[1, 0].set_title('How Much Do Customers Spend?', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Total Amount Spent ($)')
axes[1, 0].set_ylabel('Number of Customers')
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Relationship between Frequency and Monetary
scatter = axes[1, 1].scatter(rfm['Frequency'], rfm['Monetary'], 
                                alpha=0.6, s=30, color=COLORS[3])
axes[1, 1].set_title('Frequency vs Monetary Value', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Number of Purchases')
axes[1, 1].set_ylabel('Total Amount Spent ($)')
axes[1, 1].grid(True, alpha=0.3)

# Add trend line
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(rfm['Frequency'], rfm['Monetary'])
line = slope * rfm['Frequency'] + intercept
axes[1, 1].plot(rfm['Frequency'], line, 'r-', alpha=0.8, linewidth=2)
axes[1, 1].text(0.05, 0.95, f'Correlation: {r_value:.2f}', 
                transform=axes[1, 1].transAxes, fontsize=10,
                verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.savefig(f"{IMAGE_OUTPUT_FOLDER}rfm_distributions.png", dpi=300, bbox_inches='tight')
plt.show()

print(f"‚úÖ RFM visualizations saved to: {IMAGE_OUTPUT_FOLDER}rfm_distributions.png")

### 2. RFM Scoring

Usaully Each R, F, and M value is turned into a score from 1 to 5.
 - 5 = best (most recent purchase, most frequent, highest spending)
 - 1 = worst (long ago, rare, low spending)
 - Scores are usually assigned by splitting customers into 5 equal groups (quintiles) or by setting specific cut-off thresholds.

Then, the 3 scores are combined into a single RFM score (like `R5F4M3`) to quickly see how valuable a customer is.

In [51]:
# Recency score: smaller days = more recent = better thus give the highest score (5) to the smallest numbers 
# hence the reverse labels (small = higher score) leading to the reverse label order
rfm["R_Score"] = pd.qcut(rfm["Recency"], 5, labels=[5,4,3,2,1])

# Frequency & Monetary: bigger numbers = more frequent or more spending = better thus give
# the highest score (5) to the largest numbers thus 1-5 label order
# F_Score is ranked first to handle ties before splitting into quintiles 
# as there is a chance of a tie. (some customers have the same frequency)
# M_Score is not ranked as there is less chance of a tie
rfm["F_Score"] = pd.qcut(rfm["Frequency"].rank(method="first"), 5, labels=[1,2,3,4,5])
rfm["M_Score"] = pd.qcut(rfm["Monetary"], 5, labels=[1,2,3,4,5])

# Combine R, F, M scores into a single RFM code for each customer making it a 
# string thus easier to segment based on the overall value
rfm["RFM_Score"] = rfm["R_Score"].astype(str) + \
                   rfm["F_Score"].astype(str) + \
                   rfm["M_Score"].astype(str)

rfm.head()


Unnamed: 0,CustomerID,Recency,Frequency,Monetary,R_Score,F_Score,M_Score,RFM_Score
0,12346,326,3,77352.96,2,3,5,235
1,12347,2,8,5633.32,5,4,5,545
2,12348,75,5,2019.4,3,4,4,344
3,12349,19,4,4428.69,5,3,5,535
4,12350,310,1,334.4,2,1,2,212


## Next Section in Notebook

After this, you are perfectly positioned to go into:

- üëâ RFM Feature Engineering
- üëâ Customer Segmentation
- üëâ CLV Prediction