In [None]:
# Import libraries
import pandas as pd
import numpy as np
from tabulate import tabulate

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")

import plotly.express as px
import plotly.colors
import plotly.io as pio

pio.orca.config.save()
import plotly.graph_objects as go
pio.templates.default = "plotly_white"

from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

## **1. Load and Prepare Data**

In [None]:
# Load the dataset with only the specified columns
df = pd.read_csv('../data/raw/online_retail.csv' )
print(df.shape)
df.head(3)

In [None]:
# Inspect data
print("\nDataset Information:")
print(df.info())

print("\nDescriptive statistics:")
df.describe()

Up to this point, we can spot some columns with invalid values. We have `Quantity`, `UnitPrice` having negative values which can have very damaging effects to our analysis if left unchecked. We can also tell the two columns are very high skew which can be a result of these invalid values. We'll do an inspection of both columns to see what we can uncover.

Let's start our preprocessing step by handling the invalid records in our data

In [None]:
# Filter records with UnitPrice <= 0 or Quantity <= 0
invalid_records = df[(df['UnitPrice'] <= 0.009) | (df['Quantity'] <= 0)]

# Drop rows that exist in invalid_records
df_valid = df[~df.index.isin(invalid_records.index)]

In [None]:
# Data Cleaning
# Checking for missing values
print("\nMissing values per column:")
print(df_valid.isnull().sum())
df_valid.dropna(inplace=True)

# Convert InvoiceDate date to datetime format
df_valid['InvoiceDate'] = pd.to_datetime(df_valid['InvoiceDate'])

# Split into separate Date and Time columns
df_valid['Date'] = df_valid['InvoiceDate'].dt.date    # Extract the date
df_valid['Time'] = df_valid['InvoiceDate'].dt.time    # Extract the time

# Define the analysis date
# Using the day after the latest purchase date in the dataset
analysis_date = df_valid['InvoiceDate'].max() + pd.DateOffset(1)
print(f"\nAnalysis Date (day after the last purchase): {analysis_date}")

# Set UnitPrice columns to 2decimals
df_valid['UnitPrice'] = df_valid['UnitPrice'].round(2)

# Convert customer id to string
df_valid['CustomerID'] = df_valid['CustomerID'].astype(int) # Eliminate the decimal
df_valid['CustomerID'] = df_valid['CustomerID'].astype(str)

# Create Transaction amount column
df_valid['TransactionAmount'] = (df_valid['UnitPrice'] * df_valid['Quantity']).round(2)

# Final check on cleaned data
print("\nCleaned Dataset Info:")
print(df_valid.info())

In the cell above, we've done the following:

* Handled invalid entries in columns Quantity & UnitPrice
* Dropped null values
* Converted InvoiceDate to datetime format and created Date and Time column
* Set the analysis date/reference date for Recency measure (1 day after last purchase)
* Converted CustomerID column into string format
* Added TransactionAmount column

**Note**: Why Keep Both Date and Time Columns? (*Incase purposes*)
- The Date column allows for broader time-based grouping, such as daily or monthly sales analysis.
- The Time column allows for finer-grained analysis, such as identifying peak shopping hours or transaction patterns within a day.

We'll perform validation checks on the stubborn columns to confirm the data is exactly how we want it

In [None]:
df_valid.describe()

**Inspecting UnitPrice column**

We saw `UnitPrice` had some unpleasant values, we'll try viewing those columns with invalid values to confirm our data is ok. Of course we'll not be looking at `TransactionAmount` since it's a *secondary column* `UnitPrice` column and `Quantity`

In [None]:
# DataFrame with UnitPrice as Zero
unitprice_zero = df_valid.loc[df['UnitPrice'].sort_values() <= 0]
unitprice_zero

In [None]:
# Inspect UnitPrice values
df_valid['UnitPrice'].value_counts().sort_index(ascending=False)

**Inspecting Quantity Amount**

In [None]:
qty_invalid = df_valid.loc[df['Quantity'].sort_values() < 1]
qty_invalid

In [None]:
# Inspect Quantity values
df_valid['Quantity'].value_counts().sort_index(ascending=False)

We can see there's a very huge difference between the item prices, the most expensive item costs around 8,142 whereas the cheapest in our dataset is about 0.04 after validating our dataset. As for the invalid values, we had zero entries for both UnitPrice and Quantity checks; there are no negative values in the cleaned dataset meaning we are set for the next step.

In [None]:
# Save cleaned data
df_valid.to_pickle('../data/processed/cleaned_data.pkl')

In [None]:
# pkl_file = pd.read_pickle('../data/processed/cleaned_data.pkl')
# pkl_file.shape

## **2. Calculate RFM Metrics**

1. **Define the Reference Date**:

Set a date to calculate recency (e.g., the last date in the dataset + 1 day).

In [None]:
# Reference date
analysis_date

2. **Calculate R, F, and M**:

- **Recency**: Days since the customer’s last transaction.
- **Frequency**: Number of transactions per customer.
- **Monetary**: Total spending per customer.

In [None]:
# Calculate RFM metrics
rfm = df_valid.groupby('CustomerID').agg(
    Recency=('InvoiceDate', lambda x: (analysis_date - x.max()).days),  # Days since last purchase
    Frequency=('InvoiceNo', 'nunique'),# Number of unique invoices
    Monetary=('TransactionAmount', 'sum')  # Total transaction amount
).reset_index()


# Verify the RFM DataFrame
print("\nRFM Table (Preview):")
print(rfm.head())

**Distributions of RFM Metrics**

In [None]:
# Apply log transformation to handle the skew values
rfm['Recency_log'] = np.log1p(rfm['Recency'])
rfm['Frequency_log'] = np.log1p(rfm['Frequency'])
rfm['Monetary_log'] = np.log1p(rfm['Monetary'])

# Plot histograms for Recency, Log Frequency, and Log Monetary
rfm[['Recency_log', 'Frequency_log', 'Monetary_log']].hist(
    bins=20, 
    layout=(1, 3), 
    figsize=(12, 4), 
    color='skyblue'
)
plt.suptitle('Distribution of RFM Metrics (Log Transformed)')
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

In [None]:
# Checking skewness after transformation
print(rfm[['Recency_log', 'Frequency_log', 'Monetary_log']].skew())

In [None]:
# Plotting boxplots for transformed Frequency and Monetary
sns.boxplot(data=rfm[['Recency_log','Frequency_log', 'Monetary_log']])
plt.title('Boxplot of Transformed Recency, Frequency, and Monetary')
plt.show()

## **3. RFM Scoring**

In [None]:
# Recency scoring (higher score for lower recency values - recent purchases)
rfm['Recency_score'] = pd.qcut(rfm['Recency_log'], 5, labels=[5, 4, 3, 2, 1])

# Frequency scoring (higher score for higher frequency values)
rfm['Frequency_score'] = pd.qcut(rfm['Frequency_log'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])

# Monetary scoring (higher score for higher monetary values)
rfm['Monetary_score'] = pd.qcut(rfm['Monetary_log'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])

# Combine the scores into a single RFM score
rfm['RFM_score'] = (
    rfm['Recency_score'].astype(int) + 
    rfm['Frequency_score'].astype(int) + 
    rfm['Monetary_score'].astype(int)
)

# Preview the RFM table with scores
print("\nRFM Table with Scores (Preview):")
rfm[['CustomerID', 'Recency_log', 'Frequency_log', 'Monetary_log', 'Recency_score', 'Frequency_score', 'Monetary_score', 'RFM_score']].head()

## **4. Value Segmentation**

In [None]:
# Create RFM segments based on the RFM score
segment_labels = ['Low-Value', 'Mid-Value', 'High-Value']
rfm['Value Segment'] = pd.qcut(rfm['RFM_score'], q=3, labels=segment_labels)
# Preview
rfm.head()

## **5. Behavioural Segmentation**

In [None]:
# Define segments based on ranges of the total RFM score
def aggregate_segment(row):
    if row['RFM_score'] >= 13:
        return 'Champions'
    elif row['RFM_score'] >= 10:
        return 'Loyal Customers'
    elif row['RFM_score'] >= 7:
        return 'Potential Loyalists'
    elif row['RFM_score'] >= 5:
        return 'At Risk'
    else:
        return 'Hibernating'

# Apply the segmentation function to each row
rfm['RFM Customer Segment'] = rfm.apply(aggregate_segment, axis=1)

# Check the segment distribution
print("\nSegment Distribution:")
print(rfm['RFM Customer Segment'].value_counts())

# Preview the RFM table with segments
print("\nRFM Table with Aggregated Score Segments (Preview):")
rfm[['CustomerID', 'RFM_score', 'RFM Customer Segment']].head()


## **6. Overview of Segments (RFM Analysis)**

We'll start by understanding distribution of customers across value and behavioural segments.

### **Distribution of Customer Across Value Segments**

In [None]:
# Value Segment Distribution
value_segment_count = rfm['Value Segment'].value_counts().reset_index()
# Assign column names
value_segment_count.columns = ['Value Segment', 'Count']

# Preview
print(value_segment_count)

pastel_colors = px.colors.qualitative.Pastel

# Create the bar chart
fig_segment_dist = px.bar(value_segment_count, x='Value Segment', y='Count', 
                          color='Value Segment', color_discrete_sequence=pastel_colors,
                          title='RFM Value Segment Distribution')

# Update the layout
fig_segment_dist.update_layout(xaxis_title='RFM Value Segment',
                              yaxis_title='Count',
                              showlegend=False)

# Show the figure
fig_segment_dist.show()
fig_segment_dist.write_image('../images/rfm_value_segment_dist.png', width=800, height=600)

- The largest portion of customers falls into the Low-Value segment (1683 customers, ~38.2%).
- The Mid-Value segment accounts for a slightly smaller share (1400 customers, ~31.8%).
- The High-Value segment represents the smallest group (1255 customers, ~30.0%)

### **Distribution of Customer Across Behavioral Segments**

In [None]:
pastel_colors = plotly.colors.qualitative.Pastel

segment_counts = rfm['RFM Customer Segment'].value_counts()

# Create a bar chart to compare segment counts
fig = go.Figure(data=[go.Bar(x=segment_counts.index, y=segment_counts.values,
                            marker=dict(color=pastel_colors))])

# Set the color of the Champions segment as a different color
champions_color = 'rgb(158, 202, 225)'
fig.update_traces(marker_color=[champions_color if segment == 'Champions' else pastel_colors[i]
                                for i, segment in enumerate(segment_counts.index)],
                  marker_line_color='rgb(8, 48, 107)',
                  marker_line_width=1.5, opacity=0.6)

# Update the layout
fig.update_layout(title='Comparison of Behavioural Segments',
                  xaxis_title='Behavioural Segments',
                  yaxis_title='Number of Customers',
                  showlegend=False)

fig.show()
fig.write_image('../images/rfm_segments_comparison.png', width=800, height=600)

1. **Top Segments**:

Potential Loyalists (1092 customers, ~25.1%) form the largest behavioral group, indicating a significant number of customers who are actively engaging but are yet to reach the highest loyalty tier.
Loyal Customers (1008 customers, ~23.2%) are a strong segment, showing consistent engagement and spending.

2. **Key Loyalty Segment**:

Champions (934 customers, ~21.5%) represent your most valuable and loyal customers who engage frequently and contribute significantly to revenue. This segment is smaller than Potential Loyalists and Loyal Customers, highlighting an opportunity for growth.

3. **At Risk Customers**:

At Risk (759 customers, ~17.4%) is a notable segment that shows a drop in engagement or spending. This group requires immediate attention to prevent churn.

4. **Low Engagement**:

Hibernating (545 customers, ~12.5%) indicates customers who are inactive or have not engaged recently. Reviving this segment could lead to additional revenue streams.

### **Distribution of Customer Segments in Value Segments**

In [None]:
segment_product_counts = rfm.groupby(['Value Segment', 'RFM Customer Segment']).size().reset_index(name='Count')

# Sort values for better visual structure
segment_product_counts = segment_product_counts.sort_values('Count', ascending=False)

# Filter out rows where Count is zero
segment_product_counts_filtered = segment_product_counts[segment_product_counts['Count'] > 0]

# Create the treemap
fig_treemap_segment_product = px.treemap(
    segment_product_counts_filtered,  # Use the filtered DataFrame
    path=['Value Segment', 'RFM Customer Segment'], 
    values='Count', 
    color='Count',  # Use Count for color intensity
    color_continuous_scale='Blues',  # Intensity-based color scale
    title='RFM Customer Segments by Value'
)

# Update the layout for improved aesthetics
fig_treemap_segment_product.update_layout(
    title_font_size=18,
    title_x=0.5,
    coloraxis_colorbar=dict(
        title="Customer Count",
        title_side="right"
    )
)

# Show the plot and save the image
fig_treemap_segment_product.show()
fig_treemap_segment_product.write_image('../images/figtree_segment_by_value.png', width=800, height=600)

- High-Value Customers:

The Champions segment dominates within the High-Value group with 934 customers, emphasizing their critical contribution to business revenue.
Other behavioral segments (At Risk, Hibernating, and Potential Loyalists) in the High-Value group have no representation, suggesting most high-value customers are currently in their prime stage of engagement and spending.

- Mid-Value Customers:

This group is well-represented by Potential Loyalists (713) and Loyal Customers (687), indicating a large portion of mid-value customers who are actively engaging and showing promise for future growth.
Behavioral segments such as At Risk, Champions, and Hibernating have no representation in this value group, suggesting potential gaps or opportunities to improve segmentation strategies.

- Low-Value Customers:

The At Risk segment is the largest within the Low-Value group, with 759 customers, followed by Hibernating (545) and Potential Loyalists (379). These customers are either disengaged or require nurturing to boost their activity.
Notably, the Champions and Loyal Customers segments have no representation, highlighting the challenge of converting low-value customers into highly engaged ones.

In [None]:
# Save RFM table
rfm.to_pickle('../data/processed/rfm_table.pkl')

### **Average RFM Scores by Customer Segment**

In [None]:
rfm['Recency_score'] = rfm['Recency_score'].astype(int)
rfm['Frequency_score'] = rfm['Frequency_score'].astype(int)
rfm['Monetary_score'] = rfm['Monetary_score'].astype(int)

# Calculate the average Recency, Frequency, and Monetary scores for each segment
segment_scores = rfm.groupby('RFM Customer Segment')[['Recency_score', 'Frequency_score', 'Monetary_score']].mean().round(1).reset_index()

# Create a grouped bar chart to compare segment scores
fig = go.Figure()

# Add bars for Recency score
fig.add_trace(go.Bar(
    x=segment_scores['RFM Customer Segment'],
    y=segment_scores['Recency_score'],
    name='Recency Score',
    marker_color='rgb(158,202,225)'
))

# Add bars for Frequency score
fig.add_trace(go.Bar(
    x=segment_scores['RFM Customer Segment'],
    y=segment_scores['Frequency_score'],
    name='Frequency Score',
    marker_color='rgb(94,158,217)'
))

# Add bars for Monetary score
fig.add_trace(go.Bar(
    x=segment_scores['RFM Customer Segment'],
    y=segment_scores['Monetary_score'],
    name='Monetary Score',
    marker_color='rgb(32,102,148)'
))

# Update the layout
fig.update_layout(
    title='Comparison of RFM Segments based on Recency, Frequency, and Monetary Scores',
    xaxis_title='RFM Segments',
    yaxis_title='Score',
    barmode='group',
    showlegend=True
)

fig.show()
fig.write_image('../images/rfm_comparisons.png', width=800, height=600)

In [None]:
from tabulate import tabulate
# Define the data
data = [
    ["At Risk", 2.1, 1.7, 1.8, "Customers who have not made purchases recently, with low frequency and spending. Immediate re-engagement strategies are needed to retain them."],
    ["Champions", 4.5, 4.8, 4.7, "Your best customers who purchase frequently, spend the most, and are highly engaged. These customers are loyal and should be rewarded with exclusive offers or personalized experiences."],
    ["Hibernating", 1.1, 1.3, 1.2, "Customers who have not purchased in a long time, with minimal engagement and spending. Reactivation campaigns could help bring them back."],
    ["Loyal Customers", 3.6, 3.7, 3.7, "Regular customers with good spending and engagement. They appreciate the brand and are likely to respond to loyalty programs or upselling strategies."],
    ["Potential Loyalists", 2.8, 2.5, 2.6, "Customers showing promising behavior but not yet fully engaged. Targeted campaigns to encourage frequency and spending could move them into the 'Loyal Customers' or 'Champions' segments."]
]

# Create a table
headers = ["RFM Customer Segment", "Recency Score", "Frequency Score", "Monetary Score", "Interpretation"]
table = tabulate(data, headers=headers, tablefmt="fancy_grid")

# Print the table
print(table)


### **Revenue Contribution by Segment**

In [None]:
# Calculate total revenue per segment
segment_revenue = rfm.groupby('RFM Customer Segment')['Monetary'].sum().reset_index()
# Renaming resulting columns for clarity
segment_revenue.columns = ['RFM Customer Segment', 'Total Revenue']
# Add a column for percentage contribution
total_revenue = segment_revenue['Total Revenue'].sum()
segment_revenue['Revenue Percentage'] = ((segment_revenue['Total Revenue']/total_revenue)*100).round(2)
print(segment_revenue)

The resulting DataFrame has the following columns:

- `RFM Customer Segment`: The name of the segment.
- `Total Revenue`: Total revenue contributed by each segment.
- `Revenue Percentage`: Percentage of total revenue each segment contributes.

Visualizing the percentage contributions by each segment in a pie chart

In [None]:
# Create a pie chart
pie_chart = plt.figure(figsize=(8, 8))
plt.pie(
    segment_revenue['Total Revenue'],
    labels=segment_revenue['RFM Customer Segment'],
    autopct='%1.1f%%',
    startangle=140,
    colors=sns.color_palette('Blues', len(segment_revenue))
)

# Add a title
plt.title('Revenue Contribution by Customer Segment', fontsize=16)
plt.tight_layout()
plt.show()
pie_chart.savefig('../images/revenue_contribution.png', width=800, height=600)

- 80.09% of Revenue Comes from Champions and Loyal Customers:
Focus most of your resources on retaining these high-value groups.

- 12.82% of Revenue Comes from Potential Loyalists and At Risk:
These groups represent both growth opportunities and potential loss. Balance retention efforts here.

- Low Returns on Hibernating (1.18%):
Spending effort on this group might not be cost-effective unless they are a strategic priority.

### **Conclusion:**

In [None]:
from IPython.display import display, Markdown
# Data for the table
data = [
    ["Champions", "25.13%", "70.19%"],
    ["Loyal Customers", "27.14%", "15.81%"],
    ["Potential Loyalists", "29.35%", "9.90%"],
    ["At Risk", "20.45%", "2.92%"],
    ["Hibernating", "14.68%", "1.18%"]
]

# Column headers
headers = ["Segment", "Customer Proportion", "Revenue Contribution"]

# Generate table
comparison_table = tabulate(data, headers=headers, tablefmt="grid")

# Render the table as markdown
display(Markdown(f"```\n{comparison_table}\n```"))

# Save the table as a text file
with open("../images/segment_comparison_table.txt", "w") as file:
    file.write(comparison_table)

The RFM analysis highlights notable patterns in customer behavior and revenue contribution:

- **Champions Lead in Both Proportion and Revenue**:
*Distribution*: 25.13% of the customer base (934 customers).
*Revenue Contribution*: 70.19% of total revenue.
This segment demonstrates high engagement and significant spending, underscoring its critical importance to the business.

- **Loyal Customers and Potential Loyalists Show Potential:**
Together, they make up 36.49% of customers but contribute only 25.71% of revenue.
These segments offer substantial growth opportunities to increase revenue through targeted engagement.

- **At Risk Customers Have Moderate Representation but Low Revenue Impact:**
*Distribution*: 20.45% of the customer base (759 customers).
*Revenue Contribution*: 2.92%.
This segment requires intervention to prevent further disengagement.

- **Hibernating Customers Have Minimal Engagement:**
*Distribution*: 14.68% of customers (545 customers).
*Revenue Contribution*: 1.18%.
They represent the least valuable segment, warranting limited focus.

### **Recommendations:**

- `Strengthen Relationships with Champions`:
Focus on retention strategies like exclusive loyalty programs, early access to products, or premium services.
Ensure consistent and personalized communication to maintain their satisfaction and spending levels.
Nurture Loyal Customers and Potential Loyalists:

- Create `targeted campaigns` to increase engagement, such as offering tiered rewards for higher spending or incentivizing frequent purchases.
Educate these customers about additional products/services to encourage upselling and cross-selling.
Re-engage At Risk Customers:

- `Deploy win-back campaigns`, including special offers, personalized outreach, or feedback collection to understand their disengagement.
Focus on reactivating high-value customers within this segment.
Optimize Efforts for Hibernating Customers:

- `Periodic reminders or seasonal promotions` can re-engage this group, but prioritize resources on higher-value segments.
Monitor and Track Performance:

- Regularly `update RFM scores` and revenue contributions to identify changes in customer behavior.
Use these insights to refine marketing and operational strategies.

By focusing on high-value segments and nurturing growth opportunities, the business can enhance customer satisfaction, boost revenue, and build a more loyal customer base.