# Step-by-Step Guide to Predict Customer Lifetime Value (CLV)
Predicting Customer Lifetime Value (CLV) is a powerful way to understand the long-term value each customer brings to your business. This guide will walk you through the process—from preparing your data to building and validating a CLV prediction model.

---

## Identifying and Preparing Data Sources

Before diving into predictive modeling, it's crucial to gather high-quality data that captures customer behavior.

### Data Requirements
You'll need the following types of data:
- Transactional Data: Purchase frequency, transaction amounts, and timestamps.
- Demographic Data: Basic customer details like age, location, and acquisition source.
- Behavioral Data: Engagement metrics such as browsing history or product interaction.
- Historical CLV (optional): Use this as a benchmark to validate predictions.

### Data Preparation
Once your data is ready, follow these steps to clean and prepare it for analysis.

#### Acquire Data
In this example, we'll use the UCL Online Retail dataset.

In [None]:
import pandas as pd
import numpy as np

data = pd.read_csv("Online Retail.csv", encoding="unicode_escape", parse_dates=['InvoiceDate'])

#### Feature Selection
Select relevant columns for analysis.

In [None]:
features = ['CustomerID', 'InvoiceNo', 'InvoiceDate', 'Quantity', 'UnitPrice']
data_clv = data[features]
data_clv.head()

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,Quantity,UnitPrice
0,17850.0,536365,2010-12-01 08:26:00,6,2.55
1,17850.0,536365,2010-12-01 08:26:00,6,3.39
2,17850.0,536365,2010-12-01 08:26:00,8,2.75
3,17850.0,536365,2010-12-01 08:26:00,6,3.39
4,17850.0,536365,2010-12-01 08:26:00,6,3.39


#### Feature Engineering
Derive new features, such as total sales per transaction.

In [None]:
data_clv['TotalSales'] = data_clv['Quantity'].multiply(data_clv['UnitPrice'])
data_clv.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_clv['TotalSales'] = data_clv['Quantity'].multiply(data_clv['UnitPrice'])


Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,Quantity,UnitPrice,TotalSales
0,17850.0,536365,2010-12-01 08:26:00,6,2.55,15.3
1,17850.0,536365,2010-12-01 08:26:00,6,3.39,20.34
2,17850.0,536365,2010-12-01 08:26:00,8,2.75,22.0
3,17850.0,536365,2010-12-01 08:26:00,6,3.39,20.34
4,17850.0,536365,2010-12-01 08:26:00,6,3.39,20.34


## Data Quality Checks

Ensure the data is clean and reliable by addressing common issues:

**Handle Missing Data:** Remove or impute missing values.

In [None]:
data_clv = data_clv[data_clv['TotalSales'] > 0]
data_clv = data_clv[pd.notnull(data_clv['CustomerID'])]

**Remove Duplicates:** Eliminate duplicate entries to avoid bias.

### Quick Dataset Summary
After cleaning, summarize your data to verify its quality.

In [None]:
maxdate = data_clv['InvoiceDate'].dt.date.max()
mindate = data_clv['InvoiceDate'].dt.date.min()
unique_cust = data_clv['CustomerID'].nunique()
tot_quantity = data_clv['Quantity'].sum()
tot_sales = data_clv['TotalSales'].sum()

print(f"Time Range: {mindate} to {maxdate}")
print(f"Unique Customers: {unique_cust}")
print(f"Total Quantity Sold: {tot_quantity}")
print(f"Total Sales: {tot_sales}")

Time Range: 2010-12-01 to 2011-12-09
Unique Customers: 4338
Total Quantity Sold: 5167812
Total Sales: 8911407.904


## Building the CLV Prediction Model

### Why Use the BG/NBD and Gamma-Gamma Models?
The BG/NBD model predicts future transactions for each customer, while the Gamma-Gamma model estimates the monetary value of those transactions. Combined, these models calculate Customer Lifetime Value.

### Feature Engineering
First, create an RFM Table (Recency, Frequency, Monetary Value).

In [None]:
! pip install lifetimes



In [None]:
import lifetimes

# Create summary data from transaction data
summary = lifetimes.utils.summary_data_from_transaction_data(
    data_clv, 'CustomerID', 'InvoiceDate', 'TotalSales')
summary = summary.reset_index()
summary.head()

Unnamed: 0,CustomerID,frequency,recency,T,monetary_value
0,12346.0,0.0,0.0,325.0,0.0
1,12347.0,6.0,365.0,367.0,599.701667
2,12348.0,3.0,283.0,358.0,301.48
3,12349.0,0.0,0.0,18.0,0.0
4,12350.0,0.0,0.0,310.0,0.0


### Train the BG/NBD Model
Fit the model to predict the number of future transactions.

In [None]:
from lifetimes import BetaGeoFitter

bgf = BetaGeoFitter()
bgf.fit(summary['frequency'], summary['recency'], summary['T'])

# Predict transactions for the next 30 days
t = 30
summary['pred_num_txn'] = bgf.conditional_expected_number_of_purchases_up_to_time(
    t, summary['frequency'], summary['recency'], summary['T'])
summary.head()

Unnamed: 0,CustomerID,frequency,recency,T,monetary_value,pred_num_txn
0,12346.0,0.0,0.0,325.0,0.0,0.062948
1,12347.0,6.0,365.0,367.0,599.701667,0.469643
2,12348.0,3.0,283.0,358.0,301.48,0.268666
3,12349.0,0.0,0.0,18.0,0.0,0.285282
4,12350.0,0.0,0.0,310.0,0.0,0.065439


### Churn Probability

In [None]:
# Calculate the probability of being alive for each customer
summary['probability_alive'] = bgf.conditional_probability_alive(
    summary['frequency'], summary['recency'], summary['T']
)

# Calculate churn probability
summary['churn_probability'] = 1 - summary['probability_alive']

In [None]:
summary.head(10)

Unnamed: 0,CustomerID,frequency,recency,T,monetary_value,pred_num_txn,probability_alive,churn_probability
0,12346.0,0.0,0.0,325.0,0.0,0.062948,1.0,0.0
1,12347.0,6.0,365.0,367.0,599.701667,0.469643,0.999698,0.000302
2,12348.0,3.0,283.0,358.0,301.48,0.268666,0.999177,0.000823
3,12349.0,0.0,0.0,18.0,0.0,0.285282,1.0,0.0
4,12350.0,0.0,0.0,310.0,0.0,0.065439,1.0,0.0
5,12352.0,6.0,260.0,296.0,368.256667,0.56085,0.999406,0.000594
6,12353.0,0.0,0.0,204.0,0.0,0.090856,1.0,0.0
7,12354.0,0.0,0.0,232.0,0.0,0.082402,1.0,0.0
8,12355.0,0.0,0.0,214.0,0.0,0.087644,1.0,0.0
9,12356.0,2.0,303.0,325.0,269.905,0.215146,0.999478,0.000522


### Export and Load - BGF Model

In [None]:
import joblib

In [None]:
joblib.dump(bgf, '/content/bgf_model.pkl')

['/content/bgf_model.pkl']

In [None]:
loaded_bgf_model = joblib.load('/content/bgf_model.pkl')

In [None]:
# Predict transactions for the next 30 days
t = 30
summary['pred_num_txn'] = loaded_bgf_model.conditional_expected_number_of_purchases_up_to_time(
    t, summary['frequency'], summary['recency'], summary['T'])
summary.head()

Unnamed: 0,CustomerID,frequency,recency,T,monetary_value,pred_num_txn
0,12346.0,0.0,0.0,325.0,0.0,0.062948
1,12347.0,6.0,365.0,367.0,599.701667,0.469643
2,12348.0,3.0,283.0,358.0,301.48,0.268666
3,12349.0,0.0,0.0,18.0,0.0,0.285282
4,12350.0,0.0,0.0,310.0,0.0,0.065439


### Train the Gamma-Gamma Model
Fit a model to predict the monetary value of transactions.

In [None]:
from lifetimes import GammaGammaFitter

summary = summary[summary['frequency'] > 0]
ggf = GammaGammaFitter()
ggf.fit(summary['frequency'], summary['monetary_value'])

# Calculate expected average profit per transaction
summary['exp_avg_sales'] = ggf.conditional_expected_average_profit(
    summary['frequency'], summary['monetary_value'])
summary.head()


Unnamed: 0,CustomerID,frequency,recency,T,monetary_value,pred_num_txn,exp_avg_sales
1,12347.0,6.0,365.0,367.0,599.701667,0.469643,569.978836
2,12348.0,3.0,283.0,358.0,301.48,0.268666,333.784235
5,12352.0,6.0,260.0,296.0,368.256667,0.56085,376.175359
9,12356.0,2.0,303.0,325.0,269.905,0.215146,324.039419
11,12358.0,1.0,149.0,150.0,683.2,0.25017,539.907126


### Export and Load Gamma-Gamma Model

In [None]:
import joblib

In [None]:
joblib.dump(ggf, '/content/ggf_model.pkl')

['/content/ggf_model.pkl']

In [None]:
loaded_ggf_model = joblib.load('/content/ggf_model.pkl')

In [None]:
# Calculate expected average profit per transaction
summary['exp_avg_sales'] = loaded_ggf_model.conditional_expected_average_profit(
    summary['frequency'], summary['monetary_value'])
summary.head()


Unnamed: 0,CustomerID,frequency,recency,T,monetary_value,pred_num_txn,exp_avg_sales
1,12347.0,6.0,365.0,367.0,599.701667,0.469643,569.978836
2,12348.0,3.0,283.0,358.0,301.48,0.268666,333.784235
5,12352.0,6.0,260.0,296.0,368.256667,0.56085,376.175359
9,12356.0,2.0,303.0,325.0,269.905,0.215146,324.039419
11,12358.0,1.0,149.0,150.0,683.2,0.25017,539.907126


### Calculating Customer Lifetime Value
Combine the predictions from the BG/NBD and Gamma-Gamma models to calculate CLV.

In [None]:
summary['predicted_clv'] = ggf.customer_lifetime_value(
    bgf, summary['frequency'], summary['recency'], summary['T'], summary['monetary_value'],
    time=1,  # Prediction horizon in months
    freq='D',  # Frequency of transactions
    discount_rate=0.01  # Discount rate for future cash flow
)
summary.head()

Unnamed: 0,CustomerID,frequency,recency,T,monetary_value,pred_num_txn,exp_avg_sales,predicted_clv
1,12347.0,6.0,365.0,367.0,599.701667,0.469643,569.978836,265.036192
2,12348.0,3.0,283.0,358.0,301.48,0.268666,333.784235,88.788717
5,12352.0,6.0,260.0,296.0,368.256667,0.56085,376.175359,208.889108
9,12356.0,2.0,303.0,325.0,269.905,0.215146,324.039419,69.025615
11,12358.0,1.0,149.0,150.0,683.2,0.25017,539.907126,133.731263


You can also calculate CLV manually:

In [None]:
summary['manual_predicted_clv'] = summary['pred_num_txn'] * summary['exp_avg_sales']

In [None]:
summary.head()

Unnamed: 0,CustomerID,frequency,recency,T,monetary_value,pred_num_txn,exp_avg_sales,predicted_clv,manual_predicted_clv
1,12347.0,6.0,365.0,367.0,599.701667,0.469643,569.978836,265.036192,267.686554
2,12348.0,3.0,283.0,358.0,301.48,0.268666,333.784235,88.788717,89.676604
5,12352.0,6.0,260.0,296.0,368.256667,0.56085,376.175359,208.889108,210.977999
9,12356.0,2.0,303.0,325.0,269.905,0.215146,324.039419,69.025615,69.715872
11,12358.0,1.0,149.0,150.0,683.2,0.25017,539.907126,133.731263,135.068576


### Model Validation
To ensure accuracy, validate your predictions against manual calculations or benchmarks.

#### Error Metrics
Use metrics like Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

mae = mean_absolute_error(summary['predicted_clv'], summary['manual_predicted_clv'])
rmse = np.sqrt(mean_squared_error(summary['predicted_clv'], summary['manual_predicted_clv']))
print(f"Mean Absolute Error (MAE) : {mae}")
print(f"Root Mean Square Error (RMSE) : {rmse}")

Mean Absolute Error (MAE) : 2.388740115190495
Root Mean Square Error (RMSE) : 8.100815583510194


In [None]:
summary['predicted_clv'].mean()

238.8740115190493

The predicted CLV is the rate not the actual profit. We can calculate the profit by multiplying it to the

In [None]:
# CLV in terms of profit (profit margin is 5%)
profit_margin = 0.05
summary['CLV'] = summary['predicted_clv'] * profit_margin
summary.head()

Unnamed: 0,CustomerID,frequency,recency,T,monetary_value,pred_num_txn,exp_avg_sales,predicted_clv,manual_predicted_clv,CLV
1,12347.0,6.0,365.0,367.0,599.701667,0.469643,569.978836,265.036192,267.686554,13.25181
2,12348.0,3.0,283.0,358.0,301.48,0.268666,333.784235,88.788717,89.676604,4.439436
5,12352.0,6.0,260.0,296.0,368.256667,0.56085,376.175359,208.889108,210.977999,10.444455
9,12356.0,2.0,303.0,325.0,269.905,0.215146,324.039419,69.025615,69.715872,3.451281
11,12358.0,1.0,149.0,150.0,683.2,0.25017,539.907126,133.731263,135.068576,6.686563


In [None]:
# Distribution of CLV for the business in the next 30 days
summary['CLV'].describe()

Unnamed: 0,CLV
count,2790.0
mean,11.943701
std,38.710017
min,1.422398
25%,3.681472
50%,6.159217
75%,10.595935
max,977.905041


Finally, we predicted the CLV for each customers for the next 30 days.

The marketing team can now use this information to target customers and increase their sales.

Also, it is hard to target each individual customers. If we have access to customer demographics data, we can first create customer segmentation and then predict the CLV value for each segments. This segment level information can then be used for personalized targeting. If there is no access/availability of customer demographics data, then an easy way would be use RFM segmentation and then predict CLV for those RFM segments.