# Customer Lifetime Value Prediction in Python


Predict Customer Lifetime Value using Probabilistic Model

## Import libraries

In [1]:
import lifetimes
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
from lifetimes import BetaGeoFitter # BG/NBD
from lifetimes import GammaGammaFitter # Gamma-Gamma Model
from lifetimes.plotting import plot_frequency_recency_matrix
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## 2. Read in the dataset

In [2]:
df = pd.read_excel('OnlineRetail.xlsx')


## 3. Understanding the dataset

In [3]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


In [5]:
df.describe()


Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


We see some extreme numbers within Quantity and UnitPrice.
Let’s clean our data.

## 4. Data Preprocessing
Filtering Our Data

In [6]:
df = df[df['Quantity'] > 0 ] # exclude the orders with 0 value
df = df[df['UnitPrice'] > 0] # exclude the Unit Price with 0 value
df = df[~df['InvoiceNo'].str.contains("C",na=False)]  # C indicates the returned orders we don't want them as well

### Checking for missing values

In [7]:
df.isnull().sum()


InvoiceNo           0
StockCode           0
Description         0
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     132220
Country             0
dtype: int64

We see that there are missing values within CustomerID. Let’s remove any observation without CustomerID.

In [8]:
df.dropna(inplace=True)  # inplace=True means we dropped them permanently

## Handling Outliers
We can also normalize outliers by capping them. You may also use a different method or leave them as they are.

*Here is the function that I created to handle outliers by capping them*.

In [9]:
def find_boundaries(df, variable,q1=0.05,q2=0.95):
# the boundaries are the quantiles
    lower_boundary = df[variable].quantile(q1) # lower quantile
    upper_boundary = df[variable].quantile(q2) # upper quantile
    return upper_boundary, lower_boundary

def capping_outliers(df,variable):
    upper_boundary,lower_boundary =  find_boundaries(df,variable)
    df[variable] = np.where(df[variable] > upper_boundary, upper_boundary, 
                            np.where(df[variable] < lower_boundary, lower_boundary, df[variable]))

**Capping Outliers for UnitPrice and Quantity**

In [10]:
capping_outliers(df,'UnitPrice')
capping_outliers(df,'Quantity')

Let’s see our new values for Quantity and UnitPrice


In [24]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID,Total Price
count,397884.0,397884.0,397884.0,397884.0
mean,8.868022,2.675785,15294.423453,16.107655
std,9.523425,2.275053,1713.14156,20.717408
min,1.0,0.42,12346.0,0.42
25%,2.0,1.25,13969.0,4.95
50%,6.0,1.95,15159.0,11.25
75%,12.0,3.75,16795.0,17.7
max,36.0,8.5,18287.0,306.0


## Preparing Our Dataset ( RFM Dataset )

After preprocessing our data, the next step is to create a Recency, Frequency, T, Monetary data frame.

### What are they?

- Frequency represents the number of repeat purchases the customer has made. This means that it’s one less than the total number of purchases. This is actually slightly wrong. It’s the count of time periods the customer had a purchase in. So if using days as units, then it’s the count of days the customer had a purchase on.


- Recency represents the age of the customer when they made their most recent purchases. This is equal to the duration between a customer’s first purchase and their latest purchase. (Thus if they have made only 1 purchase, the recency is 0.)


- T represents the age of the customer at whatever time units are chosen (weekly, in the above dataset). This is equal to the duration between a customer’s first purchase and the end of the period under study.


- Monetary Value represents the average value of a given customer’s purchases. This is equal to the sum of all a customer’s purchases divided by the total number of purchases. Note that the denominator here is different than the frequency described above.


### Creating Column ‘Total Price’

To calculate Monetary Values we need to create a new feature by multiplying UnitPrice and Quantity. So we found the Total Price for each customer.

In [25]:
df['Total Price'] = df['UnitPrice'] * df['Quantity']

In order to create an RFM data frame, 
We will use *summary_data_from_transaction_data* from lifetimes.

In [26]:
clv = lifetimes.utils.summary_data_from_transaction_data(df,'CustomerID','InvoiceDate',
                                                         'Total Price',observation_period_end='2011-12-09')

In [27]:
clv = clv[clv['frequency']>1] # we want only customers shopped more than 2 times

## Frequency/Recency analysis using the BG/NBD model

By using BetaGeoFitter, we will implement BG/NBD model to our new data frame and be able to predict the number of purchases for each customer.

In [28]:
bgf = BetaGeoFitter(penalizer_coef=0.001)
bgf.fit(clv['frequency'], clv['recency'], clv['T'])

<lifetimes.BetaGeoFitter: fitted with 1916 subjects, a: 0.00, alpha: 109.98, b: 0.00, r: 2.35>

## Expected Number of Purchases within 6 Months


In [29]:
t = 180 # 30 day period
clv['expected_purc_6_months'] = bgf.conditional_expected_number_of_purchases_up_to_time(t, clv['frequency'], clv['recency'], clv['T'])
clv.sort_values(by='expected_purc_6_months',ascending=False).head(5)

Unnamed: 0_level_0,frequency,recency,T,monetary_value,expected_purc_6_months
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
14911.0,131.0,372.0,373.0,917.278855,49.698999
12748.0,112.0,373.0,373.0,257.314911,42.617895
17841.0,111.0,372.0,373.0,349.07964,42.245205
15311.0,89.0,373.0,373.0,421.881573,34.046032
14606.0,88.0,372.0,373.0,125.302955,33.673343


### Gamma-Gamma Model
After predicting the expected number of purchases of each customer. We need to use monetary value in order to predict CLV.

The Gamma Gamma model predicts the most likely value for each transaction.

### Assumptions for Gamma-Gamma Model
In order to use the Gamma-Gamma model, we need to make sure that there is no correlation between frequency and monetary value.

In [30]:
clv[['frequency','monetary_value']].corr()


Unnamed: 0,frequency,monetary_value
frequency,1.0,0.110771
monetary_value,0.110771,1.0


Since there is a weak correlation, let’s build the gamma-gamma model to predict values.

In [31]:
ggf = GammaGammaFitter(penalizer_coef=0.01)
ggf.fit(clv["frequency"],
        clv["monetary_value"])

<lifetimes.GammaGammaFitter: fitted with 1916 subjects, p: 3.79, q: 0.34, v: 3.72>

## Predicting CLV for the Next 6 Months
Now, We are ready to predict Customer Lifetime Value using BG/NBD and Gamma Gamma Model.

In [32]:
clv['6_Months_CLV']=ggf.customer_lifetime_value(bgf,
                                   clv["frequency"],
                                   clv["recency"],
                                   clv["T"],
                                   clv["monetary_value"],
                                   time=6,
                                   freq='D',
                                   discount_rate=0.01)
clv.sort_values('6_Months_CLV',ascending=False).head()

Unnamed: 0_level_0,frequency,recency,T,monetary_value,expected_purc_6_months,6_Months_CLV
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
14096.0,16.0,97.0,101.0,3012.454375,15.657745,46062.3147
14911.0,131.0,372.0,373.0,917.278855,49.698999,44093.511057
14646.0,44.0,353.0,354.0,2507.804091,17.982416,43732.700984
14156.0,42.0,362.0,371.0,1366.275476,16.598352,21996.982767
18102.0,25.0,367.0,367.0,2112.8432,10.322125,21214.008259


Now we can see our CLV Value for each customer within the next 6 months.

Segmenting CLV into Different Groups
We can also segment our customers into different groups.

In [33]:
clv['Segment'] =  pd.qcut(clv['6_Months_CLV'],4,labels = ['Hibernating','Need Attention', 
                                                          'LoyalCustomers', 'Champions'])

In [35]:
clv.sort_values('6_Months_CLV',ascending=False).head()

Unnamed: 0_level_0,frequency,recency,T,monetary_value,expected_purc_6_months,6_Months_CLV,Segment
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
14096.0,16.0,97.0,101.0,3012.454375,15.657745,46062.3147,Champions
14911.0,131.0,372.0,373.0,917.278855,49.698999,44093.511057,Champions
14646.0,44.0,353.0,354.0,2507.804091,17.982416,43732.700984,Champions
14156.0,42.0,362.0,371.0,1366.275476,16.598352,21996.982767,Champions
18102.0,25.0,367.0,367.0,2112.8432,10.322125,21214.008259,Champions


In [36]:
# Let’s group our dataset by the segment:

clv.groupby('Segment').mean()

Unnamed: 0_level_0,frequency,recency,T,monetary_value,expected_purc_6_months,6_Months_CLV
Segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Hibernating,3.169102,220.565762,291.824635,148.661593,2.552501,369.465054
Need Attention,4.018789,239.442589,282.941545,270.869214,3.00403,763.560023
LoyalCustomers,5.682672,241.569937,273.110647,369.673664,3.834613,1276.982217
Champions,11.244259,264.48643,284.411273,636.264425,6.007552,3416.144496




***After segmenting our customers by CLV***

We can ;

- Offer specific products to each segment
- Create a marketing plan to increase CLV for lower segment
- Try to focus on the higher segments in order to decrease customer acquisition costs.

Let’s sum up everything we’ve done :

- Cleaned data
- After cleaning we created summary data that includes Frequency, Recency, Tenure and Monetary values.
- Trained BG/NBD model in order to predict the purchase number of each customer.
- Gamma-Gamma Model was created in order to predict average monetary value.


### Conclusion

Customer Lifetime Value prediction is a great way to get valuable insights about your customer acquisition, marketing efforts, and your company’s financial future.

In this post, We predicted customer lifetime value using the probabilistic model BG/NBD and Gamma-Gamma.

There are other methods to predict CLV. These are :

- Machine Learning
- Cohort Analysis
- Aggregate Methods

You may also find the notebook and dataset on my Github!

Hope you found it helpful!

Let’s connect on Linkedin.