## What’s a Customer Worth? 
#### Modelling Customers Lifetime Value (CLV)

Data Source: http://archive.ics.uci.edu/ml/machine-learning-databases/00352/

In [1]:
import pandas as pd 
import warnings
warnings.filterwarnings('ignore')
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt 
import datetime as dt

In [2]:
df = pd.read_excel('data_set_1_online_retail.xlsx')
df['CustomerID'] = df['CustomerID'].astype(str)

df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   541909 non-null  object        
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 33.1+ MB


In [4]:
df.describe(include='all')

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
count,541909.0,541909,540455,541909.0,541909,541909.0,541909.0,541909
unique,25900.0,4070,4223,,23260,,4373.0,38
top,573585.0,85123A,WHITE HANGING HEART T-LIGHT HOLDER,,2011-10-31 14:41:00,,,United Kingdom
freq,1114.0,2313,2369,,1114,,135080.0,495478
first,,,,,2010-12-01 08:26:00,,,
last,,,,,2011-12-09 12:50:00,,,
mean,,,,9.55225,,4.611114,,
std,,,,218.081158,,96.759853,,
min,,,,-80995.0,,-11062.06,,
25%,,,,1.0,,1.25,,


As usual, we have some cleaning to do, then create a new dataframe that only contains __CustomerID__, __InvoiceDate__ (remove the time) and add a new column — __sales__:

In [5]:
df = df[pd.notnull(df['CustomerID'])]
df = df[(df['Quantity']>0)]
df['Sales'] = df['Quantity'] * df['UnitPrice']
cols_of_interest = ['CustomerID', 'InvoiceDate', 'Sales']
df = df[cols_of_interest]

print(df.head())
print(df['CustomerID'].nunique())

  CustomerID         InvoiceDate  Sales
0    17850.0 2010-12-01 08:26:00  15.30
1    17850.0 2010-12-01 08:26:00  20.34
2    17850.0 2010-12-01 08:26:00  22.00
3    17850.0 2010-12-01 08:26:00  20.34
4    17850.0 2010-12-01 08:26:00  20.34
4340


## CLV Model Definition
For the CLV models, the following nomenclature is used:
* Frequency represents the number of repeat purchases the customer has made. This means that it’s one less than the total number of purchases.
* T represents the age of the customer in whatever time units chosen (daily, in our dataset). This is equal to the duration between a customer’s first purchase and the end of the period under study.
* Recency represents the age of the customer when they made their most recent purchases. This is equal to the duration between a customer’s first purchase and their latest purchase. (Thus if they have made only 1 purchase, the recency is 0.)

## Data Explore

In [6]:
from lifetimes.plotting import *
from lifetimes.utils import *
from lifetimes.estimation import *

In [7]:
import lifetimes
print(dir(lifetimes))

['BetaGeoFitter', 'GammaGammaFitter', 'ModifiedBetaGeoFitter', 'ParetoNBDFitter', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'estimation', 'generate_data', 'plotting', 'utils', 'version']


In [8]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate']).dt.date
df['Sales'] = df['Sales'].astype(int)
df = df.sort_values(['InvoiceDate'])
data = summary_data_from_transaction_data(df.iloc[:, :3], customer_id_col = 'CustomerID', datetime_col='InvoiceDate', monetary_value_col='Sales',observation_period_end='2011-12-9')
data.head()

AttributeError: 'DataFrame' object has no attribute 'ix'

In [14]:
df2 = pd.read_excel('data_for_RFM_score_analysis.xlsx')
df2.head()

Unnamed: 0,id,date,sales
0,000173c5-978c-4b52-b7a4-5ebf974deb86,2020-08-13,1690.0
1,000173c5-978c-4b52-b7a4-5ebf974deb86,2020-08-14,6145.0
2,000173c5-978c-4b52-b7a4-5ebf974deb86,2020-08-15,4550.0
3,000173c5-978c-4b52-b7a4-5ebf974deb86,2020-08-17,1270.0
4,000173c5-978c-4b52-b7a4-5ebf974deb86,2020-08-20,3830.0


In [17]:
summary_data_from_transaction_data(df2,'date', 'id')

ParserError: Unknown string format: 000173c5-978c-4b52-b7a4-5ebf974deb86

In [16]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 804659 entries, 0 to 804658
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype         
---  ------  --------------   -----         
 0   id      804659 non-null  object        
 1   date    804659 non-null  datetime64[ns]
 2   sales   804659 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 18.4+ MB


##### References
1. https://towardsdatascience.com/whats-a-customer-worth-8daf183f8a4f
2. https://medium.com/swlh/identify-your-high-value-customer-7b8868b65554
3. https://lifetimes.readthedocs.io/en/latest/Quickstart.html
4. https://www.mikulskibartosz.name/predicting-customer-lifetime-value-using-the-pareto-nbd-model-and-gamma-gamma-model/