    CUSTOMER LIFETIME VALUE

Normaly, we invest in customers that are often acquisitions costs, online and offline ads, promotions in order to create revenue and be profitable. These actions can create some customers with higher valuable in terms of lifetime, but there are always some customers who reduce the profitablity. The problem is we want to identify the customers' behaviors or say, patterns. This requires a step that we segment customers into different groups and act comparably.

First, we need to select a time window, it can be anything like 3,6,9,12 or 24 months. We can use a equation below to compute the Lifetime value for every customer in that specific time window:

        Lifetime value = Total Gross Revenues - Total Cost

The problem is as we migth have some customers having very high negative lifetime value historically, it could be too late to take an action. At this point, we need to predict the future with machine learning.

These are some important steps:

1. Define an appropriate time frame for customer lifetime value
2. Determine the features we will use to forecast the future and create them
3. Compute lifetime value (LTV) for the purpose of training machine learning model
4. Build and excute the machine learning model
5. Check if the model is helpful

Identifying the time somehow depends on industry, business model, and strategy and more that vary. Consider for example, 1 year is a very short for some industries while the others it is a very long period. In our case, we will opt the time frame with 6 months ahead.

We are going to calculate RFM score for each customer ID. To implement it correctly, we need to split our dataset. We will take 3 months of data, compute RFM and use this result for forecasting next 6 months

In [None]:
#We shall import some libraries
from __future__ import division
from datetime import datetime, timedelta,date
import pandas as pd
%matplotlib inline
from sklearn.metrics import classification_report,confusion_matrix
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from sklearn.cluster import KMeans


import plotly.plotly as py
import plotly.offline as pyoff
import plotly.graph_objs as go

import xgboost as xgb
from sklearn.model_selection import KFold, cross_val_score, train_test_split

import xgboost as xgb

#initate plotly
pyoff.init_notebook_mode()

In [None]:
#read data from csv and redo the data work we done before
df = pd.read_csv('X_Southeast_Asia_retail.csv')
df['InvoiceDate'] = pd.to_datetime(tx_data['InvoiceDate'])
df_Vietnam = df.query("Country=='Vietnam'").reset_index(drop=True)

We will create the 3 months and 6 months dataframes

In [None]:
#create 3m and 6m dataframes
df_3months = df_Vietnam[(df_Vietnam.InvoiceDate < date(2018,6,1)) & (df_Vietnam.InvoiceDate >= date(2018,3,1))].reset_index(drop=True)
df_6months = df_Vietnam[(df_Vietnam.InvoiceDate >= date(2018,6,1)) & (df_Vietnam.InvoiceDate < date(2018,12,1))].reset_index(drop=True)

In [None]:
#create df_user for classifying clustering
df_user = pd.DataFrame(df_3months['CustomerID'].unique())
df_user.columns = ['CustomerID']

We are going to create a fucntion for ordering cluster numbers

In [None]:
def order_cluster(cluster_field_name, target_field_name,df,ascending):
    new_cluster_field_name = 'new_' + cluster_field_name
    df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
    df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
    df_new['index'] = df_new.index
    df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)
    df_final = df_final.drop([cluster_field_name],axis=1)
    df_final = df_final.rename(columns={"index":cluster_field_name})
    return df_final

We will use KMeans clustering to compute Recency score:

In [None]:
#calculate recency score
df_max_purchase = df_3months.groupby('CustomerID').InvoiceDate.max().reset_index()
df_max_purchase.columns = ['CustomerID','max_purchase_date']
df_max_purchase['Recency'] = (tx_max_purchase['max_purchase_date'].max() - tx_max_purchase['max_purchase_date']).dt.days
df_user = pd.merge(df_user, df_max_purchase[['CustomerID','Recency']], on='CustomerID')
#Initiate the KMeans method
kmeans = KMeans(n_clusters=4)
kmeans.fit(df_user[['Recency']])
#Predict clusters
df_user['recency_cluster'] = kmeans.predict(df_user[['Recency']])
df_user = order_cluster('recency_cluster', 'Recency',tx_user,False)

Likewise, we can use this method to calculate frequency score again:

In [None]:
#calcuate frequency score
df_frequency = tx_3m.groupby('CustomerID').InvoiceDate.count().reset_index()
df_frequency.columns = ['CustomerID','Frequency']
df_user = pd.merge(tx_user, tx_frequency, on='CustomerID')

kmeans = KMeans(n_clusters=4)
kmeans.fit(tx_user[['Frequency']])
df_user['frequency_cluster'] = kmeans.predict(tx_user[['Frequency']])

df_user = order_cluster('frequency_cluster', 'Frequency',tx_user,True)

Calcualate the revenue score:

In [None]:
#calcuate revenue score
df_3months['Revenue'] = df_3months['UnitPrice'] * df_3months['Quantity']
df_revenue = df_3months.groupby('CustomerID').Revenue.sum().reset_index()
tx_user = pd.merge(df_user, df_revenue, on='CustomerID')

kmeans = KMeans(n_clusters=4)
kmeans.fit(df_user[['Revenue']])
df_user['revenue_cluster'] = kmeans.predict(df_user[['Revenue']])
df_user = order_cluster('revenue_cluster', 'Revenue',tx_user,True)

After calculating these three scores including recency, requency, revenue score. We are going to create an overall score out of them!!

We can name these score: 0-2: Low Value, 3-4: Mid Value, >5: High Value

In [None]:
#overall scoring
df_user['overall_score'] = df_user['recency_cluster'] + df_user['requency_cluster'] + df_user['revenue_cluster']
df_user['Segment'] = 'Low-Value'
df_user.loc[df_user['overall_score']>2,'Segment'] = 'Mid-Value' 
df_user.loc[df_user['overall_score']>4,'Segment'] = 'High-Value' 

Because our features is now ready, we will compute 6 months Lifetime value for each customer ID that are used for the model.

In [None]:
df_6months['Revenue'] = df_6months['UnitPrice'] * df_6months['Quantity']
df_user_6m = df_6months.groupby('CustomerID')['Revenue'].sum().reset_index()
df_user_6m.columns = ['CustomerID','m6_Revenue']


#we can plot LTV histogram
plot_data = [go.Histogram(x=df_user_6m.query('m6_Revenue < 10000')['m6_Revenue'])]

plot_layout = go.Layout(title='6m Revenue')
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

If we run this code aove, we will see the insight that we have customers with negative Lifetime value. And there are also some outliers too. Removing out the outliers of this dataset will make sense and have a correct machine learning model.

Next step, we will merge our 3 months and 6 months dataframes to view correlation between LTV and features

In [None]:
df_merge = pd.merge(df_user, df_user_6m, on='CustomerID', how='left')
df_merge = df_merge.fillna(0)

df_graph = df_merge.query("m6_Revenue < 30000")

plot_data = [
    go.Scatter(
        x=tx_graph.query("Segment == 'Low-Value'")['OverallScore'],
        y=tx_graph.query("Segment == 'Low-Value'")['m6_Revenue'],
        mode='markers',
        name='Low',
        marker= dict(size= 7,
            line= dict(width=1),
            color= 'blue',
            opacity= 0.8
           )
    ),
        go.Scatter(
        x=tx_graph.query("Segment == 'Mid-Value'")['OverallScore'],
        y=tx_graph.query("Segment == 'Mid-Value'")['m6_Revenue'],
        mode='markers',
        name='Mid',
        marker= dict(size= 9,
            line= dict(width=1),
            color= 'green',
            opacity= 0.5
           )
    ),
        go.Scatter(
        x=tx_graph.query("Segment == 'High-Value'")['OverallScore'],
        y=tx_graph.query("Segment == 'High-Value'")['m6_Revenue'],
        mode='markers',
        name='High',
        marker= dict(size= 11,
            line= dict(width=1),
            color= 'red',
            opacity= 0.9
           )
    ),
]

plot_layout = go.Layout(
        yaxis= {'title': "6m LTV"},
        xaxis= {'title': "RFM Score"},
        title='LTV'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

If we excute the code above, We would observe that positive correlation is quite observable. This gave us insights that high RFM score means high LTV.

Before building the machine learning model, we need to determine what is the type of this machine learning problem. LTV itself is a regression problem. A machine learning model can predict the value of the LTV. But here, we want LTV segments. Because this makes it more actionable insights and easy to communicate with non-technical users. By applying K-means clustering, we can explore our existing LTV groups/segmentatations and build segments on top of it.

Our tip here is that we need to classify customers differently based on their predicted LTV. For instance, we will train the model and have 3 segmentations, then we apply these result on our dataframes.

We will apply KMeans to decide number of segmentations and view its characteristics.

In [None]:
#We filter out outliers
df_merge = df_merge[df_merge['m6_Revenue']<df_merge['m6_Revenue'].quantile(0.99)]

#creating 3 clusters
kmeans = KMeans(n_clusters=3)
kmeans.fit(df_merge[['m6_Revenue']])
df_merge['ltv_cluster'] = kmeans.predict(df_merge[['m6_Revenue']])

#order cluster number based on LTV
df_merge = order_cluster('ltv_cluster', 'm6_Revenue',tx_merge,True)

#creating a new cluster dataframe
df_cluster = df_merge.copy()

#we can see the summary of desciptive statistic of each cluster
df_cluster.groupby('ltv_cluster')['m6_Revenue'].describe()

There are a few steps before training the machine learning model:

1. Convert categorical variables onto numberical variables
2. Check the correlation of features versus our label, LTV clusters
3. Split the dataset on train and test set
4. Run the machine learning model to see its real performance.

In [None]:
#convert categorical variables to dummy variables
df_class = pd.get_dummies(df_cluster, drop_first=True)

In [None]:
#We can calculate correlation and print them out
corr_matrix = df_class.corr()
corr_matrix['ltv_cluster'].sort_values(ascending=False)

#Create X and y
X = df_class.drop(['ltv_cluster','m6_Revenue'], axis=1)
y = tx_class['ltv_cluster']

#Split train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=56, stratify=y)

If we excute this code, we would observe that Revenue, Frequency and RFM score will be useful in our machine learning model.

We can use XGBoost to classify our customers. It becomes a multiple classification model because we had 3 groups in total.

In [None]:
LTV_XGBOOST_MODEL = xgb.XGBClassifier(max_depth=5, learning_rate=0.1,
                                     objective='multi:softprob', n_jobs=-1).fit(X_train, y_train)

#We will print the accuaracy of XGBoost
print(f'Accuray of XGBoost Model on training set = {round(LTV_XGBOOST_MODEL.score(X_train, y_train),2)}')
print(f'Accuray of XGBoost Model on testing set = {round(LTV_XGBOOST_MODEL.score(X_test, y_test),2)}')

If we run this code, we can see that biggest cluster we have is cluster 0 which is 76.5% of the total.
84% vs 76.5% will tell us that our machine learning model is useful or not but needs some improvement for sure.

We can identify that by looking at classification report:

In [None]:
y_pred = LTV_Xgboost_Model.predict(X_test)
print(f'Classification Report = {classification_report(y_test, y_pred)}')

Precision and recall are defensible for 0. For example, for cluster 0 (Low LTV), if model says that this customer belongs to cluster 0, 90 out of 100 will be correct. And the model successfully identifies 94% of actual cluster 0 customers (recall). We really need to improve the model for other clusters. For example, we barely detect 60% of Mid LTV customers.

In [None]:
y_pred_LTV = pd.DataFrame({'CustomerID':X_test['CustomerID'],'LTV' 'Prediction':y_pred})

In [None]:
y_pred_LTV.to_csv('y_pred_LTV.csv')

In [None]:
#prediction_LTV = pd.merge([y_pred_LTV['y_pred'], y_test])