# Objectives

Build a probabilistic model to predict customer lifetime value for non-contractual business 

* Find out who are most likely to buy again 
* Estimate Customer Lifetime Value
* Calculate expected average profit per customer 

# Data Scource
Brazilian E-Commerce Public Dataset by Olist 
<br /> 100,000 Orders with product, customer and reviews info
<br /> https://www.kaggle.com/olistbr/brazilian-ecommerce?select=olist_order_items_dataset.csv

# Frequency & Recency Analysis using BG/NBD Model

#### BG/NBD Model 5 Assumptions:
1. While active, the number of transactions made by a customer follows a Poisson process with transaction rate . This is equivalent to assuming that the time between transactions is distributed exponential with transaction rate λ
2. Heterogeneity in  follows a gamma distribution with pdf
3. After any transaction, a customer becomes inactive with probability p. Therefore, the point at which the customer “drops out” is distributed across transactions according to a (shifted) geometric distribution with pmf P inactive immediately after jth transaction
4. Heterogeneity in p follows a beta distribution with pdf
5. The transaction rate and the dropout probability p vary independently across customers

#### The following nomenclature is used:

**Frequency** represents the number of repeat purchases the customer has made. This means that it’s one less than the total number of purchases. This is actually slightly wrong. It’s the count of time periods the customer had a purchase in. So if using days as units, then it’s the count of days the customer had a purchase on.

**T** represents the age of the customer in whatever time units chosen (weekly, in the above dataset). This is equal to the duration between a customer’s first purchase and the end of the period under study.

**Recency** represents the age of the customer when they made their most recent purchases. This is equal to the duration between a customer’s first purchase and their latest purchase. (Thus if they have made only 1 purchase, the recency is 0.)

**Monetary** represents the average value of a given customer’s purchases. This is equal to the sum of all a customer’s purchases divided by the total number of purchases. Note that the denominator here is different than the frequency described above.



# Import Packages and Data Processing 


In [74]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import datetime as dt
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.simplefilter("ignore")

In [75]:
customers_raw = pd.read_csv("../input/brazilian-ecommerce/olist_customers_dataset.csv")
orders_raw = pd.read_csv("../input/brazilian-ecommerce/olist_orders_dataset.csv", parse_dates = ['order_purchase_timestamp'])
orderItems_raw = pd.read_csv("../input/brazilian-ecommerce/olist_order_items_dataset.csv")

In [76]:
# Check the structure of the data
customers_raw.info()
# Inspect the data
customers_raw.describe(include='all')

In [77]:
customers = customers_raw[['customer_id', 'customer_unique_id']]
customers.head()

In [78]:
# Check the structure of the data
orders_raw.info()
# Inspect the data
orders_raw.describe(include='all')

Filter orders by order_status. Here we will exclude orders with 'canceled' as order_status. We have to consider fraud in real live situation. Since we don't have much information about the status in this dataset, we are going to be inclusive about the orders. 

We will use order_purchase_timestamp as date

In [79]:
orders_raw['order_status'].unique()

In [80]:
orders_raw = orders_raw[orders_raw.order_status != 'canceled']
orders_raw['date'] = pd.to_datetime(orders_raw['order_purchase_timestamp']).dt.date

In [81]:
orders = orders_raw[['order_id', 'customer_id','date']]
orders.head()

In [82]:
orderItems_raw.info()
orderItems_raw.head()

In [83]:
orderItems_raw['order_id'].nunique()

In [84]:
orderTrans = orderItems_raw.groupby(['order_id']).agg({'price':'sum'}).reset_index()
orderTrans.rename(columns={'price':'revenue'}, inplace = True)
orderTrans

Combine Orders dataset with OrderItems dataset to get the revenue of each orders 

In [85]:
orderDetails = pd.merge(orders, orderTrans, on = 'order_id', how = 'left')
orderDetails

In [86]:
orderCusDetails_raw = pd.merge(customers, orderDetails, on = 'customer_id', how = 'left')
orderCusDetails_raw

In [87]:
# Testing
orderCusDetails_raw[orderCusDetails_raw['customer_unique_id'] == '12f5d6e1cbf93dafd9dcc19095df0b3d']

In [88]:
orderCusDetails = orderCusDetails_raw.groupby(['customer_unique_id','date']).agg({'revenue':'sum'}).reset_index()
orderCusDetails

In [89]:
# Testing
orderCusDetails[orderCusDetails['customer_unique_id'] == '12f5d6e1cbf93dafd9dcc19095df0b3d']

In [90]:
orderCusDetails['customer_unique_id'].nunique()

In [91]:
df = orderCusDetails.copy()

In [92]:
print('Min : {}, Max : {}'.format(min(df['date']), max(df['date'])))
lastDate = max(df['date']) + dt.timedelta(1)
print(lastDate)

In [93]:
df1 = df.groupby('customer_unique_id') .agg({'date': lambda x:(lastDate - x.min()).days}).reset_index()
df1.rename(columns = {'date':'T'}, inplace = True)
df1

In [94]:
df2 = df.groupby('customer_unique_id').agg({'date': lambda x:(x.max() - x.min()).days}).reset_index()
df2.rename(columns = {'date':'recency'}, inplace = True)
df2

In [95]:
# Testing
# df2[df2['customer_unique_id'] == '12f5d6e1cbf93dafd9dcc19095df0b3d']
df2[df2['recency'] != 0]

In [96]:
df3 = df.groupby(['customer_unique_id']).agg({'date': 'count',                                             
                                             'revenue':'sum'}).reset_index()
df3.rename(columns = {'date':'frequency','revenue':'total_monetary'}, inplace = True)
df3

In [97]:
df3['avg_monetary'] = df3['total_monetary'] / df3['frequency']
df3['frequency'] = df3['frequency'] - 1
df3 = df3.drop(columns=['total_monetary'])
df3

In [98]:
# Testing
# df3[df3['frequency'] != 0]
df3[df3['customer_unique_id'] == '02168ea18740a0fdaaa15f11bebba5db']

In [99]:
df_combined1 = pd.merge(df1, df2, on = 'customer_unique_id', how = 'outer')
df_combined2 = pd.merge(df_combined1, df3, on = 'customer_unique_id', how = 'outer')
df_combined2

In [100]:
# Testing
orderCusDetails[orderCusDetails['customer_unique_id'] == 'ff922bdd6bafcdf99cb90d7f39cea5b3']

In [101]:
# Testing
df_combined2[df_combined2['recency'] != 0]

# Basic Frequency/Recency Analysis

In [102]:
pip install lifetimes 

In [103]:
import lifetimes

In [104]:
from lifetimes import BetaGeoFitter

bgf = BetaGeoFitter(penalizer_coef = 0.001)
bgf.fit(df_combined2['frequency'], df_combined2['recency'], df_combined2['T'])
print(bgf)
bgf.summary

### Visualizing Frequency/Recency Matrix

which computes the expected number of transactions an artificial customer is to make in the next time period, given his or her recency and frequency. 

#### Expected Number of Future Purchases for 1 Unit of Time

In [105]:
from lifetimes.plotting import plot_frequency_recency_matrix

plot_frequency_recency_matrix(bgf)

We can see that if a customer has bought 15 times from us, and their latest purchase was when they were 700 days old, then they are our best customer (bottom-right).

Area around (4,350) - Customers who buy infrequently, but we've seem them recently. They might be dead or just between purchases

#### Probability of Still Being Alive

In [106]:
from lifetimes.plotting import plot_probability_alive_matrix

plot_probability_alive_matrix(bgf)

#### Customers Who Will Most Lilely to Buy Again

In [107]:
t = 1
df_combined2['predicted_purchases'] = bgf.conditional_expected_number_of_purchases_up_to_time(t, df_combined2['frequency'], df_combined2['recency'], df_combined2['T'])
df_combined2.sort_values(by='predicted_purchases', ascending=False).head()

These are the customers who are probably going to buy again in the next period. 

#### Predicting Customer's Future Behavior

In [108]:
t = 100 #predict purchases in 100 days
individual = df_combined2.iloc[66666]
# The below function is an alias to `bfg.conditional_expected_number_of_purchases_up_to_time`
bgf.predict(t, individual['frequency'], individual['recency'], individual['T'])

# Estimating Customer Lifetime Value Using the Gamma-Gamma Model

This model assumes that there is no relationship between the monetary value and the purchase frequency. 
In practice we need to check whether the Pearson correlation between the two vectors is close to 0 in order to use this model.

In [109]:
returning_customers_summary = df_combined2[(df_combined2['frequency'] > 0) & (df_combined2['avg_monetary'] > 0)]
returning_customers_summary.head()

In [110]:
returning_customers_summary[['avg_monetary', 'frequency']].corr()

Since the correlation between monetary and frequency is not strong, we can use Gamma-Gamma model to predict the conditional, expected average lifetime value of customers 

In [111]:
from lifetimes import GammaGammaFitter

ggf = GammaGammaFitter(penalizer_coef = 0)
ggf.fit(returning_customers_summary['frequency'],
        returning_customers_summary['avg_monetary'])
print(ggf)

In [112]:
ggf.conditional_expected_average_profit(
         returning_customers_summary['frequency'],
         returning_customers_summary['avg_monetary']).head()

In [113]:
print("Expected conditional average profit: %s, Average profit: %s" % (
    ggf.conditional_expected_average_profit(
        df_combined2['frequency'],
        df_combined2['avg_monetary']
    ).mean(),
    df_combined2[df_combined2['frequency']>0]['avg_monetary'].mean()
))


# References:
1. https://lifetimes.readthedocs.io/en/latest/Quickstart.html#estimating-customer-lifetime-value-using-the-gamma-gamma-model
2. “Counting Your Customers” the Easy Way: An Alternative to the Pareto/NBD Model by Fader et al. in 2005. http://brucehardie.com/papers/018/fader_et_al_mksc_05.pdf

