**PLEASE MAKE A COPY BEFORE CHANGING**

**Copyright** 2021 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.


<b>Important</b>
This content are intended for educational and informational purposes only.

## Introduction

**objective**:  The goal of this colab is to calculate the Lifetime Value of your customer base. The method used for calculation is the BG/NDB model as described in this [paper](http://mktg.uni-svishtov.bg/ivm/resources/Counting_Your_Customers.pdf). 



# Code Section

In [None]:
!pip install -q lifetimes
!pip install -q --upgrade git+https://github.com/HIPS/autograd.git@master 
!pip install -U -q PyDrive

In [None]:
from google.colab import auth
from googleapiclient.discovery import build
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from oauth2client.client import GoogleCredentials

from lifetimes import BetaGeoFitter
from lifetimes.plotting import plot_frequency_recency_matrix
from lifetimes.plotting import plot_probability_alive_matrix
from lifetimes.plotting import plot_period_transactions
from lifetimes.plotting import plot_calibration_purchases_vs_holdout_purchases
from lifetimes.plotting import plot_history_alive
from lifetimes import GammaGammaFitter

import datetime
import pandas as pd
import matplotlib.pyplot as plt

# Allow plots to be displayed inline.
%matplotlib inline

# Authenticate the user
auth.authenticate_user()

### Input your settings to run the model

*   **project_id**: The id of Google Cloud project where the query will run.
*   **table_name**: Name of the Google Analytics table
transaction.
*   **time_unit**: The granularity to group transactions (weeks is usually the best)
*   **units_to_predict**: Number of periods to predict (in case of using weeks as time_unit, 52 would predict a year ahead)
*   **number_of_segments**: Number segments to group the users in.
*   **id_type**: Two options of IDs. client_id is based on GA cookie (not cross device and can change overtime. user_id can be provide better acurracy).
*   **use_id_dimension_index**: The custom dimension index that is holding the user id value. If the id_type is user_id, this field is mandatory.
*   **data_import_key_index**: The custom dimension index that is holding the user id. If you are not sure, don't worry, it can be changed later.
*   **data_import_value_index**: The custom dimension index that is holding the LTV segment. If you are not sure, don't worry, it can be changed later.




# New Section

In [None]:
project_id = 'my-project'#@param
table_name = 'bigquery-public-data.google_analytics_sample.ga_sessions_*'#@param
time_unit = 'weeks'#@param ['days', 'weeks', 'months']
units_to_predict = 52#@param
start_date = '2018-01-01'#@param{type:"date"}
end_date = '2019-01-01'#@param{type:"date"}
number_of_segments = 5#@param
id_type = 'client_id'#@param ['client_id', 'user_id']
user_id_dimension_index = 0#@param
data_import_key_index = 11#@param
data_import_value_index = 12#@param

dc = {}
dc['start_date'] = start_date.replace('-', '')
dc['end_date'] = end_date.replace('-', '')
dc['table_name'] = table_name
dc['id_type'] = id_type
dc['user_id_dimension_index'] = user_id_dimension_index

if time_unit == 'days':
    dc['time_unit'] = 1
elif time_unit == 'weeks' or time_unit == '':
    dc['time_unit'] = 7
else:
    dc['time_unit'] = 12


### Get data from BigQuery

Query the Google Analytics 360 in BigQuery to create the RFM matrix containing the following columns.


*   **user_id**: id of the user
*   **total_transactions**: Ammount of transaction during the period.
*   **average_order_value**: Sum of total_transaction_value / total_transactions
*   **frequency**: Represents the number of repeat purchases the customer has made.
*   **recency**: Represents the age of the customer when they made their most recent purchase. This is equal to the duration between a customer’s first purchase and their latest purchase. (Thus if they have made only 1 purchase, the recency is 0.)
*   **T**: Represents the age of the customer in whatever time units chosen (weekly, in the above dataset). This is equal to the duration between a customer’s first purchase and the end of the period under study.





In [None]:
q1 = """

WITH transactions as (
SELECT
  clientid AS user_id,
  PARSE_DATE('%Y%m%d', date) AS transaction_date,
  SUM(totals.totalTransactionRevenue) / 1000000 AS transaction_value
FROM 
  `{table_name}`
  WHERE
    _TABLE_SUFFIX BETWEEN '{start_date}'
    AND '{end_date}' AND totals.transactions > 0
    AND totals.totalTransactionRevenue > 0
GROUP BY
  1, 2)
  
SELECT
  user_id,
  COUNT(1) AS total_transactions,
  ROUND(SUM(transaction_value)/COUNT(1),1) AS average_order_value,
  (COUNT(1)-1) AS frequency,  
  ROUND (DATE_DIFF(MAX(transaction_date),
            MIN(transaction_date),
            DAY) / {time_unit} ,1) # time multiplyer 
    AS recency,

  ROUND((DATE_DIFF((SELECT MAX(transaction_date) FROM transactions),
             MIN(transaction_date),
             DAY)+1) / {time_unit} ,1) # time multiplyer
    AS T
FROM
  transactions
GROUP BY
  1
  
""".format(**dc)


q2 = """

WITH transactions as(
SELECT
  cds.value as user_id,
  PARSE_DATE('%Y%m%d', date) AS transaction_date,
  SUM(totals.totalTransactionRevenue) / 1000000 AS transaction_value
FROM
  `{table_name}`,
  UNNEST(customdimensions) AS cds
WHERE
  _TABLE_SUFFIX BETWEEN '{start_date}'
  AND '{end_date}'
  AND cds.index = {user_id_dimension_index}
  AND totals.transactions > 0
GROUP BY
  1,2)
  
SELECT
  user_id,
  COUNT(1) AS total_transactions,
  ROUND(SUM(transaction_value)/COUNT(1),1) AS average_order_value,
  (COUNT(1)-1) AS frequency,  
  ROUND (DATE_DIFF(MAX(transaction_date),
            MIN(transaction_date),
            DAY) / 7 ,1) # time multiplyer 
    AS recency,

  ROUND((DATE_DIFF((SELECT MAX(transaction_date) FROM transactions),
             MIN(transaction_date),
             DAY)+1) / 7 ,1) # time multiplyer
    AS T
FROM
  transactions
GROUP BY
  1
""".format(**dc)

if id_type == 'user_id' and user_id_dimension_index != 0:
  q = q2
else:
  q = q1


In [None]:
df = pd.io.gbq.read_gbq(q, project_id=project_id, verbose=False, dialect='standard') 

In [None]:
df.head()

### Train the model
Using the transformed dataset we fit the BetaGeoFitter model. Once the model is trained we can plot probability of alive Matrix.

Once the BetaGeofitter model is trained, we can train a Gamma Model to estimate future average order value.

In [None]:
bgf = BetaGeoFitter(penalizer_coef=0.0)
bgf.fit(df['frequency'], df['recency'], df['T'])

In [None]:
plt.figure(figsize=(10,10))
plot_frequency_recency_matrix(bgf)
;

In [None]:
plt.figure(figsize=(10,10))
plot_probability_alive_matrix(bgf)
;

In [None]:
ggf = GammaGammaFitter(penalizer_coef = 0)
ggf.fit(df['total_transactions'],
        df['average_order_value'])

In [None]:
df['prob_alive'] = bgf.conditional_probability_alive(df['frequency'], df['recency'], df['T'])
df['predicted_transactions'] = bgf.conditional_expected_number_of_purchases_up_to_time(units_to_predict, df['frequency'], df['recency'], df['T'])
df['predicted_aov'] = ggf.conditional_expected_average_profit(df['total_transactions'], df['average_order_value'])
df['predicted_ltv'] = df['predicted_transactions'] * df['predicted_aov']


### Results

In this section we display the following by user:



*   **prob_alive**: The probability of a customer being alive
*   **predicted_transactions**: The predicted number of transactions in the predicted period
*   **predicted_aov**: The predicted average order value in the predicted period
*   **predicted_ltv**: The total customer life time value for the predicted period. (predicted_ltv = predicted_aov * predicted_transactions)






We also group the customer into N segments.

In [None]:
df['segment'] = pd.qcut(df['predicted_ltv'], number_of_segments, labels=list(range(1,number_of_segments+1)))
summary = df.groupby('segment').agg({'prob_alive':'mean', 'predicted_transactions': 'mean', 
                           'predicted_aov': 'mean', 
                           'predicted_ltv': ['mean','sum']})

In [None]:
summary = summary.round(2)

In [None]:
df.head()

In [None]:
data_import_key_index = 'ga:dimension{}'.format(data_import_key_index)
data_import_value_index = 'ga:dimension{}'.format(data_import_value_index)

df = df[['user_id', 'segment']] 
df.columns = [data_import_key_index, data_import_value_index]
df.head()

# Optional: Save output to drive.

In [None]:
def output_to_googledrive(df, output_file_name='ltv.csv'):
    date = str(datetime.datetime.today()).split()[0]
    file_name = date + '_' + output_file_name
    file_url = 'https://drive.google.com/open?id='
    gauth = GoogleAuth()
    gauth.credentials = GoogleCredentials.get_application_default()
    drive = GoogleDrive(gauth)
    uploaded = drive.CreateFile({'title': file_name})
    uploaded.SetContentString(df.to_csv(index=False))
    uploaded.Upload()
    print(file_name)
    print('Full File URL: {}{}'.format(file_url, uploaded.get('id')))

In [None]:
output_to_googledrive(df, output_file_name='full_ltv.csv')
output_to_googledrive(summary, output_file_name='summary_ltv.csv')


### Sending data back to Google Analytics 360

We have 2 approaches to upload information back to GA.


1.   **Measurement Protocol Hits**: http call to our collection servers passing the ltv and segmentation per user.
2.   **Data Import (Query Time)**: Upload a csv into Google Analytics 360 using the management API our the UI.

