# DISCLAIMER
Copyright 2021 Google LLC. 

*This solution, including any related sample code or data, is made available on an “as is,” “as available,” and “with all faults” basis, solely for illustrative purposes, and without warranty or representation of any kind. This solution is experimental, unsupported and provided solely for your convenience. Your use of it is subject to your agreements with Google, as applicable, and may constitute a beta feature as defined under those agreements. To the extent that you make any data available to Google in connection with your use of the solution, you represent and warrant that you have all necessary and appropriate rights, consents and permissions to permit Google to use and process that data. By using any portion of this solution, you acknowledge, assume and accept all risks, known and unknown, associated with its usage, including with respect to your deployment of any portion of this solution in your systems, or usage in connection with your business, if at all.*

# Estimating lifetime value of users via Crystalvalue

**Crystalvalue** is a best practice comprehensive framework for running end-to-end LTV solutions leveraging Google Cloud Vertex AI AutoML.

To illustrate how to use this library, this notebook uses the Online Retail II data set from Kaggle which contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011. The company mainly sells unique all-occasion gift-ware. More details on this dataset can be found here https://www.kaggle.com/mashlyn/online-retail-ii-uci.

# Set up - Getting access to the dataset

In order to use the Kaggle API, In order to use the Kaggle’s public API, you must first authenticate using an API token. You can do this by visiting your Kaggle account and 'Creating New API Token'(See https://www.kaggle.com/docs/api)

In [None]:
!pip install kaggle

Kaggle requires the json to be in a specific folder called 'kaggle'

In [None]:
!mkdir ~/.kaggle

In [None]:
!cp kaggle.json ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d mashlyn/online-retail-ii-uci

In [None]:
!sudo apt-get install unzip

In [None]:
!unzip online-retail-ii-uci.zip -d data/

This creates a CSV which can be imported into Big Query.T he original dataset is
at an item (StockCode) - transaction (InvoiceNo) level. The CrystalValue
pipeline requires a transaction level Big Query dataset so the first step of
data preparation is to create an aggregated Big Query table called
'online_retail_tx' (aggregated by CustomerID, InvoiceNo and InvoiceDate).

# Installing dependencies and initializing Crystalvalue

In [None]:
%pip install --upgrade -q -r 'crystalvalue/requirements.txt'

In [None]:
from crystalvalue import crystalvalue
from google.cloud import bigquery

In [None]:
# Create BigQuery client for cloud authentication.
bigquery_client = bigquery.Client()

In [None]:
# Initiate the CrystalValue class.
pipeline = crystalvalue.CrystalValue(
    bigquery_client=bigquery_client,
    dataset_id='your_dataset_name',  # Dataset ID containing the table
    customer_id_column='customer_id',
    date_column='invoice_datetime',  # Date of transaction
    value_column='revenue',  # metric to use for LTV calculation
    window_date='2010-12-05'
)  #  If the window_date is not provided, the Crystalvalue feature engineering script automatically chooses a window date that is 365 days before the end of the dataset.

# Feature Engineering

Crystalvalue takes a transaction level dataset and creates a machine
learning-ready dataset that can be ingested by AutoML. It will first create an
SQL query (and write it to the file path and then execute it. Data types will be
automatically detected from the BigQuery schema if `numerical_features` and
`non_numerical_features` are not explicitly configured. By default, the model
will use features from between 2 and 1 years ago to predict value from between 1
year ago and now but this is configurable.

Note: Columns should not be nested.

In [None]:
# Perform feature engineering using BigQuery.

# CrystalValue automatically detects data types and applies transformations.
# CrystalValue by default will predict 1 year ahead (configurable) using data
# accumulated from 1 year before (configurable).

training_data = pipeline.feature_engineer(
    transaction_table_name='online_retail_tx',  # name of the table that contains transaction (step Preparing table in BQ)
    query_template_train_file='crystalvalue/sql_templates/train_user_split_new.sql',  # path to SQL file that will create table with training data in BQ. sql_templates folder contains two template scripts.
    write_executed_query_file='crystalvalue/example_generated_crystalvalue_input.sql'  # File path to write the generated SQL query.
)  

# Model Training

Crystalvalue leverages AutoML Tabular models (Vertex AI) which requires an
AutoML Dataset as an input. This is done as part of the 'training' step of the
pipeline. By default CrystalValue chooses the following(configurable): - A
predefined split with random 15% of users as test, 15% in validation and 70% in
training - Optimization objective as Minimize root-mean-squared error (RMSE).
This is recommended but can be modified to MAE or RMSLE - 1 node hour of
training (1000 milli node hours). It is recommended to modify this in line with
the number of rows in the dataset.

In this example we keep all the default settings so training model is as simple
as calling pipeline.train(). For more details see:
https://cloud.google.com/vertex-ai/docs/datasets/create-dataset-api
https://cloud.google.com/vertex-ai/docs/training/automl-api

In [None]:
# Creates AI Platform Dataset and trains AutoML model.
pipeline.train()
# Once this step completes you can check out your trained AutoML model in the Google Cloud Platform UI! https://console.cloud.google.com//vertex-ai/models

# Generating predictions

Once model training is done, you can generate predictions by running `.predict()` method on `pipeline` object.

In [None]:
pipeline.predict(
    input_table_name='prediction_data',  # table that contains features to predict with.
    model_resource_name='4028856894775885824',  #The resource name of the Vertex AI model - printed upon completion of the previous step or can be viewed via the Vertex AI dashboard(under 'Models' for the selected region).
)

There are 2 additional optional  parameters:
* `model_name` - name of the model specified at `train` step. (default is 'crystalvalue_model')
* `destination_table` - name of BigQuery table that contains predictions. (default is 'predictions')`.    

# Model Evaluation

To evaluate model goodness of fit, we use 3 criteria:

* Bin level Charts - predicted vs actual LTV by decile
* Spearman Correlation between predicted and actual LTV
* Normalized Gini coefficient

To ensure consistency and comparision across model runs, these outputs are sent to a Big Query table that can capture changes in model performance over all iterations.

In [None]:
pipeline.evaluate(predictions_table='predictions',  # table that contains CrystalValue predictions.
                  model_display_name='crystalvalue_model',  # Model name, default: crystalvalue_model
                  table_evaluation_stats='Kaggle_crystalvalue_evaluation',  #Big Query Table to export evaluation statistics to' default: crystalvalue_evaluation
                  number_bins=10)  #Number of bins for plots; default = 10

# TO DO
Elaborate once we are able to run the predictions on internal GCP projects