# DISCLAIMER
Copyright 2021 Google LLC. 

*This solution, including any related sample code or data, is made available on an “as is,” “as available,” and “with all faults” basis, solely for illustrative purposes, and without warranty or representation of any kind. This solution is experimental, unsupported and provided solely for your convenience. Your use of it is subject to your agreements with Google, as applicable, and may constitute a beta feature as defined under those agreements. To the extent that you make any data available to Google in connection with your use of the solution, you represent and warrant that you have all necessary and appropriate rights, consents and permissions to permit Google to use and process that data. By using any portion of this solution, you acknowledge, assume and accept all risks, known and unknown, associated with its usage, including with respect to your deployment of any portion of this solution in your systems, or usage in connection with your business, if at all.*

# Estimating lifetime value of users via Crystalvalue

**Crystalvalue** is a best practice comprehensive framework for running end-to-end LTV solutions leveraging Google Cloud Vertex AI AutoML.

To illustrate how to use this library, this notebook uses the Online Retail II data set from Kaggle which contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011. The company mainly sells unique all-occasion gift-ware. More details on this dataset can be found here https://www.kaggle.com/mashlyn/online-retail-ii-uci.

# Set up - Getting access to the dataset

In order to use the Kaggle’s public API, you must first authenticate using an API token. You can do this by visiting your Kaggle account and click 'Create New API Token' (See https://www.kaggle.com/docs/api). This will download an API token (called kaggle.json). Put this file in your working directory and run the following commands.

In [None]:
!pip install kaggle

Kaggle requires the json to be in a specific folder called 'kaggle'

In [None]:
!mkdir ~/.kaggle

In [None]:
!cp kaggle.json ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d mashlyn/online-retail-ii-uci

In [None]:
!sudo apt-get install unzip

In [None]:
!unzip online-retail-ii-uci.zip -d data/

This creates a CSV file which we will import into BigQuery in the next steps.

# Installing dependencies and initializing Crystalvalue

Note: this notebook assumes you are working from the crystalvalue folder's parent directory.

In [None]:
%pip install --upgrade -q -r './crystalvalue/requirements.txt'

In [None]:
import pandas as pd

from crystalvalue import crystalvalue
from google.cloud import bigquery

In [None]:
# Create BigQuery client for cloud authentication.
bigquery_client = bigquery.Client()

In [None]:
# Read the data and rename the columns to be BiqQuery friendly.
data = pd.read_csv('./data/online_retail_II.csv')
data.columns = data.columns.str.replace(' ', '')
data.head()

In [None]:
# Load the data to Bigquery.
dataset_id = 'your_dataset'  # Make sure this dataset exists in your project.
table_name = 'online_retail_data'  # This is what we will call the table that will be created.
location = 'europe-west1'  # This is the location of your dataset in Bigquery. Here we use 'europe-west1`.

bigquery_job = bigquery_client.load_table_from_dataframe(
      dataframe=data,
      destination=f'{bigquery_client.project}.{dataset_id}.{table_name}',
      location=location).result()

In [None]:
# Initiate the CrystalValue class with the relevant parameters.
pipeline = crystalvalue.CrystalValue(
  bigquery_client=bigquery_client,
  dataset_id=dataset_id,  # Dataset ID containing the table.
  customer_id_column='CustomerID',
  date_column='InvoiceDate',
  value_column='Price',  # column to use for LTV calculation.
  days_lookback=90,  # How many days in the past to use for feature engineering.
  days_lookahead=365,  # How many days in the future to use for value prediction.
  location=location)  

# Data Checks

CrystalValue will run some checks on your data to check if the data is suitable for LTV modelling and raise errors if not. This will also output a new BigQuery table in your dataset called `summary_statistics` with key information such as the number of customers, transactions and analysis time period. This information can be used to check for outliers or anomalies (e.g. negative prices). 

In [None]:
summary_statistics = pipeline.run_data_checks(
    transaction_table_name=table_name)

# Feature Engineering

Crystalvalue takes a transaction level dataset and creates a machine learning-ready dataset that can be ingested by AutoML. Data types are automatically inferred from the BigQuery schema unless the features are provided using the feature_types parameter in the `.feature_engineer()` method. Data transformations are applied automatically depending on the data type. The data crunching happens in BigQuery and the executed script can be optionally written to your directory in case you want to look through it. The features will be created in a table called `training_data` by default.

In [None]:
training_data = pipeline.feature_engineer(
  transaction_table_name=table_name,
  write_executed_query_file='crystalvalue/executed_query.sql'  # (Optional) File path to write the executed SQL query.
)  

# Model Training

Crystalvalue leverages AutoML Tabular models (Vertex AI) which requires an
AutoML Dataset as an input. CrystalValue does this automatically as part of the 'training' step of the
pipeline. This step typically takes about 2 or more hours to run. By default CrystalValue chooses the following parameters (configurable):
*  Predefined split with random 15% of users as test, 15% in validation and 70% in
training.
*  Optimization objective as Minimize root-mean-squared error (RMSE). This is recommended but can be modified to [MAE or RMSLE](https://cloud.google.com/automl-tables/docs/train#opt-obj).
*  1 node hour of training (1000 milli node hours). It is recommended to start with this training time. [Modify this in line with the number of rows](https://cloud.google.com/automl-tables/docs/train#training_a_model) in the dataset when you are ready for productionising. See information here about [pricing](https://cloud.google.com/automl-tables/pricing).

In this example we keep all the default settings so training the model is as simple
as calling pipeline.train(). For more details see:  
https://cloud.google.com/vertex-ai/docs/datasets/create-dataset-api  
https://cloud.google.com/vertex-ai/docs/training/automl-api  

Once you start the training, you can view your model training progress here:  
https://console.cloud.google.com/vertex-ai/training/training-pipelines  
Once the training is finished, check out your trained AutoML model in the UI. Feature importance graphs and statistics on the data can be viewed here:  
 https://console.cloud.google.com//vertex-ai/models

In [None]:
# Creates AI Platform Dataset and trains AutoML model.
pipeline.train()

# Model Evaluation

To evaluate model goodness of fit, we use 3 criteria:

* Bin level Charts - predicted vs actual LTV by decile
* Spearman Correlation 
* Normalized Gini coefficient

The following commands deploys your model (which is required for evaluation) and then it performs the model evaluation. To ensure consistency and comparision across model runs, these outputs are sent to a BigQuery table (by default called `crystalvalue_evaluation`) that can capture changes in model performance over all iterations.

In [None]:
pipeline.deploy_model()

In [None]:
pipeline.evaluate_model(endpoint='23554634534545')

# Generating predictions

Once model training is done, you can generate predictions by running `.predict()` method on `pipeline` object.

In [None]:
pipeline.predict(
    input_table_name='prediction_data',  # table that contains features to predict with.
    model_resource_name='4028856894775885824',  #The resource name of the Vertex AI model - printed upon completion of the previous step or can be viewed via the Vertex AI dashboard(under 'Models' for the selected region).
)

There are 2 additional optional  parameters:
* `model_name` - name of the model specified at `train` step. (default is 'crystalvalue_model')
* `destination_table` - name of BigQuery table that contains predictions. (default is 'predictions')`.    