# DISCLAIMER
Copyright 2021 Google LLC. 

*This solution, including any related sample code or data, is made available on an “as is,” “as available,” and “with all faults” basis, solely for illustrative purposes, and without warranty or representation of any kind. This solution is experimental, unsupported and provided solely for your convenience. Your use of it is subject to your agreements with Google, as applicable, and may constitute a beta feature as defined under those agreements. To the extent that you make any data available to Google in connection with your use of the solution, you represent and warrant that you have all necessary and appropriate rights, consents and permissions to permit Google to use and process that data. By using any portion of this solution, you acknowledge, assume and accept all risks, known and unknown, associated with its usage, including with respect to your deployment of any portion of this solution in your systems, or usage in connection with your business, if at all.*

# Crystalvalue Demo: Predictive Customer LifeTime Value for a Retail Store

Crystalvalue is a best practice comprehensive solution for running predictive LTV solutions leveraging Google Cloud Vertex AI. 

This demo runs the Crystalvalue python library in a notebook, from feature engineering to scheduling predictions. It uses the [Online Retail II data set from Kaggle](https://www.kaggle.com/mashlyn/online-retail-ii-uci) which contains transactions for a UK retail store between 2009 and 2011. Enable the [Vertex API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,storage-component.googleapis.com) for this demo to run.

This notebook assumes that it is being run from within a [Google Cloud Platform AI Notebook](https://console.cloud.google.com/vertex-ai/notebooks/list/instances) with a Compute Engine default service account (the default setting when an AI Notebook is created) and TensorFlow backend. Ensure that the [Compute Engine default service account API](https://console.cloud.google.com/flows/enableapi?apiid=compute.googleapis.com) is enabled. When running it on your own data, we recommend [setting up your own service account](https://cloud.google.com/vertex-ai/docs/pipelines/configure-project).

If you would like to share feedback about Crystalvalue, please email crystalvalue@google.com.

# Clone the Crystalvalue codebase

Start by cloning the Crystalvalue codebase and running a demo notebook from the root directory. To run Crystalvalue on your own data, simply change the parameters to it works on your data.

```git clone https://github.com/google/crystalvalue```

# Set up - Downloading the dataset

In order to use the Kaggle’s public API, you must first authenticate using an API token. You can do this by visiting your Kaggle account and click 'Create New API Token' (See https://www.kaggle.com/docs/api). This will download an API token (called kaggle.json). Put this file in your working directory and run the following commands from your AI Notebook. Kaggle requires the json to be in a specific folder called 'kaggle'.

In [None]:
!pip install kaggle
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/kaggle.json
!kaggle datasets download -d mashlyn/online-retail-ii-uci
!sudo apt-get install unzip
!unzip online-retail-ii-uci.zip -d data/

This creates a `online_retail_II.csv` in `/data` which we will import into BigQuery in the next steps.

# Installing dependencies and initializing Crystalvalue

First create a dataset in [Bigquery](https://console.cloud.google.com/bigquery) that will be used for this analysis if you don't already have one. The dataset location should be in a [location that Vertex AI services are available](https://cloud.google.com/vertex-ai/docs/general/locations#available_regions). 

In [None]:
%pip install -q -r 'requirements.txt'

In [None]:
import pandas as pd

from src import crystalvalue

In [None]:
# Initiate the CrystalValue class with the relevant parameters.
pipeline = crystalvalue.CrystalValue(
  project_id='your_project_name',  # The GCP project id.
  dataset_id='your_dataset_name',  # The name of the pre-created dataset to work in. 
  customer_id_column='CustomerID',  # The customer ID column.
  date_column='InvoiceDate',  # The transaction date column.
  value_column='value',  #  Column to use for LTV calculation (i.e. profit or revenue).
  days_lookback=90,  #  How many days in the past to use for feature engineering.
  days_lookahead=365,  #  How many days in the future to use for value prediction.
  ignore_columns=['Invoice'],  #  A list of columns in your input table to ignore.
  location='europe-west4',  # This is the location of your dataset in Bigquery.
  write_parameters=True,  #  Write parameters to local file so they can be retrieved for prediction.
  credentials=None,  # The (optional) credentials to authenticate Bigquery and AIPlatform clients.
)  

In [None]:
# Read the data and rename the columns to be BiqQuery friendly (no spaces).
data = pd.read_csv('./data/online_retail_II.csv')
data.columns = data.columns.str.replace(' ', '')
data['value'] = data['Price'] * data['Quantity']  # Calculate the value for transactions.
data.head()

In [None]:
# Load the data to Bigquery.
TABLE_NAME = 'online_retail_data'

pipeline.load_dataframe_to_bigquery(
    data=data,
    bigquery_table_name=TABLE_NAME)

# Data Checks (Optional)

CrystalValue will run some checks on your data to check if the data is suitable for LTV modelling and raise errors if not. This will also output a new BigQuery table in your dataset called `crystalvalue_data_statistics` with key information such as the number of customers, customer return rate, transactions and analysis time period. This information can be used to check for outliers or anomalies (e.g. negative prices). 

In [None]:
summary_statistics = pipeline.run_data_checks(
    transaction_table_name=TABLE_NAME)

If a custom data cleaning routine has to be implemented use the `.run_query()` method. The example below removes transactions with negative prices. This method could also be used to run custom feature engineering scripts instead of the automated `.feature_engineering()` method in the next step. This data cleaning routine can be scheduled as part of the pipeline that we will define later (for model training and prediction).

In [None]:
query = f"""
SELECT *
FROM {pipeline.project_id}.{pipeline.dataset_id}.{TABLE_NAME}
WHERE Price > 0
"""

pipeline.run_query(
    query_sql=query,
    destination_table_name=TABLE_NAME)

# Feature Engineering

Crystalvalue takes a transaction or browsing level dataset and creates a machine learning-ready dataset that can be ingested by AutoML. Data types are automatically inferred from the BigQuery schema unless the features are provided using the `feature_types` parameter in the `.feature_engineer()` method. Data transformations are applied automatically depending on the data type. The data crunching happens in BigQuery and the executed script can be optionally written to your directory. The features will be created in a BigQuery table called `crystalvalue_train_data` by default.

In [None]:
crystalvalue_train_data = pipeline.feature_engineer(
  transaction_table_name=TABLE_NAME,
  write_executed_query_file='src/executed_train_query.sql'  # (Optional) File path to write the executed SQL query.
)  

# Model Training

Crystalvalue leverages [Vertex AI (Tabular) AutoML](https://cloud.google.com/vertex-ai/docs/training/automl-api) which requires a
[Vertex AI Dataset](https://cloud.google.com/vertex-ai/docs/datasets/create-dataset-api) as an input. CrystalValue automatically creates a Vertex AI Dataset from your input table as part of the training step of the pipeline. The training process typically takes about 2 or more hours to run. The Vertex AI Dataset will have a display name `crystalvalue_dataset`. The model will have a display name `crystalvalue_model` but it will also receive a model ID (so even if you train multiple models they will not be overwritten and can be identified using these IDs). By default CrystalValue chooses the following parameters:
*  Predefined split with random 15% of users as test, 15% in validation and 70% in training.
*  Optimization objective as Minimize root-mean-squared error (RMSE).
*  1 node hour of training (1000 milli node hours), which we recommend starting with. [Modify this in line with the number of rows](https://cloud.google.com/automl-tables/docs/train#training_a_model) in the dataset when you are ready for productionising. See information here about [pricing](https://cloud.google.com/automl-tables/pricing).

In this example we keep all the default settings so training the model is as simple as calling `pipeline.train_automl_model()`.

In order to make fast predictions later, you can deploy the model using the `.deploy_model()` method.

Once you start the training, you can view your model training progress here:  
https://console.cloud.google.com/vertex-ai/training/training-pipelines  
Once the training is finished, check out your Dataset (with statistics and distributions) and Model (with feature importance) in the UI:  
 https://console.cloud.google.com/vertex-ai/datasets   
 https://console.cloud.google.com/vertex-ai/models

In [None]:
model_object = pipeline.train_automl_model()

In [None]:
model_object = pipeline.deploy_model()

# Model Evaluation

To evaluate a model, we use the following criteria:

* The spearman correlation, a measure of how well the model **ranked** the Liftetime value of customers in the test set. This is measured between -1 (worse) and 1 (better).
* The normalised Gini coefficient, another measure of how well the model **ranked** the Lifetime value of customers in the test set compared to random ranking. This is measured between 0 (worse) and 1 (better). 
* The normalised Mean Average Error (MAE%). This is a measure of the **error** of the model's predictions for Lifetime value in the test set. 
* top_x_percent_predicted_customer_value_share: The proportion of value (i.e. total profit or revenue) in the test set that is accounted for by the top x% model-predicted customers. 

These outputs are sent to a BigQuery table (by default called `crystalvalue_evaluation`). Subsequent model evaluations append model performance evaluation metrics to this table to allow for comparison across models.

In [None]:
metrics = pipeline.evaluate_model()

# Generating predictions

Once model training is done, you can generate predictions. Features need to be engineered (the exact same as were used for model training) before prediction. This is done using the `.feature_engineer()` method by setting the parameter `query_type='predict_query'`. The features will be created in a BigQuery table called `crystalvalue_predict_data` by default. The model will make predictions for all customers in the provided input table that have any activity during the lookback window. The pLTV predictions will be for the period starting from the last date in the input table (not today's date).  

In [None]:
crystalvalue_predict_data = pipeline.feature_engineer(
    transaction_table_name=TABLE_NAME,  # An existing bigquery table in your dataset id containing the data to predict with.
    query_type='predict_query')


predictions = pipeline.predict(
    input_table=crystalvalue_predict_data,
    destination_table='crystalvalue_predictions'  # The bigquery table to append predictions to. It will be created if it does not exist yet.
    )  

# Scheduling daily predictions

Crystalvalue uses [Vertex Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction) to schedule and monitor machine learning predictions. It can also be used for model retraining. The example below demonstrates how to set up the model to automatically create predictions using new input data from the source BigQuery table every day at 1am. The frequency and timing of the schedule can be altered using the chron schedule below. Once this pipeline is set up, you can view it [here](https://console.cloud.google.com/vertex-ai/pipelines). If you want a tutorial on how to set up Vertex Pipelines [this guide](https://cloud.google.com/vertex-ai/docs/pipelines/build-pipeline).

In order to use Vertex AI pipelines, we need a cloud storage bucket. Use the code below to create a cloud storage bucket. Note that you may have to grant Storage Object Admin to your service account to ensure the pipeline can run.

In [None]:
BUCKET_NAME = 'crystalvalue_bucket'
storage_bucket = pipeline.create_storage_bucket(bucket_name=BUCKET_NAME)

In order to use Vertex AI pipelines with Crystalvalue we also need to create a docker container which will be stored in Google Cloud Container Registry. The following code builds a docker container and pushes it to your [GCP Container Registry](https://cloud.google.com/container-registry). 


In [None]:
!docker build -t crystalvalue .
!docker tag crystalvalue gcr.io/$pipeline.project_id/crystalvalue
!docker push gcr.io/$pipeline.project_id/crystalvalue

The Kubeflow components contains self-contained functions. Read about [Kubeflow components](https://www.kubeflow.org/docs/components/pipelines/sdk/component-development/).  

In [None]:
from kfp import dsl
from kfp.v2 import compiler
from kfp.v2.dsl import component
from kfp.v2.google.client import AIPlatformClient


@component(base_image=f"gcr.io/{pipeline.project_id}/crystalvalue:latest")
def pipeline_function():  
  from src import crystalvalue
  parameters = crystalvalue.load_parameters_from_file()
  pipeline = crystalvalue.CrystalValue(**parameters)
  TRANSACTION_TABLE = 'online_retail_data'  # Add your input table name.
  pipeline.run_data_checks(transaction_table_name=TRANSACTION_TABLE)  
  features = pipeline.feature_engineer(transaction_table_name=TRANSACTION_TABLE,
                                       query_type='predict_query')
  pipeline.predict(features)


@dsl.pipeline(
    name="crystalvaluepipeline",
    pipeline_root=f"gs://{BUCKET_NAME}/pipeline_root",
)
def crystalvalue_pipeline():
    pipeline_function()
    
compiler.Compiler().compile(
  pipeline_func=crystalvalue_pipeline,
  package_path="crystalvaluepipeline.json"
)

# Choose a region compatible with Vertex Pipelines. 
# This doesn't have to be the same as your data location.
api_client = AIPlatformClient(
    project_id=pipeline.project_id,
    region=pipeline.location,
)


(Optional) Check if your pipeline runs using the following function:

```
api_client.create_run_from_job_spec(job_spec_path="crystalvaluepipeline.json")
```

Create the scheduled pipeline. Adjust time zone and cron schedule as necessary.

In [None]:
response = api_client.create_schedule_from_job_spec(
    job_spec_path="crystalvaluepipeline.json",
    schedule="0 1 * * *",
    time_zone="America/Los_Angeles")

You can view your running and scheduled pipelines at:
https://console.cloud.google.com/vertex-ai/pipelines

# (Optional) Get insights into the relationship between your features and customer LTV

To get insights into how your model is making predictions based on your features using the [What-If Tool](https://pair-code.github.io/what-if-tool/). Check out an [online demo here](https://pair-code.github.io/what-if-tool/demos/age.html).

In [None]:
import numpy as np

from witwidget.notebook.visualization import WitConfigBuilder
from witwidget.notebook.visualization import WitWidget

In [None]:
features_with_predictions = pd.concat([
    crystalvalue_predict_data.iloc[:,7:],
    predictions['predicted_value']], axis=1)

In [None]:
config_builder = WitConfigBuilder(
    np.array(features_with_predictions[0:1000]).tolist(),
    list(features_with_predictions)
)
WitWidget(config_builder, height=1000)

# Clean Up

To clean up tables created during this demo, delete the BigQuery tables that were created. All Vertex AI resources can be removed from the [Vertex AI console](https://console.cloud.google.com/vertex-ai). If you set up a Vertex Pipeline then also remove any relevant resources from [Cloud Storage](https://console.cloud.google.com/storage) and [Container Registry](https://console.cloud.google.com//gcr/images/). 

In [None]:
pipeline.delete_table('crystalvalue_data_statistics')
pipeline.delete_table('crystalvalue_evaluation')
pipeline.delete_table('crystalvalue_train_data')
pipeline.delete_table('crystalvalue_predict_data')
pipeline.delete_table('crystalvalue_predictions')