# AutoML Propensity to Purchase with code

Use the Vertex AI Python Client to recreate a no-code approach with code (Python). This builds a custom model with AutoML and executes a batch prediction 

Based on the original source [vertex-ai-mlops/02b - Vertex AI - AutoML with clients (code).ipynb](https://github.com/statmike/vertex-ai-mlops/blob/main/02b%20-%20Vertex%20AI%20-%20AutoML%20with%20clients%20(code).ipynb) by fellow Googler, Mike Henderson [statmike](https://github.com/statmike)

## Setup
Inputs:

In [None]:
PROJECT_ID = 'demos-vertex-ai'
REGION = 'us-central1'
DATANAME = 'propensity'
NOTEBOOK = 'automl-propensity-code'

# Resources
DEPLOY_COMPUTE = 'n1-standard-4'

# Model Training
VAR_TARGET = 'will_buy_on_return_visit'
VAR_OMIT = 'fullVisitorId' # add more variables to the string with space delimiters

packages:

In [None]:
from google.cloud import aiplatform
from datetime import datetime

from google.cloud import bigquery
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value
import json
import numpy as np

clients: 

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION)
bq = bigquery.Client(project = PROJECT_ID)

parameters:

In [None]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
DIR = f"temp/{NOTEBOOK}"

environment:

In [None]:
!rm -rf {DIR}
!mkdir -p {DIR}

## Create BigQuery Dataset
List BQ datasets in the project first to see

In [None]:
query = f"""
SELECT schema_name
FROM `{PROJECT_ID}.INFORMATION_SCHEMA.SCHEMATA`
"""
bq.query(query = query).to_dataframe()

Create dataset if missing

In [None]:
query = f"""
CREATE SCHEMA IF NOT EXISTS `{PROJECT_ID}.{DATANAME}`
OPTIONS(
    location = 'US',
    labels = [('notebook','{NOTEBOOK}')]
)
"""
job = bq.query(query = query)
job.result()

In [None]:
(job.ended-job.started).total_seconds()

list BQ datasets again to confirm creation:

In [None]:
query = f"""
SELECT schema_name
FROM `{PROJECT_ID}.INFORMATION_SCHEMA.SCHEMATA`
"""
bq.query(query = query).to_dataframe()

## Create BigQuery Table 

Submit job to save query results to a table via Python [Writing query results  |  BigQuery  |  Google Cloud](https://cloud.google.com/bigquery/docs/writing-results#writing_query_results)

In [None]:
table_id = f"{PROJECT_ID}.{DATANAME}.{DATANAME}"

job_config = bigquery.QueryJobConfig(destination=table_id,
                                    write_disposition = 'WRITE_TRUNCATE')

sql = """
  SELECT
    fullVisitorId,
    bounces,
    time_on_site,
    will_buy_on_return_visit
  FROM (
        # select features
        SELECT
          fullVisitorId,
          IFNULL(totals.bounces, 0) AS bounces,
          IFNULL(totals.timeOnSite, 0) AS time_on_site
        FROM
          `data-to-insights.ecommerce.web_analytics`
        WHERE
          totals.newVisits = 1
        AND date BETWEEN '20160801' # train on first 9 months of data
        AND '20170430'
       )
  JOIN (
        SELECT
          fullvisitorid,
          IF (
              COUNTIF (
                       totals.transactions > 0
                       AND totals.newVisits IS NULL
                      ) > 0,
              1,
              0
             ) AS will_buy_on_return_visit
        FROM
          `bigquery-public-data.google_analytics_sample.*`
        GROUP BY
          fullvisitorid
       )
  USING (fullVisitorId)
  ORDER BY time_on_site DESC
"""

# Start the query, passing in the extra configuration.
query_job = bq.query(sql, job_config=job_config)  # Make an API request.
query_job.result()  # Wait for the job to complete.

print("Query results loaded to the table {}".format(table_id))

# Create AutoML Dataset (link to BigQuery table)

In [None]:
dataset = aiplatform.TabularDataset.create(
    display_name = f'{NOTEBOOK}_{DATANAME}_{TIMESTAMP}', 
    bq_source = f'bq://{PROJECT_ID}.{DATANAME}.{DATANAME}',
    labels = {'notebook':f'{NOTEBOOK}'}
)

# Train Model with AutoML 

In [None]:
column_specs = list(set(dataset.column_names) - set(VAR_OMIT.split()) - set([VAR_TARGET, 'splits']))

In [None]:
column_specs = dict.fromkeys(column_specs, 'auto')

In [None]:
print(column_specs)

Define a Job:

* Consider Weighting
* Model Type
* Optimization Objective

https://googleapis.dev/python/aiplatform/latest/aiplatform.html#google.cloud.aiplatform.AutoMLTabularTrainingJob

In [None]:
tabular_classification_job = aiplatform.AutoMLTabularTrainingJob(
    display_name = f'{NOTEBOOK}_{DATANAME}_{TIMESTAMP}',
    optimization_prediction_type = 'classification',
    optimization_objective = 'maximize-au-prc',
    column_specs = column_specs,
    labels = {'notebook':f'{NOTEBOOK}'}
)

In [None]:
model = tabular_classification_job.run(
    dataset = dataset,
    target_column = VAR_TARGET,
    # predefined_split_column_name = 'splits',
    #    training_fraction_split = 0.8,
    #    validation_fraction_split = 0.1,
    #    test_fraction_split = 0.1,
    budget_milli_node_hours = 1000,
    model_display_name = f'{NOTEBOOK}_{DATANAME}_{TIMESTAMP}',
    disable_early_stopping = False,
    model_labels = {'notebook':f'{NOTEBOOK}'}
)

# Evaluation 
One can evaluate the model in 2 ways 

1. within the Cloud Console under [Vertex AI > Models](https://console.cloud.google.com/vertex-ai/models) 
2. via the API 


Setup a model client for the model created by this notebook:

In [None]:
# model = aiplatform.Model('projects/746038361521/locations/us-central1/models/298666940522561536')
model.resource_name

In [None]:
model_client = aiplatform.gapic.ModelServiceClient(
    client_options = {
        'api_endpoint' : f'{REGION}-aiplatform.googleapis.com'
    }
)

Retrives the aggregate model evalution metrics for the model as a whole. First, use `.list_model_evaluations` to retrieve the evaluation id, then use `.get_model_evaluation` for the evaluation id:

In [None]:
evaluations = model_client.list_model_evaluations(parent = model.resource_name)
evals = iter(evaluations)
eval_id = next(evals).name
geteval = model_client.get_model_evaluation(name = eval_id)

In [None]:
geteval.metrics['auPrc']

In [None]:
for i in range(len(geteval.metrics['confusionMatrix']['annotationSpecs'])):
    print('True Label = ', 
          geteval.metrics['confusionMatrix']['annotationSpecs'][i]['displayName'], 
          ' has Predicted labels = ', 
          geteval.metrics['confusionMatrix']['rows'][i])


In [None]:
slices = model_client.list_model_evaluation_slices(parent = eval_id)
for slice in slices:
    print('Label = ', slice.slice_.value, 'has auPrc = ', slice.metrics['auPrc'])

# Batch Prediction

## Create sample batch input (BigQuery table)
From original dataset for a simplified demonstration

In [None]:
table_id = f"{PROJECT_ID}.{DATANAME}.batch_01"

job_config = bigquery.QueryJobConfig(destination=table_id,
                                    write_disposition = 'WRITE_TRUNCATE')

sql = f"""
  SELECT * FROM {PROJECT_ID}.{DATANAME}.{DATANAME} WHERE RAND() < 10/555987
"""

# Start the query, passing in the extra configuration.
query_job = bq.query(sql, job_config=job_config)  # Make an API request.
query_job.result()  # Wait for the job to complete.

print("Query results loaded to the table {}".format(table_id))

## Batch Prediction from BigQuery source to BigQuery Destination, with Explanations

In [None]:
batch = aiplatform.BatchPredictionJob.create(
    job_display_name = f'{NOTEBOOK}_{DATANAME}_{TIMESTAMP}',
    model_name = model.name,
    instances_format = "bigquery",
    predictions_format = "bigquery",
    bigquery_source = f'bq://{PROJECT_ID}.{DATANAME}.batch_01',
    bigquery_destination_prefix = f"{PROJECT_ID}",
    generate_explanation = True,
    labels = {'notebook':f'{NOTEBOOK}'}
)

## View batch prediction output table for downstream use

In remarketing, email or other outreach as part of customer loyalty program.


check for name of batch prediction output to BigQuery then query to view results
[Get batch predictions  |  Vertex AI  |  Google Cloud](https://cloud.google.com/vertex-ai/docs/predictions/batch-predictions#tabular)


In [None]:
query = f"""
SELECT schema_name
FROM `{PROJECT_ID}.INFORMATION_SCHEMA.SCHEMATA`
"""
bq.query(query = query, location = 'us-central1').to_dataframe()

Query to view probability of purchase by user id `fullVisitorId`:

```sql
SELECT
  fullVisitorId,
  predicted_will_buy_on_return_visit.classes[OFFSET(1)] AS purchaseYN,
  predicted_will_buy_on_return_visit.scores[OFFSET(1)] AS purchasePropensity
FROM 
    `demos-vertex-ai.prediction_automl_propensity_code_propensity_20220321145617_2022_03_21T12_38_12_306Z.predictions_2022_03_21T12_38_12_306Z`
```

In [None]:
# The following two lines are only necessary to run once.
# Comment out otherwise for speed-up.
from google.cloud.bigquery import Client, QueryJobConfig
client = Client()

query = """SELECT
  fullVisitorId,
  predicted_will_buy_on_return_visit.classes[OFFSET(1)] AS purchaseYN,
  predicted_will_buy_on_return_visit.scores[OFFSET(1)] AS purchasePropensity
FROM 
    `demos-vertex-ai.prediction_automl_propensity_code_propensity_20220321145617_2022_03_21T12_38_12_306Z.predictions_2022_03_21T12_38_12_306Z`"""
job = client.query(query)
df = job.to_dataframe()
df

### Conclusion

If we assume our threshold of .5, all of these customers are not likely to purchase on their next visit.