# BigQuery ML - Google Analytics 4 Propensity to Churn (Simplified)

Somewhat simplified example from official Google Cloud example notebooks: 

* [analytics-componentized-patterns/bqml\_ga4\_gaming\_propensity\_to\_churn.ipynb](https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/blob/master/gaming/propensity-model/bqml/bqml_ga4_gaming_propensity_to_churn.ipynb)  - expanded data preparation for model training, batch prediction output 
* [training-data-analyst/lab\_exercise.ipynb](https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/quests/vertex-ai/vertex-bqml/lab_exercise.ipynb)  - train and tune a BQML XGBoost propensity model to predict customer churn, online prediction (no longer required to manually export saved model from BQ ML to deploy)
* [devrel-demos/vertex-ai-first-model-deployed.ipynb](https://github.com/GoogleCloudPlatform/devrel-demos/blob/main/ai-ml/vertex-ai-first-model-in-production/vertex-ai-first-model-deployed.ipynb) - simplifed data preparation model training and deployment with Vertex AI Pipelines, online prediction output 

### Summary

1. Setup 
2. Create training dataset
3. Train model with BQML (logistic regression)
4. Evaluate model
5. Create predictions (batch)
6. Export predictions from BigQuery to Cloud Storage 


## Setup

Set variables below for use in the code throughout the notebook

In [None]:
PROJECT_ID = "YOUR-PROJECT-ID"  # replace with your project id
REGION = 'US'

### Create Google Cloud stroage bucket 

for exporting prediction results 

In [None]:
! gsutil mb -l us-central1 gs://$PROJECT_ID-bqmlga4-demo

## Create Training dataset

First we create 3 views for joining together into a training dataset:

1. Returning users
2. User Demographics
3. User Behavior 

### Create BigQuery dataset 

For creating training dataset and locating ML model trained with BQML

In [None]:
%%bigquery --project $PROJECT_ID

CREATE SCHEMA `demos-vertex-ai.bqmlga4_demo`
OPTIONS(
    location='US'
)

### returning users 

Query to obtain returning users based on first engagement date 

In [None]:
%%bigquery --project $PROJECT_ID
CREATE OR REPLACE VIEW bqmlga4_demo.returningusers AS(
    WITH firstlasttouch AS(
        SELECT
        user_pseudo_id,
        MIN(event_timestamp) AS user_first_engagement,
        MAX(event_timestamp) AS user_last_engagement
        FROM
      `firebase-public-project.analytics_153293282.events_ *`
        WHERE event_name="user_engagement"
        GROUP BY
        user_pseudo_id

    )
    SELECT
    user_pseudo_id,
    user_first_engagement,
    user_last_engagement,
    EXTRACT(MONTH from TIMESTAMP_MICROS(user_first_engagement)) as month,
    EXTRACT(DAYOFYEAR from TIMESTAMP_MICROS(user_first_engagement)) as julianday,
    EXTRACT(DAYOFWEEK from TIMESTAMP_MICROS(user_first_engagement)) as dayofweek,

    -- add 24 hr to user's first touch
    (user_first_engagement + 86400000000) AS ts_24hr_after_first_engagement,

    -- churned=1 if last_touch within 24 hr of app installation, else 0
    IF(user_last_engagement < (user_first_engagement + 86400000000),
        1,
        0) AS churned,

    -- bounced=1 if last_touch within 10 min, else 0
    IF(user_last_engagement <= (user_first_engagement + 600000000),
        1,
        0) AS bounced,
    FROM
    firstlasttouch
    GROUP BY
    1, 2, 3
)

SELECT
*
FROM
bqmlga4_demo.returningusers
LIMIT 100


### demographics 

Query to get user demographic data 

In [None]:
%%bigquery --project $PROJECT_ID

CREATE OR REPLACE VIEW bqmlga4_demo.user_demographics AS(

    WITH first_values AS(
        SELECT
        user_pseudo_id,
        geo.country as country,
        device.operating_system as operating_system,
        device.language as language,
        ROW_NUMBER() OVER(PARTITION BY user_pseudo_id ORDER BY event_timestamp DESC) AS row_num
        FROM `firebase-public-project.analytics_153293282.events_ *`
        WHERE event_name="user_engagement"
    )
    SELECT * EXCEPT(row_num)
    FROM first_values
    WHERE row_num=1
)

SELECT
*
FROM
bqmlga4_demo.user_demographics
LIMIT 10


### behaviorial 

Query for user behavor features 


In [None]:
%%bigquery --project $PROJECT_ID

CREATE OR REPLACE VIEW bqmlga4_demo.user_aggregate_behavior AS(
    WITH
    events_first24hr AS(
        # select user data only from first 24 hr of using the app
        SELECT
        e.*
        FROM
        `firebase-public-project.analytics_153293282.events_ *` e
        JOIN
        bqmlga4_demo.returningusers r
        ON
        e.user_pseudo_id=r.user_pseudo_id
        WHERE
        e.event_timestamp <= r.ts_24hr_after_first_engagement
    )
    SELECT
    user_pseudo_id,
    SUM(IF(event_name='user_engagement', 1, 0)) AS cnt_user_engagement,
    SUM(IF(event_name='level_start_quickplay', 1, 0)) AS cnt_level_start_quickplay,
    SUM(IF(event_name='level_end_quickplay', 1, 0)) AS cnt_level_end_quickplay,
    SUM(IF(event_name='level_complete_quickplay', 1, 0)) AS cnt_level_complete_quickplay,
    SUM(IF(event_name='level_reset_quickplay', 1, 0)) AS cnt_level_reset_quickplay,
    SUM(IF(event_name='post_score', 1, 0)) AS cnt_post_score,
    SUM(IF(event_name='spend_virtual_currency', 1, 0)) AS cnt_spend_virtual_currency,
    SUM(IF(event_name='ad_reward', 1, 0)) AS cnt_ad_reward,
    SUM(IF(event_name='challenge_a_friend', 1, 0)) AS cnt_challenge_a_friend,
    SUM(IF(event_name='completed_5_levels', 1, 0)) AS cnt_completed_5_levels,
    SUM(IF(event_name='use_extra_steps', 1, 0)) AS cnt_use_extra_steps,
    FROM
    events_first24hr
    GROUP BY
    1
)

SELECT
*
FROM
bqmlga4_demo.user_aggregate_behavior
LIMIT 10


### training data 

Finally, join the 3 previously created datasets into the model-ready training dataset 

In [None]:
%%bigquery --project $PROJECT_ID

CREATE OR REPLACE VIEW bqmlga4_demo.train AS(

    SELECT
    dem.*,
    IFNULL(beh.cnt_user_engagement, 0) AS cnt_user_engagement,
    IFNULL(beh.cnt_level_start_quickplay, 0) AS cnt_level_start_quickplay,
    IFNULL(beh.cnt_level_end_quickplay, 0) AS cnt_level_end_quickplay,
    IFNULL(beh.cnt_level_complete_quickplay, 0) AS cnt_level_complete_quickplay,
    IFNULL(beh.cnt_level_reset_quickplay, 0) AS cnt_level_reset_quickplay,
    IFNULL(beh.cnt_post_score, 0) AS cnt_post_score,
    IFNULL(beh.cnt_spend_virtual_currency, 0) AS cnt_spend_virtual_currency,
    IFNULL(beh.cnt_ad_reward, 0) AS cnt_ad_reward,
    IFNULL(beh.cnt_challenge_a_friend, 0) AS cnt_challenge_a_friend,
    IFNULL(beh.cnt_completed_5_levels, 0) AS cnt_completed_5_levels,
    IFNULL(beh.cnt_use_extra_steps, 0) AS cnt_use_extra_steps,
    ret.user_first_engagement,
    ret.month,
    ret.julianday,
    ret.dayofweek,
    ret.churned
    FROM
    bqmlga4_demo.returningusers ret
    LEFT OUTER JOIN
    bqmlga4_demo.user_demographics dem
    ON
    ret.user_pseudo_id=dem.user_pseudo_id
    LEFT OUTER JOIN
    bqmlga4_demo.user_aggregate_behavior beh
    ON
    ret.user_pseudo_id=beh.user_pseudo_id
    WHERE ret.bounced=0
)

SELECT
*
FROM
bqmlga4_demo.train
LIMIT 10


## Train model - logistic regression

The `MODEL_REGISTRY` and `VERTEX_AI_MODEL_VERSION_ALIASES` parameters will add the resulting model to Vertex AI Model Registry, see more details at the links below:

* https://cloud.google.com/blog/topics/developers-practitioners/mlops-bigquery-ml-vertex-ai-model-registry
* https://cloud.google.com/bigquery-ml/docs/managing-models-vertex 

In [None]:
%%bigquery --project $PROJECT_ID

CREATE OR REPLACE MODEL bqmlga4_demo.churn_logreg

OPTIONS(
    MODEL_TYPE="LOGISTIC_REG",
    INPUT_LABEL_COLS=["churned"],
    MODEL_REGISTRY="vertex_ai",
    VERTEX_AI_MODEL_VERSION_ALIASES=['logistic_reg', 'demo']
) AS

SELECT
*
FROM
bqmlga4_demo.train


## Create predictions 

For propensity modeling, the most important output is the probability of a behavior occuring. The following query returns the probability that the user will return after 24 hrs. The higher the probability and closer it is to 1, the more likely the user is predicted to churn, and the closer it is to 0, the more likely the user is predicted to return.


In [None]:
%%bigquery --project $PROJECT_ID

CREATE OR REPLACE TABLE bqmlga4_demo.predictions AS(
    SELECT
    user_pseudo_id,
    churned,
    predicted_churned,
    predicted_churned_probs[OFFSET(0)].prob as probability_churned

    FROM
    ML.PREDICT(MODEL bqmlga4_demo.churn_logreg,
               (SELECT * FROM bqmlga4_demo.train)) - -can be replaced with a proper test dataset
)

SELECT
*
FROM
bqmlga4_demo.predictions


## Export model predictions out of BigQuery

### Export predictions to GCS 

In [None]:
%%bigquery --project $PROJECT_ID

EXPORT DATA OPTIONS(
    uri='gs://demos-vertex-ai-bqmlga4-demo/churn_predictions*.csv',
    format='CSV',
    overwrite=TRUE
) AS
SELECT
*
FROM
bqmlga4_demo.predictions

## Activate/Operationalize Model Predictions

Once you have the model predictions, there are different steps you can take based on your business objective.

In our analysis, we used `user_pseudo_id` as the user identifier. However, ideally, your app should send back the `user_id` from your app to Google Analytics. This will help you to:

* join any first-party data you have for model predictions
* joins the model predictions with your first-party data

Once you have this join capability, you can:

* Export the model predictions back into Google Analytics as user attribute. This can be done using [Data Import feature](https://support.google.com/analytics/answer/10071301) in Google Analytics 4.
  * Based on the prediction values you can [Create and edit audiences](https://support.google.com/analytics/answer/2611404) and also do [Audience targeting](https://support.google.com/optimize/answer/6283435). For example, an audience can be users with prediction probability between 0.4 and 0.7, to represent users who are predicted to be "on the fence" between churning and returning.
* Adjust the user experience for targeted users within your app. For Firebase Apps, you can use the [Import segmentments](https://firebase.google.com/docs/projects/import-segments) feature. You can tailor user experience by targeting your identified users through Firebase services such as Remote Config, Cloud Messaging, and In-App Messaging. This will involve importing the segment data from BigQuery into Firebase. After that you can send notifications to the users, configure the app for them, or follow the user journeys across devices.
* Run targeted marketing campaigns via CRMs like Salesforce, e.g. send out reminder emails.


## Online Prediction 

We will deploy and perform an online prediction with the Python SDK for Vertex AI. 

### Initialize SDK

First we initialize the SDK:

In [None]:
from google.cloud import aiplatform

aiplatform.init(project = "demos-vertex-ai", location = "us-central1")

Next, we  list models to obtain the model `resource_name` for deploying in the next command

In [None]:
models = aiplatform.Model.list(filter = "display_name=churn_logreg")
model = aiplatform.Model(model_name = models[0].resource_name)
model

### create endpoint 

In [None]:
endpoint = aiplatform.Endpoint.create(display_name='bqmlga4')

In [None]:
# deploy model to endpoint 
endpoint.deploy(model,
                min_replica_count=1,
                max_replica_count=5,
                machine_type='n1-standard-4')

In [None]:
# create sample of for online prediction 
test_instance={
    "user_pseudo_id": "71746D859E4E6D655FEE8FFBC3B6C7E9",
    "country": "United States",
    "operating_system": "ANDROID",
    "language": "en-us",
    "cnt_user_engagement": 1,
    "cnt_level_start_quickplay": 0,
    "cnt_level_end_quickplay": 0,
    "cnt_level_complete_quickplay": 0,
    "cnt_level_reset_quickplay": 0,
    "cnt_post_score": 0,
    "cnt_spend_virtual_currency": 0,
    "cnt_ad_reward": 0,
    "cnt_challenge_a_friend": 0,
    "cnt_completed_5_levels": 0,
    "cnt_use_extra_steps": 0,
    "user_first_engagement": 1529165922877000,
    "month": 6,
    "julianday": 167,
    "dayofweek": 7

}

response = endpoint.predict([test_instance])

print('API response: ', response)

# https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/get-online-predictions#aiplatform_create_endpoint_sample-python
# https://cloud.google.com/vertex-ai/docs/datasets/data-types-tabular#format-for-prediction

In [None]:
# cleanup
endpoint.undeploy_all()
endpoint.delete()

## Appendix 

### Example set of online predictions 

Copy and paste the following into the UI/GCP console or use via REST (gcloud SDK + curl) or via Python

```json
{
  "instances": [{   
    "user_pseudo_id": "71746D859E4E6D655FEE8FFBC3B6C7E9",
    "country": "United States",
    "operating_system": "ANDROID",
    "language": "en-us",
    "cnt_user_engagement": 1,
    "cnt_level_start_quickplay": 0,
    "cnt_level_end_quickplay": 0,
    "cnt_level_complete_quickplay": 0,
    "cnt_level_reset_quickplay": 0,
    "cnt_post_score": 0,
    "cnt_spend_virtual_currency": 0,
    "cnt_ad_reward": 0,
    "cnt_challenge_a_friend": 0,
    "cnt_completed_5_levels": 0,
    "cnt_use_extra_steps": 0,
    "user_first_engagement": 1529165922877000,
    "month": 6,
    "julianday": 167,
    "dayofweek": 7
    }]
  }
```