In [None]:
# Copyright 2021 Google LLC.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 3. Model Training

This notebook demonstrates how BigQuery ML can be used to build propensity model.

## Requirements

TODO(): Update correct notebook cross-links.

Input features used for training needs to be stored as a BigQuery table. This can be done using [2. ML Data preparation notebook](2.ml_data_preparation.ipynb).

## Install and import required modules

In [None]:
# Install gps_building_blocks package if not installed
# !pip install gps_building_blocks

In [None]:
from utils import model
from gps_building_blocks.cloud.utils import bigquery as bigquery_utils

## Configuration

Before training propensity model, let's setup initial configurations that are used across the notebook.

### GCP configation

Configure following variables based on your GCP project:

In [None]:
# GCP Project ID
PROJECT_ID = 'project-id'
# BigQuery dataset name
DATASET_NAME = 'dataset'
# BigQuery table (name) containing features dataset
FEATURES_TABLE_NAME = 'features_table'
# Output model name to save in BigQuery
MODEL_NAME = 'propensity_model'

Next, let's configure modeling options.

### Model and features configuration

Model options can be configured in detail based on BigQuery ML specifications
listed in [The CREATE MODEL statement](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create).

**NOTE**: Propensity modeling supports only following four types of models available in BigQuery ML:
- LOGISTIC_REG
- [AUTOML_CLASSIFIER](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-automl)
- [BOOSTED_TREE_CLASSIFIER](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-boosted-tree)
- [DNN_CLASSIFIER](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-dnn-models)

In order to use specific model options, you can add options to following configuration exactly same as listed in the [The CREATE MODEL statement](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create). For example, if you want to trian [AUTOML_CLASSIFIER](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-automl) with `BUDGET_HOURS=1`, you can specify it as:

```python
params = {
  'model_type': 'AUTOML_CLASSIFIER',
  'budget_hours': 1
}
```

In [None]:
FEATURE_COLUMNS = [
  'feature1',
  'feature2'
]
TARGET_COLUMN = 'label'

params = {
  'model_path': f'{PROJECT_ID}.{DATASET_NAME}.{MODEL_NAME}',
  'features_table_path': f'{PROJECT_ID}.{DATASET}.{FEATURES_TABLE_NAME}',
  'model_type': 'LOGISTIC_REG',
  'data_split_method': 'AUTO_SPLIT',
  'auto_class_weights': 'TRUE',
  'feature_columns': FEATURE_COLUMNS,
  'target_column': TARGET_COLUMN
}

## Train the model

First, we initialize `PropensityModel` with config parameters.

In [None]:
bq_client = bigquery_utils.BigQueryUtils(project_id=PROJECT_ID)
propensity_model = model.PropensityModel(bq_client=bq_client, params=params)

Next cell triggers model training job in BigQuery which takes some time to finish depending on dataset size and model complexity. Set `verbose=True`, if you want to verify training query details.

In [None]:
model.train(verbose=False)

Following cell allows you to see detailed information about the input features used to train a model. It provides following columns:
- input — The name of the column in the input training data.
- min — The sample minimum. This column is NULL for non-numeric inputs.
- max — The sample maximum. This column is NULL for non-numeric inputs.
- mean — The average. This column is NULL for non-numeric inputs.
- stddev — The standard deviation. This column is NULL for non-numeric inputs.
- category_count — The number of categories. This column is NULL for non-categorical columns.
- null_count — The number of NULLs.

For more details refer to [help page](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-feature).

In [None]:
propensity_model.get_feature_info()

## Evaluate the model

This section helps to do quick model evaluation to get following model metrics:
- precision
- recall
- accuracy
- f1_score
- log_loss
- roc_auc

Two **optional parameters** can be specified for evaluation:
1. `eval_table`: BigQuery table containing evaluation dataset
2. `threshold`: Custom threshold to be used for evaluation. Default value is `0.5`.

If neither of these options are specified, the model is evaluated using evaluation dataset split during training with default threshold of 0.5.

TODO(zmtbnv): Update notebook links after notebooks are published.

**NOTE:** This evaluation provides basic model performance metrics. For thorough evaluation refer to [5. Model evaluation notebook](5.model_evaluation_and_diagnostics.ipynb).

In [None]:
EVAL_TABLE_NAME = 'eval_table'

eval_params = {
  'eval_table':  f'{PROJECT_ID}.{DATASET_NAME}.{EVAL_TABLE_NAME}',
  'threshold': 0.5
}
propensity_model.evaluate(eval_params, verbose=False)

## Next

Use [5. Model evaluation notebook](5.model_evaluation_and_diagnostics.ipynb) to get detailed performance metrics of the model and decide of model actually solves the business problem.