In [None]:
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Feature engineering for large scale recommenders with NVIDIA NVTabular and Vertex AI

# Overview

The focus of this guide is to compile prescriptive guidelines for developing and operationalizing data preprocessing and feature engineering workflows using Vertex AI and NVIDIA NVTabular.  

# Dataset

The dataset used for this tutorial is the [Criteo 1TB Click Logs dataset](https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/).  

### From the Criteo website:

#### Overview
 - This dataset contains feature values and click feedback for millions of display ads. Its purpose is to benchmark algorithms for clickthrough rate (CTR) prediction.
 - This dataset contains 24 files, each one corresponding to one day of data.
 - Each row corresponds to a display ad served by Criteo and the first column is indicates whether this ad has been clicked or not. The positive (clicked) and negatives  (non-clicked) examples have both been subsampled (but at different rates) in order to reduce the dataset size.
 - There are 13 features taking integer values (mostly count features) and 26 categorical features. The values of the categorical features have been hashed
onto 32 bits for anonymization purposes.
 - The semantic of these features is undisclosed. Some features may have missing values.
 - The rows are chronologically ordered.

#### Data fields
 - Label - Target variable that indicates if an ad was clicked (1) or not (0).
 - I1-I13 - A total of 13 columns of integer features (mostly count features).
 - C1-C26 - A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes. 
 - The semantic of the features is undisclosed.  
 - When a value is missing, the field is empty.

#### Format
The columns are tab separated with the following schema:  
<label> <integer feature 1> … <integer feature 13> <categorical feature 1> … <categorical feature 26>

# Objective
This notebook demonstrates how to do data preprocessing with NVIDIA NVTabular on Vertex AI Pipeline steps.  
You will create three different pipelines to understand possible alternatives on how to manipulate data.

*Pipeline 1*: Input CSV files from Google Cloud Storage (GCS for short)  
 - Read CSV files from GCS
 - Fit the dataset (calculate statistics necessary for data transformation)
 - Transform the data
 - Output to GCS

*Pipeline 2*: Input Parquet files exported from BigQuery
 - Export Parquer files from a table in Bigquery
 - Fit the dataset (calculate statistics necessary for data transformation)
 - Transform the data
 - Output to GCS

*Pipeline 3*: Input CSV files from GCS and output to Vertex AI Feature Store 
 - Read CSV files from Google Cloud Storage
 - Fit the dataset (calculate statistics necessary for data transformation)
 - Transform the data
 - Output to GCS
 - Load the transformed data to BigQuery
 - Create a Vertex AI Feature Store and load the data from BigQuery

The goal is to present how to use NVTabular to transform the data on GPU and different ways of inputing (GCS and Bigquery) and outputing (GCS, BigQuery, FeatureStore) data.

# Costs
This tutorial uses billable components of Google Cloud (GCP):
 - Vertex AI (Pipelines, FeatureStore, GPUs, Training)
 - Cloud Storage
 - BigQuery
 - Cloud Artifact Registry

Use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

# Instalation

## Set up your local development environment

If you are using Google Cloud Notebooks, your environment already meets all the requirements to run this notebook. You can skip this step.

Otherwise, make sure your environment meets this notebook's requirements. You need the following:
 - The Google Cloud SDK
 - Git
 - Python 3
 - virtualenv
 - Jupyter notebook running in a virtual environment with Python 3
 - Docker

The Google Cloud guide to [Setting up a Python development environment](https://cloud.google.com/python/setup) and the [Jupyter installation guide](https://jupyter.org/install) provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:

 - [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)
 - [Install Python 3.](https://cloud.google.com/python/setup#installing_python)
 - [Install virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv) and create a virtual environment that uses Python 3. Activate the virtual environment.
 - To install Jupyter, run pip install jupyter on the command-line in a terminal shell.
 - To launch Jupyter, run jupyter notebook on the command-line in a terminal shell.
 - Open this notebook in the Jupyter Notebook Dashboard.

## Install additional packages

To run build and run the pipelines, you need to install Kubeflow Pipelines SDK, Vertex AI SDK and NVTabular SDK locally.

In [None]:
# Setup user flag specific for your environment

import os

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# Google Cloud Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_GOOGLE_CLOUD_NOTEBOOK:
    USER_FLAG = "--user"

In [None]:
# Install kfp, vertex and nvtabular SDKs
! pip install {USER_FLAG} --upgrade kfp google-cloud-aiplatform nvtabular

### Restart the kernel
After installing the additional packages, you need to restart the notebook kernel so it can find the packages.

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

Check the versions of the packages you installed. The KFP SDK version should be >=1.6.

In [None]:
! python3 -c "import kfp; print('KFP SDK version: {}'.format(kfp.__version__))"

# Before you begin
This notebook does not require a GPU runtime, but the pipeline steps running on GCP will.

## Set up your Google Cloud project
### The following steps are required, regardless of your notebook environment.
 - [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.
 - [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).
 - [Enable the Vertex AI, Cloud Storage, BigQuery, Compute Engine, Cloud Build and Artifact Registry APIs.](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,compute_component,storage-component.googleapis.com,bigquery.googleapis.com,cloudbuild.googleapis.com,artifactregistry.googleapis.com)
 - Follow the "Configuring your project" instructions from the Vertex Pipelines documentation.
 - If you are running this notebook locally, you will need to install the [Cloud SDK](https://cloud.google.com/sdk).
 - Enter your project ID in the cell below. Then run the cell to make sure the Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with ! as shell commands, and it interpolates Python variables prefixed with $ into these commands.

## Set your project ID
### If you don't know your project ID, you may be able to get your project ID using gcloud.

In [None]:
import os

PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output= ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

Otherwise, set your project ID here.

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "<YOUR_GCP_PROJECT_HERE>"

## Timestamp
If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append it onto the name of resources you create in this tutorial.

In [None]:
from datetime import datetime
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

## Authenticate your Google Cloud account
### **If you are using Google Cloud Notebooks or a GCE VM, your environment is already authenticated with the default service account. Skip this step.**

If you are using Colab, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.

Otherwise, follow these steps:
 - In the Cloud Console, go to the [Create service account key page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).
 - Click **Create service account**.
 - In the **Service account** name field, enter a name, and click **Create**.
 - In the **Grant this service account access to project** section, click the **Role** drop-down list. Type "Vertex AI" into the filter box, and select **Vertex AI Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**. Type "Bigquery Object Admin" into the filter box, and select **BigQuery Admin**.
 - Click Create. A JSON file that contains your key downloads to your local environment.
 - Enter the path to your service account key as the **GOOGLE_APPLICATION_CREDENTIALS** variable in the cell below and run the cell.

In [None]:
import os
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# If on Google Cloud Notebooks, then don't execute this code
if not IS_GOOGLE_CLOUD_NOTEBOOK:
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

## Create a Cloud Storage bucket as necessary
You will need a Cloud Storage bucket for this example. If you don't have one that you want to use, you can make one now.  
The bucket will be used to store the following artifacts:
 - Temporary files
 - NVTabular workflow files
 - Transformed and preprocessed files

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.

You may also change the REGION variable, which is used for operations throughout the rest of this notebook. Make sure to [choose a region where Vertex AI services are available](https://cloud.google.com/vertex-ai/docs/general/locations#available_regions). You may not use a Multi-Regional Storage bucket for training with Vertex AI.

In [None]:
BUCKET_NAME = "gs://<YOUR-BUCKET-NAME>" # <------- Change HERE
REGION = "us-central1"

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "gs://<YOUR-BUCKET-NAME>":
    BUCKET_NAME = "gs://" + PROJECT_ID + "aip-" + TIMESTAMP

**Only if your bucket doesn't already exist**: Uncomment and run the following cell to create your Cloud Storage bucket.

In [None]:
# ! gsutil mb -l $REGION $BUCKET_NAME

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_NAME

## Build the docker image
In order to execute each step in the pipeline, you need to build a docker image with the required packages and push it to Artifact Registry.  
Follow these steps to create an Artifact Registry for your docker images:

1) [Choose a shell](https://cloud.google.com/artifact-registry/docs/docker/quickstart) to execute the gcloud commands
2) Define the name of your registry in the next cell.

In [None]:
REG_NAME = 'quickstart-docker-repo' # <----- Change HERE

3) Run the following command to create a new Docker repository in the location us-central1 with the description "docker repository".

In [None]:
! gcloud artifacts repositories create f'{REG_NAME}' --repository-format=docker \
--location=us-central1 --description="Docker repository"

4) Run the following command to verify that your repository was created.

In [None]:
! gcloud artifacts repositories list

5) **IMPORTANT**: [Configure Authentication](https://cloud.google.com/artifact-registry/docs/docker/quickstart#auth)
6) Create a Dockerfile with the next cell and build the image

In [None]:
%%file Dockerfile.test
FROM gcr.io/deeplearning-platform-release/base-cu110

WORKDIR /nvtabular

RUN conda install -c nvidia -c rapidsai -c numba -c conda-forge pynvml dask-cuda nvtabular=0.5.3 cudatoolkit=11.0
RUN pip install google-cloud-aiplatform kfp

ENV LD_LIBRARY_PATH /usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION python

7) Define a tag for your docker image, build and push it

In [None]:
IMG_NAME = 'nvt-preprocessing' # <------ Change HERE

IMG_REGION = 'us-central1-docker.pkg.dev'
IMG_TAG = f'{IMG_REGION}/{PROJECT_ID}/{REG_NAME}/{IMG_NAME}'
! docker build -t IMG_TAG .
! docker push IMG_TAG

8) Make sure your image was pushed to the registry.

In [None]:
! gcloud artifacts docker images list IMG_TAG

## Quotas

Make sure you have enough quota (GPU, CPU and Memory) for your environment before proceeding.  
On GCP console, visit [IAM & Admin > Quotas](https://console.cloud.google.com/iam-admin/quotas).

## Import libraries and define constants
Define some constants.

In [None]:
USER = '<YOUR-USERNAME-HERE>'  # <---CHANGE THIS
PIPELINE_ROOT = "{}/pipeline_root/{}".format(BUCKET_NAME, USER)

Do some imports:

In [None]:
# Standard
import json

# Google Cloud
from google.cloud import aiplatform

# Kubeflow Pipelines
from kfp.v2 import compiler

# NVTabular
import nvtabular as nvt
from nvtabular.ops import (
    Categorify,
    Clip,
    FillMissing,
    Normalize,
)

# Preprocessing with NVTabular: Vertex AI Pipelines definitions

### Pipeline Components
The components were defined as python functions in the file `kfp_components.py`.  
Each component is annotated with Inputs or Outputs (or both) to keep track of lineage metadata.

The BASE_IMAGE used to execute the components is the same docker image you built a few steps back.

### Pipeline step configuration
Most of the NVTabular preprocessing are executed as steps in the Pipeline.  
Some steps require a more robust runtime configuration with more CPU, memory and GPU.  

Each pipeline definition has its own runtime configurations that can be set directly to the component execution.  
In Vertex AI Pipelines, you can set the amount of CPU, memory and GPU in the pipeline specification, like this:

```
    component_being_executed.set_cpu_limit("8") # Number of CPUs
    component_being_executed.set_memory_limit("32G") # Memory quantity
    component_being_executed.set_gpu_limit("1") # Number of GPUs
    component_being_executed.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-t4') # GPU type
```

### Execution order
Each pipeline specification presented in the following cells are self-contained, that is, you can choose which one is more adequate for your needs.  
If you want to experiment the Pipeline #2, there is no need to execute the first or the third.

# 1) Pipeline: Input CSV files from GCS

This pipeline will execute the following steps:

*Pipeline 1*: 
 - Read CSV files from GCS
 - Fit the dataset (calculate statistics necessary for data transformation)
 - Transform the data
 - Output to GCS

 Architecture overview:

<img src="../../images/pipeline_1.png" alt="Pipeline" style="height: 60%; width:60%;"/>

The original Criteo dataset is in CSV format, but the recommended format to run the NVTabular preprocessing step is in PARQUET.  
The first step of this pipeline read the data from GCS, converts the data from CSV to PARQUET using the `nvtabular.Dataset.to_parquet` method, and write the converted data back to GCS.

The second step in the pipeline calculates statistics on the dataset based on the transformations defined in the `nvtabular.Workflow`.  
Finally, the last step is executed to transform the dataset and write back to GCS.

In [None]:
# Import components and pipeline definition
from pipeline_gcs import preprocessing_pipeline_gcs

To convert the CSV to PARQUET you need to pass a list with the column names and a dictionary mapping the column name to the type of the column.  
This data will be used to instantiate the classe `nvtabular.Dataset` and consequently call the `.to_parquet` method.

This dataset was two different types: `int32` and `hex` (hexadecimal strings that should be converted to int32)

In [None]:
# Columns and dtypes definition
cont_names = ["I" + str(x) for x in range(1, 14)]
cat_names = ["C" + str(x) for x in range(1, 27)]
columns = ["label"] + cont_names + cat_names

# Specify column dtypes. Note that "hex" means that
# the values will be hexadecimal strings that should
# be converted to int32
cols_dtype = {}
cols_dtype["label"] = 'int32'
for x in cont_names:
    cols_dtype[x] = 'int32'
for x in cat_names:
    cols_dtype[x] = 'hex'

To transform the data using NVTabular you must create a DAG with the transformation steps.  
NVTabular implements several operators to processes your dataset. They can be found [HERE](https://nvidia.github.io/NVTabular/v0.4.0/resources/api/ops/index.html).

In this example the DAG perform the following transformations:
 - Categorical features (columns that start with C): Apply [Categorify](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/categorify.html)
 - Continuous features (columns that start with I): Apply [FillMissing](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/fillmissing.html), [Clip](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/clip.html) and [Normalize](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/normalize.html) (in this order).

 The definition of the Workflow will be used as a guide to calculate the necessary statistics, and execute the data transformation.  
 It will be uploaded to GCS for future use.

In [None]:
# Transformation pipeline
num_buckets = 10000000
categorify_op = Categorify(max_size=num_buckets)
cat_features = cat_names >> categorify_op
cont_features = cont_names >> FillMissing() >> Clip(min_value=0) >> Normalize()
features = cat_features + cont_features + ['label']

# Create and save workflow
workflow = nvt.Workflow(features)
workflow.save('./saved_workflow')

In [None]:
# Upload to GCS
! gsutil cp -r ./saved_workflow BUCKET_NAME

Next, let's define some variables necessary to execute the pipeline.

In [None]:
BUCKET = BUCKET_NAME[5:] # retrive the bucket name only, without prefix

train_paths = ['workshop-datasets/criteo_small/day_0'] # Sample training CSV file to be converted
valid_paths = ['workshop-datasets/criteo_small/day_1'] # Sample validation CSV file to be converted
output_path = f'{BUCKET}/converted' # Where to write the converted PARQUET files
workflow_path = f'{BUCKET}/saved_workflow' # Location of the workflow json/pkl files
output_transformed = f'{BUCKET}/transformed_data' # Location to write the transformed data

sep = '\t' # Separator for the CSV file
gpus = '0' # Identifier of the GPU. As you will execute with only 1 GPU, only the first identier is passed.
           # If you were to execute the pipeline with 4 GPUs, you should use '0,1,2,3'.

recursive = False # If the train/valid paths should be navigated recursivelly
shuffle = None # How to shuffle the dataset both in the conversion from CSV to PARQUET and during transformation.

NVTabular makes it possible to shuffle during dataset creation. This creates a uniformly shuffled dataset that allows the dataloader to load large contiguous chunks of data, which are already randomized across the entire dataset. NVTabular also makes it possible to control the number of chunks that are combined into a batch, providing flexibility when trading off between performance and true randomization. This mechanism is critical when dealing with datasets that exceed CPU memory and individual epoch shuffling is desired during training. Full shuffle of such a dataset can exceed training time for the epoch by several orders of magnitude.

In [None]:
# Create a dictionarry will all the parameters defined until now
parameter_values = {
    'train_paths': json.dumps(train_paths),
    'valid_paths': json.dumps(valid_paths),
    'output_path': output_path,
    'columns': json.dumps(columns),
    'cols_dtype': json.dumps(cols_dtype),
    'output_transformed': output_transformed,
    'workflow_path': workflow_path,
    'sep': sep,
    'gpus': gpus,
    'recursive': json.dumps(recursive),
    'shuffle': json.dumps(shuffle)
}

In [None]:
# Compile the pipeline.
# This command will validate the pipeline and generate a JSON file with its specifications
PACKAGE_PATH = 'pipeline_gcs.json'
compiler.Compiler().compile(
       pipeline_func=preprocessing_pipeline_gcs,
       package_path=PACKAGE_PATH
)

In [None]:
# Initialize aiplatform SDK client
aiplatform.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=BUCKET_NAME
)

In [None]:
# Submit the job to Vertex AI Pipelines
pipeline_job = aiplatform.PipelineJob(
    display_name='nvt_convert_pipeline_gcs',
    template_path=PACKAGE_PATH,
    enable_caching=False,
    parameter_values=parameter_values,
)

pipeline_job.run()

At the end of the execution, the pipeline will look like this:

<img src="../../images/gcs_vertex_pipeline.png" alt="Pipeline" style="height: 15%; width:15%;"/>

# 2) Pipeline: Source data in BQ

This pipeline will execute the following steps:

*Pipeline 2*: Input Parquet files exported from BigQuery
 - Export Parquer files from a table in Bigquery
 - Fit the dataset (calculate statistics necessary for data transformation)
 - Transform the data
 - Output to GCS

Architecture overview:

 <img src="../../images/pipeline_2.png" alt="Pipeline" style="height: 60%; width:60%;"/>

In this scenario, the dataset is located in a table in BigQuery.  
The recommended format to run the NVTabular preprocessing step is in Parquet.  
The first step of this pipeline exports the data from BigQuery to PARQUET files in GCS.

The second step in the pipeline calculates statistics on the dataset based on the transformations defined in the `nvtabular.Workflow`.  
Finally, the last step is executed to transform the dataset and write back to GCS.

In [None]:
# Import components and pipeline definition
from pipeline_bq import preprocessing_pipeline_bq

To transform the data using NVTabular you must create a DAG with the transformation steps.  
NVTabular implements several operators to processes your dataset. They can be found [HERE](https://nvidia.github.io/NVTabular/v0.4.0/resources/api/ops/index.html).

In this example the DAG perform the following transformations:
 - Categorical features (columns that start with C): Apply [Categorify](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/categorify.html)
 - Continuous features (columns that start with I): Apply [FillMissing](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/fillmissing.html), [Clip](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/clip.html) and [Normalize](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/normalize.html) (in this order).

 The definition of the Workflow will be used as a guide to calculate the necessary statistics, and execute the data transformation.  
 It will be uploaded to GCS for future use.

In [None]:
# Columns and dtypes definition
cont_names = ["I" + str(x) for x in range(1, 14)]
cat_names = ["C" + str(x) for x in range(1, 27)]
columns = ["label"] + cont_names + cat_names

# Transformation pipeline
num_buckets = 10000000
categorify_op = Categorify(max_size=num_buckets)
cat_features = cat_names >> categorify_op
cont_features = cont_names >> FillMissing() >> Clip(min_value=0) >> Normalize()
features = cat_features + cont_features + ['label']

# Create and save workflow
workflow = nvt.Workflow(features)
workflow.save('./saved_workflow')

In [None]:
# Upload to GCS
! gsutil cp -r ./saved_workflow BUCKET_NAME

Next, let's define some variables necessary to execute the pipeline.

In [None]:
BUCKET = BUCKET_NAME[5:] # retrive the bucket name only, without prefix

output_path = f'{BUCKET}/bq_converted' # Where to write the converted PARQUET files
bq_project = PROJECT_ID # Project ID where the dataset is stored
bq_dataset_id = 'criteo_pipeline' # Dataset ID
bq_table_train = 'train' # Table ID for the training data
bq_table_valid = 'valid' # Table ID for the validation data
location = 'US' # Location to process the data in BigQuery

workflow_path = f'{BUCKET}/saved_workflow' # Location of the workflow json/pkl files
output_transformed = f'{BUCKET}/bq_transformed_data' # Location to write the transformed data
gpus = '0' # Identifier of the GPU. As you will execute with only 1 GPU, only the first identier is passed.
           # If you were to execute the pipeline with 4 GPUs, you should use '0,1,2,3'.

recursive = False # If the train/valid paths should be navigated recursivelly
shuffle = None # How to shuffle the dataset both in the conversion from CSV to PARQUET and during transformation.

In [None]:
# Create a dictionarry will all the parameters defined until now
parameter_values = {
    'bq_table_train': bq_table_train,
    'bq_table_valid': bq_table_valid,
    'output_path': output_path,
    'bq_project': bq_project,
    'bq_dataset_id': bq_dataset_id,
    'location': location,
    'gpus': gpus,
    'workflow_path': workflow_path,
    'output_transformed': output_transformed,
    'recursive': json.dumps(recursive),
    'shuffle': json.dumps(shuffle)
}

In [None]:
# Compile the pipeline.
# This command will validate the pipeline and generate a JSON file with its specifications
PACKAGE_PATH = 'pipeline_bq.json'
compiler.Compiler().compile(
       pipeline_func=preprocessing_pipeline_bq,
       package_path=PACKAGE_PATH
)

In [None]:
# Initialize aiplatform SDK client
aiplatform.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=BUCKET_NAME
)

In [None]:
# Submit the job to Vertex AI Pipelines
pipeline_job = aiplatform.PipelineJob(
    display_name='nvt_convert_pipeline_bq',
    template_path=PACKAGE_PATH,
    enable_caching=False,
    parameter_values=parameter_values,
)

pipeline_job.run()

At the end of the execution, the pipeline will look like this:

<img src="../../images/bq_vertex_pipeline.png" alt="Pipeline" style="height: 15%; width:15%;"/>

# 3) Pipeline: Source GCS and output to Feature Store

This pipeline will execute the following steps:

**Pipeline 3**: Input CSV files from GCS and output to Vertex AI Feature Store 
 - Read CSV files from Google Cloud Storage
 - Fit the dataset (calculate statistics necessary for data transformation)
 - Transform the data
 - Output to GCS
 - Load the transformed data to BigQuery
 - Create a Vertex AI Feature Store and load the data from BigQuery

 Architecture overview:

 <img src="../../images/pipeline_3.png" alt="Pipeline" style="height: 70%; width:70%;"/>

In [None]:
# Import components and pipeline definition
from pipeline_gcs_feat import preprocessing_pipeline_gcs_feat

The original Criteo dataset is in CSV format, but the recommended format to run the NVTabular preprocessing step is in PARQUET.  
The first step of this pipeline read the data from GCS, converts the data from CSV to PARQUET using the `nvtabular.Dataset.to_parquet` method, and write the converted data back to GCS.

The second step in the pipeline calculates statistics on the dataset based on the transformations defined in the `nvtabular.Workflow`.  
Finally, the last step is executed to transform the dataset and write back to GCS.

In [None]:
# Columns and dtypes definition
cont_names = ["I" + str(x) for x in range(1, 14)]
cat_names = ["C" + str(x) for x in range(1, 27)]
columns = ["label"] + cont_names + cat_names

# Specify column dtypes. Note that "hex" means that
# the values will be hexadecimal strings that should
# be converted to int32
cols_dtype = {}
cols_dtype["label"] = 'int32'
for x in cont_names:
    cols_dtype[x] = 'int32'
for x in cat_names:
    cols_dtype[x] = 'hex'

To transform the data using NVTabular you must create a DAG with the transformation steps.  
NVTabular implements several operators to processes your dataset. They can be found [HERE](https://nvidia.github.io/NVTabular/v0.4.0/resources/api/ops/index.html).

In this example the DAG perform the following transformations:
 - Categorical features (columns that start with C): Apply [Categorify](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/categorify.html)
 - Continuous features (columns that start with I): Apply [FillMissing](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/fillmissing.html), [Clip](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/clip.html) and [Normalize](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/normalize.html) (in this order).

 The definition of the Workflow will be used as a guide to calculate the necessary statistics, and execute the data transformation.  
 It will be uploaded to GCS for future use.

In [None]:
# Transformation pipeline
num_buckets = 10000000
categorify_op = Categorify(max_size=num_buckets)
cat_features = cat_names >> categorify_op
cont_features = cont_names >> FillMissing() >> Clip(min_value=0) >> Normalize()
features = cat_features + cont_features + ['label']

# Create and save workflow
workflow = nvt.Workflow(features)
workflow.save('./saved_workflow')

In [None]:
# Upload to GCS
! gsutil cp -r ./saved_workflow BUCKET_NAME

Next, let's define some variables necessary to execute the pipeline.

In [None]:
BUCKET = BUCKET_NAME[5:] # retrive the bucket name only, without prefix

train_paths = ['workshop-datasets/criteo_small/day_0'] # Sample training CSV file to be converted
valid_paths = ['workshop-datasets/criteo_small/day_1'] # Sample validation CSV file to be converted
output_path = f'{BUCKET}/converted' # Where to write the converted PARQUET files
workflow_path = f'{BUCKET}/saved_workflow' # Location of the workflow json/pkl files
output_transformed = f'{BUCKET}/transformed_data' # Location to write the transformed data

sep = '\t' # Separator for the CSV file
gpus = '0' # Identifier of the GPU. As you will execute with only 1 GPU, only the first identier is passed.
           # If you were to execute the pipeline with 4 GPUs, you should use '0,1,2,3'.

recursive = False # If the train/valid paths should be navigated recursivelly
shuffle = None # How to shuffle the dataset both in the conversion from CSV to PARQUET and during transformation.

bq_project = 'renatoleite-mldemos' # Project ID where BQ is located
bq_dataset_id = 'criteo_pipeline' # Dataset ID
bq_dest_table_id = 'transformed_train' # Table ID where the data will be loaded after transformation.

In [None]:
# Create a dictionarry will all the parameters defined until now
parameter_values = {
    'train_paths': json.dumps(train_paths),
    'valid_paths': json.dumps(valid_paths),
    'output_path': output_path,
    'columns': json.dumps(columns),
    'cols_dtype': json.dumps(cols_dtype),
    'output_transformed': output_transformed,
    'workflow_path': workflow_path,
    'sep': sep,
    'gpus': gpus,
    'bq_project': bq_project,
    'bq_dataset_id':bq_dataset_id,
    'bq_dest_table_id': bq_dest_table_id,
    'recursive': json.dumps(recursive),
    'shuffle': json.dumps(shuffle)
}

In [None]:
# Compile the pipeline.
# This command will validate the pipeline and generate a JSON file with its specifications
PACKAGE_PATH = 'pipeline_nvt_gcs_feat.json'
compiler.Compiler().compile(
       pipeline_func=preprocessing_pipeline_gcs_feat,
       package_path=PACKAGE_PATH
)

In [None]:
# Initialize aiplatform SDK client
aiplatform.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=BUCKET_NAME
)

In [None]:
# Submit the job to Vertex AI Pipelines
pipeline_job = aiplatform.PipelineJob(
    display_name='pipeline_nvt_gcs_feat',
    template_path=PACKAGE_PATH,
    enable_caching=False,
    parameter_values=parameter_values,
)

pipeline_job.run()

At the end of the execution, the pipeline will look like this:

<img src="../../images/gcs_feat_vertex_pipeline.png" alt="Pipeline" style="height: 15%; width:15%;"/>