In [None]:
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Using NVTabular for large scale feature engineering on CSV files in Google Cloud Storage

This notebook demonstrates how to do data preprocessing with NVIDIA NVTabular on Vertex AI Pipeline steps using Google Cloud Storage as the data source.  
You will create a pipeline with the following steps:
 - Read CSV files from Google Cloud Storage (GCS)
 - Convert these files to parquet format and write to GCS
 - Define the DAG with transformation steps and create a Workflow
 - Fit the dataset (calculate statistics necessary for data transformation)
 - Transform the data
 - Output transformed parquet files to GCS

The dataset used for this tutorial is the [Criteo 1TB Click Logs dataset](https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/).  

Architecture overview:

<img src="./images/pipeline_1.png" alt="Pipeline" style="height: 60%; width:60%;"/>

## Setup

Before we dive into the details on how to execute the preprocessing pipeline, you need to accomplish the following steps:
 - Define variables: project ID, region, bucket name and location.
 - Build the container that will execute each preprocessing step
 - Import your pipeline and some libraries

### Define variables

In [1]:
# At this point, it is assumed that you already created your project and bucket in GCS.
# Fill the following variables with the information you collected from the previous notebook.
PROJECT_ID = 'renatoleite-mldemos' # ID of your project
REGION = 'us-central1' # Region where the pipeline will be executed
BUCKET = 'renatoleite-criteo-partial' # Bucket name to read the Criteo dataset
LOCATION = 'us' # Location to Bigquery resources
STAGING_BUCKET = f'gs://{BUCKET}/temp' # Temp location for Vertex AI Pipeline files

### Build container image and push to gcr.io (Google Container Registry)

In [2]:
# Docker image name and Dockerfile location
IMAGE_NAME = 'nvt_preprocessing'
IMAGE_URI = f'gcr.io/{PROJECT_ID}/{IMAGE_NAME}'
DOCKERFILE = 'src/preprocessing'

import os
os.environ['IMAGE_URI'] = IMAGE_URI

In [None]:
# Build can take up to 1 hour
! gcloud builds submit --timeout "2h" --tag {IMAGE_URI} {DOCKERFILE}

### Import libraries

In [3]:
# Standard
import json
from datetime import datetime

# Google Cloud
from google.cloud import aiplatform

# Kubeflow Pipelines
from kfp.v2 import compiler

# Import components and pipeline definition
from src.pipelines.pipeline_preprocessing_gcs import preprocessing_pipeline_gcs

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

# Vertex AI Pipelines definitions

## Pipeline Components
A pipeline component is a self-contained set of code that performs one step in your ML workflow.  
The components are defined as python functions in the file `src/preprocessing/kfp_components.py`.  
Each component is annotated with Inputs and Outputs to keep track of lineage metadata.

The `base_image` used to execute the components is the same docker image you built a few steps back.

## Pipeline steps configuration
All the NVTabular preprocessing are executed as steps in the Pipeline.  
Some steps require a more robust runtime configuration with more CPU, memory and GPU.  

Each pipeline definition has its own runtime configurations that can be set directly to the component execution.  
In Vertex AI Pipelines, you can set the amount of CPU, memory and GPU in the pipeline specification, like this:

```
component_being_executed.set_cpu_limit("8") # Number of CPUs
component_being_executed.set_memory_limit("32G") # Memory quantity
component_being_executed.set_gpu_limit("1") # Number of GPUs
component_being_executed.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-t4') # GPU type
```

More information on how to set these parameters [HERE](https://cloud.google.com/vertex-ai/docs/pipelines/build-pipeline#specify-machine-type).

# Pipeline steps

## Step 1) Convert CSV files to Parquet with NVTabular

The original Criteo dataset is in CSV format, but the recommended data format to run the NVTabular preprocessing task and get the best possible performance is Parquet; a compressed, column-oriented file structure format. While NVTabular also supports reading from CSV files, it can be over twice as slow as reading from Parquet.  

This step in the pipeline will read the data from GCS, convert from CSV to PARQUET using the `nvtabular.Dataset.to_parquet` method, and write the converted data back to GCS.

To convert the CSV to PARQUET you need to pass a dictionary mapping the column names to its data type.  
This data will be used to instantiate the classe `nvtabular.Dataset` and consequently call the `.to_parquet` method.

This dataset was two different types: `int32` and `hex` (hexadecimal strings that should be converted to int32)

```python
# Specify column dtypes. Note that "hex" means that
# the values will be hexadecimal strings that should
# be converted to int32
col_dtypes = {}

col_dtypes["label"] = np.int32
for x in ["I" + str(i) for i in range(1, 14)]:
    col_dtypes[x] = np.int32
for x in ["C" + str(i) for i in range(1, 27)]:
    col_dtypes[x] = 'hex'

return col_dtypes
```

## Step 2) Create Workflow DAG definition for statistics calculation and data transformation

To preprocess the data, first we need to define a transformation pipeline as a DAG (directed acyclic graph).  
Each transformation step in the transformation pipeline executes multiple calculations, called `ops`.  
An op can be applied to a ColumnGroup from an overloaded ">>" operator, which in turn returns a new ColumnGroup. A ColumnGroup is a list of column names as text.

Example:
features = [ column_name, ...] >> op1 >> op2 >> ...

Here are some examples of ops implemented in NVTabular:
 - Filtering outliers or missing values, or creating new features indicating that a value is missing;
 - Imputing and filling in missing data;
 - Discretization or bucketing of continuous features;
 - Creating features by splitting or combining existing features, for example, breaking down a date column into day-of-week, month-of-year, day-of-month features;
 - Normalizing numerical features to have zero mean and unit variance or applying transformations, for example with log transform;
 - Encoding discrete features using one-hot vectors or converting them to continuous integer indices.  
 
The list of all ops can be found [HERE](https://nvidia.github.io/NVTabular/main/api/ops/index.html).

In this example, the DAG performs the following transformations:
 - Categorical features (columns that start with C): Apply [Categorify](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/categorify.html)
 - Continuous features (columns that start with I): Apply [FillMissing](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/fillmissing.html), [Clip](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/clip.html) and [Normalize](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/normalize.html) (in this order).

 The definition of the `nvt.Workflow` will be used as a guide to calculate the necessary statistics, and execute the data transformation.  
 It will be uploaded to GCS for future use.

### Definition of the transformation pipeline

```python
# Transformation pipeline
num_buckets = 10000000
categorify_op = Categorify(max_size=num_buckets)
cat_features = cat_names >> categorify_op
cont_features = cont_names >> FillMissing() >> Clip(min_value=0) >> Normalize()
features = cat_features + cont_features + ['label']
```

<img src="./images/dag_preprocessing.png" alt="Pipeline" style="height: 40%; width:40%;"/>

NVTabular is designed to minimize the number of passes through the data. This is achieved with a lazy execution strategy. Data operations are not executed until an explicit apply phase.  
In the first phase, these operations are only registered into the workflow. This allows NVTabular to optimize the collection of statistics that require iteration over the entire dataset.

When processing terabyte-scale datasets, it is critical to plan this statistics-gathering phase as well as transformation phase carefully in advance and avoid unnecessary passes through the data.  
NVTabular requires at most N passes through the data, where N is the number of chained operations. This is often less as lazy execution allows for careful planning and optimization of the workflow.  
Other libraries, such as cuDF and pandas, due to their eager execution nature, do not allow workflow optimization and can iterate through the whole dataset as many times as the number of operations.


The Workflow first ``fit`` by calculating statistics on the dataset, and then once fit it can ``transform`` the datasets by applying these statistics.

## Step 3) 

### Parameters values for pipeline execution

In [4]:
WORKFLOW_PATH = f'{BUCKET}/workflow' # Where to write the calculated workflow
OUTPUT_CONVERTED = f'{BUCKET}/converted' # Location to write the transformed data
OUTPUT_TRANSFORMED = f'{BUCKET}/transformed_data' # Location to write the transformed data

train_paths = ['gs://workshop-datasets/criteo/day_0'] # Sample training CSV file to be converted to parquet
valid_paths = ['gs://workshop-datasets/criteo/day_1'] # Sample validation CSV file to be converted to parquet

sep = '\t' # Separator for the CSV file
recursive = False # If the train/valid paths should be navigated recursivelly

In NVTabular, NVIDIA provides an option to shuffle the dataset before storing to disk.  
The uniformly shuffled dataset enables the data loader to read in contiguous chunks of data that are already randomized across the entire dataset.
NVTabular provides the option to control the number of chunks that are combined into a batch, allowing the end user flexibility when trading off between performance and true randomization.  
This mechanism is critical when dealing with datasets that exceed CPU memory and per-epoch shuffling is desired during training.  
Full shuffling of such a dataset can exceed training time for the epoch by several orders of magnitude.

In the next cell, choose between: PER_PARTITION, PER_WORKER, FULL. `None` will not shuffle the data.

In [5]:
shuffle = None # How to shuffle the dataset both in the conversion from CSV to PARQUET and during transformation.

### Create dictionary with parameter values

In [6]:
# Create a dictionary with all the parameters defined until now
parameter_values = {
    'train_paths': json.dumps(train_paths),
    'valid_paths': json.dumps(valid_paths),
    'output_converted': OUTPUT_CONVERTED,
    'output_transformed': OUTPUT_TRANSFORMED,
    'workflow_path': WORKFLOW_PATH,
    'sep': sep,
    'recursive': json.dumps(recursive),
    'shuffle': json.dumps(shuffle)
}

## Pipeline execution

### KFP pipeline compilation

In [7]:
# Compile the pipeline.
# This command will validate the pipeline and generate a JSON file with its specifications
PACKAGE_PATH = 'nvt_gcs_pipeline.json'
compiler.Compiler().compile(
       pipeline_func=preprocessing_pipeline_gcs,
       package_path=PACKAGE_PATH
)

### Initialize aiplatform SDK client

In [8]:
# Initialize aiplatform SDK client
aiplatform.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=STAGING_BUCKET
)

### Submit job to Vertex AI Pipelines

In [None]:
pipeline_job = aiplatform.PipelineJob(
    display_name=f'{TIMESTAMP}_nvt_gcs_pipeline',
    template_path=PACKAGE_PATH,
    enable_caching=False,
    parameter_values=parameter_values,
)

pipeline_job.run()