# Using NVTabular for large scale feature engineering on CSV files in Cloud Storage

This notebook demonstrates how to preprocess CSV data files in Cloud Storage using NVIDIA NVTabular and Vertex AI. The data preprocessing is implemented using Vertex AI Pipelines, which covers the following steps.  

 1. Read CSV files from Cloud Storage.
 2. Convert the CSV files to parquet format and write it Cloud Storage.
 3. Fit a pre-defined NVTabular workflow to the training data split to calculate transformation statistics.
 4. Transform the training and validation data splits using the fitted workflow.
 5. Output transformed parquet files to Cloud Storage.


<img src="./images/pipeline_1.png" alt="Pipeline"/>

## NVTabular Overview

[Merlin NVTabular](https://developer.nvidia.com/nvidia-merlin/nvtabular) is a feature engineering and preprocessing library designed to effectively manipulate 
large datasets and significantly reduce data preparation time, as follows:

* Processes large datasets not bound by CPU or GPU memory.
* Accelerates data preprocessing computation on GPUs using the RAPIDS cuDF library.
* Supports multi-node scaling and multi-GPU with DASK-CUDA distributed parallelism.
* Supports tabular data formats, including comma-separated values (CSV) files, Apache Parquet, Apache Orc, and Apache Avro.
* Provides data loaders that are optimized for TensorFlow, PyTorch, and Merlin HugeCTR.
* Includes multi-hot categoricals and vector continuous passing support to ease feature engineering.


To preprocess the data, we need to define a transformation `workflow`.  
Each transformation step in the transformation pipeline executes multiple calculations, called `ops`. 
NVTabular provides a [set of ops](https://nvidia.github.io/NVTabular/main/api/ops/index.html), which include:

 - Filtering outliers or missing values, or creating new features indicating that a value is missing;
 - Imputing and filling in missing data;
 - Discretization or bucketing of continuous features;
 - Creating features by splitting or combining existing features, for example, breaking down a date column into day-of-week, month-of-year, day-of-month features;
 - Normalizing numerical features to have zero mean and unit variance or applying transformations, for example with log transform;
 - Encoding discrete features using one-hot vectors or converting them to continuous integer indices.  

NVTabular processes a dataset, given a pre-defined workflow, in two steps:

1. The `fit` step, where NVTabular compute the statistics required for transforming the data. Such a step requires at most `N` passes through the data, where `N` is the number of chained operations in the workflow.
2. The `apply` step, where NVTabular uses the fitted workflow to process the data. 

NVTabular is designed to minimize the number of passes through the data. This is achieved with a lazy execution strategy. Data operations are not executed until an explicit apply phase. This allows NVTabular to optimize the workflow that requires iteration over the entire dataset.



## Preprocessing Criteo dataset

The Criteo dataset contains over four billion samples spanning 24 CSV files. Each record contains 40 columns: 13 columns are numerical, 26 columns are categorical, and 1 binary target column. See [00-dataset-management.ipynb](00-dataset-management.ipynb) for more details.


### NVTabular preprocessing Workflow for Criteo dataset

In this example, the preprocessing `nvt.Workflow` consists for the following operations:
 - [Categorify](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/categorify.html): applied to categorical columns (columns that start with C). 
 - [FillMissing](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/fillmissing.html): applied to continuous columns (columns that start with I).
 - [Clip](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/clip.html):  applied to continuous columns after FillMissing.
 - [Normalize](https://nvidia.github.io/NVTabular/v0.7.0/resources/api/ops/normalize.html): applied to continuous columns after Clip.
 
 <img src="./images/dag_preprocessing.png" alt="Pipeline" style="height: 50%; width:50%;"/>
 
 The `nvt.Workflow` is createdin in the `create_criteo_nvt_workflow` method, which can be found in [src/preprocessing/etl.py](src/preprocessing/etl.py) module. 
 This `nvt.Workflow` will be used as a guide to calculate the necessary statistics, and execute the data transformation.  
 
 
### Converting CSV files to Parquet with NVTabular

The Criteo dataset is provides in CSV format, but the recommended data format to run the NVTabular preprocessing task and get the best possible performance is [Parquet](http://parquet.apache.org/documentation/latest/); a compressed, column-oriented file structure format. While NVTabular also supports reading from CSV files, reading  
Parquet files can 2X faster than reading CSV files.  

To convert the Criteo CSV data to Parquet, the following steps are performed:

1. Create a `nvt.Dataset` object the CSV data using the `create_csv_dataset` method in [src/preprocessing/etl.py](src/preprocessing/etl.py).
2. Convert the CSV data to Parquet, and write it to Cloud Storahe using the `convert_csv_to_parquet` method in [src/preprocessing/etl.py](src/preprocessing/etl.py).

### Implementing the preprocessing pipeline using KFP

[src/pipelines/preprocessing_gcs.py](src/pipelines/preprocessing_gcs.py) defines the KFP pipeline to preprocess the Criteo CSV data. 
A pipeline component is a self-contained set of code that performs one step in your ML workflow.  
The pipeline uses the following components defined in [src/pipelines/components.py](src/pipelines/components.py):

1. `convert_csv_to_parquet_op`: this component converts raw CSV files to Parquet files, and store them to Cloud Storage. 
2. `analyze_dataset_op`: this component creates a Criteo preprocessing `nvt.Workflow`, fit it to the training data split, and store it to Cloud Storage.
3. `transform_dataset_op`: this component loads the fitted `nvt.Workflow` from Cloud Storage, uses it to transform and input datas split, and store the transformed data as Parquet files to Cloud Storage.

Each component is annotated with Inputs and Outputs to keep track of lineage metadata.
The `base_image` used to execute the components is defined in [Dockerfile.nvtabular](Dockerfile.nvtabular). 

Each step in the pipeline is configured with the required CPU, memory and GPU configurations, as follows:

```
component_being_executed.set_cpu_limit("8") # Number of CPUs
component_being_executed.set_memory_limit("32G") # Memory quantity
component_being_executed.set_gpu_limit("1") # Number of GPUs
component_being_executed.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-t4') # GPU type
```

See [Specify machine type for a pipeline step](https://cloud.google.com/vertex-ai/docs/pipelines/build-pipeline#specify-machine-type) for more information.


You can configure the pipeline by setting the variables in the [config.py](config.py) module.


## Setup

In [1]:
PROJECT_ID = 'merlin-on-gcp' # Change to your project Id.
REGION = 'us-central1' # Change to your region.
DATASET_GCS_LOCATION = 'gs://workshop-datasets/criteo'
BUCKET =  'merlin-on-gcp' # Change to your bucket.

In [1]:
import json
from datetime import datetime
from google.cloud import aiplatform
from kfp.v2 import compiler
from src.pipelines.preprocessing_gcs import preprocessing_gcs

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

KeyError: 'IMAGE_URI'

### Build container image and push to gcr.io (Google Container Registry)

In [None]:
# Docker image name and Dockerfile location
IMAGE_NAME = 'nvt_preprocessing'
IMAGE_URI = f'gcr.io/{PROJECT_ID}/{IMAGE_NAME}'
DOCKERFILE = 'src/preprocessing'

import os
os.environ['IMAGE_URI'] = IMAGE_URI

In [None]:
# Build can take up to 1 hour
! gcloud builds submit --timeout "2h" --tag {IMAGE_URI} {DOCKERFILE}

### Parameters values for pipeline execution

In [None]:
WORKFLOW_PATH = f'{BUCKET}/workflow' # Where to write the calculated workflow
OUTPUT_CONVERTED = f'{BUCKET}/converted' # Location to write the transformed data
OUTPUT_TRANSFORMED = f'{BUCKET}/transformed_data' # Location to write the transformed data

train_paths = ['gs://workshop-datasets/criteo/day_0'] # Sample training CSV file to be converted to parquet
valid_paths = ['gs://workshop-datasets/criteo/day_1'] # Sample validation CSV file to be converted to parquet

sep = '\t' # Separator for the CSV file
recursive = False # If the train/valid paths should be navigated recursivelly

In NVTabular, NVIDIA provides an option to shuffle the dataset before storing to disk.  
The uniformly shuffled dataset enables the data loader to read in contiguous chunks of data that are already randomized across the entire dataset.
NVTabular provides the option to control the number of chunks that are combined into a batch, allowing the end user flexibility when trading off between performance and true randomization.  
This mechanism is critical when dealing with datasets that exceed CPU memory and per-epoch shuffling is desired during training.  
Full shuffling of such a dataset can exceed training time for the epoch by several orders of magnitude.

In the next cell, choose between: PER_PARTITION, PER_WORKER, FULL. `None` will not shuffle the data.

In [None]:
shuffle = None # How to shuffle the dataset both in the conversion from CSV to PARQUET and during transformation.

### Create dictionary with parameter values

In [None]:
# Create a dictionary with all the parameters defined until now
parameter_values = {
    'train_paths': json.dumps(train_paths),
    'valid_paths': json.dumps(valid_paths),
    'output_converted': OUTPUT_CONVERTED,
    'output_transformed': OUTPUT_TRANSFORMED,
    'workflow_path': WORKFLOW_PATH,
    'sep': sep,
    'recursive': json.dumps(recursive),
    'shuffle': json.dumps(shuffle)
}

## Pipeline execution

### KFP pipeline compilation

In [None]:
# Compile the pipeline.
# This command will validate the pipeline and generate a JSON file with its specifications
PACKAGE_PATH = 'nvt_gcs_pipeline.json'
compiler.Compiler().compile(
       pipeline_func=preprocessing_gcs,
       package_path=PACKAGE_PATH
)

In [None]:
compiler.Compiler().compile(

### Initialize aiplatform SDK client

In [None]:
# Initialize aiplatform SDK client
aiplatform.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=f'gs://{BUCKET}/temp' 
)

### Submit job to Vertex AI Pipelines

In [None]:
pipeline_job = aiplatform.PipelineJob(
    display_name=f'{TIMESTAMP}_nvt_gcs_pipeline',
    template_path=PACKAGE_PATH,
    enable_caching=False,
    parameter_values=parameter_values,
)

pipeline_job.run()