<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_hugectr_multi-gpu-offline-inference/nvidia_logo.png" style="width: 90px; float: right;">

# Multi-GPU Offline Inference

## Overview

In HugeCTR version 3.4.1, we provide Python APIs to perform multi-GPU offline inference.
This work leverages the [HugeCTR Hierarchical Parameter Server](https://nvidia-merlin.github.io/HugeCTR/master/hugectr_core_features.html#hierarchical-parameter-server) and enables concurrent execution on multiple devices.
The `Norm` or `Parquet` dataset format is currently supported by multi-GPU offline inference.

This notebook explains how to perform multi-GPU offline inference with the HugeCTR Python APIs.
For more details about the API, see the [HugeCTR Python Interface](https://nvidia-merlin.github.io/HugeCTR/master/api/python_interface.html#inference-api) documentation.

## Installation

### Get HugeCTR from NGC

The HugeCTR Python module is preinstalled in the 22.09 and later [Merlin Training Container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-hugectr): `nvcr.io/nvidia/merlin/merlin-hugectr:22.09`.

You can check the existence of required libraries by running the following Python code after launching this container.

```bash
$ python3 -c "import hugectr"
```

**Note**: This Python module contains both training APIs and offline inference APIs. For online inference with Triton Inference Server, refer to the [HugeCTR Backend](https://github.com/triton-inference-server/hugectr_backend) documentation.

> If you prefer to build HugeCTR from the source code instead of using the NGC container, refer to the [How to Start Your Development](https://nvidia-merlin.github.io/HugeCTR/master/hugectr_contributor_guide.html#how-to-start-your-development) documentation.

## Data Generation

HugeCTR provides a tool to generate synthetic datasets.
The [Data Generator](https://nvidia-merlin.github.io/HugeCTR/master/api/python_interface.html#data-generator-api) class is capable of generating datasets in different formats and with different distributions.
We will generate multi-hot Parquet datasets with a power-law distribution for this notebook:

In [1]:
import hugectr
from hugectr.tools import DataGeneratorParams, DataGenerator

data_generator_params = DataGeneratorParams(
  format = hugectr.DataReaderType_t.Parquet,
  label_dim = 2,
  dense_dim = 2,
  num_slot = 3,
  i64_input_key = True,
  nnz_array = [2, 1, 3],
  source = "./multi_hot_parquet/file_list.txt",
  eval_source = "./multi_hot_parquet/file_list_test.txt",
  slot_size_array = [10000, 10000, 10000],
  check_type = hugectr.Check_t.Non,
  dist_type = hugectr.Distribution_t.PowerLaw,
  power_law_type = hugectr.PowerLaw_t.Short,
  num_files = 32,
  eval_num_files = 8)
data_generator = DataGenerator(data_generator_params)
data_generator.generate()

[HCTR][08:59:54.134][INFO][RK0][main]: Generate Parquet dataset
[HCTR][08:59:54.134][INFO][RK0][main]: train data folder: ./multi_hot_parquet, eval data folder: ./multi_hot_parquet, slot_size_array: 10000, 10000, 10000, nnz array: 2, 1, 3, #files for train: 32, #files for eval: 8, #samples per file: 40960, Use power law distribution: 1, alpha of power law: 1.3
[HCTR][08:59:54.136][INFO][RK0][main]: ./multi_hot_parquet exist
[HCTR][08:59:54.140][INFO][RK0][main]: ./multi_hot_parquet/train/gen_0.parquet
[HCTR][08:59:55.615][INFO][RK0][main]: ./multi_hot_parquet/train/gen_1.parquet
[HCTR][08:59:55.850][INFO][RK0][main]: ./multi_hot_parquet/train/gen_2.parquet
[HCTR][08:59:56.078][INFO][RK0][main]: ./multi_hot_parquet/train/gen_3.parquet
[HCTR][08:59:56.311][INFO][RK0][main]: ./multi_hot_parquet/train/gen_4.parquet
[HCTR][08:59:56.534][INFO][RK0][main]: ./multi_hot_parquet/train/gen_5.parquet
[HCTR][08:59:56.770][INFO][RK0][main]: ./multi_hot_parquet/train/gen_6.parquet
[HCTR][08:59:56.959

## Train from Scratch

We can train fom scratch by performing the following steps with Python APIs:

1. Create the solver, reader and optimizer, then initialize the model.
2. Construct the model graph by adding input, sparse embedding and dense layers in order.
3. Compile the model and have an overview of the model graph.
4. Dump the model graph to a JSON file.
5. Fit the model, save the model weights and optimizer states implicitly.
6. Dump one batch of evaluation results to files.

In [2]:
%%writefile multi_hot_train.py
import hugectr
from mpi4py import MPI
solver = hugectr.CreateSolver(model_name = "multi_hot",
                              max_eval_batches = 1,
                              batchsize_eval = 131072,
                              batchsize = 16384,
                              lr = 0.001,
                              vvgpu = [[0]],
                              i64_input_key = True,
                              repeat_dataset = True,
                              use_cuda_graph = True)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
                                  source = ["./multi_hot_parquet/file_list.txt"],
                                  eval_source = "./multi_hot_parquet/file_list_test.txt",
                                  check_type = hugectr.Check_t.Non,
                                  slot_size_array = [10000, 10000, 10000])
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam)
model = hugectr.Model(solver, reader, optimizer)
model.add(hugectr.Input(label_dim = 2, label_name = "label",
                        dense_dim = 2, dense_name = "dense",
                        data_reader_sparse_param_array = 
                        [hugectr.DataReaderSparseParam("data1", [2, 1], False, 2),
                        hugectr.DataReaderSparseParam("data2", 3, False, 1),]))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 4,
                            embedding_vec_size = 16,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding1",
                            bottom_name = "data1",
                            optimizer = optimizer))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 2,
                            embedding_vec_size = 16,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding2",
                            bottom_name = "data2",
                            optimizer = optimizer))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding1"],
                            top_names = ["reshape1"],
                            leading_dim=32))                            
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding2"],
                            top_names = ["reshape2"],
                            leading_dim=16))                            
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,
                            bottom_names = ["reshape1", "reshape2", "dense"], top_names = ["concat1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["concat1"],
                            top_names = ["fc1"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc1"],
                            top_names = ["relu1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["relu1"],
                            top_names = ["fc2"],
                            num_output=2))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.MultiCrossEntropyLoss,
                            bottom_names = ["fc2", "label"],
                            top_names = ["loss"],
                            target_weight_vec = [0.5, 0.5]))
model.compile()
model.summary()
model.graph_to_json("multi_hot.json")
model.fit(max_iter = 1100, display = 200, eval_interval = 1000, snapshot = 1000, snapshot_prefix = "multi_hot")
model.export_predictions("multi_hot_pred_" + str(1000), "multi_hot_label_" + str(1000))

Overwriting multi_hot_train.py


In [3]:
!python3 multi_hot_train.py

HugeCTR Version: 3.7
[HCTR][09:00:10.032][INFO][RK0][main]: Initialize model: multi_hot
[HCTR][09:00:10.032][INFO][RK0][main]: Global seed is 69819197
[HCTR][09:00:10.135][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
[HCTR][09:00:11.978][INFO][RK0][main]: Start all2all warmup
[HCTR][09:00:11.978][INFO][RK0][main]: End all2all warmup
[HCTR][09:00:11.979][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][09:00:11.980][INFO][RK0][main]: Device 0: Tesla V100-SXM2-32GB
[HCTR][09:00:11.985][INFO][RK0][main]: num of DataReader workers for train: 1
[HCTR][09:00:11.985][INFO][RK0][main]: num of DataReader workers for eval: 1
[HCTR][09:00:12.176][INFO][RK0][main]: Vocabulary size: 30000
[HCTR][09:00:12.177][INFO][RK0][main]: max_vocabulary_size_per_gpu_=21845
[HCTR][09:00:12.179][INFO][RK0][main]: max_vocabulary_size_per_gpu_=10922
[HCTR][09:00:12.181][INFO][RK0][main]: Graph analysis to resolve tensor dependency
[HCTR][09:00:43.965][INFO][RK0][main]: gpu0 start to init e

### Multi-GPU Offline Inference

We can demonstrate multi-GPU offline inference by performing the following steps with Python APIs:

1. Configure the inference hyperparameters.
2. Initialize the inference model. The model is a collection of inference sessions deployed on multiple devices.
3. Make an inference from the evaluation dataset.
4. Check the correctness of the inference by comparing it with the dumped evaluation results.

**Note**: The `max_batchsize` configured within `InferenceParams` is the global batch size.
The value for `max_batchsize` should be divisible by the number of deployed devices.
The numpy array returned by `InferenceModel.predict` is of the shape `(max_batchsize * num_batches, label_dim)`.

In [4]:
import hugectr
from hugectr.inference import InferenceModel, InferenceParams
import numpy as np
from mpi4py import MPI

model_config = "multi_hot.json"
inference_params = InferenceParams(
    model_name = "multi_hot",
    max_batchsize = 16384,
    hit_rate_threshold = 1.0,
    dense_model_file = "multi_hot_dense_1000.model",
    sparse_model_files = ["multi_hot0_sparse_1000.model", "multi_hot1_sparse_1000.model"],
    deployed_devices = [0, 1, 2, 3, 4, 5, 6, 7],
    use_gpu_embedding_cache = True,
    cache_size_percentage = 0.5,
    i64_input_key = True
)
inference_model = InferenceModel(model_config, inference_params)
pred = inference_model.predict(
    8,
    "./multi_hot_parquet/file_list_test.txt",
    hugectr.DataReaderType_t.Parquet,
    hugectr.Check_t.Non,
    [10000, 10000, 10000]
)
grount_truth = np.loadtxt("multi_hot_pred_1000")
print("pred: ", pred)
print("grount_truth: ", grount_truth)
diff = pred.flatten()-grount_truth
mse = np.mean(diff*diff)
print("mse: ", mse)

[HCTR][09:01:06.072][INFO][RK0][main]: Global seed is 3072588155
[HCTR][09:01:06.222][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
  GPU 1 ->  node 0
  GPU 2 ->  node 0
  GPU 3 ->  node 0
  GPU 4 ->  node 1
  GPU 5 ->  node 1
  GPU 6 ->  node 1
  GPU 7 ->  node 1
[HCTR][09:01:23.763][INFO][RK0][main]: Start all2all warmup
[HCTR][09:01:23.996][INFO][RK0][main]: End all2all warmup
[HCTR][09:01:24.013][INFO][RK0][main]: default_emb_vec_value is not specified using default: 0
[HCTR][09:01:24.013][INFO][RK0][main]: default_emb_vec_value is not specified using default: 0
[HCTR][09:01:24.013][INFO][RK0][main]: Creating HashMap CPU database backend...
[HCTR][09:01:24.013][INFO][RK0][main]: Volatile DB: initial cache rate = 1
[HCTR][09:01:24.013][INFO][RK0][main]: Volatile DB: cache missed embeddings = 0
[HCTR][09:01:24.347][INFO][RK0][main]: Table: hps_et.multi_hot.sparse_embedding1; cached 19849 / 19849 embeddings in volatile database (PreallocatedHashMapBackend); load: 19849 

pred:  [[0.51329887 0.4888402 ]
 [0.55268604 0.62567735]
 [0.48302165 0.5015869 ]
 ...
 [0.52275413 0.46319592]
 [0.46984023 0.5436093 ]
 [0.48216432 0.48920953]]
grount_truth:  [0.513299 0.48884  0.552686 ... 0.543609 0.482164 0.48921 ]
mse:  8.482603947165404e-14
