<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_hugectr_hps-demo/nvidia_logo.png" style="width: 90px; float: right;">

# Hierarchical Parameter Server Demo

## Overview

In HugeCTR version 3.5, we provide Python APIs for embedding table lookup with [HugeCTR Hierarchical Parameter Server (HPS)](https://nvidia-merlin.github.io/HugeCTR/master/hugectr_core_features.html#hierarchical-parameter-server)
HPS supports different database backends and GPU embedding caches.

This notebook demonstrates how to use HPS with HugeCTR Python APIs. Without loss of generality, the HPS APIs are utilized together with the ONNX Runtime APIs to create an ensemble inference model, where HPS is responsible for embedding table lookup while the ONNX model takes charge of feed forward of dense neural networks.

1. [Inference with HPS & ONNX](#section-1)
2. [Lookup the Embedding Vector from DLPacke](#section-2)
3. [Multi-process inferenceon](#section-3)
4. [Redis Cluster deployment (without TLS/SSL)](#section-4)
5. [Redis Cluster deployment (with TLS/SSL)](#section-5)

## Setup

To setup the environment, refer to [HugeCTR Example Notebooks](../notebooks) and follow the instructions there before running the following.

## Data Generation

HugeCTR provides a tool to generate synthetic datasets. The [Data Generator](https://nvidia-merlin.github.io/HugeCTR/master/api/python_interface.html#data-generator-api) is capable of generating datasets of different file formats and different distributions. We will generate one-hot Parquet datasets with power-law distribution for this notebook:

In [1]:
import hugectr
from hugectr.tools import DataGeneratorParams, DataGenerator

data_generator_params = DataGeneratorParams(
  format = hugectr.DataReaderType_t.Parquet,
  label_dim = 1,
  dense_dim = 10,
  num_slot = 4,
  i64_input_key = True,
  nnz_array = [1, 1, 1, 1],
  source = "./data_parquet/file_list.txt",
  eval_source = "./data_parquet/file_list_test.txt",
  slot_size_array = [10000, 10000, 10000, 10000],
  check_type = hugectr.Check_t.Non,
  dist_type = hugectr.Distribution_t.PowerLaw,
  power_law_type = hugectr.PowerLaw_t.Short,
  num_files = 16,
  eval_num_files = 4,
  num_samples_per_file = 40960)
data_generator = DataGenerator(data_generator_params)
data_generator.generate()

[HCTR][06:31:47.413][INFO][RK0][main]: Generate Parquet dataset
[HCTR][06:31:47.413][INFO][RK0][main]: train data folder: ./data_parquet, eval data folder: ./data_parquet, slot_size_array: 10000, 10000, 10000, 10000, nnz array: 1, 1, 1, 1, #files for train: 16, #files for eval: 4, #samples per file: 40960, Use power law distribution: 1, alpha of power law: 1.3
[HCTR][06:31:47.416][INFO][RK0][main]: ./data_parquet exist
[HCTR][06:31:47.423][INFO][RK0][main]: ./data_parquet/train/gen_0.parquet
[HCTR][06:31:50.739][INFO][RK0][main]: ./data_parquet/train/gen_1.parquet
[HCTR][06:31:50.846][INFO][RK0][main]: ./data_parquet/train/gen_2.parquet
[HCTR][06:31:50.929][INFO][RK0][main]: ./data_parquet/train/gen_3.parquet
[HCTR][06:31:51.011][INFO][RK0][main]: ./data_parquet/train/gen_4.parquet
[HCTR][06:31:51.092][INFO][RK0][main]: ./data_parquet/train/gen_5.parquet
[HCTR][06:31:51.171][INFO][RK0][main]: ./data_parquet/train/gen_6.parquet
[HCTR][06:31:51.250][INFO][RK0][main]: ./data_parquet/train

## Train from Scratch

We can train from scratch by performing the following steps with Python APIs:

1. Create the solver, reader and optimizer, then initialize the model.
2. Construct the model graph by adding input, sparse embedding and dense layers in order.
3. Compile the model and have an overview of the model graph.
4. Dump the model graph to the JSON file.
5. Fit the model, save the model weights and optimizer states implicitly.
6. Dump one batch of evaluation results to files.

In [2]:
%%writefile train.py
import os
import hugectr
from mpi4py import MPI
import numpy as np
solver = hugectr.CreateSolver(model_name = "hps_demo",
                              max_eval_batches = 1,
                              batchsize_eval = 1024,
                              batchsize = 1024,
                              lr = 0.001,
                              vvgpu = [[0]],
                              i64_input_key = True,
                              repeat_dataset = True,
                              use_cuda_graph = True)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
                                  source = ["./data_parquet/file_list.txt"],
                                  eval_source = "./data_parquet/file_list_test.txt",
                                  check_type = hugectr.Check_t.Non,
                                  slot_size_array = [10000, 10000, 10000, 10000])
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam)
model = hugectr.Model(solver, reader, optimizer)
model.add(hugectr.Input(label_dim = 1, label_name = "label",
                        dense_dim = 10, dense_name = "dense",
                        data_reader_sparse_param_array = 
                        [hugectr.DataReaderSparseParam("data1", [1, 1], True, 2),
                        hugectr.DataReaderSparseParam("data2", [1, 1], True, 2)]))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 4,
                            embedding_vec_size = 16,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding1",
                            bottom_name = "data1",
                            optimizer = optimizer))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 8,
                            embedding_vec_size = 32,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding2",
                            bottom_name = "data2",
                            optimizer = optimizer))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding1"],
                            top_names = ["reshape1"],
                            leading_dim=32))                            
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding2"],
                            top_names = ["reshape2"],
                            leading_dim=64))                            
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,
                            bottom_names = ["reshape1", "reshape2", "dense"], top_names = ["concat1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["concat1"],
                            top_names = ["fc1"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc1"],
                            top_names = ["relu1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["relu1"],
                            top_names = ["fc2"],
                            num_output=1))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.BinaryCrossEntropyLoss,
                            bottom_names = ["fc2", "label"],
                            top_names = ["loss"]))
model.compile()
model.summary()
model.graph_to_json("hps_demo.json")
model.fit(max_iter = 1100, display = 200, eval_interval = 1000, snapshot = 1000, snapshot_prefix = "hps_demo")

ground_truth = model.check_out_tensor("fc2", hugectr.Tensor_t.Evaluate)
np.save("ground_truth.npy", ground_truth)

Writing train.py


In [3]:
!python3 train.py

HugeCTR Version: 23.8
[HCTR][06:32:11.556][INFO][RK0][main]: Initialize model: hps_demo
[HCTR][06:32:11.556][INFO][RK0][main]: Global seed is 2598678435
[HCTR][06:32:11.561][INFO][RK0][main]: Device to NUMA mapping:
[HCTR][06:32:11.642][INFO][RK0][main]:   GPU 0 ->  node 0
[HCTR][06:32:15.564][DEBUG][RK0][main]: [device 0] allocating 0.0000 GB, available 30.0886 
[HCTR][06:32:15.564][INFO][RK0][main]: Start all2all warmup
[HCTR][06:32:15.565][INFO][RK0][main]: End all2all warmup
[HCTR][06:32:15.566][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][06:32:15.567][INFO][RK0][main]: Device 0: Tesla V100-SXM2-32GB
[HCTR][06:32:15.636][INFO][RK0][main]: eval source ./data_parquet/file_list_test.txt max_row_group_size 40960
[HCTR][06:32:15.808][INFO][RK0][main]: train source ./data_parquet/file_list.txt max_row_group_size 40960
[HCTR][06:32:15.810][INFO][RK0][main]: num of DataReader workers for train: 1
[HCTR][06:32:15.810][INFO][RK0][main]: num of DataReader workers for eval: 1
[HC

## Convert HugeCTR to ONNX

We will convert the saved HugeCTR models to ONNX using the HugeCTR to ONNX Converter. For more information about the converter, refer to the README in the [onnx_converter](https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/onnx_converter) directory of the repository.

For the sake of double checking the correctness, we will investigate both cases of conversion depending on whether or not to convert the sparse embedding models.

In [4]:
import hugectr2onnx
hugectr2onnx.converter.convert(onnx_model_path = "hps_demo_with_embedding.onnx",
                            graph_config = "hps_demo.json",
                            dense_model = "hps_demo_dense_1000.model",
                            convert_embedding = True,
                            sparse_models = ["hps_demo0_sparse_1000.model", "hps_demo1_sparse_1000.model"])

hugectr2onnx.converter.convert(onnx_model_path = "hps_demo_without_embedding.onnx",
                            graph_config = "hps_demo.json",
                            dense_model = "hps_demo_dense_1000.model",
                            convert_embedding = False)

[HUGECTR2ONNX][INFO]: Converting Data layer to ONNX
[HUGECTR2ONNX][INFO]: Converting DistributedSlotSparseEmbeddingHash layer to ONNX
[HUGECTR2ONNX][INFO]: Converting DistributedSlotSparseEmbeddingHash layer to ONNX
[HUGECTR2ONNX][INFO]: Converting Reshape layer to ONNX
[HUGECTR2ONNX][INFO]: Converting Reshape layer to ONNX
[HUGECTR2ONNX][INFO]: Converting Concat layer to ONNX
[HUGECTR2ONNX][INFO]: Converting InnerProduct layer to ONNX
[HUGECTR2ONNX][INFO]: Converting ReLU layer to ONNX
[HUGECTR2ONNX][INFO]: Converting InnerProduct layer to ONNX
[HUGECTR2ONNX][INFO]: Converting Sigmoid layer to ONNX
[HUGECTR2ONNX][INFO]: The model is checked!
[HUGECTR2ONNX][INFO]: The model is saved at hps_demo_with_embedding.onnx
[HUGECTR2ONNX][INFO]: Converting Data layer to ONNX
Skip sparse embedding layers in converted ONNX model
[HUGECTR2ONNX][INFO]: Converting DistributedSlotSparseEmbeddingHash layer to ONNX
Skip sparse embedding layers in converted ONNX model
[HUGECTR2ONNX][INFO]: Converting Dis

<a id="section-1"></a>
## 1. Inference with HPS & ONNX

We will make inference by performing the following steps with Python APIs:

1. Configure the HPS hyperparameters. Please refer to [hps configuration](https://nvidia-merlin.github.io/HugeCTR/main/hugectr_parameter_server.html#inference-parameters-and-embedding-cache-configuration) for detailed configurations.
2. Initialize the HPS object, which is responsible for embedding table lookup.
3. Loading the Parquet data.
4. Make inference with the HPS object and the ONNX inference session of `hps_demo_without_embedding.onnx`.
5. Check the correctness by comparing with dumped evaluation results.
6. Make inference with the ONNX inference session of `hps_demo_with_embedding.onnx` (double check).

In [5]:
from hugectr.inference import HPS, ParameterServerConfig, InferenceParams

import pandas as pd
import numpy as np

import onnxruntime as ort

slot_size_array = [10000, 10000, 10000, 10000]
key_offset = np.insert(np.cumsum(slot_size_array), 0, 0)[:-1]
batch_size = 1024

# 1. Configure the HPS hyperparameters
ps_config = ParameterServerConfig(
           emb_table_name = {"hps_demo": ["sparse_embedding1", "sparse_embedding2"]},
           embedding_vec_size = {"hps_demo": [16, 32]},
           max_feature_num_per_sample_per_emb_table = {"hps_demo": [2, 2]},
           inference_params_array = [
              InferenceParams(
                model_name = "hps_demo",
                max_batchsize = batch_size,
                hit_rate_threshold = 1.0,
                dense_model_file = "",
                sparse_model_files = ["hps_demo0_sparse_1000.model", "hps_demo1_sparse_1000.model"],
                deployed_devices = [0],
                use_gpu_embedding_cache = True,
                cache_size_percentage = 0.5,
                i64_input_key = True)
           ])

# 2. Initialize the HPS object
hps = HPS(ps_config)

# 3. Loading the Parquet data.
df = pd.read_parquet("data_parquet/val/gen_0.parquet")
dense_input_columns = df.columns[1:11]
cat_input1_columns = df.columns[11:13]
cat_input2_columns = df.columns[13:15]
dense_input = df[dense_input_columns].loc[0:batch_size-1].to_numpy(dtype=np.float32)
cat_input1 = (df[cat_input1_columns].loc[0:batch_size-1].to_numpy(dtype=np.int64) + key_offset[0:2]).reshape((batch_size, 2, 1))
cat_input2 = (df[cat_input2_columns].loc[0:batch_size-1].to_numpy(dtype=np.int64) + key_offset[2:4]).reshape((batch_size, 2, 1))

# 4. Make inference from the HPS object and the ONNX inference session of `hps_demo_without_embedding.onnx`.
embedding1 = hps.lookup(cat_input1.flatten(), "hps_demo", 0).reshape(batch_size, 2, 16)
embedding2 = hps.lookup(cat_input2.flatten(), "hps_demo", 1).reshape(batch_size, 2, 32)
sess = ort.InferenceSession("hps_demo_without_embedding.onnx")
res = sess.run(output_names=[sess.get_outputs()[0].name],
               input_feed={sess.get_inputs()[0].name: dense_input,
               sess.get_inputs()[1].name: embedding1,
               sess.get_inputs()[2].name: embedding2})
pred = res[0]

# 5. Check the correctness by comparing with dumped evaluation results.
ground_truth = np.load("ground_truth.npy").flatten()
print("ground_truth: ", ground_truth)

diff = pred.flatten()-ground_truth
mse = np.mean(diff*diff)
print("pred: ", pred)
print("mse between pred and ground_truth: ", mse)

# 6. Make inference with the ONNX inference session of `hps_demo_with_embedding.onnx` (double check).
sess_ref = ort.InferenceSession("hps_demo_with_embedding.onnx")
res_ref = sess_ref.run(output_names=[sess_ref.get_outputs()[0].name],
                   input_feed={sess_ref.get_inputs()[0].name: dense_input,
                   sess_ref.get_inputs()[1].name: cat_input1,
                   sess_ref.get_inputs()[2].name: cat_input2})
pred_ref = res_ref[0]
diff_ref = pred_ref.flatten()-ground_truth
mse_ref = np.mean(diff_ref*diff_ref)
print("pred_ref: ", pred_ref)
print("mse between pred_ref and ground_truth: ", mse_ref)

[HCTR][06:32:40.791][INFO][RK0][main]: Creating HashMap CPU database backend...
[HCTR][06:32:40.791][DEBUG][RK0][main]: Created blank database backend in local memory!
[HCTR][06:32:40.791][INFO][RK0][main]: Volatile DB: initial cache rate = 1
[HCTR][06:32:40.791][INFO][RK0][main]: Volatile DB: cache missed embeddings = 0
[HCTR][06:32:40.791][DEBUG][RK0][main]: Created raw model loader in local memory!
[HCTR][06:32:41.123][INFO][RK0][main]: Table: hps_et.hps_demo.sparse_embedding1; cached 18488 / 18488 embeddings in volatile database (HashMapBackend); load: 18488 / 18446744073709551615 (0.00%).
[HCTR][06:32:41.431][INFO][RK0][main]: Table: hps_et.hps_demo.sparse_embedding2; cached 18470 / 18470 embeddings in volatile database (HashMapBackend); load: 18470 / 18446744073709551615 (0.00%).
[HCTR][06:32:41.431][DEBUG][RK0][main]: Real-time subscribers created!
[HCTR][06:32:41.431][INFO][RK0][main]: Creating embedding cache in device 0.
[HCTR][06:32:41.437][INFO][RK0][main]: Model name: hps_

2023-09-20 06:32:41.566238532 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'key_to_indice_hash_all_tables'. It is not used by any node and should be removed from the model.


<a id="section-2"></a>
## 2. Lookup the Embedding Vector from DLPack

We also provide a `lookup_fromdlpack` interface that could query embedding keys on the `CPU` and return the embedding vectors on the `GPU/CPU`.

1. Suppose you have created a Pytorch/Tensorflow tensor that stores the embedded keys.
2. Convert the embedding key tensor to DLPack capsule through the corresponding platform's `to_dlpack` function.
3. Creates an empty tensor as a buffer to store embedding vectors. 
4. Convert a buffer tensor to DLPack capsule.
5. Lookup the embedding vector of the corresponding embedding key directly through `lookup_fromdlpack` interface, and output it to the embedding vector buffer tensor
6. If the output capsule is allocated on the GPU, then a  `device_id` needs to be specified in `lookup_fromdlpack` interface for corresponding embedding cache. If not specified, the default value is device 0

Note: Please make sure that tensorflow or pytorch have been installed correctly in the `merlin-hugectr` container:

```bash
pip install tensorflow
pip install torch
```

In [6]:
embedding1 = hps.lookup(cat_input1.flatten(), "hps_demo", 0).reshape(batch_size, 2, 16)
embedding2 = hps.lookup(cat_input2.flatten(), "hps_demo", 1).reshape(batch_size, 2, 32)

# 1. Look up from dlpack for Pytorch tensor on CPU
print(" Look up from dlpack for Pytorch tensor")
import torch.utils.dlpack
import os
print("************Look up from pytorch dlpack on CPU")
device = torch.device("cpu")
key = torch.tensor(cat_input1.flatten(),dtype=torch.int64, device=device)
out = torch.empty((1,cat_input1.flatten().shape[0]*16), dtype=torch.float32, device=device)
key_capsule = torch.utils.dlpack.to_dlpack(key)
print("The device type of embedding keys that lookup dlpack from hps interface for embedding table 0 of hps_demo: {}, the keys: {}".format(key.device, key))
out_capsule = torch.utils.dlpack.to_dlpack(out)
# Lookup the embedding vectors from dlpack
hps.lookup_fromdlpack(key_capsule, out_capsule,"hps_demo", 0)
out_put = torch.utils.dlpack.from_dlpack(out_capsule)
print("[The device type of embedding vectors that lookup dlpack from hps interface for embedding table 0 of hps_demo: {}, the vectors: {}\n".format(out_put.device, out_put))
diff = out_put-embedding1.reshape(1,cat_input1.flatten().shape[0]*16)
if diff.mean() > 1e-4:
    raise RuntimeError("Too large mse between pytorch dlpack on cpu and native HPS lookup api: {}".format(diff.mean()))
    sys.exit(1)
else:
    print("Pytorch dlpack on cpu  results are consistent with native HPS lookup api, mse: {}".format(diff.mean()))
    

# 2. Look up from dlpack for Pytorch tensor on GPU
print("************Look up from pytorch dlpack on GPU")
cuda_device = torch.device("cuda:0" if torch.cuda.is_available else "cpu")
key = torch.tensor(cat_input1.flatten(),dtype=torch.int64, device=device)
key_capsule = torch.utils.dlpack.to_dlpack(key)
out = torch.empty((cat_input1.flatten().shape[0]*16), dtype=torch.float32, device=cuda_device)
out_capsule = torch.utils.dlpack.to_dlpack(out)
hps.lookup_fromdlpack(key_capsule, out_capsule,"hps_demo", 0)
out_put = torch.utils.dlpack.from_dlpack(out_capsule)
print("The device type of embedding vectors that lookup dlpack from hps interface for embedding table 0 of hps_demo: {}, the vectors: {}\n\n".format(out_put.device, out_put))
diff = out_put.cpu()-embedding1.reshape(1,cat_input1.flatten().shape[0]*16)
if diff.mean() > 1e-3:
    raise RuntimeError("Too large mse between pytorch dlpack on cpu and native HPS lookup api: {}".format(diff.mean()))
    sys.exit(1)
else:
    print("Pytorch dlpack on GPU results are consistent with native HPS lookup api, mse: {}".format(diff.mean()))

 Look up from dlpack for Pytorch tensor
************Look up from pytorch dlpack on CPU
The device type of embedding keys that lookup dlpack from hps interface for embedding table 0 of hps_demo: cpu, the keys: tensor([   85, 10028,     0,  ..., 10004,    10, 10000])
[The device type of embedding vectors that lookup dlpack from hps interface for embedding table 0 of hps_demo: cpu, the vectors: tensor([[-0.0307,  0.0264, -0.0294,  ...,  0.0151, -0.0281,  0.0088]])

Pytorch dlpack on cpu  results are consistent with native HPS lookup api, mse: 0.0
************Look up from pytorch dlpack on GPU
The device type of embedding vectors that lookup dlpack from hps interface for embedding table 0 of hps_demo: cuda:0, the vectors: tensor([-0.0307,  0.0264, -0.0294,  ...,  0.0151, -0.0281,  0.0088],
       device='cuda:0')


Pytorch dlpack on GPU results are consistent with native HPS lookup api, mse: 0.0


In [7]:
# 3. Look up from dlpack for tensorflow tensor on CPU
print("Look up from dlpack for Tensorflow tensor")
from tensorflow.python.dlpack import dlpack  
import tensorflow as tf
from tensorflow.python.eager import context
from tensorflow.python.framework import dtypes
print("***************Look up from tensorflow dlpack on CPU**********")
with tf.device('/CPU:0'):
    key_tensor = tf.constant(cat_input2.flatten(),dtype=tf.int64)
    out_tensor = tf.zeros([1, cat_input2.flatten().shape[0]*32],dtype=tf.float32)
    print("The device type of embedding keys that lookup dlpack from hps interface for embedding table 1 of hps_demo: {}, the keys: {}".format(key_tensor.device, key_tensor))
    key_capsule = tf.experimental.dlpack.to_dlpack(key_tensor)
    out_dlcapsule = tf.experimental.dlpack.to_dlpack(out_tensor)
hps.lookup_fromdlpack(key_capsule,out_dlcapsule, "hps_demo", 1)
out = tf.experimental.dlpack.from_dlpack(out_dlcapsule)
print("The device type of embedding vectors that lookup dlpack from hps interface for embedding table 1 of hps_demo: {}, the vectors: {}\n".format(out.device, out))
diff = out-embedding2.reshape(1,cat_input2.flatten().shape[0]*32)
mse = tf.reduce_mean(diff)
if mse> 1e-3:
    raise RuntimeError("Too large mse between tensorflow dlpack on cpu and native HPS lookup api: {}".format(mse))
    sys.exit(1)
else:
    print("tensorflow dlpack on CPU results are consistent with native HPS lookup api, mse: {}".format(mse))
    
# 4. Look up from dlpack for tensorflow tensor on GPU
print("***************Look up from tensorflow dlpack on GPU**********")
with tf.device('/GPU:0'):
    key_tensor = tf.constant(cat_input2.flatten(),dtype=tf.int64)
    out_tensor = tf.zeros([cat_input2.flatten().shape[0]*32],dtype=tf.float32)
    key_capsule = tf.experimental.dlpack.to_dlpack(key_tensor)
    out_dlcapsule = tf.experimental.dlpack.to_dlpack(out_tensor)
hps.lookup_fromdlpack(key_capsule,out_dlcapsule, "hps_demo", 1)
out= tf.experimental.dlpack.from_dlpack(out_dlcapsule)
print("[HUGECTR][INFO] The device type of embedding vectors that lookup dlpack from hps interface for embedding table 1 of wdl: {}, the vectors: {}\n".format(out.device, out))
diff = out-embedding2.reshape(1,cat_input2.flatten().shape[0]*32)
mse = tf.reduce_mean(diff)
if mse> 1e-3:
    raise RuntimeError("Too large mse between tensorflow dlpack on cpu and native HPS lookup api: {}".format(mse))
    sys.exit(1)
else:
    print("tensorflow dlpack on GPU results are consistent with native HPS lookup api, mse: {}".format(mse))

Look up from dlpack for Tensorflow tensor


2023-09-20 06:34:21.729218: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


***************Look up from tensorflow dlpack on CPU**********


2023-09-20 06:34:44.168630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30048 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0
2023-09-20 06:34:44.170043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 30184 MB memory:  -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:07:00.0, compute capability: 7.0
2023-09-20 06:34:44.171618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 30184 MB memory:  -> device: 2, name: Tesla V100-SXM2-32GB, pci bus id: 0000:0a:00.0, compute capability: 7.0
2023-09-20 06:34:44.173095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 30184 MB memory:  -> device: 3, name: Tesla V100-SXM2-32GB, pci bus id

The device type of embedding keys that lookup dlpack from hps interface for embedding table 1 of hps_demo: /job:localhost/replica:0/task:0/device:CPU:0, the keys: [20005 30047 20004 ... 30001 20037 30001]
The device type of embedding vectors that lookup dlpack from hps interface for embedding table 1 of hps_demo: /job:localhost/replica:0/task:0/device:CPU:0, the vectors: [[ 0.02182689  0.01806355  0.01985828 ...  0.0136845  -0.01738386
  -0.00323257]]

tensorflow dlpack on CPU results are consistent with native HPS lookup api, mse: 0.0
***************Look up from tensorflow dlpack on GPU**********
[HUGECTR][INFO] The device type of embedding vectors that lookup dlpack from hps interface for embedding table 1 of wdl: /job:localhost/replica:0/task:0/device:GPU:0, the vectors: [ 0.02182689  0.01806355  0.01985828 ...  0.0136845  -0.01738386
 -0.00323257]

tensorflow dlpack on GPU results are consistent with native HPS lookup api, mse: 0.0


<a id="section-3"></a>
## 3. Multi-process inference

It is possible to share the a hashmap database between multiple processes. The following example launches 3 processes which achieve this using the operating system's shared memory, which is located at `/dev/shm` in most unix systems. In this example, we separate processes into a primary and multiple secondary processes, and only the primary process initializes the shared memory database. The secondary processes wait until the shared memory has been fully initialized. However, note that inter-process database access is guaranteed to be thread-safe. Therefore, it is also possible to implement more complicated initialization/refresh mechanisms for your use-case.

In [8]:
%%writefile multi_process_hps.py
import os
import time
import multiprocessing as mp
import pandas as pd
import numpy as np
import onnxruntime as ort
from hugectr import DatabaseType_t
from hugectr.inference import HPS, ParameterServerConfig, InferenceParams, VolatileDatabaseParams

slot_size_array = [10000, 10000, 10000, 10000]
key_offset = np.insert(np.cumsum(slot_size_array), 0, 0)[:-1]
batch_size = 1024

def create_hps(name, initialized, device_id, num_max_processes):
    print(f'subprocess：{name}（{os.getpid()}）launch...')
    
    # 1. Let secondary processes wait until shared memory is initialized.
    while name != 'primary' and initialized.value == 0:
        print(f'Subprocess {name} awaiting SHM initialization...')
        time.sleep(1)

    # 2. Configure the HPS hyperparameters
    ps_config = ParameterServerConfig(
           emb_table_name = {"hps_demo": ["sparse_embedding1", "sparse_embedding2"]},
           embedding_vec_size = {"hps_demo": [16, 32]},
           max_feature_num_per_sample_per_emb_table = {"hps_demo": [2, 2]},
           inference_params_array = [
              InferenceParams(
                model_name = "hps_demo",
                max_batchsize = batch_size,
                hit_rate_threshold = 1.0,
                dense_model_file = "",
                sparse_model_files = ["hps_demo0_sparse_1000.model", "hps_demo1_sparse_1000.model"],
                device_id=device_id,
                deployed_devices = [device_id],
                use_gpu_embedding_cache = True,
                cache_size_percentage = 0.5,
                i64_input_key = True)
           ],
           volatile_db = VolatileDatabaseParams(
                DatabaseType_t.multi_process_hash_map,  # Use /dev/shm instead of normal memory for storage.
                # Skips initializing model. If we run HPS in multiple processes, only one needs to initialize.
                initialize_after_startup = name == 'primary',
           ))

    # 3. Initialize the HPS object
    hps = HPS(ps_config)
    initialized.value += 1
    print(f'Subprocess {name} initialized')
    
    # 4. In (1) the secondary processes wait until the primary process has completed initializing
    #    the shared memory. If the last process disconnects, the shared memory is erased.
    #    Therefore, if threads that currently have attached to the shared memory manage to complete
    #    their program before another process has attached, the contents of the shared memory are
    #    lost and the new process will instead construct an empty shared memory. To avoid this
    #    situation, we have multiple options.
    #
    #   a) Setting `shared_memory_auto_remove = False` in the `VolatileDatabaseParams`
    #      configuration [default: True]. This will prevent the deletion of the shared memory when
    #      the last process disconnects. In other words, revoking this flag allows you to preserve
    #      and use the state of a shared memory across multiple program restarts. However, while
    #      desirable in some situations, this is not the behavior we need here, because this
    #      notebook cell should be allowed to be executed repeatedly without relying on risidual
    #      state.
    #
    #   b) Another approach is to ensure that the all other processes that should attach have
    #      attached. Here we achieve this by simply monitoring the `initialized` cross process
    #      counter variable that we used in (1). Once it hits `num_max_processes` we can be sure
    #      that each subprocess has properly connected.
    while initialized.value != num_max_processes:
        print(f'Subprocess {name} await other processes...')
        time.sleep(1)
    
    # 5. Load query data.
    df = pd.read_parquet("data_parquet/val/gen_0.parquet")
    dense_input_columns = df.columns[1:11]
    cat_input1_columns = df.columns[11:13]
    cat_input2_columns = df.columns[13:15]
    dense_input = df[dense_input_columns].loc[0:batch_size-1].to_numpy(dtype=np.float32)
    cat_input1 = (df[cat_input1_columns].loc[0:batch_size-1].to_numpy(dtype=np.int64) + key_offset[0:2]).reshape((batch_size, 2, 1))
    cat_input2 = (df[cat_input2_columns].loc[0:batch_size-1].to_numpy(dtype=np.int64) + key_offset[2:4]).reshape((batch_size, 2, 1))

    # 6. Make inference from the HPS object and the ONNX inference session of `hps_demo_without_embedding.onnx`.
    embedding1 = hps.lookup(cat_input1.flatten(), "hps_demo", 0,device_id).reshape(batch_size, 2, 16)
    embedding2 = hps.lookup(cat_input2.flatten(), "hps_demo", 1,device_id).reshape(batch_size, 2, 32)
    sess = ort.InferenceSession("hps_demo_without_embedding.onnx")
    res = sess.run(output_names=[sess.get_outputs()[0].name],
                   input_feed={sess.get_inputs()[0].name: dense_input,
                   sess.get_inputs()[1].name: embedding1,
                   sess.get_inputs()[2].name: embedding2})
    pred = res[0]

    # 7. Check the correctness by comparing with dumped evaluation results.
    ground_truth = np.load("ground_truth.npy").flatten()
    print(f'Subprocess {name}; ground_truth: {ground_truth}')
    diff = pred.flatten()-ground_truth
    mse = np.mean(diff*diff)
    print(f'Subprocess {name}; pred: {pred}')
    print(f'Subprocess {name}; mse between pred and ground_truth: {mse}')

    # 8. Make inference with the ONNX inference session of `hps_demo_with_embedding.onnx` (double check).
    sess_ref = ort.InferenceSession("hps_demo_with_embedding.onnx")
    res_ref = sess_ref.run(output_names=[sess_ref.get_outputs()[0].name],
                   input_feed={sess_ref.get_inputs()[0].name: dense_input,
                   sess_ref.get_inputs()[1].name: cat_input1,
                   sess_ref.get_inputs()[2].name: cat_input2})
    pred_ref = res_ref[0]
    diff_ref = pred_ref.flatten()-ground_truth
    mse_ref = np.mean(diff_ref*diff_ref)
    print(f'Subprocess {name}; pred_ref: {pred_ref}')
    print(f'Subprocess {name}; mse between pred_ref and ground_truth: {mse_ref}')

    print(f'Subprocess {name} exiting...')

if __name__ == '__main__':
    # Destroy shared memory.
    try:
        os.remove('/dev/shm/hctr_mp_hash_map_database')
    except:
        pass
    
    initialized = mp.Value('i', 0)

    # Create sub processes.
    processes = [
        mp.Process(target=create_hps, args=('primary', initialized, 0, 3)),
        mp.Process(target=create_hps, args=('secondary', initialized, 1, 3)),
        mp.Process(target=create_hps, args=('secondary', initialized, 2, 3)),
    ]
    for p in processes:
        p.start()

    # Go to sleep until subprocesses are initialized.
    while initialized.value < len(processes):
        print(f'Main process; awaiting subprocess initialization... So far {initialized.value} initialized...')
        time.sleep(1)
        
    # Wait for subprocesses to exit.
    for i, p in enumerate(processes):
        print(f'Main process; awaiting subprocess {i} to exit...')
        p.join()
    print(f'Main process; exiting...')

Writing multi_process_hps.py


In [9]:
!python3 multi_process_hps.py

subprocess：primary（1394）launch...
[HCTR][06:48:37.272][INFO][RK0][main]: Creating Multi-Process HashMap CPU database backend...
[HCTR][06:48:37.272][INFO][RK0][main]: Connecting to shared memory 'hctr_mp_hash_map_database'...
subprocess：secondary（1396）launch...
Subprocess secondary awaiting SHM initialization...
Main process; awaiting subprocess initialization... So far 0 initialized...
subprocess：secondary（1397）launch...
Subprocess secondary awaiting SHM initialization...
[HCTR][06:48:37.772][INFO][RK0][main]: Connected to shared memory 'hctr_mp_hash_map_database'; OS total = 270453215232 bytes, OS available = 269706559488 bytes, HCTR allocated = 17179869184 bytes, HCTR free = 17179868672 bytes; other processes connected = 0
[HCTR][06:48:37.773][INFO][RK0][main]: Volatile DB: initial cache rate = 1
[HCTR][06:48:37.773][INFO][RK0][main]: Volatile DB: cache missed embeddings = 0
[HCTR][06:48:37.773][DEBUG][RK0][main]: Created raw model loader in local memory!
Subprocess secondary awaiti

<a id="section-4"></a>
## 4. Redis Cluster deployment (without TLS/SSL)
HugeCTR can use Redis clusters as backing storage. In the following steps we show how to setup a mock Redis / HugeCTR deployment in a single machine. We assume that you have started this notebook in a HugeCTR docker container.

**Step 1: Get + build Redis**

In [10]:
!rm -f 7.0.8.tar.gz && wget https://github.com/redis/redis/archive/7.0.8.tar.gz
!rm -rf redis-7.0.8 && tar -xf 7.0.8.tar.gz && ln -sf redis-7.0.8 redis
!cd redis && make

--2023-09-20 06:49:01--  https://github.com/redis/redis/archive/7.0.8.tar.gz
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/redis/redis/tar.gz/refs/tags/7.0.8 [following]
--2023-09-20 06:49:01--  https://codeload.github.com/redis/redis/tar.gz/refs/tags/7.0.8
Resolving codeload.github.com (codeload.github.com)... 192.30.255.120
Connecting to codeload.github.com (codeload.github.com)|192.30.255.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘7.0.8.tar.gz’

7.0.8.tar.gz            [   <=>              ]   2.87M  5.50MB/s    in 0.5s    

2023-09-20 06:49:02 (5.50 MB/s) - ‘7.0.8.tar.gz’ saved [3011655]

cd src && make all
make[1]: Entering directory '/hugectr/notebooks/tmr/redis-7.0.8/src'
./mkreleasehdr.sh: line 2: echo: write error: Broken pipe
    [34

If you see the message `Hint: It's a good idea to run 'make test' ;)` followed by `make[1]: Leaving directory ...`, the compilation should have completed successfully.

**Step 2: Configure a mock Redis cluster**

*WARNING: The following commands will erase the all contents in the following directories: `redis-server-1`, `redis-server-2` and `redis-server-3`.*

In [11]:
!mkdir -p redis-server-1 redis-server-2 redis-server-3
!rm -f redis-server-1/* redis-server-2/* redis-server-3/*

!ln -sf $PWD/redis/src/redis-server redis-server-1/redis-server
!ln -sf $PWD/redis/src/redis-server redis-server-2/redis-server
!ln -sf $PWD/redis/src/redis-server redis-server-3/redis-server

In [12]:
%%writefile redis-server-1/redis.conf
daemonize yes
port 7000
cluster-enabled yes
cluster-config-file nodes.conf
appendonly no
save ""

Writing redis-server-1/redis.conf


In [13]:
%%writefile redis-server-2/redis.conf
daemonize yes
port 7001
cluster-enabled yes
cluster-config-file nodes.conf
appendonly no
save ""

Writing redis-server-2/redis.conf


In [14]:
%%writefile redis-server-3/redis.conf
daemonize yes
port 7002
cluster-enabled yes
cluster-config-file nodes.conf
appendonly no
save ""

Writing redis-server-3/redis.conf


**Step 3: Form Redis cluster**

*WARNING: The following command will shutdown any processes called `redis-cluster` in the current system!*

In [15]:
# Shutdown existing cluster (if any).
!pkill redis-server

# Reset configuration and start 3 Redis servers.
!cd redis-server-1 && rm -f nodes.conf && ./redis-server redis.conf
!cd redis-server-2 && rm -f nodes.conf && ./redis-server redis.conf
!cd redis-server-3 && rm -f nodes.conf && ./redis-server redis.conf

# Form the cluster.
!redis/src/redis-cli \
    --cluster create 127.0.0.1:7000 127.0.0.1:7001 127.0.0.1:7002 \
    --cluster-yes

[29;1m>>> Performing hash slots allocation on 3 nodes...
[0mMaster[0] -> Slots 0 - 5460
Master[1] -> Slots 5461 - 10922
Master[2] -> Slots 10923 - 16383
M: fa9bb82124685a6438a696cc1562693ccc815ff0 127.0.0.1:7000
   slots:[0-5460] (5461 slots) master
M: c6d7ad6353bf568d17a147e65b8198ded9d65717 127.0.0.1:7001
   slots:[5461-10922] (5462 slots) master
M: e26ae6cfbeea8a1e6367444445364d963ae17436 127.0.0.1:7002
   slots:[10923-16383] (5461 slots) master
[29;1m>>> Nodes configuration updated
[0m[29;1m>>> Assign a different config epoch to each node
[0m[29;1m>>> Sending CLUSTER MEET messages to join the cluster
[0mWaiting for the cluster to join
.
[29;1m>>> Performing Cluster Check (using node 127.0.0.1:7000)
[0mM: fa9bb82124685a6438a696cc1562693ccc815ff0 127.0.0.1:7000
   slots:[0-5460] (5461 slots) master
M: e26ae6cfbeea8a1e6367444445364d963ae17436 127.0.0.1:7002
   slots:[10923-16383] (5461 slots) master
M: c6d7ad6353bf568d17a147e65b8198ded9d65717 127.0.0.1:7001
   slots:[5461-10

**Step 4: Run HugeCTR**

In [16]:
import os
import time
import multiprocessing as mp
import pandas as pd
import numpy as np
import onnxruntime as ort
from hugectr import DatabaseType_t
from hugectr.inference import HPS, ParameterServerConfig, InferenceParams, VolatileDatabaseParams

slot_size_array = [10000, 10000, 10000, 10000]
key_offset = np.insert(np.cumsum(slot_size_array), 0, 0)[:-1]
batch_size = 1024

print('Launching...')

# 1. Configure the HPS hyperparameters.
ps_config = ParameterServerConfig(
       emb_table_name = {'hps_demo': ['sparse_embedding1', 'sparse_embedding2']},
       embedding_vec_size = {'hps_demo': [16, 32]},
       max_feature_num_per_sample_per_emb_table = {'hps_demo': [2, 2]},
       inference_params_array = [
          InferenceParams(
            model_name = 'hps_demo',
            max_batchsize = batch_size,
            hit_rate_threshold = 1.0,
            dense_model_file = '',
            sparse_model_files = ['hps_demo0_sparse_1000.model', 'hps_demo1_sparse_1000.model'],
            deployed_devices = [0],
            use_gpu_embedding_cache = True,
            cache_size_percentage = 0.5,
            i64_input_key = True)
       ],
       volatile_db = VolatileDatabaseParams(
            DatabaseType_t.redis_cluster,
            address = '127.0.0.1:7000',
            num_partitions = 15,
            num_node_connections = 5,
       ))

# 2. Initialize the HPS object.
hps = HPS(ps_config)
print('HPS initialized')

# 3. Load query data.
df = pd.read_parquet('data_parquet/val/gen_0.parquet')
dense_input_columns = df.columns[1:11]
cat_input1_columns = df.columns[11:13]
cat_input2_columns = df.columns[13:15]
dense_input = df[dense_input_columns].loc[0:batch_size-1].to_numpy(dtype=np.float32)
cat_input1 = (df[cat_input1_columns].loc[0:batch_size-1].to_numpy(dtype=np.int64) + key_offset[0:2]).reshape((batch_size, 2, 1))
cat_input2 = (df[cat_input2_columns].loc[0:batch_size-1].to_numpy(dtype=np.int64) + key_offset[2:4]).reshape((batch_size, 2, 1))

# 4. Make inference from the HPS object and the ONNX inference session of `hps_demo_without_embedding.onnx`.
embedding1 = hps.lookup(cat_input1.flatten(), 'hps_demo', 0).reshape(batch_size, 2, 16)
embedding2 = hps.lookup(cat_input2.flatten(), 'hps_demo', 1).reshape(batch_size, 2, 32)
sess = ort.InferenceSession('hps_demo_without_embedding.onnx')
res = sess.run(output_names=[sess.get_outputs()[0].name],
               input_feed={sess.get_inputs()[0].name: dense_input,
               sess.get_inputs()[1].name: embedding1,
               sess.get_inputs()[2].name: embedding2})
pred = res[0].flatten()

# 5. Check the correctness by comparing with dumped evaluation results.
ground_truth = np.load("ground_truth.npy").flatten()
print('-------------------------------------------------------------------------------')
print('                         HPS demo without embedding                            ')
print('-------------------------------------------------------------------------------')
print(f'Ground truth: {ground_truth.shape} = {ground_truth}')
print('-------------------------------------------------------------------------------')
print(f'Prediction without embedding: {pred.shape} = {pred}')

diff = pred - ground_truth
mse = np.mean(diff * diff)
print(f'MSE between prediction and ground_truth: {mse}')

# 6. Make inference with the ONNX inference session of `hps_demo_with_embedding.onnx` (double check).
sess_ref = ort.InferenceSession('hps_demo_with_embedding.onnx')
res_ref = sess_ref.run(output_names=[sess_ref.get_outputs()[0].name],
               input_feed={sess_ref.get_inputs()[0].name: dense_input,
               sess_ref.get_inputs()[1].name: cat_input1,
               sess_ref.get_inputs()[2].name: cat_input2})
pred_ref = res_ref[0].flatten()

print('-------------------------------------------------------------------------------')
print('                           HPS demo with embedding                             ')
print('-------------------------------------------------------------------------------')
print(f'Ground truth: {ground_truth.shape} = {ground_truth}')
print('-------------------------------------------------------------------------------')
print(f'Prediction with embedding: {pred_ref.shape} = {pred_ref}')

diff_ref = pred_ref.flatten() - ground_truth
mse_ref = np.mean(diff_ref * diff_ref)
print(f'MSE between prediction and ground_truth: {mse_ref}')

Launching...
HPS initialized
[HCTR][06:54:27.572][INFO][RK0][main]: Creating RedisCluster backend...
[HCTR][06:54:27.577][INFO][RK0][main]: RedisCluster: Connecting via 127.0.0.1:7000...
[HCTR][06:54:27.577][INFO][RK0][main]: Volatile DB: initial cache rate = 1
[HCTR][06:54:27.577][INFO][RK0][main]: Volatile DB: cache missed embeddings = 0
[HCTR][06:54:27.577][DEBUG][RK0][main]: Created raw model loader in local memory!
[HCTR][06:54:27.753][INFO][RK0][main]: Table: hps_et.hps_demo.sparse_embedding1; cached 18488 / 18488 embeddings in volatile database (RedisCluster); load: 18488 / 18446744073709551615 (0.00%).
[HCTR][06:54:27.873][INFO][RK0][main]: Table: hps_et.hps_demo.sparse_embedding2; cached 18470 / 18470 embeddings in volatile database (RedisCluster); load: 18470 / 18446744073709551615 (0.00%).
[HCTR][06:54:30.134][DEBUG][RK0][main]: Real-time subscribers created!
[HCTR][06:54:30.134][INFO][RK0][main]: Creating embedding cache in device 0.
[HCTR][06:54:30.140][INFO][RK0][main]: M

2023-09-20 06:54:30.230052244 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'key_to_indice_hash_all_tables'. It is not used by any node and should be removed from the model.


**Step 5: Shutdown Redis cluster**

In [17]:
!pkill redis-server

<a id="section-5"></a>
## 5. Redis Cluster deployment (with TLS/SSL)
When using Redis as backing storage, HugeCTR can use make use of TLS/SSL to encrypt data transfers. In the following steps we setupt a small Redis cluster and enable SSL for it.

**Step 1: Build a TLS/SSL capable distribution of Redis**

In [18]:
!rm -f 7.0.8.tar.gz && wget https://github.com/redis/redis/archive/7.0.8.tar.gz
!rm -rf redis-7.0.8 && tar -xf 7.0.8.tar.gz && ln -sf redis-7.0.8 redis
!cd redis && make BUILD_TLS=yes

--2023-09-20 06:55:14--  https://github.com/redis/redis/archive/7.0.8.tar.gz
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/redis/redis/tar.gz/refs/tags/7.0.8 [following]
--2023-09-20 06:55:14--  https://codeload.github.com/redis/redis/tar.gz/refs/tags/7.0.8
Resolving codeload.github.com (codeload.github.com)... 192.30.255.121
Connecting to codeload.github.com (codeload.github.com)|192.30.255.121|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘7.0.8.tar.gz’

7.0.8.tar.gz            [     <=>            ]   2.87M  3.24MB/s    in 0.9s    

2023-09-20 06:55:15 (3.24 MB/s) - ‘7.0.8.tar.gz’ saved [3011655]

cd src && make all
make[1]: Entering directory '/hugectr/notebooks/tmr/redis-7.0.8/src'
./mkreleasehdr.sh: line 2: echo: write error: Broken pipe
    [34

If you see the message `Hint: It's a good idea to run 'make test' ;)` followed by `make[1]: Leaving directory ...`, the compilation should have completed successfully.

**Step 2: Configure a mock Redis cluster**

Setup TLS/SSL certificates. Can skip if encyryption is not needed.

*WARNING: The following commands will erase the all contents in the following directories: `test_certs`, `redis-server-1`, `redis-server-2` and `redis-server-3`.*

In [19]:
!mkdir -p test_certs
!rm -f test_certs/*

with open("test_certs/openssl.conf", "w") as f:
    f.write("""[ redis_server ]
keyUsage = digitalSignature, keyEncipherment

[ hugectr_client ]
keyUsage = digitalSignature, keyEncipherment
nsCertType = client""")
    
# Create private keys for CA, Redis server and HugeCTR client.
!openssl genrsa -out test_certs/ca-private.pem 4096
!openssl genrsa -out test_certs/redis-private.pem 4096
!openssl genrsa -out test_certs/hugectr-private.pem 4096

# Create public keys for CA, Redis server and HugeCTR client.
#!openssl rsa -pubout -in test_certs/ca-private.pem -out test_certs/ca-public.pem
#!openssl rsa -pubout -in test_certs/redis-private.pem -out test_certs/redis-public.pem
#!openssl rsa -pubout -in test_certs/hugectr-private.pem -out test_certs/hugectr-public.pem

# Form dummy CA.
!openssl req -new -nodes -sha256 -x509 -subj '/O=NVIDIA Merlin/CN=Certificate Authority' -days 365 \
    -key test_certs/ca-private.pem \
    -out test_certs/ca.crt
    
# Generate certificate for Redis server.
!openssl req -new -sha256 -subj "/O=NVIDIA Merlin/CN=Redis Server" \
    -key test_certs/redis-private.pem | \
        openssl x509 -req -sha256 \
            -CA test_certs/ca.crt \
            -CAkey test_certs/ca-private.pem \
            -CAserial test_certs/redis.ser \
            -CAcreateserial \
            -days 365 \
            -extfile test_certs/openssl.conf -extensions redis_server \
            -out test_certs/redis.crt

# Generate certificate for HugeCTR client.
!openssl req -new -sha256 -subj "/O=NVIDIA Merlin/CN=HugeCTR Redis Client" \
        -key test_certs/hugectr-private.pem | \
        openssl x509 \
            -req -sha256 \
            -CA test_certs/ca.crt \
            -CAkey test_certs/ca-private.pem \
            -CAserial test_certs/hugectr.ser \
            -CAcreateserial \
            -days 365 \
            -extfile test_certs/openssl.conf -extensions hugectr_client \
            -out test_certs/hugectr.crt

Certificate request self-signature ok
subject=O = NVIDIA Merlin, CN = Redis Server
Certificate request self-signature ok
subject=O = NVIDIA Merlin, CN = HugeCTR Redis Client


In [20]:
!mkdir -p redis-server-1 redis-server-2 redis-server-3
!rm -f redis-server-1/* redis-server-2/* redis-server-3/*

!ln -sf $PWD/redis/src/redis-server redis-server-1/redis-server
!ln -sf $PWD/redis/src/redis-server redis-server-2/redis-server
!ln -sf $PWD/redis/src/redis-server redis-server-3/redis-server

!ln -sf $PWD/test_certs/ca.crt redis-server-1/ca.crt
!ln -sf $PWD/test_certs/ca.crt redis-server-2/ca.crt
!ln -sf $PWD/test_certs/ca.crt redis-server-3/ca.crt

!ln -sf $PWD/test_certs/redis-private.pem redis-server-1/private.pem
!ln -sf $PWD/test_certs/redis-private.pem redis-server-2/private.pem
!ln -sf $PWD/test_certs/redis-private.pem redis-server-3/private.pem

!ln -sf $PWD/test_certs/redis.crt redis-server-1/redis.crt
!ln -sf $PWD/test_certs/redis.crt redis-server-2/redis.crt
!ln -sf $PWD/test_certs/redis.crt redis-server-3/redis.crt

In [21]:
%%writefile redis-server-1/redis.conf
daemonize yes
port 0
cluster-enabled yes
cluster-config-file nodes.conf
tls-port 7000
tls-ca-cert-file ca.crt
tls-cert-file redis.crt
tls-key-file private.pem
tls-cluster yes
appendonly no
save ""

Writing redis-server-1/redis.conf


In [22]:
%%writefile redis-server-2/redis.conf
daemonize yes
port 0
cluster-enabled yes
cluster-config-file nodes.conf
tls-port 7001
tls-ca-cert-file ca.crt
tls-cert-file redis.crt
tls-key-file private.pem
tls-cluster yes
appendonly no
save ""

Writing redis-server-2/redis.conf


In [23]:
%%writefile redis-server-3/redis.conf
daemonize yes
port 0
cluster-enabled yes
cluster-config-file nodes.conf
tls-port 7002
tls-ca-cert-file ca.crt
tls-cert-file redis.crt
tls-key-file private.pem
tls-cluster yes
appendonly no
save ""

Writing redis-server-3/redis.conf


**Step 3: Form Redis cluster**

*WARNING: The following command will shutdown any processes called `redis-cluster` in the current system!*

In [24]:
# Shutdown existing cluster (if any).
!pkill redis-server

# Reset configuration and start 3 Redis servers.
!cd redis-server-1 && rm -f nodes.conf && ./redis-server redis.conf
!cd redis-server-2 && rm -f nodes.conf && ./redis-server redis.conf
!cd redis-server-3 && rm -f nodes.conf && ./redis-server redis.conf

# Form the cluster.
!redis/src/redis-cli \
    --cluster create 127.0.0.1:7000 127.0.0.1:7001 127.0.0.1:7002 \
    --cluster-yes \
    --tls \
    --cacert test_certs/ca.crt \
    --cert test_certs/hugectr.crt \
    --key test_certs/hugectr-private.pem

[29;1m>>> Performing hash slots allocation on 3 nodes...
[0mMaster[0] -> Slots 0 - 5460
Master[1] -> Slots 5461 - 10922
Master[2] -> Slots 10923 - 16383
M: a441806db5506b7600ee8ae794fa01dc31ac83c9 127.0.0.1:7000
   slots:[0-5460] (5461 slots) master
M: 6fa93392a396aa3c321736234b7eafc86bb1f979 127.0.0.1:7001
   slots:[5461-10922] (5462 slots) master
M: 8e9cd68cc229fcb568a84d7358011201b4246046 127.0.0.1:7002
   slots:[10923-16383] (5461 slots) master
[29;1m>>> Nodes configuration updated
[0m[29;1m>>> Assign a different config epoch to each node
[0m[29;1m>>> Sending CLUSTER MEET messages to join the cluster
[0mWaiting for the cluster to join
..
[29;1m>>> Performing Cluster Check (using node 127.0.0.1:7000)
[0mM: a441806db5506b7600ee8ae794fa01dc31ac83c9 127.0.0.1:7000
   slots:[0-5460] (5461 slots) master
M: 8e9cd68cc229fcb568a84d7358011201b4246046 127.0.0.1:7002
   slots:[10923-16383] (5461 slots) master
M: 6fa93392a396aa3c321736234b7eafc86bb1f979 127.0.0.1:7001
   slots:[5461-1

**Step 4: Run HugeCTR**

In [25]:
import os
import time
import multiprocessing as mp
import pandas as pd
import numpy as np
import onnxruntime as ort
from hugectr import DatabaseType_t
from hugectr.inference import HPS, ParameterServerConfig, InferenceParams, VolatileDatabaseParams

slot_size_array = [10000, 10000, 10000, 10000]
key_offset = np.insert(np.cumsum(slot_size_array), 0, 0)[:-1]
batch_size = 1024

print('Launching...')

# 1. Configure the HPS hyperparameters.
ps_config = ParameterServerConfig(
       emb_table_name = {'hps_demo': ['sparse_embedding1', 'sparse_embedding2']},
       embedding_vec_size = {'hps_demo': [16, 32]},
       max_feature_num_per_sample_per_emb_table = {'hps_demo': [2, 2]},
       inference_params_array = [
          InferenceParams(
            model_name = 'hps_demo',
            max_batchsize = batch_size,
            hit_rate_threshold = 1.0,
            dense_model_file = '',
            sparse_model_files = ['hps_demo0_sparse_1000.model', 'hps_demo1_sparse_1000.model'],
            deployed_devices = [0],
            use_gpu_embedding_cache = True,
            cache_size_percentage = 0.5,
            i64_input_key = True)
       ],
       volatile_db = VolatileDatabaseParams(
            DatabaseType_t.redis_cluster,
            address = '127.0.0.1:7000',
            num_partitions = 15,
            num_node_connections = 5,
            enable_tls = True,
            tls_ca_certificate = 'test_certs/ca.crt',
            tls_client_certificate = 'test_certs/hugectr.crt',
            tls_client_key = 'test_certs/hugectr-private.pem',
            tls_server_name_identification = 'redis.localhost',
       ))

# 2. Initialize the HPS object.
hps = HPS(ps_config)
print('HPS initialized')

# 3. Load query data.
df = pd.read_parquet('data_parquet/val/gen_0.parquet')
dense_input_columns = df.columns[1:11]
cat_input1_columns = df.columns[11:13]
cat_input2_columns = df.columns[13:15]
dense_input = df[dense_input_columns].loc[0:batch_size-1].to_numpy(dtype=np.float32)
cat_input1 = (df[cat_input1_columns].loc[0:batch_size-1].to_numpy(dtype=np.int64) + key_offset[0:2]).reshape((batch_size, 2, 1))
cat_input2 = (df[cat_input2_columns].loc[0:batch_size-1].to_numpy(dtype=np.int64) + key_offset[2:4]).reshape((batch_size, 2, 1))

# 4. Make inference from the HPS object and the ONNX inference session of `hps_demo_without_embedding.onnx`.
embedding1 = hps.lookup(cat_input1.flatten(), 'hps_demo', 0).reshape(batch_size, 2, 16)
embedding2 = hps.lookup(cat_input2.flatten(), 'hps_demo', 1).reshape(batch_size, 2, 32)
sess = ort.InferenceSession('hps_demo_without_embedding.onnx')
res = sess.run(output_names=[sess.get_outputs()[0].name],
               input_feed={sess.get_inputs()[0].name: dense_input,
               sess.get_inputs()[1].name: embedding1,
               sess.get_inputs()[2].name: embedding2})
pred = res[0].flatten()

# 5. Check the correctness by comparing with dumped evaluation results.
ground_truth = np.load("ground_truth.npy").flatten()
print('-------------------------------------------------------------------------------')
print('                         HPS demo without embedding                            ')
print('-------------------------------------------------------------------------------')
print(f'Ground truth: {ground_truth.shape} = {ground_truth}')
print('-------------------------------------------------------------------------------')
print(f'Prediction without embedding: {pred.shape} = {pred}')

diff = pred - ground_truth
mse = np.mean(diff * diff)
print(f'MSE between prediction and ground_truth: {mse}')

# 6. Make inference with the ONNX inference session of `hps_demo_with_embedding.onnx` (double check).
sess_ref = ort.InferenceSession('hps_demo_with_embedding.onnx')
res_ref = sess_ref.run(output_names=[sess_ref.get_outputs()[0].name],
               input_feed={sess_ref.get_inputs()[0].name: dense_input,
               sess_ref.get_inputs()[1].name: cat_input1,
               sess_ref.get_inputs()[2].name: cat_input2})
pred_ref = res_ref[0].flatten()

print('-------------------------------------------------------------------------------')
print('                           HPS demo with embedding                             ')
print('-------------------------------------------------------------------------------')
print(f'Ground truth: {ground_truth.shape} = {ground_truth}')
print('-------------------------------------------------------------------------------')
print(f'Prediction with embedding: {pred_ref.shape} = {pred_ref}')

diff_ref = pred_ref.flatten() - ground_truth
mse_ref = np.mean(diff_ref * diff_ref)
print(f'MSE between prediction and ground_truth: {mse_ref}')

Launching...
HPS initialized
[HCTR][07:00:07.643][INFO][RK0][main]: Creating RedisCluster backend...
[HCTR][07:00:07.644][INFO][RK0][main]: RedisCluster: Connecting via 127.0.0.1:7000...
[HCTR][07:00:07.667][INFO][RK0][main]: Volatile DB: initial cache rate = 1
[HCTR][07:00:07.667][INFO][RK0][main]: Volatile DB: cache missed embeddings = 0
[HCTR][07:00:07.667][DEBUG][RK0][main]: Created raw model loader in local memory!
[HCTR][07:00:07.894][INFO][RK0][main]: Table: hps_et.hps_demo.sparse_embedding1; cached 18488 / 18488 embeddings in volatile database (RedisCluster); load: 18488 / 18446744073709551615 (0.00%).
[HCTR][07:00:07.984][INFO][RK0][main]: Table: hps_et.hps_demo.sparse_embedding2; cached 18470 / 18470 embeddings in volatile database (RedisCluster); load: 18470 / 18446744073709551615 (0.00%).
[HCTR][07:00:07.984][DEBUG][RK0][main]: Real-time subscribers created!
[HCTR][07:00:07.984][INFO][RK0][main]: Creating embedding cache in device 0.
[HCTR][07:00:07.990][INFO][RK0][main]: M

2023-09-20 07:00:08.022623188 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'key_to_indice_hash_all_tables'. It is not used by any node and should be removed from the model.


**Step 5: Shutdown Redis cluster**

In [26]:
!pkill redis-server