<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# HugeCTR Continuous Training and Inference Demo (Part II)

## Inference using Triton

### 1.1 Generate related model folders

In [1]:
import os
from time import time
import re
import shutil
import glob
import warnings

BASE_DIR = "/wdl_infer"
model_folder  = os.path.join(BASE_DIR, "model")
wdl_model_repo= os.path.join(model_folder, "wdl")
wdl_version =os.path.join(wdl_model_repo, "1")

if os.path.isdir(model_folder):
    shutil.rmtree(model_folder)
os.makedirs(model_folder)

if os.path.isdir(wdl_model_repo):
    shutil.rmtree(wdl_model_repo)
os.makedirs(wdl_model_repo)

if os.path.isdir(wdl_version):
    shutil.rmtree(wdl_version)
os.makedirs(wdl_version)

### 1.2 Copy WDL model files and configuration to model repository

In [2]:
!cp -r wdl_0_sparse_model $wdl_version/
!cp -r wdl_1_sparse_model $wdl_version/
!cp  wdl_dense_0.model $wdl_version/
!cp wdl.json $wdl_version/
!ls -l $wdl_version

total 5840
-rw-r--r-- 1 root root    3628 Dec  3 03:36 wdl.json
drwxr-xr-x 2 root root    4096 Dec  3 03:36 wdl_0_sparse_model
drwxr-xr-x 2 root root    4096 Dec  3 03:36 wdl_1_sparse_model
-rw-r--r-- 1 root root 5963780 Dec  3 03:36 wdl_dense_0.model


### 1.3 Generate the Triton configuration for deploying WDL 

In [3]:
%%writefile $wdl_model_repo/config.pbtxt
name: "wdl"
backend: "hugectr"
max_batch_size:64,
input [
   {
    name: "DES"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "CATCOLUMN"
    data_type: TYPE_INT64
    dims: [ -1 ]
  },
  {
    name: "ROWINDEX"
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]
output [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_GPU
    gpus:[0]
  }
]

parameters [
  {
  key: "config"
  value: { string_value: "/wdl_infer/model/wdl/1/wdl.json" }
  },
  {
  key: "gpucache"
  value: { string_value: "true" }
  },
  {
  key: "hit_rate_threshold"
  value: { string_value: "0.8" }
  },
  {
  key: "refresh_interval"
  value: { string_value: "20" }
  },
  {
  key: "refresh_delay"
  value: { string_value: "0.0" }
  },
  {
  key: "gpucacheper"
  value: { string_value: "0.5" }
  },
  {
  key: "label_dim"
  value: { string_value: "1" }
  },
  {
  key: "slots"
  value: { string_value: "27" }
  },
  {
  key: "cat_feature_num"
  value: { string_value: "28" }
  },
 {
  key: "des_feature_num"
  value: { string_value: "13" }
  },
  {
  key: "max_nnz"
  value: { string_value: "2" }
  },
  {
  key: "embedding_vector_size"
  value: { string_value: "16" }
  },
  {
  key: "embeddingkey_long_type"
  value: { string_value: "true" }
  }
]

Writing /wdl_infer/model/wdl/config.pbtxt


### 1.4 Make a directory for RocksDB

In [4]:
!mkdir /wdl_infer/rocksdb

mkdir: cannot create directory ‘/wdl_infer/rocksdb’: File exists


### 1.5 Generate the HugeCTR Backend parameter server configuration for deploying WDL

In [5]:
%%writefile /wdl_infer/model/ps.json
{
    "supportlonglong":true,
    "volatile_db": {
        "type": "redis_cluster",
        "address": "127.0.0.1:7000,127.0.0.1:7001,127.0.0.1:7002",
        "user_name": "default",
        "password": "",
        "num_partitions": 8,
        "max_get_batch_size": 100000,
        "max_set_batch_size": 100000,
        "overflow_policy": "evict_oldest",
        "overflow_margin": 10000000,
        "overflow_resolution_target": 0.8,
        "initial_cache_rate": 1.0,
        "update_filters": [ ".+" ]
    },
    "persistent_db": {
        "type": "rocksdb",
        "path": "/wdl_infer/rocksdb",
        "num_threads": 16,
        "read_only": false,
        "max_get_batch_size": 1,
        "max_set_batch_size": 10000,
        "update_filters": [ "^hps_.+$" ]
    },
    "update_source": {
        "type": "kafka",
        "brokers": "10.23.137.25:9093",
        "receive_buffer_size": 262144,
        "poll_timeout_ms": 500,
        "max_batch_size": 8192,
        "failure_backoff_ms": 50,
        "max_commit_interval": 32
    },
    "models":[
        {
            "model":"wdl",
            "sparse_files":["/wdl_infer/model/wdl/1/wdl_0_sparse_model", "/wdl_infer/model/wdl/1/wdl_1_sparse_model"],
            "dense_file":"/wdl_infer/model/wdl/1/wdl_dense_0.model",
            "network_file":"/wdl_infer/model/wdl/1/wdl.json",
            "num_of_worker_buffer_in_pool": 1,
            "num_of_refresher_buffer_in_pool": 1,
            "cache_refresh_percentage_per_iteration": 0.2,
            "deployed_device_list":[0],
            "max_batch_size":64,
            "default_value_for_each_table":[0.0,0.0],
            "hit_rate_threshold":0.9,
            "gpucacheper":0.5,
            "gpucache":true,
            "maxnum_des_feature_per_sample": 13,
			"maxnum_catfeature_query_per_table_per_sample" : [2,26],
			"embedding_vecsize_per_table" : [1,15],
			"slot_num":28
        }
    ]
}

Writing /wdl_infer/model/ps.json


### 2.1 Start Triton

**Please make sure you have started Redis cluster following the README before you start Triton.**

**Start the Triton server in a new terminal using the following command:**
```
tritonserver --model-repository=/wdl_infer/model/ --load-model=wdl --model-control-mode=explicit --backend-directory=/usr/local/hugectr/backends --backend-config=hugectr,ps=/wdl_infer/model/ps.json
```

### 2.2 Inference using Triton

In [6]:
%%writefile triton_infer.py

from tritonclient.utils import *
import tritonclient.http  as httpclient
import numpy as np
import pandas as pd
import sys

model_name = 'wdl'
CATEGORICAL_COLUMNS=["C1_C2","C3_C4"] + ["C" + str(x) for x in range(1, 27)]
CONTINUOUS_COLUMNS=["I" + str(x) for x in range(1, 14)]
LABEL_COLUMNS = ['label']
test_df=pd.read_csv("infer_data.csv",sep=',')

with httpclient.InferenceServerClient("localhost:8000") as client:
    dense_features = np.array([list(test_df[CONTINUOUS_COLUMNS].values.flatten())],dtype='float32')
    embedding_columns = np.array([list((test_df[CATEGORICAL_COLUMNS]).values.flatten())],dtype='int64')
    row_ptrs = np.array([list(range(0,11, 2)) + list(range(0,131))], dtype='int32')
    
    inputs = [
        httpclient.InferInput("DES", dense_features.shape,
                              np_to_triton_dtype(dense_features.dtype)),
        httpclient.InferInput("CATCOLUMN", embedding_columns.shape,
                              np_to_triton_dtype(embedding_columns.dtype)),
        httpclient.InferInput("ROWINDEX", row_ptrs.shape,
                              np_to_triton_dtype(row_ptrs.dtype)),

    ]

    inputs[0].set_data_from_numpy(dense_features)
    inputs[1].set_data_from_numpy(embedding_columns)
    inputs[2].set_data_from_numpy(row_ptrs)
    outputs = [
        httpclient.InferRequestedOutput("OUTPUT0")
    ]

    response = client.infer(model_name,
                            inputs,
                            request_id=str(1),
                            outputs=outputs)

    result = response.get_response()
    print(result)
    print("Prediction Result:")
    print(response.as_numpy("OUTPUT0"))

Overwriting triton_infer.py


In [7]:
!python3 triton_infer.py

{'id': '1', 'model_name': 'wdl', 'model_version': '1', 'parameters': {'NumSample': 5, 'DeviceID': 0}, 'outputs': [{'name': 'OUTPUT0', 'datatype': 'FP32', 'shape': [5], 'parameters': {'binary_data_size': 20}}]}
Prediction Result:
[0.01366859 0.00814866 0.06785329 0.00727612 0.01993068]


### 2.3 Continuous inference

**Send inference request again after [continous training](./Continuous_Training.ipynb) was done:**

In [8]:
!python3 triton_infer.py

{'id': '1', 'model_name': 'wdl', 'model_version': '1', 'parameters': {'NumSample': 5, 'DeviceID': 0}, 'outputs': [{'name': 'OUTPUT0', 'datatype': 'FP32', 'shape': [5], 'parameters': {'binary_data_size': 20}}]}
Prediction Result:
[0.00362184 0.00090019 0.05462332 0.00286225 0.00531276]
