In [1]:
# Copyright 2020 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_hugectr_movie-lens-example/nvidia_logo.png" style="width: 90px; float: right;">

# HugeCTR demo on Movie lens data

## Overview

HugeCTR is a recommender-specific framework that is capable of distributed training across multiple GPUs and nodes for click-through-rate (CTR) estimation.
HugeCTR is a component of NVIDIA Merlin ([documentation](https://nvidia-merlin.github.io/Merlin/main/README.html) | [GitHub](https://github.com/NVIDIA-Merlin/Merlin)).
Merlin which is a framework that accelerates the entire pipeline from data ingestion and training to deploying GPU-accelerated recommender systems.

### Learning objectives

* Training a deep-learning recommender model (DLRM) on the MovieLens 20M [dataset](https://grouplens.org/datasets/movielens/20m/).
* Walk through data preprocessing, training a DLRM model with HugeCTR, and then using the movie embedding to answer item similarity queries.


## Prerequisites

### Docker containers

Start the notebook inside a running 22.09 or later NGC Docker container: `nvcr.io/nvidia/merlin/merlin-hugectr:22.09`.
The HugeCTR Python interface is installed to the path `/usr/local/hugectr/lib/` and the path is added to the environment variable `PYTHONPATH`.
You can use the HugeCTR Python interface within the Docker container without any additional configuration.

### Hardware

This notebook requires a Pascal, Volta, Turing, Ampere or newer GPUs, such as P100, V100, T4 or A100.
You can view the GPU information with the `nvidia-smi` command:

In [1]:
!nvidia-smi

Mon Aug 15 07:05:22 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   30C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   32C    P0    43W / 300W |      0MiB / 32510MiB |      0%      Default |
|       

## Data download and preprocessing

We first install a few extra utilities for data preprocessing.

In [2]:
print("Downloading and installing 'tqdm' package.")
!pip3 -q install torch tqdm

print("Downloading and installing 'unzip' command")
!apt-get update
!apt-get install -y zip

Downloading and installing 'tqdm' package.
Downloading and installing 'unzip' command
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease [1581 B]
Get:2 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]      
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages [663 kB]
Get:4 http://archive.ubuntu.com/ubuntu focal InRelease [265 kB]                
Get:5 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [2087 kB]
Get:6 http://security.ubuntu.com/ubuntu focal-security/restricted amd64 Packages [1461 kB]
Get:7 http://security.ubuntu.com/ubuntu focal-security/universe amd64 Packages [888 kB]
Get:8 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]       
Get:9 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 Packages [27.5 kB]
Get:10 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]     
Get:11 http://archive.ubuntu.com/ubuntu

Next, we download and unzip the MovieLens 20M [dataset](https://grouplens.org/datasets/movielens/20m/).

In [6]:
print("Downloading and extracting 'Movie Lens 20M' dataset.")
#!wget -nc http://files.grouplens.org/datasets/movielens/ml-20m.zip -P data -q --show-progress
!unzip -n data/ml-20m.zip -d data
!ls ./data

Downloading and extracting 'Movie Lens 20M' dataset.
Archive:  data/ml-20m.zip
   creating: data/ml-20m/
  inflating: data/ml-20m/genome-scores.csv  
  inflating: data/ml-20m/genome-tags.csv  
  inflating: data/ml-20m/links.csv   
  inflating: data/ml-20m/movies.csv  
  inflating: data/ml-20m/ratings.csv  
  inflating: data/ml-20m/README.txt  
  inflating: data/ml-20m/tags.csv    
ml-20m	ml-20m.zip


### MovieLens data preprocessing

In [1]:
import pandas as pd
import torch
import tqdm

MIN_RATINGS = 20
USER_COLUMN = 'userId'
ITEM_COLUMN = 'movieId'

  from .autonotebook import tqdm as notebook_tqdm


Next, we read the data into a Pandas dataframe and encode `userID` and `itemID` with integers.

In [2]:
df = pd.read_csv('./data/ml-20m/ratings.csv')
print("Filtering out users with less than {} ratings".format(MIN_RATINGS))
grouped = df.groupby(USER_COLUMN)
df = grouped.filter(lambda x: len(x) >= MIN_RATINGS)

print("Mapping original user and item IDs to new sequential IDs")
df[USER_COLUMN], unique_users = pd.factorize(df[USER_COLUMN])
df[ITEM_COLUMN], unique_items = pd.factorize(df[ITEM_COLUMN])

nb_users = len(unique_users)
nb_items = len(unique_items)

print("Number of users: %d\nNumber of items: %d"%(len(unique_users), len(unique_items)))

Filtering out users with less than 20 ratings
Mapping original user and item IDs to new sequential IDs
Number of users: 138493
Number of items: 26744


Next, we split the data into a train and test set.
The last movie each user has recently rated is used for the test set.

In [3]:
# Need to sort before popping to get the last item
df.sort_values(by='timestamp', inplace=True)
    
# clean up data
del df['rating'], df['timestamp']
df = df.drop_duplicates() # assuming it keeps order

df.head()

Unnamed: 0,userId,movieId
4182421,28506,3258
18950979,131159,23
18950936,131159,3
18950930,131159,630
12341178,85251,1867


In [4]:
# HugeCTR expect user ID and item ID to be different, so we will add nb_users to the movieId to prevent key range overlapping
df['movieId'] = df['movieId'] + nb_users

In [5]:
# now we have filtered and sorted by time data, we can split test data out
grouped_sorted = df.groupby(USER_COLUMN, group_keys=False)
test_data = grouped_sorted.tail(1).sort_values(by=USER_COLUMN)

# need to pop for each group
train_data = grouped_sorted.apply(lambda x: x.iloc[:-1])

In [6]:
train_data['target']=1
test_data['target']=1
train_data.head()

Unnamed: 0,userId,movieId,target
20,0,138513,1
19,0,138512,1
86,0,138579,1
61,0,138554,1
23,0,138516,1


Because the MovieLens data contains only positive examples, first we define a utility function to generate negative samples.

In [7]:
class _TestNegSampler:
    def __init__(self, train_ratings, nb_users, nb_items, nb_neg):
        self.nb_neg = nb_neg
        self.nb_users = nb_users 
        self.nb_items = nb_items 

        # compute unique ids for quickly created hash set and fast lookup
        ids = (train_ratings[:, 0] * self.nb_items) + train_ratings[:, 1]
        self.set = set(ids)

    def generate(self, batch_size=128*1024):
        users = torch.arange(0, self.nb_users).reshape([1, -1]).repeat([self.nb_neg, 1]).transpose(0, 1).reshape(-1)

        items = [-1] * len(users)

        random_items = torch.LongTensor(batch_size).random_(0, self.nb_items).tolist()
        print('Generating validation negatives...')
        for idx, u in enumerate(tqdm.tqdm(users.tolist())):
            if not random_items:
                random_items = torch.LongTensor(batch_size).random_(0, self.nb_items).tolist()
            j = random_items.pop()
            while u * self.nb_items + j in self.set:
                if not random_items:
                    random_items = torch.LongTensor(batch_size).random_(0, self.nb_items).tolist()
                j = random_items.pop()

            items[idx] = j
        items = torch.LongTensor(items)
        return items

Next, we generate the negative samples for training.

In [8]:
sampler = _TestNegSampler(df.values, nb_users, nb_items, 500)  # using 500 negative samples
train_negs = sampler.generate()
train_negs = train_negs.reshape(-1, 500)

sampler = _TestNegSampler(df.values, nb_users, nb_items, 100)  # using 100 negative samples
test_negs = sampler.generate()
test_negs = test_negs.reshape(-1, 100)

Generating validation negatives...


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 69246500/69246500 [00:57<00:00, 1197676.54it/s]


Generating validation negatives...


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13849300/13849300 [00:11<00:00, 1181648.22it/s]


In [9]:
import numpy as np

# generating negative samples for training
train_data_neg = np.zeros((train_negs.shape[0]*train_negs.shape[1],3), dtype=int)
idx = 0
for i in tqdm.tqdm(range(train_negs.shape[0])):
    for j in range(train_negs.shape[1]):
        train_data_neg[idx, 0] = i # user ID
        train_data_neg[idx, 1] = train_negs[i, j] # negative item ID
        idx += 1
    
# generating negative samples for testing
test_data_neg = np.zeros((test_negs.shape[0]*test_negs.shape[1],3), dtype=int)
idx = 0
for i in tqdm.tqdm(range(test_negs.shape[0])):
    for j in range(test_negs.shape[1]):
        test_data_neg[idx, 0] = i
        test_data_neg[idx, 1] = test_negs[i, j]
        idx += 1

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 138493/138493 [06:23<00:00, 360.91it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 138493/138493 [01:16<00:00, 1804.65it/s]


In [10]:
train_data_np= np.concatenate([train_data_neg, train_data.values])
np.random.shuffle(train_data_np)

test_data_np= np.concatenate([test_data_neg, test_data.values])
np.random.shuffle(test_data_np)


### Write HugeCTR data files

After pre-processing, we write the data to disk using HugeCTR the [Norm](https://nvidia-merlin.github.io/HugeCTR/master/api/python_interface.html#norm) dataset format.

In [11]:
from ctypes import c_longlong as ll
from ctypes import c_uint
from ctypes import c_float
from ctypes import c_int

def write_hugeCTR_data(huge_ctr_data, filename='huge_ctr_data.dat'):
    print("Writing %d samples"%huge_ctr_data.shape[0])
    with open(filename, 'wb') as f:
        #write header
        f.write(ll(0)) # 0: no error check; 1: check_num
        f.write(ll(huge_ctr_data.shape[0])) # the number of samples in this data file
        f.write(ll(1)) # dimension of label
        f.write(ll(1)) # dimension of dense feature
        f.write(ll(2)) # long long slot_num
        for _ in range(3): f.write(ll(0)) # reserved for future use

        for i in tqdm.tqdm(range(huge_ctr_data.shape[0])):
            f.write(c_float(huge_ctr_data[i,2])) # float label[label_dim];
            f.write(c_float(0)) # dummy dense feature
            f.write(c_int(1)) # slot 1 nnz: user ID
            f.write(c_uint(huge_ctr_data[i,0]))
            f.write(c_int(1)) # slot 2 nnz: item ID
            f.write(c_uint(huge_ctr_data[i,1]))

#### Train data

In [12]:
def generate_filelist(filelist_name, num_files, filename_prefix):
    with open(filelist_name, 'wt') as f:
        f.write('{0}\n'.format(num_files));
        for i in range(num_files):
            f.write('{0}_{1}.dat\n'.format(filename_prefix, i))

In [13]:
!rm -rf ./data/hugeCTR
!mkdir ./data/hugeCTR

for i, data_arr in enumerate(np.array_split(train_data_np,10)):
    write_hugeCTR_data(data_arr, filename='./data/hugeCTR/train_huge_ctr_data_%d.dat'%i)

generate_filelist('./data/hugeCTR/train_filelist.txt', 10, './data/hugeCTR/train_huge_ctr_data')

Writing 8910827 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8910827/8910827 [00:28<00:00, 313062.86it/s]


Writing 8910827 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8910827/8910827 [00:28<00:00, 314545.08it/s]


Writing 8910827 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8910827/8910827 [00:28<00:00, 313687.26it/s]


Writing 8910827 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8910827/8910827 [00:28<00:00, 316105.12it/s]


Writing 8910827 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8910827/8910827 [00:28<00:00, 313179.63it/s]


Writing 8910827 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8910827/8910827 [00:28<00:00, 314053.42it/s]


Writing 8910827 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8910827/8910827 [00:28<00:00, 312377.54it/s]


Writing 8910827 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8910827/8910827 [00:28<00:00, 313288.65it/s]


Writing 8910827 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8910827/8910827 [00:28<00:00, 313456.87it/s]


Writing 8910827 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8910827/8910827 [00:28<00:00, 312600.20it/s]


#### Test data

In [14]:
for i, data_arr in enumerate(np.array_split(test_data_np,10)):
    write_hugeCTR_data(data_arr, filename='./data/hugeCTR/test_huge_ctr_data_%d.dat'%i)
    
generate_filelist('./data/hugeCTR/test_filelist.txt', 10, './data/hugeCTR/test_huge_ctr_data')

Writing 1398780 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1398780/1398780 [00:04<00:00, 314708.42it/s]


Writing 1398780 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1398780/1398780 [00:04<00:00, 313743.84it/s]


Writing 1398780 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1398780/1398780 [00:04<00:00, 316072.53it/s]


Writing 1398779 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1398779/1398779 [00:04<00:00, 315541.63it/s]


Writing 1398779 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1398779/1398779 [00:04<00:00, 315705.03it/s]


Writing 1398779 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1398779/1398779 [00:04<00:00, 315520.94it/s]


Writing 1398779 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1398779/1398779 [00:04<00:00, 313371.66it/s]


Writing 1398779 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1398779/1398779 [00:04<00:00, 314972.66it/s]


Writing 1398779 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1398779/1398779 [00:04<00:00, 314166.37it/s]


Writing 1398779 samples


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1398779/1398779 [00:04<00:00, 315031.06it/s]


## HugeCTR DLRM training

In this section, we will train a DLRM network on the augmented movie lens data. First, we write the training Python script.

In [18]:
%%writefile hugectr_dlrm_movielens.py
import hugectr
from mpi4py import MPI
solver = hugectr.CreateSolver(max_eval_batches = 1000,
                              batchsize_eval = 65536,
                              batchsize = 65536,
                              lr = 0.1,
                              warmup_steps = 1000,
                              decay_start = 10000,
                              decay_steps = 40000,
                              decay_power = 2.0,
                              end_lr = 1e-5,
                              vvgpu = [[0, 1]],
                              repeat_dataset = True,
                              use_mixed_precision = True,
                              scaler = 1024)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Norm,
                                  source = ["./data/hugeCTR/train_filelist.txt"],
                                  eval_source = "./data/hugeCTR/test_filelist.txt",
                                  num_workers = 2,
                                  check_type = hugectr.Check_t.Non)
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.SGD,
                                    update_type = hugectr.Update_t.Local,
                                    atomic_update = True)
model = hugectr.Model(solver, reader, optimizer)
model.add(hugectr.Input(label_dim = 1, label_name = "label",
                        dense_dim = 1, dense_name = "dense",
                        data_reader_sparse_param_array = 
                        [hugectr.DataReaderSparseParam("data1", 1, True, 2)]))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.LocalizedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 150,
                            embedding_vec_size = 64,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding1",
                            bottom_name = "data1",
                            optimizer = optimizer))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.FusedInnerProduct,
                            bottom_names = ["dense"],
                            top_names = ["fc1"],
                            num_output=64))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.FusedInnerProduct,
                            bottom_names = ["fc1"],
                            top_names = ["fc2"],
                            num_output=128))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.FusedInnerProduct,
                            bottom_names = ["fc2"],
                            top_names = ["fc3"],
                            num_output=64))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Interaction,
                            bottom_names = ["fc3","sparse_embedding1"],
                            top_names = ["interaction1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.FusedInnerProduct,
                            bottom_names = ["interaction1"],
                            top_names = ["fc4"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.FusedInnerProduct,
                            bottom_names = ["fc4"],
                            top_names = ["fc5"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.FusedInnerProduct,
                            bottom_names = ["fc5"],
                            top_names = ["fc6"],
                            num_output=512))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.FusedInnerProduct,
                            bottom_names = ["fc6"],
                            top_names = ["fc7"],
                            num_output=256))                                                  
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["fc7"],
                            top_names = ["fc8"],
                            num_output=1))                                                                                           
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.BinaryCrossEntropyLoss,
                            bottom_names = ["fc8", "label"],
                            top_names = ["loss"]))
model.compile()
model.summary()
model.fit(max_iter = 50000, display = 1000, eval_interval = 3000, snapshot = 49000, snapshot_prefix = "./hugeCTR_saved_model_DLRM/")

Overwriting hugectr_dlrm_movielens.py


In [19]:
!rm -rf ./hugeCTR_saved_model_DLRM/
!mkdir ./hugeCTR_saved_model_DLRM/

In [20]:
!python3 hugectr_dlrm_movielens.py

HugeCTR Version: 3.8
[HCTR][08:52:43.612][INFO][RK0][main]: Global seed is 3027443801
[HCTR][08:52:43.615][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
  GPU 1 ->  node 0
[HCTR][08:52:46.789][INFO][RK0][main]: Start all2all warmup
[HCTR][08:52:46.796][INFO][RK0][main]: End all2all warmup
[HCTR][08:52:46.798][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][08:52:46.799][INFO][RK0][main]: Device 0: Tesla V100-SXM2-32GB
[HCTR][08:52:46.800][INFO][RK0][main]: Device 1: Tesla V100-SXM2-32GB
[HCTR][08:52:46.800][INFO][RK0][main]: num of DataReader workers for train: 2
[HCTR][08:52:46.800][INFO][RK0][main]: num of DataReader workers for eval: 2
[HCTR][08:52:46.809][INFO][RK0][main]: max_vocabulary_size_per_gpu_=614400
[HCTR][08:52:46.812][INFO][RK0][main]: Graph analysis to resolve tensor dependency
[HCTR][08:53:14.439][INFO][RK0][main]: gpu0 start to init embedding
[HCTR][08:53:14.439][INFO][RK0][tid #140704957323008]: gpu1 start to init embedding
[HCTR][08:53:14.44

## Answer item similarity with DLRM embedding

In this section, we demonstrate how the output of HugeCTR training can be used to carry out simple inference tasks. Specifically, we will show that the movie embeddings can be used for simple item-to-item similarity queries. Such a simple inference can be used as an efficient candidate generator to generate a small set of candidates prior to deep learning model re-ranking. 

First, we read the embedding tables and extract the movie embeddings.

In [21]:
import struct 
import pickle
import numpy as np

key_type = 'I64'
key_type_map = {"I32": ["I", 4], "I64": ["q", 8]}

embedding_vec_size = 64

HUGE_CTR_VERSION = 2.21 # set HugeCTR version here, 2.2 for v2.2, 2.21 for v2.21

if HUGE_CTR_VERSION <= 2.2:
    each_key_size = key_type_map[key_type][1] + key_type_map[key_type][1] + 4 * embedding_vec_size
else:
    each_key_size = key_type_map[key_type][1] + 8 + 4 * embedding_vec_size

In [22]:
embedding_table = {}
        
with open("./hugeCTR_saved_model_DLRM/0_sparse_49000.model" + "/key", 'rb') as key_file, \
     open("./hugeCTR_saved_model_DLRM/0_sparse_49000.model" + "/emb_vector", 'rb') as vec_file:
    try:
        while True:
            key_buffer = key_file.read(key_type_map[key_type][1])
            vec_buffer = vec_file.read(4 * embedding_vec_size)
            if len(key_buffer) == 0 or len(vec_buffer) == 0:
                break
            key = struct.unpack(key_type_map[key_type][0], key_buffer)[0]
            values = struct.unpack(str(embedding_vec_size) + "f", vec_buffer)

            embedding_table[key] = values

    except BaseException as error:
        print(error)


In [24]:
# Create mapping between the MovieId and the keys in the embedding table
def mid_to_key(mid):
    return mid + nb_users

def key_to_mid(key):
    return key - nb_users

In [29]:
max_key = max(embedding_table.keys())
item_embedding = np.zeros((max_key + 1, embedding_vec_size), dtype='float')
for i in embedding_table.keys():
    item_embedding[i] = embedding_table[i]

### Answer nearest neighbor queries


In [30]:
from scipy.spatial.distance import cdist

def find_similar_movies(nn_movie_id, item_embedding, k=10, metric="euclidean"):
    #find the top K similar items according to one of the distance metric: cosine or euclidean
    sim = 1-cdist(item_embedding, item_embedding[nn_movie_id].reshape(1, -1), metric=metric)
   
    return sim.squeeze().argsort()[-k:][::-1]

In [31]:
import pandas as pd
movies = pd.read_csv("./data/ml-20m/movies.csv", index_col="movieId")

In [33]:
movies.index[:10]

Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype='int64', name='movieId')

In [39]:
item_embedding.shape

(165237, 64)

In [45]:
for movie_ID in movies.index[:10]:
    try:
        print("Query: ", movies.loc[movie_ID]["title"], movies.loc[movie_ID]["genres"])

        print("Similar movies: ")
        similar_movies = find_similar_movies(mid_to_key(movie_ID), item_embedding)

        for i in similar_movies[1:]:
            try:
                print(key_to_mid(i), movies.loc[key_to_mid(i)]["title"], movies.loc[key_to_mid(i)]["genres"])
            except Exception as e:
                pass
        print("=================================\n")
    except Exception as e:
        pass

Query:  Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
Similar movies: 
339 While You Were Sleeping (1995) Comedy|Romance
2549 Wing Commander (1999) Action|Sci-Fi

Query:  Jumanji (1995) Adventure|Children|Fantasy
Similar movies: 
511 Program, The (1993) Action|Drama
1897 High Art (1998) Drama|Romance
314 Secret of Roan Inish, The (1994) Children|Drama|Fantasy|Mystery
28 Persuasion (1995) Drama|Romance
194 Smoke (1995) Comedy|Drama
80 White Balloon, The (Badkonake sefid) (1995) Children|Drama
10 GoldenEye (1995) Action|Adventure|Thriller
1084 Bonnie and Clyde (1967) Crime|Drama
649 Cold Fever (Á köldum klaka) (1995) Comedy|Drama

Query:  Grumpier Old Men (1995) Comedy|Romance
Similar movies: 
626 Thin Line Between Love and Hate, A (1996) Comedy
952 Around the World in 80 Days (1956) Adventure|Comedy
1119 Drunks (1995) Drama
353 Crow, The (1994) Action|Crime|Fantasy|Thriller
791 Last Klezmer: Leopold Kozlowski, His Life and Music, The (1994) Documentary
1115 Sleepover (199