# Recomendacion utilizando Recbole

## Setup

Instalamos las librerias necesarias

In [1]:
!pip install git+https://github.com/RUCAIBox/RecBole.git@0.1.x

Collecting git+https://github.com/RUCAIBox/RecBole.git@0.1.x
  Cloning https://github.com/RUCAIBox/RecBole.git (to revision 0.1.x) to /tmp/pip-req-build-si44p1w1
  Running command git clone -q https://github.com/RUCAIBox/RecBole.git /tmp/pip-req-build-si44p1w1
  Running command git checkout -b 0.1.x --track origin/0.1.x
  Switched to a new branch '0.1.x'
  Branch '0.1.x' set up to track remote branch '0.1.x' from 'origin'.
Collecting tqdm>=4.48.2
[?25l  Downloading https://files.pythonhosted.org/packages/e9/4e/afa45872365fe2abd13c8022d39348c01808b8cfeea129937920d7bb2244/tqdm-4.54.0-py2.py3-none-any.whl (69kB)
[K     |████████████████████████████████| 71kB 6.5MB/s 
[?25hCollecting scikit_learn>=0.23.2
[?25l  Downloading https://files.pythonhosted.org/packages/5c/a1/273def87037a7fb010512bbc5901c31cfddfca8080bc63b42b26e3cc55b3/scikit_learn-0.23.2-cp36-cp36m-manylinux1_x86_64.whl (6.8MB)
[K     |████████████████████████████████| 6.8MB 19.9MB/s 
[?25hCollecting pyyaml>=5.1.0
[?25l  D

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Preprocesamiento de la data
En este cuaderno se utilizara la data del challenge de Mercado Libre. Para ello, hay que preprocesarla al formato que utiliza RecBole.

Importamos las librerias necesarias

In [3]:
import json
import gzip
import datetime
from tqdm import tqdm

Creamos una funcion para abrir el zip que contiene los datos

In [10]:
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

In [None]:
for k in parse('test_dataset.jl.gz'):
  print(k)
  break

{'user_history': [{'event_info': 1572239, 'event_timestamp': '2019-09-26T18:31:47.705-0400', 'event_type': 'view'}, {'event_info': 1572239, 'event_timestamp': '2019-09-26T18:35:04.724-0400', 'event_type': 'view'}, {'event_info': 1572239, 'event_timestamp': '2019-09-26T18:37:35.532-0400', 'event_type': 'view'}, {'event_info': 1572239, 'event_timestamp': '2019-09-26T18:38:54.680-0400', 'event_type': 'view'}, {'event_info': 1572239, 'event_timestamp': '2019-09-26T18:40:26.904-0400', 'event_type': 'view'}, {'event_info': 1572239, 'event_timestamp': '2019-09-26T18:40:35.707-0400', 'event_type': 'view'}, {'event_info': 1572239, 'event_timestamp': '2019-09-26T18:41:07.467-0400', 'event_type': 'view'}, {'event_info': 1572239, 'event_timestamp': '2019-09-26T18:41:47.445-0400', 'event_type': 'view'}, {'event_info': 1572239, 'event_timestamp': '2019-09-26T19:03:34.256-0400', 'event_type': 'view'}, {'event_info': 1194894, 'event_timestamp': '2019-09-27T21:33:38.704-0400', 'event_type': 'view'}, {'

Con la siguiente función podemos procesar los datos. 

En primer lugar las columnas deben tener nombre del tipo NOMBRE_COLUMNA:TIPO_DATO. Como vamos a hacer recomendación secuencial, nos interesa tener un user_id, un item_id y un timestamp

Luego por cada usuario vamos recorriendo cada una de sus interacciones. Las interacciones pueden ser de tipo "View" o de tipo "Search", sin embargo las de tipo Search no poseen un item id, por lo que de momento solamente consideraremos las del tipo "View". La fecha se parsea como un timestamp

Luego de las interacciones, cada usuario tiene un id del item que compro. De momento no diferenciaremos entre comprar y ver, por lo que simplemente agregamos este item comprado con un timestamp mayor a todos los otros timestamps.

Todo esto lo escribimos en un archivo .inter. Este tipo de archivos es el que utiliza RecBole para representar datos de interacciones. Se le puede agregar todo tipo de parametros, pero los requeridos son siempre el userId y itemId

In [7]:
def create_recbole_atomic_file(path_input, path_output):
  # Crea un archivo .inter para usar con recbole
  with open(f"{path_output}.inter", 'w') as file:
    file.write('\t'.join(['user_id:token', 'item_id:token', 'timestamp:float']) + '\n')
    user_id = 1
    for l in tqdm(parse(path_input)):
      biggest_timestamp = 0
      history = []
      for event in l['user_history']:     
        if event['event_type'] == 'view':
          item_id = event['event_info']

          time = int(datetime.datetime.strptime(event['event_timestamp'], '%Y-%m-%dT%H:%M:%S.%f%z').timestamp())
          if time > biggest_timestamp:
            biggest_timestamp = time
          file.write('\t'.join([str(user_id), str(item_id), str(time)]) + '\n')
      if len(history) > 0:
        file.write('\t'.join([str(user_id), str(l['item_bought']), str(biggest_timestamp + 100)]) + '\n')
      user_id += 1
        


In [8]:
def test_atomic_file(path_input, path_train, path_output):
  # Primero tenemos que ver el ultimo user id para no repetirlos
  user_id = 1
  for _ in tqdm(parse(path_train)):
    user_id += 1
  # Ahora hacemos el mismo preprocesamiento que el de train pero sin la ultima cosa
  with open(f"{path_output}.inter", 'w') as file:
    file.write('\t'.join(['user_id:token', 'item_id:token', 'timestamp:float']) + '\n')
    for l in tqdm(parse(path_input)):
      count_events = 0
      for event in l['user_history']:
        if event['event_type'] == 'view':
          count_events += 1  
          item_id = event['event_info']
          time = int(datetime.datetime.strptime(event['event_timestamp'], '%Y-%m-%dT%H:%M:%S.%f%z').timestamp())
          file.write('\t'.join([str(user_id), str(item_id), str(time)]) + '\n')
      if count_events == 0:
        # Esto es para no perder info por si solo hay searches. La prediccion va a ser inutil pero bueno xd
        time = 100
        item_id = 1572239
        file.write('\t'.join([str(user_id), str(item_id), str(time)]) + '\n')
      user_id += 1

Ahora creamos los ficheros correspondientes y aplicamos la funcion de preprocesamiento. Con esto tenemos listo el dataset para utilizar

In [4]:
!mkdir ml_test

In [5]:
!mkdir ml

In [11]:
create_recbole_atomic_file('/content/drive/Shareddrives/RecSys/Datasets/train_dataset.jl.gz', 'ml/ml')

413163it [03:34, 1924.86it/s]


In [None]:
test_atomic_file('/content/drive/Shareddrives/RecSys/Datasets/test_dataset.jl.gz', '/content/drive/Shareddrives/RecSys/Datasets/train_dataset.jl.gz', 'ml_test/ml_test')

413163it [01:39, 4160.81it/s]
177070it [01:25, 2070.95it/s]


## Utilizando Recbole

### Configuración del modelo

RecBole funciona creando archivos de configuración para crear casi todas sus cosas

In [12]:
parameters_dict = {
    'data_path': './',
    'epochs': 10,
    'valid_metric': 'NDCG@10',
    'MAX_ITEM_LIST_LENGTH': 10,
    'train_batch_size': 128,
    'eval_batch_size': 128,
    'TIME_FIELD': 'timestamp'
}

In [13]:
def create_yaml_config(output_path):
  with open(output_path, 'w') as file:
    file.write('USER_ID_FIELD: user_id\nITEM_ID_FIELD: item_id\nTIME_FIELD: timestamp\n \nload_col:\n    inter: [user_id, item_id, timestamp]\n')
    # Aqui podemos escribir mas cosas que nos interesen


In [14]:
create_yaml_config('/content/data_config.yaml')

In [15]:
from recbole.model.sequential_recommender import SASRec
from recbole.trainer import Trainer
from logging import getLogger
from recbole.utils import init_seed, init_logger
from recbole.data import create_dataset, data_preparation
from recbole.config import Config

In [16]:
config = Config(model='SASRec',
                dataset='ml',
                config_file_list=['/content/data_config.yaml'],
                config_dict=parameters_dict)

In [17]:
config

General Hyper Parameters: 
gpu_id=0
use_gpu=True
seed=2020
state=INFO
reproducibility=True
data_path=./ml

Training Hyper Parameters: 
checkpoint_dir=saved
epochs=10
train_batch_size=128
learner=adam
learning_rate=0.001
training_neg_sample_num=1
eval_step=1
stopping_step=10

Evaluation Hyper Parameters: 
eval_setting=TO_LS,full
group_by_user=True
split_ratio=[0.8, 0.1, 0.1]
leave_one_num=2
real_time_process=True
metrics=['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk=[10]
valid_metric=NDCG@10
eval_batch_size=128

Dataset Hyper Parameters: 
field_separator=	
seq_separator= 
USER_ID_FIELD=user_id
ITEM_ID_FIELD=item_id
RATING_FIELD=rating
LABEL_FIELD=label
threshold=None
NEG_PREFIX=neg_
load_col={'inter': ['user_id', 'item_id', 'timestamp']}
unload_col=None
additional_feat_suffix=None
max_user_inter_num=None
min_user_inter_num=0
max_item_inter_num=None
min_item_inter_num=0
lowest_val=None
highest_val=None
equal_val=None
not_equal_val=None
drop_filter_field=True
fields_in_same_space=Non

In [18]:
init_seed(config['seed'], config['reproducibility'])
init_logger(config)
logger = getLogger()

In [19]:
dataset = create_dataset(config)
logger.info(dataset)

03 Dec 20:00    INFO ml
The number of users: 386391
Average actions of users: 15.347232071223376
The number of items: 1601278
Average actions of items: 3.7033049247569285
The number of inters: 5930017
The sparsity of the dataset: 99.99904156602715%
Remain Fields: ['user_id', 'item_id', 'timestamp']


In [20]:
train, val, test = data_preparation(config, dataset)

03 Dec 20:00    INFO Build [ModelType.SEQUENTIAL] DataLoader for [train] with format [InputType.POINTWISE]
03 Dec 20:00    INFO Evaluation Setting:
	Group by user_id
	Ordering: {'strategy': 'by', 'field': ['timestamp'], 'ascending': True}
	Splitting: {'strategy': 'loo', 'leave_one_num': 2}
	Negative Sampling: {'strategy': 'by', 'distribution': 'uniform', 'by': 1}
03 Dec 20:00    INFO batch_size = [[128]], shuffle = [True]

03 Dec 20:00    INFO Build [ModelType.SEQUENTIAL] DataLoader for [evaluation] with format [InputType.POINTWISE]
03 Dec 20:00    INFO Evaluation Setting:
	Group by user_id
	Ordering: {'strategy': 'by', 'field': ['timestamp'], 'ascending': True}
	Splitting: {'strategy': 'loo', 'leave_one_num': 2}
	Negative Sampling: {'strategy': 'full', 'distribution': 'uniform'}
03 Dec 20:00    INFO batch_size = [[128, 128]], shuffle = [False]



In [21]:
model = SASRec(config, train).to(config['device'])
logger.info(model)


03 Dec 20:00    INFO SASRec(
  (item_embedding): Embedding(1601278, 64, padding_idx=0)
  (position_embedding): Embedding(10, 64)
  (trm_encoder): TransformerEncoder(
    (layer): ModuleList(
      (0): TransformerLayer(
        (multi_head_attention): MultiHeadAttention(
          (query): Linear(in_features=64, out_features=64, bias=True)
          (key): Linear(in_features=64, out_features=64, bias=True)
          (value): Linear(in_features=64, out_features=64, bias=True)
          (attn_dropout): Dropout(p=0.5, inplace=False)
          (dense): Linear(in_features=64, out_features=64, bias=True)
          (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
          (out_dropout): Dropout(p=0.5, inplace=False)
        )
        (feed_forward): FeedForward(
          (dense_1): Linear(in_features=64, out_features=256, bias=True)
          (dense_2): Linear(in_features=256, out_features=64, bias=True)
          (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=

In [None]:

trainer = Trainer(config, model)

In [None]:
trainer.resume_checkpoint('/content/saved/SASRec-Nov-24-2020_18-50-34.pth')

RuntimeError: ignored

In [None]:
%%time
best_valid_score, best_valid_result = trainer.fit(train, val)

KeyboardInterrupt: ignored

In [None]:
test_result = trainer.evaluate(test)
print(test_result)

FileNotFoundError: ignored

## Evaluation

In [None]:
import numpy as np
import torch
from recbole.model.sequential_recommender import SASRec
from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.data.interaction import Interaction

In [None]:
@torch.no_grad()
def get_scores(uid_series, model, test_data):
    """Calculate the scores of all items for each user in uid_series.
    
    Note:
        The score of [pad] and history items will be set into -inf.
    
    Args:
        uid_series (np.ndarray): User id series
        model (AbstractRecommender): Model to predict
        test_data (AbstractDataLoader): The test_data of model
    
    Returns:
        torch.Tensor: the scores of all items for each user in uid_series.
    """
    uid_field = test_data.dataset.uid_field
    iid_field = test_data.dataset.iid_field
    dataset = test_data.dataset
    
    # Get scores of all items
    input_interaction = Interaction({uid_field: torch.tensor(uid_series.repeat(dataset.item_num))})
    input_interaction.update(test_data.get_item_feature().repeat(len(uid_series)))
    #input_interaction.update(test_data.get)
    print(input_interaction)
    score = model.predict(input_interaction).view(len(uid_series), dataset.item_num)

    score[:, 0] = -np.inf  # set scores of [pad] to -inf

    # Get history items
    test_inter = test_data.dataset.inter_feat
    history_item_ids = []
    for uid in uid_series:
        pos_item_id = test_inter[iid_field][test_inter[uid_field] == uid].values
        used_item_id = test_data.sampler.used_ids[uid]
        history_item_ids.append(list(used_item_id - set(pos_item_id)))

    # set scores of history items to -inf
    for i, hist_iid in enumerate(history_item_ids):
        score[i, hist_iid] = -np.inf

    return score

In [None]:
def get_topk(uid_series, model, test_data, k):
    """Calculate the top-k items' scores and ids for each user in uid_series.
    
    Args:
        uid_series (np.ndarray): User id series
        model (AbstractRecommender): Model to predict
        test_data (AbstractDataLoader): The test_data of model
        k (int): The top-k items.
    
    Returns:
        tuple:
            - topk_scores (torch.Tensor): The scores of topk items.
            - topk_index (torch.Tensor): The index of topk items, which is also the internal ids of items.
    """
    score = get_scores(uid_series, model, test_data)
    return torch.topk(score, k)

In [None]:
parameters_dict_eval = {
    'data_path': '/content/',
    'epochs': 10,
    'valid_metric': 'NDCG@10',
    'MAX_ITEM_LIST_LENGTH': 20,
    'train_batch_size': 128,
    'eval_batch_size': 128,
    'TIME_FIELD': 'timestamp',
    'split_ratio':[0,0,1]
}

In [None]:
config_eval = Config(model='SASRec',
                dataset='ml_test',
                config_file_list=['/content/data_config.yaml'],
                config_dict=parameters_dict_eval)
config = Config(model='SASRec',
                dataset='ml',
                config_file_list=['/content/data_config.yaml'],
                config_dict=parameters_dict)

In [None]:
config

In [None]:
dataset = create_dataset(config)
dataset_eval = create_dataset(config_eval)

In [None]:
train_data, valid_data, test_data = data_preparation(config, dataset)
_, _, test_data_eval = data_preparation(config_eval, dataset_eval)

In [None]:
model = SASRec(config, train_data).to(config['device'])

In [None]:
print(model)

SASRec(
  (item_embedding): Embedding(1605656, 64, padding_idx=0)
  (position_embedding): Embedding(10, 64)
  (trm_encoder): TransformerEncoder(
    (layer): ModuleList(
      (0): TransformerLayer(
        (multi_head_attention): MultiHeadAttention(
          (query): Linear(in_features=64, out_features=64, bias=True)
          (key): Linear(in_features=64, out_features=64, bias=True)
          (value): Linear(in_features=64, out_features=64, bias=True)
          (attn_dropout): Dropout(p=0.5, inplace=False)
          (dense): Linear(in_features=64, out_features=64, bias=True)
          (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
          (out_dropout): Dropout(p=0.5, inplace=False)
        )
        (feed_forward): FeedForward(
          (dense_1): Linear(in_features=64, out_features=256, bias=True)
          (dense_2): Linear(in_features=256, out_features=64, bias=True)
          (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
          (drop

In [None]:
checkpoint = torch.load('/content/drive/Shareddrives/RecSys/TrainingCheckpoints/SASRec/SASRec-Nov-24-2020_18-50-34.pth')

In [None]:
model.load_state_dict(checkpoint['state_dict'])

<All keys matched successfully>

In [None]:
model.eval()

SASRec(
  (item_embedding): Embedding(1605656, 64, padding_idx=0)
  (position_embedding): Embedding(10, 64)
  (trm_encoder): TransformerEncoder(
    (layer): ModuleList(
      (0): TransformerLayer(
        (multi_head_attention): MultiHeadAttention(
          (query): Linear(in_features=64, out_features=64, bias=True)
          (key): Linear(in_features=64, out_features=64, bias=True)
          (value): Linear(in_features=64, out_features=64, bias=True)
          (attn_dropout): Dropout(p=0.5, inplace=False)
          (dense): Linear(in_features=64, out_features=64, bias=True)
          (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
          (out_dropout): Dropout(p=0.5, inplace=False)
        )
        (feed_forward): FeedForward(
          (dense_1): Linear(in_features=64, out_features=256, bias=True)
          (dense_2): Linear(in_features=256, out_features=64, bias=True)
          (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
          (drop

In [None]:
uid_series = dataset_eval.token2id(dataset_eval.uid_field, ['1'])

In [None]:
topk_score, topk_iid_list = get_topk(uid_series, model, test_data_eval, 10)
print(topk_score, topk_iid_list)

The batch_size of interaction: 1605656
    user_id, torch.Size([1605656]), cpu
    item_id, torch.Size([1605656]), cpu




KeyError: ignored

## Cuda memory queries

In [None]:
!nvidia-smi

Wed Nov 25 01:12:14 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P0    33W /  70W |  14431MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
import torch
print(torch.cuda.empty_cache())

None


In [None]:
print(torch.cuda.memory_summary(device=None, abbreviated=False))

|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 4            |        cudaMalloc retries: 4         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    6699 MB |   13551 MB |     893 TB |     893 TB |
|       from large pool |    6684 MB |   13536 MB |     888 TB |     888 TB |
|       from small pool |      15 MB |      50 MB |       4 TB |       4 TB |
|---------------------------------------------------------------------------|
| Active memory         |    6699 MB |   13551 MB |     893 TB |     893 TB |
|       from large pool |    6684 MB |   13536 MB |     888 TB |     888 TB |
|       from small pool |      15 MB |      50 MB |       4 TB |       4 TB |
|---------------------------------------------------------------