# Tutorial

This tutorial will walk you through the comprehensive process of training a model in LibRecommender, i.e. **data processing -> feature engineering -> training -> evaluate -> save/load -> retrain**. We will use [Wide & Deep](https://arxiv.org/pdf/1606.07792.pdf) as the example algorithm. 

First make sure LibRecommender is installed.

In [None]:
!pip install LibRecommender

For how to deploy a trained model in LibRecommender, see [Serving Guide](https://librecommender.readthedocs.io/en/latest/serving_guide/python.html).

**NOTE**: If you encounter errors like `Variables already exist, disallowed...`, just call `tf.compat.v1.reset_default_graph()` first.

## Load data

In this tutorial we willl use the [MovieLens 1M](https://grouplens.org/datasets/movielens/1m/) dataset. The following code will load the data into `pandas.DataFrame` format. If the data does not exist locally, it will be downloaded at first.

In [None]:
import random
import warnings
import zipfile
from pathlib import Path

import pandas as pd
import tensorflow as tf
import tqdm
warnings.filterwarnings("ignore")

In [5]:
def load_ml_1m():
    # download and extract zip file
    tf.keras.utils.get_file(
        "ml-1m.zip", 
        "http://files.grouplens.org/datasets/movielens/ml-1m.zip", 
        cache_dir=".", 
        cache_subdir=".", 
        extract=True,
    )
    # read and merge data into same table
    cur_path = Path(".").absolute()
    ratings = pd.read_csv(
        cur_path / "ml-1m" / "ratings.dat", 
        sep="::", 
        usecols=[0, 1, 2, 3], 
        names=["user", "item", "rating", "time"],
    )
    users = pd.read_csv(
        cur_path / "ml-1m" / "users.dat", 
        sep="::",
        usecols=[0, 1, 2, 3], 
        names=["user", "sex", "age", "occupation"],
    )
    items = pd.read_csv(
        cur_path / "ml-1m" / "movies.dat", 
        sep="::",
        usecols=[0, 2], 
        names=["item", "genre"],
        encoding="iso-8859-1",
    )
    items[["genre1", "genre2", "genre3"]] = (
        items["genre"].str.split(r"|", expand=True).fillna("missing").iloc[:, :3]
    )
    items.drop("genre", axis=1, inplace=True)
    data = ratings.merge(users, on="user").merge(items, on="item")
    data.rename(columns={"rating": "label"}, inplace=True)
    # random shuffle data
    data = data.sample(frac=1, random_state=42).reset_index(drop=True)
    return data

In [6]:
data = load_ml_1m()
print("data shape:", data.shape)

data shape: (1000209, 10)


In [7]:
data.iloc[random.choices(range(len(data)), k=10)]  # randomly select 10 rows

Unnamed: 0,user,item,label,time,sex,age,occupation,genre1,genre2,genre3
389991,3993,2119,4,965699325,M,25,0,Horror,missing,missing
155010,699,564,2,975558670,M,18,0,Comedy,missing,missing
737170,4121,2377,4,965357614,M,45,7,Horror,Sci-Fi,missing
78641,4330,2028,4,965242550,F,45,6,Action,Drama,War
629621,1065,597,3,974948100,F,25,0,Comedy,Romance,missing
787474,5831,708,4,957909659,M,25,1,Comedy,Romance,missing
344925,4092,3526,5,965420545,F,25,0,Comedy,Drama,missing
888858,3681,441,5,966356720,M,25,7,Comedy,missing,missing
417010,771,908,4,975440099,F,50,1,Drama,Thriller,missing
511404,2795,1179,3,972892354,M,50,13,Crime,Drama,Film-Noir


Now we have about 1 million data. In order to perform evaluation after training, we need to split the data into train, eval and test data first. In this tutorial we will simply use `random_split`. For other ways of splitting data, see [Data Processing](https://librecommender.readthedocs.io/en/latest/user_guide/data_processing.html).

**For now, We will only use first half data for training. Later we will use the rest data to retrain the model.**

## Process Data & Features

In [None]:
from libreco.data import random_split

# split data into three folds for training, evaluating and testing
first_half_data = data[: (len(data) // 2)]
train_data, eval_data, test_data = random_split(first_half_data, multi_ratios=[0.8, 0.1, 0.1], seed=42)

The data contains some categorical features such as "sex" and "genre", as well as a numerical feature "age". In LibRecommender we use `sparse_col` to represent categorical features and `dense_col` to represent numerical features. So one should specify the column information and then use `DatasetFeat.build_*` functions to process the data.

In [9]:
print("first half data shape:", first_half_data.shape)

first half data shape: (500104, 10)


In [10]:
from libreco.data import DatasetFeat

sparse_col = ["sex", "occupation", "genre1", "genre2", "genre3"]
dense_col = ["age"]
user_col = ["sex", "age", "occupation"]
item_col = ["genre1", "genre2", "genre3"]

train_data, data_info = DatasetFeat.build_trainset(train_data, user_col, item_col, sparse_col, dense_col)
eval_data = DatasetFeat.build_evalset(eval_data)
test_data = DatasetFeat.build_testset(test_data)

"user_col" means features belong to user, and "item_col" means features belong to item. Note that the column numbers should match, i.e. `len(sparse_col) + len(dense_col) == len(user_col) + len(item_col)`.

In [11]:
print(data_info)

n_users: 6040, n_items: 3576, data density: 1.8523 %


In this example we treat all the samples in data as positive samples, and perform negative sampling. This is called "implicit data".

In [12]:
# sample negative items for each record
train_data.build_negative_samples(data_info)
eval_data.build_negative_samples(data_info)
test_data.build_negative_samples(data_info)

random neg item sampling elapsed: 0.338s
random neg item sampling elapsed: 0.043s
random neg item sampling elapsed: 0.043s


## Training the Model

Now with all the data and features prepared, we can start training the model! 

Since as its name suggests, the `Wide & Deep` algorithm has wide and deep parts, and they use different optimizers. So we should specify the learning rate separately by using a dict: `{"wide": 0.01, "deep": 3e-4}`. For other model hyperparameters, see API reference of [WideDeep](https://librecommender.readthedocs.io/en/latest/api/algorithms/wide_deep.html).

In [13]:
from libreco.algorithms import WideDeep

In [23]:
model = WideDeep(
    task="ranking",
    data_info=data_info,
    embed_size=16,
    n_epochs=2,
    loss_type="cross_entropy",
    lr={"wide": 0.005, "deep": 1e-4},
    batch_size=2048,
    use_bn=True,
    hidden_units=(128, 64, 32),
)

model.fit(
    train_data,
    verbose=2,
    shuffle=True,
    eval_data=eval_data,
    metrics=["loss", "roc_auc", "precision", "recall", "ndcg"],
)

Training start time: [35m2023-02-20 19:46:23[0m
total params: [33m192,477[0m | embedding params: [33m165,109[0m | network params: [33m27,368[0m


train: 100%|█████████████████████████████████████████████████| 391/391 [00:04<00:00, 95.63it/s]


Epoch 1 elapsed: 4.280s
	 [32mtrain_loss: 8.1818[0m


eval_pointwise: 100%|██████████████████████████████████████████| 13/13 [00:00<00:00, 92.43it/s]


	 eval log_loss: 5.8765
	 eval roc_auc: 0.7034


eval_listwise: 100%|██████████████████████████████████████| 2797/2797 [00:15<00:00, 184.03it/s]


	 eval precision@10: 0.0287
	 eval recall@10: 0.0413
	 eval ndcg@10: 0.1096


train: 100%|████████████████████████████████████████████████| 391/391 [00:03<00:00, 104.64it/s]


Epoch 2 elapsed: 3.925s
	 [32mtrain_loss: 4.1555[0m


eval_pointwise: 100%|█████████████████████████████████████████| 13/13 [00:00<00:00, 184.96it/s]


	 eval log_loss: 2.5937
	 eval roc_auc: 0.8073


eval_listwise: 100%|██████████████████████████████████████| 2797/2797 [00:15<00:00, 184.98it/s]


	 eval precision@10: 0.0311
	 eval recall@10: 0.0481
	 eval ndcg@10: 0.1240


We've trained the model for 2 epochs and evaluated the performance on the eval data during training. Next we can evaluate on the *independent* test data.

In [24]:
from libreco.evaluation import evaluate

evaluate(model=model, data=test_data, metrics=["loss", "roc_auc", "precision", "recall", "ndcg"])

eval_pointwise: 100%|█████████████████████████████████████████| 13/13 [00:00<00:00, 167.36it/s]
eval_listwise: 100%|██████████████████████████████████████| 1024/1024 [00:05<00:00, 185.17it/s]


{'loss': 2.59511655229092,
 'roc_auc': 0.8060875394289854,
 'precision': 0.028857421875000004,
 'recall': 0.046256462484535465,
 'ndcg': 0.12245855533553587}

## Make Recommendation

The recommend part is pretty straightforward. You can make recommendation for one user or a batch of users.

In [25]:
model.recommend_user(user=1, n_rec=3)

{1: array([ 260, 2858,  858])}

In [26]:
model.recommend_user(user=[1, 2, 3], n_rec=3)

{1: array([ 260, 2858,  858]),
 2: array([1617, 2987,  608]),
 3: array([1580, 1198,  480])}

## Save, Load and Inference

When saving the model, we should also save the `data_info` for feature information.

In [27]:
data_info.save("model_path", model_name="wide_deep")
model.save("model_path", model_name="wide_deep")

Then we can load the model and make recommendation again.

In [28]:
tf.compat.v1.reset_default_graph()  # need to reset graph in TensorFlow1

In [29]:
from libreco.data import DataInfo

loaded_data_info = DataInfo.load("model_path", model_name="wide_deep")
loaded_model = WideDeep.load("model_path", model_name="wide_deep", data_info=loaded_data_info)
loaded_model.recommend_user(user=1, n_rec=3)

total params: [33m192,477[0m | embedding params: [33m165,109[0m | network params: [33m27,368[0m


{1: array([ 260, 2858,  858])}

## Retrain the Model with New Data

Remember that we split the original `MovieLens 1M` data into two parts in the first place? We will treat the **second half** of the data as our new data and retrain the saved model with it. In real-world recommender systems, data may be generated every day, so it is inefficient to train the model from scratch every time we get some new data.

In [30]:
second_half_data = data[(len(data) // 2) :]
train_data, eval_data = random_split(second_half_data, multi_ratios=[0.8, 0.2])

In [31]:
print("second half data shape:", second_half_data.shape)

second half data shape: (500105, 10)


The data processing is similar, except that we should use `merge_trainset()` and `merge_evalset()` in DatasetFeat.

The purpose of these functions is combining information from old data with that from new data, especially for the possible new users/items from new data. For more details, see [Model Retrain](https://librecommender.readthedocs.io/en/latest/user_guide/model_retrain.html).

In [32]:
train_data = DatasetFeat.merge_trainset(train_data, loaded_data_info, merge_behavior=True)  # use loaded_data_info
eval_data = DatasetFeat.merge_evalset(eval_data, loaded_data_info)

train_data.build_negative_samples(loaded_data_info, seed=2022)  # use loaded_data_info
eval_data.build_negative_samples(loaded_data_info, seed=2222)

random neg item sampling elapsed: 0.346s
random neg item sampling elapsed: 0.091s


Then we construct a new model, and call `rebuild_model` method to assign the old variables into the new model.

In [33]:
tf.compat.v1.reset_default_graph()  # need to reset graph in TensorFlow1

In [34]:
new_model = WideDeep(
    task="ranking",
    data_info=loaded_data_info,  # pass loaded_data_info
    embed_size=16,
    n_epochs=2,
    loss_type="cross_entropy",
    lr={"wide": 0.005, "deep": 1e-4},
    batch_size=2048,
    use_bn=True,
    hidden_units=(128, 64, 32),
)

new_model.rebuild_model(path="model_path", model_name="wide_deep", full_assign=True)

total params: [33m194,228[0m | embedding params: [33m166,860[0m | network params: [33m27,368[0m


Finally, the training and recommendation parts are the same as before.

In [35]:
new_model.fit(
    train_data, 
    verbose=2, 
    shuffle=True, 
    eval_data=eval_data,
    metrics=["loss", "roc_auc", "precision", "recall", "ndcg"],
)

Training start time: [35m2023-02-20 19:48:06[0m


train: 100%|█████████████████████████████████████████████████| 391/391 [00:04<00:00, 96.49it/s]


Epoch 1 elapsed: 4.258s
	 [32mtrain_loss: 2.2819[0m


eval_pointwise: 100%|█████████████████████████████████████████| 25/25 [00:00<00:00, 117.99it/s]


	 eval log_loss: 1.4160
	 eval roc_auc: 0.8295


eval_listwise: 100%|██████████████████████████████████████| 2981/2981 [00:16<00:00, 176.73it/s]


	 eval precision@10: 0.0938
	 eval recall@10: 0.0672
	 eval ndcg@10: 0.2890


train: 100%|████████████████████████████████████████████████| 391/391 [00:03<00:00, 102.77it/s]


Epoch 2 elapsed: 4.012s
	 [32mtrain_loss: 1.4704[0m


eval_pointwise: 100%|█████████████████████████████████████████| 25/25 [00:00<00:00, 142.56it/s]


	 eval log_loss: 0.9665
	 eval roc_auc: 0.8387


eval_listwise: 100%|██████████████████████████████████████| 2981/2981 [00:18<00:00, 162.72it/s]


	 eval precision@10: 0.0918
	 eval recall@10: 0.0649
	 eval ndcg@10: 0.2828


In [38]:
new_model.recommend_user(user=1, n_rec=3)

{1: array([2858, 1580,  260])}

In [39]:
new_model.recommend_user(user=[1, 2, 3], n_rec=3)

{1: array([2858, 1580,  260]),
 2: array([ 260, 1196, 1580]),
 3: array([1580, 2858,  589])}

**This completes our tutorial!**

+ For more examples, see the [examples](https://github.com/massquantity/LibRecommender/tree/master/examples) folder on GitHub. 

+ For more usages, please head to [User Guide](https://librecommender.readthedocs.io/en/latest/user_guide/index.html).

+ For serving a trained model, please head to [Python Serving Guide](https://librecommender.readthedocs.io/en/latest/serving_guide/python.html).