<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

<i>This notebook has been taken from Microsoft's recommender system library: [Source](https://github.com/microsoft/recommenders/blob/main/examples/02_model_collaborative_filtering/lightgcn_deep_dive.ipynb). It has been modified to fit the context of our study. The modifications include the addition of new evaluation methods, the ability to add new datasets, and cluster validation process.</i>

# LightGCN - simplified GCN model for recommendation

This notebook serves as an introduction to LightGCN [1], which is an simple, linear and neat Graph Convolution Network (GCN) [3] model for recommendation.

## 0 Global Settings and Imports

In [1]:
import sys
import os
import codecs
import scrapbook as sb
import pandas as pd
import numpy as np
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.utils.timer import Timer
from recommenders.models.deeprec.models.graphrec.lightgcn import LightGCN
from recommenders.models.deeprec.DataModel.ImplicitCF import ImplicitCF
from recommenders.datasets import movielens
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.utils.constants import SEED as DEFAULT_SEED
from recommenders.models.deeprec.deeprec_utils import prepare_hparams
from recommenders.evaluation.python_evaluation import (
    map_at_k,
    ndcg_at_k,
    precision_at_k,
    recall_at_k,
    serendipity,
    user_serendipity,
    user_item_serendipity,
    catalog_coverage
)

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))
print("Tensorflow version: {}".format(tf.__version__))
os.chdir('../')

System version: 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:28:38) [MSC v.1929 64 bit (AMD64)]
Pandas version: 1.5.3
Tensorflow version: 2.12.0


In [2]:
# top k items to recommend
TOP_K = 10

# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

# Model parameters
EPOCHS = 50
BATCH_SIZE = 1024

SEED = DEFAULT_SEED  # Set None for non-deterministic results

yaml_file = "./models/lightgcn/config/lightgcn.yaml"
user_file = "./models/lightgcn/output/tests/user_embeddings.csv"
item_file = "./models/lightgcn/output/item_embeddings.csv"

In [3]:
ratio = 0.85
dataset = 'ml-100k'
dataset_path = os.path.join('datasets', dataset)

ratings_path = os.path.join(dataset_path, 'u.data')
ratings_file = codecs.open(ratings_path, 'rU', 'UTF-8')
df = pd.read_csv(ratings_file, sep='\t', names=('userID', 'itemID', 'rating', 'timestamp'))

# Normal train/test split (random portion) 
# train, test = python_stratified_split(df, ratio=ratio)
train_df = pd.read_csv('output/exp-2/train.csv')
test_df = pd.read_csv('output/exp-2/test.csv')

In [4]:
clusters = pd.read_csv('./output/exp-2/group_clusters.csv', usecols=['user_id', 'group', 'group_group'])
train_clusters = train_df.reset_index().merge(clusters, left_on='userID', right_on='user_id').drop(columns=['user_id'])

# Target group cluster (we iterate over all of them in every run)
target_group = 9
target_group_df = train_clusters[train_clusters['group_group'] == target_group]
train = target_group_df[['userID', 'itemID', 'rating', 'timestamp']]

# Choose only ratings that can be predicted
users_in_train = list(set(train.userID.to_list()))
test = test_df[test_df.userID.isin(users_in_train)]

In [5]:
print("total users in main dataset:", len(list(set(df.userID.to_list()))))
print("total users in train dataset:", len(list(set(train.userID.to_list()))))
print("total users in test dataset:", len(list(set(test.userID.to_list()))))
# train = train.set_index('index')
# test = test.set_index('index')

total users in main dataset: 943
total users in train dataset: 95
total users in test dataset: 95


In [6]:
train

Unnamed: 0,userID,itemID,rating,timestamp
25483,293,77,2,888907210
25484,293,386,2,888908065
25485,293,1226,3,888905198
25486,293,566,3,888907312
25487,293,815,2,888905122
...,...,...,...,...
36055,387,10,4,886481228
36056,387,676,1,886480733
36057,387,180,4,886479737
36058,387,215,2,886483906


In [7]:
set(train_clusters[train_clusters['group_group'] == 0].group.to_list())

{0, 1, 2, 3, 5, 6, 9, 10, 12, 13, 15, 17, 18, 19}

### 2.2 Process data

`ImplicitCF` is a class that intializes and loads data for the training process. During the initialization of this class, user IDs and item IDs are reindexed, ratings greater than zero are converted into implicit positive interaction, and adjacency matrix $R$ of user-item graph is created. Some important methods of `ImplicitCF` are:

`get_norm_adj_mat`, load normalized adjacency matrix of user-item graph if it already exists in `adj_dir`, otherwise call `create_norm_adj_mat` to create the matrix and save the matrix if `adj_dir` is not `None`. This method will be called during the initialization process of LightGCN model.

`create_norm_adj_mat`, create normalized adjacency matrix of user-item graph by calculating $D^{-\frac{1}{2}} A D^{-\frac{1}{2}}$, where $\mathbf{A}=\left(\begin{array}{cc}\mathbf{0} & \mathbf{R} \\ \mathbf{R}^{T} & \mathbf{0}\end{array}\right)$.

`train_loader`, generate a batch of training data — sample a batch of users and then sample one positive item and one negative item for each user. This method will be called before each epoch of the training process.


In [8]:
data = ImplicitCF(train=train, test=test, seed=SEED)

  df = train if test is None else train.append(test)


### 2.3 Prepare hyper-parameters

Important parameters of `LightGCN` model are:

`data`, initialized LightGCNDataset object.

`epochs`, number of epochs for training.

`n_layers`, number of layers of the model.

`eval_epoch`, if it is not None, evaluation metrics will be calculated on test set every "eval_epoch" epochs. In this way, we can observe the effect of the model during the training process.

`top_k`, the number of items to be recommended for each user when calculating ranking metrics.

A complete list of parameters can be found in `yaml_file`. We use `prepare_hparams` to read the yaml file and prepare a full set of parameters for the model. Parameters passed as the function's parameters will overwrite yaml settings.

In [9]:
hparams = prepare_hparams(
    yaml_file,
    n_layers=3,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    learning_rate=0.005,
    eval_epoch=5,
    top_k=TOP_K,
)

### 2.4 Create and train model

With data and parameters prepared, we can create the LightGCN model.

To train the model, we simply need to call the `fit()` method.

In [10]:
model = LightGCN(hparams, data, seed=SEED)

Already create adjacency matrix.
Already normalize adjacency matrix.
Using xavier initialization.


In [11]:
with Timer() as train_time:
    model.fit()

print("Took {} seconds for training.".format(train_time.interval))

Epoch 1 (train)0.3s: train loss = 0.68596 = (mf)0.68590 + (embed)0.00006
Epoch 2 (train)0.2s: train loss = 0.64831 = (mf)0.64822 + (embed)0.00009
Epoch 3 (train)0.2s: train loss = 0.55893 = (mf)0.55876 + (embed)0.00017
Epoch 4 (train)0.2s: train loss = 0.46079 = (mf)0.46050 + (embed)0.00029
Epoch 5 (train)0.2s + (eval)0.1s: train loss = 0.38865 = (mf)0.38821 + (embed)0.00044, recall = 0.14805, ndcg = 0.27333, precision = 0.21789, map = 0.07778
Epoch 6 (train)0.2s: train loss = 0.35513 = (mf)0.35455 + (embed)0.00058
Epoch 7 (train)0.2s: train loss = 0.32355 = (mf)0.32286 + (embed)0.00069
Epoch 8 (train)0.2s: train loss = 0.31360 = (mf)0.31282 + (embed)0.00078
Epoch 9 (train)0.2s: train loss = 0.29454 = (mf)0.29368 + (embed)0.00086
Epoch 10 (train)0.2s + (eval)0.0s: train loss = 0.27835 = (mf)0.27742 + (embed)0.00093, recall = 0.15259, ndcg = 0.28639, precision = 0.23158, map = 0.08051
Epoch 11 (train)0.2s: train loss = 0.26733 = (mf)0.26632 + (embed)0.00101
Epoch 12 (train)0.2s: train l

### 2.5 Recommendation and Evaluation

Recommendation and evaluation have been performed on the specified test set during training. After training, we can also use the model to perform recommendation and evalution on other data. Here we still use `test` as test data, but `test` can be replaced by other data with similar data structure.

#### 2.5.1 Recommendation

We can call `recommend_k_items` to recommend k items for each user passed in this function. We set `remove_seen=True` to remove the items already seen by the user. The function returns a dataframe, containing each user and top k items recommended to them and the corresponding ranking scores.

In [12]:
topk_scores = model.recommend_k_items(test, top_k=TOP_K, remove_seen=True)
topk_scores.head()

Unnamed: 0,userID,itemID,prediction
0,293,186,6.825802
1,293,385,6.763488
2,293,89,6.660824
3,293,64,6.643375
4,293,12,6.426495


#### 2.5.2 Evaluation

With `topk_scores` predicted by the model, we can evaluate how LightGCN performs on this test set.

In [13]:
import json

eval_map = map_at_k(test, topk_scores, k=TOP_K)
eval_ndcg = ndcg_at_k(test, topk_scores, k=TOP_K)
eval_precision = precision_at_k(test, topk_scores, k=TOP_K)
eval_recall = recall_at_k(test, topk_scores, k=TOP_K)
eval_serendipity = serendipity(train, topk_scores)
eval_coverage = catalog_coverage(train, topk_scores)

metric_results = {
    'MAP': eval_map,
    'NDCG': eval_ndcg,
    'Precision': eval_precision,
    'Recall': eval_recall,
    'User Serendipity': eval_serendipity,
    'Coverage': eval_coverage
}

print(json.dumps(metric_results, indent=4))
with open("./output/exp-2/metric_results.txt", "w") as fp:
    json.dump(metric_results, fp, indent=4)

{
    "MAP": 0.09923318545049116,
    "NDCG": 0.3346981388531298,
    "Precision": 0.2715789473684211,
    "Recall": 0.17059628342104388,
    "User Serendipity": 0.6721069444636192,
    "Coverage": 0.23121387283236994
}


In [14]:
# load clusters if not previously loaded
# clusters = pd.read_csv('./output/exp-2/group_clusters.csv', usecols=['user_id', 'group'])

# get per-user serendipity score
eval_serendipity = user_serendipity(train, topk_scores)

# calculate per-cluster serendipity score
eval_serendipity_clulsters = clusters.merge(eval_serendipity, left_on='user_id', right_on='userID').drop(columns=['userID'])
cluster_serendipity = eval_serendipity_clulsters.groupby('group').mean()
cluster_serendipity[['user_serendipity']].to_csv('./output/exp-2/cluster_serendipity.csv')

In [15]:
eval_serendipity.to_csv('./output/exp-2/user_serendipity.csv', index=False)

In [16]:
set(df.group.to_list())

AttributeError: 'DataFrame' object has no attribute 'group'

### 2.6 Infer embeddings

With `infer_embedding` method of LightGCN model, we can export the embeddings of users and items in the training set to CSV files for future use.

In [None]:
model.infer_embedding(user_file, item_file)

## 3. Compare LightGCN with SAR and NCF

Here there are the performances of LightGCN compared to [SAR](../00_quick_start/sar_movielens.ipynb) and [NCF](../00_quick_start/ncf_movielens.ipynb) on MovieLens dataset of 100k and 1m. The method of data loading and splitting is the same as that described above and the GPU used was a GeForce GTX 1080Ti.

Settings common to the three models: `epochs=15, seed=42`.

Settings for LightGCN: `embed_size=64, n_layers=3, batch_size=1024, decay=0.0001, learning_rate=0.015 `.

Settings for SAR: `similarity_type="jaccard", time_decay_coefficient=30, time_now=None, timedecay_formula=True`.

Settings for NCF: `n_factors=4, layer_sizes=[16, 8, 4], batch_size=1024, learning_rate=0.001`.

| Data Size | Model    | Training time | Recommending time | MAP@10   | nDCG@10  | Precision@10 | Recall@10 |
| --------- | -------- | ------------- | ----------------- | -------- | -------- | ------------ | --------- |
| 100k      | LightGCN | 27.8865       | 0.6445            | 0.129236 | 0.436297 | 0.381866     | 0.205816  |
| 100k      | SAR      | 0.4895        | 0.1144            | 0.110591 | 0.382461 | 0.330753     | 0.176385  |
| 100k      | NCF      | 116.3174      | 7.7660            | 0.105725 | 0.387603 | 0.342100     | 0.174580  |
| 1m        | LightGCN | 396.7298      | 1.4343            | 0.075012 | 0.377501 | 0.345679     | 0.128096  |
| 1m        | SAR      | 4.5593        | 2.8357            | 0.060579 | 0.299245 | 0.270116     | 0.104350  |
| 1m        | NCF      | 1601.5846     | 85.4567           | 0.062821 | 0.348770 | 0.320613     | 0.108121  |

From the above results, we can see that LightGCN performs better than the other two models.

### References: 
1. Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang & Meng Wang, LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation, 2020, https://arxiv.org/abs/2002.02126
2. LightGCN implementation [TensorFlow]: https://github.com/kuandeng/lightgcn
3. Thomas N. Kipf and Max Welling, Semi-Supervised Classification with Graph Convolutional Networks, ICLR, 2017, https://arxiv.org/abs/1609.02907
4. Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua, Neural Graph Collaborative Filtering, SIGIR, 2019, https://arxiv.org/abs/1905.08108
5. Y. Koren, R. Bell and C. Volinsky, "Matrix Factorization Techniques for Recommender Systems", in Computer, vol. 42, no. 8, pp. 30-37, Aug. 2009, doi: 10.1109/MC.2009.263.  url: https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf