<a href="https://colab.research.google.com/github/itsMaherrr/recommendation-project/blob/main/projet_recommandation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Nom:** MEBIROUK

**Prénom:** Maher

**Parcours:** MLSD FI

# Project: Recommender System with LM Embeddings
## Project Title
**DeepSemanticLightGCN: Hybrid Graph Recommendation with BERT Initialization**

## Objective
This notebook implements the course project solution: a **Hybrid Recommendation System** that leverages Language Model embeddings within the **Cornac** framework.
The approach bridges **Graph Neural Networks (LightGCN)** with **Natural Language Processing (BERT)**. We address the "Cold Start" problem by calculating semantic similarity between items using an **Autoencoder-compressed BERT** architecture and injecting this knowledge into the Collaborative Filtering graph.

## Scope & Deliverables
This work focuses on applied machine learning and fulfills the project requirements via:
- **Hybrid Modeling:** Combining ID-based learning with content-based LM features.
- **Advanced Architecture:** Custom implementation of `DeepSemanticLightGCN` initialized with semantic weights.
- **Comparative Analysis:** Benchmarking against standard Matrix Factorization baseline.
- **Required Metrics:** Evaluation is strictly focused on **Recall@10**, **NDCG@10** and **Precision@10**.

## How to Run
1. **Environment:** Google Colab (Recommended) or Jupyter Notebook.
2. **Hardware:** **Important:** Select a GPU Runtime (e.g., T4 GPU) to accelerate BERT encoding and GNN training.
3. **Execution:** Run all cells sequentially (`Runtime → Run all`).
4. **Note:** The "Semantic Preparation" step is self-contained but may take a few minutes to encode text data.

## Expected Outputs
Upon execution, the notebook produces:
- **Semantic Vector Space:** Compressed text embeddings via the custom Autoencoder.
- **Trained Hybrid Model:** The `DeepSemanticLightGCN` with pre-injected weights.
- **Comparative Leaderboard:** A table evaluating all models specifically on **Recall@10**, **Precision@10** and **NDCG@10**.
- **Visualizations:** Tables summarizing model performance and showing the impact of semantic injection.

## Setting up the environment

In [None]:
!pip uninstall -y dgl torch torchdata

[0mFound existing installation: torch 2.9.0+cu126
Uninstalling torch-2.9.0+cu126:
  Successfully uninstalled torch-2.9.0+cu126
Found existing installation: torchdata 0.11.0
Uninstalling torchdata-0.11.0:
  Successfully uninstalled torchdata-0.11.0


In [None]:
# !pip install dgl==2.4.0 -f https://data.dgl.ai/wheels/torch-2.4/cu121/repo.html

In [None]:
!pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
!pip install torchdata==0.7.1
!pip install dgl==2.4.0 -f https://data.dgl.ai/wheels/torch-2.4/cu121/repo.html
!pip3 install git+https://github.com/PreferredAI/cornac.git

Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch==2.5.1
  Downloading https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp312-cp312-linux_x86_64.whl (780.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m780.4/780.4 MB[0m [31m895.7 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.20.1
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.20.1%2Bcu121-cp312-cp312-linux_x86_64.whl (7.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.3/7.3 MB[0m [31m115.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio==2.5.1
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.5.1%2Bcu121-cp312-cp312-linux_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m107.0 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.5.1)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_cuda_nvrtc_cu

In [None]:
!pip uninstall -y sentence-transformers transformers huggingface-hub tokenizers
!pip install sentence-transformers==2.7.0 transformers==4.41.2

Found existing installation: sentence-transformers 5.2.0
Uninstalling sentence-transformers-5.2.0:
  Successfully uninstalled sentence-transformers-5.2.0
Found existing installation: transformers 4.57.3
Uninstalling transformers-4.57.3:
  Successfully uninstalled transformers-4.57.3
Found existing installation: huggingface-hub 0.36.0
Uninstalling huggingface-hub-0.36.0:
  Successfully uninstalled huggingface-hub-0.36.0
Found existing installation: tokenizers 0.22.2
Uninstalling tokenizers-0.22.2:
  Successfully uninstalled tokenizers-0.22.2
Collecting sentence-transformers==2.7.0
  Downloading sentence_transformers-2.7.0-py3-none-any.whl.metadata (11 kB)
Collecting transformers==4.41.2
  Downloading transformers-4.41.2-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.15.1 (from sentence-transformers==2.7.0)
  Downloading huggingface_hub-1.3.2-py3-none-a

## Importing the necessary libraries

In [None]:
import cornac
from cornac.data import Dataset
from cornac.eval_methods import RatioSplit
from cornac.metrics import MAE, RMSE, Precision, Recall, NDCG, AUC, MAP
from cornac.models import Recommender, MF, LightGCN
import torch
import torch.nn as nn
import torch.optim as optim
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm, trange
from cornac.models.lightgcn.lightgcn import Model, construct_graph

import numpy as np
import pandas as pd

## Creating the custom model (LightGCN + BERT)

### Creating the autoencoder (the one we're going to use in the LightGCN model)

In [None]:
class Autoencoder(nn.Module):
    def __init__(self, input_dim=384, encoding_dim=64):
        super(Autoencoder, self).__init__()
        # encoder
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, encoding_dim),
        )
        # decoder
        self.decoder = nn.Sequential(
            nn.Linear(encoding_dim, 128),
            nn.ReLU(),
            nn.Linear(128, input_dim)
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return encoded, decoded

### Creating the custom LightGCN model

In [None]:
class DeepSemanticLightGCN(LightGCN):
    def __init__(self, name="Deep_LightGCN_BERT", emb_size=64, num_epochs=50, id_to_text_map=None, **kwargs):
        super().__init__(name=name, emb_size=emb_size, num_epochs=num_epochs, verbose=True, **kwargs)
        self.id_to_text_map = id_to_text_map
        self.bert = SentenceTransformer('all-MiniLM-L6-v2')
        self.target_dim = emb_size
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.ae = Autoencoder(input_dim=384, encoding_dim=self.target_dim).to(self.device)

    def compress_embeddings(self, bert_vectors):
        print(f"[{self.name}] Training Autoencoder to compress embeddings...")

        # convert to pytorch tensor
        data_tensor = torch.FloatTensor(bert_vectors).to(self.device)

        # initialize autoencoder
        optimizer = optim.Adam(self.ae.parameters(), lr=1e-3)
        criterion = nn.MSELoss()

        # quick training loop
        self.ae.train()
        for epoch in range(75):
            optimizer.zero_grad()
            encoded, decoded = self.ae(data_tensor)
            loss = criterion(decoded, data_tensor)
            loss.backward()
            optimizer.step()

        # extract the encoded vectors
        self.ae.eval()
        with torch.no_grad():
            compressed_vectors, _ = self.ae(data_tensor)

        print(f"[{self.name}] Compression complete. Loss: {loss.item():.4f}")
        return compressed_vectors.cpu().numpy()

    def fit(self, train_set, val_set=None, freeze_epochs=5):
        # prepare semantic embeddings
        print(f"[{self.name}] Preparing Deep Semantic Embeddings...")

        idx_to_raw = {v: k for k, v in train_set.iid_map.items()}
        texts = [self.id_to_text_map.get(idx_to_raw.get(idx), "") for idx in range(train_set.num_items)]

        bert_vectors = self.bert.encode(
            texts, convert_to_numpy=True, show_progress_bar=True, normalize_embeddings=True
        )

        node_features = self.compress_embeddings(bert_vectors)

        if node_features.shape != (train_set.num_items, self.emb_size):
            print(f"[{self.name}] Shape mismatch: {node_features.shape} != {(train_set.num_items, self.emb_size)}. Skipping injection.")
            pretrained_weights = None
        else:
            pretrained_weights = torch.tensor(node_features, dtype=torch.float32).to(self.device)
            print(f"[{self.name}] Semantic embeddings ready for injection.")

        # build LightGCN model
        if not self.trainable:
            return self

        # graph
        graph = construct_graph(train_set, train_set.num_users, train_set.num_items).to(self.device)

        # model initialization
        model = Model(
            graph,
            self.emb_size,
            self.num_layers,
            self.lambda_reg,
        ).to(self.device)

        self.num_users = train_set.num_users
        self.num_items = train_set.num_items
        self.min_rating = train_set.min_rating
        self.max_rating = train_set.max_rating

        # inject semantic embeddings
        if pretrained_weights is not None:
            injected = False
            for name, param in model.named_parameters():
                if param.shape == pretrained_weights.shape:
                    with torch.no_grad():
                        param.copy_(pretrained_weights)
                    print(f"[{self.name}] Injected semantic weights into '{name}'")
                    injected = True
                    item_param_name = name  # save for freezing
                    break
            if not injected:
                print(f"[{self.name}] Could not find matching parameter for injection.")

        # optimizer
        optimizer = torch.optim.Adam(model.parameters(), lr=self.learning_rate)

        # training loop
        pbar = trange(self.num_epochs, desc="Training", unit="epoch", leave=True, disable=not self.verbose)
        for epoch_idx in pbar:
            model.train()
            accum_loss = 0.0

            batch_iterator = train_set.uij_iter(batch_size=self.batch_size, shuffle=True)
            total_batches = train_set.num_batches(self.batch_size)

            for batch_u, batch_i, batch_j in tqdm(batch_iterator, total=total_batches, desc=f"Epoch {epoch_idx+1}", leave=False, disable=not self.verbose):
                batch_u = torch.tensor(batch_u).to(self.device)
                batch_i = torch.tensor(batch_i).to(self.device)
                batch_j = torch.tensor(batch_j).to(self.device)

                # freeze semantic embeddings for first few epochs
                if pretrained_weights is not None and epoch_idx < freeze_epochs:
                    for name, param in model.named_parameters():
                        if name == item_param_name:
                            param.requires_grad = False

                u_g_embeddings, pos_i_g_embeddings, neg_i_g_embeddings = model(graph, batch_u, batch_i, batch_j)

                batch_loss, _, _ = model.loss_fn(u_g_embeddings, pos_i_g_embeddings, neg_i_g_embeddings)
                accum_loss += batch_loss.cpu().item() * len(batch_u)

                optimizer.zero_grad()
                batch_loss.backward()
                optimizer.step()

                # unfreeze after first batch if frozen
                if pretrained_weights is not None and epoch_idx < freeze_epochs:
                    for name, param in model.named_parameters():
                        if name == item_param_name:
                            param.requires_grad = True

            accum_loss /= len(train_set.uir_tuple[0])
            pbar.set_postfix(loss=accum_loss)

        # store final embeddings
        model.eval()
        with torch.no_grad():
            u_embs, i_embs, _ = model(graph)
            self.U = u_embs.cpu().detach().numpy().astype(np.float32)
            self.V = i_embs.cpu().detach().numpy().astype(np.float32)

        print(f"[{self.name}] Training finished. Embeddings saved.")
        return self


### Preparing the data

In [None]:
ratings = cornac.datasets.movielens.load_feedback(
    fmt="UIR",
    variant="1m"
)

Data from http://files.grouplens.org/datasets/movielens/ml-1m.zip
will be cached into /root/.cornac/ml-1m/ratings.dat


0.00B [00:00, ?B/s]

Unzipping ...
File cached!


In [None]:
plots, item_ids = cornac.datasets.movielens.load_plot()

Data from https://static.preferred.ai/cornac/datasets/movielens/ml_plot.zip
will be cached into /root/.cornac/movielens/ml_plot.dat


0.00B [00:00, ?B/s]

Unzipping ...
File cached!


In [None]:
id_to_plot = dict(zip(item_ids, plots))

print(f"Data Loaded: {len(ratings)} ratings and {len(id_to_plot)} plots")

Data Loaded: 1000209 ratings and 10076 plots


## Defining the evaluation metrics

In [None]:
metrics = [
    MAE(),
    RMSE(),
    Precision(k=10),
    Recall(k=10),
    NDCG(k=10),
    AUC(),
    MAP()
]

## Filtering the training data (we keep only the high ratings)

In [None]:
high_ratings = [r for r in ratings if r[2] >= 4.0]

In [None]:
hrs = RatioSplit(
    data=high_ratings,
    test_size=0.2,
    rating_threshold=4.0,
    seed=123,
    verbose=True
)

rating_threshold = 4.0
exclude_unknowns = True
---
Training data:
Number of users = 6038
Number of items = 3494
Number of ratings = 460224
Max rating = 5.0
Min rating = 4.0
Global mean = 4.4
---
Test data:
Number of users = 6038
Number of items = 3494
Number of ratings = 115015
Number of unknown users = 0
Number of unknown items = 0
---
Total users = 6038
Total items = 3494


**(ignore the following section if you want to re-train the models from scratch)**

## Generate the results of a pre-trained checkpoint

In [None]:
!pip install -q gdown
!mkdir -p models
!gdown --folder "https://drive.google.com/drive/folders/1rl7ybV0DE529dZYSqKV1ToMJJCDymEM_" -O models -q


In [None]:
mf_load = MF.load('models/MF/')
gcn_load = LightGCN.load('models/LightGCN')
gcn_bert_load = LightGCN.load('models/LightGCN + Text')

  return torch.load(io.BytesIO(b))


In [None]:
import pandas as pd
import sys

pd.set_option('display.width', 10000)
pd.set_option('display.float_format', '{:.4f}'.format)

loaded_models = [mf_load, gcn_load, gcn_bert_load]
results = {}

for model in loaded_models:
    result, _ = hrs.evaluate(
        model=model,
        metrics=metrics,
        user_based=True
    )
    results[model.name] = result.metric_avg_results

df = pd.DataFrame(results).T
print(df)

sys.exit()


[MF] Training started!

[MF] Evaluation started!




Rating:   0%|          | 0/115015 [00:00<?, ?it/s]

Ranking:   0%|          | 0/5991 [00:00<?, ?it/s]


[LightGCN] Training started!

[LightGCN] Evaluation started!


Rating:   0%|          | 0/115015 [00:00<?, ?it/s]

Ranking:   0%|          | 0/5991 [00:00<?, ?it/s]


[LightGCN + Text] Training started!
[LightGCN + Text] Preparing Deep Semantic Embeddings...


Batches:   0%|          | 0/110 [00:00<?, ?it/s]

[LightGCN + Text] Training Autoencoder to compress embeddings...
[LightGCN + Text] Compression complete. Loss: 0.0009
[LightGCN + Text] Semantic embeddings ready for injection.

[LightGCN + Text] Evaluation started!


Rating:   0%|          | 0/115015 [00:00<?, ?it/s]

Ranking:   0%|          | 0/5991 [00:00<?, ?it/s]

                   MAE   RMSE    AUC    MAP  NDCG@10  Precision@10  Recall@10  Train (s)  Test (s)
MF              0.3987 0.4539 0.8062 0.0367   0.0559        0.0555     0.0385     0.0004   18.5099
LightGCN        0.5264 0.6893 0.9396 0.1809   0.2804        0.2289     0.1692     0.0003   13.3925
LightGCN + Text 0.5253 0.6886 0.9407 0.1842   0.2843        0.2325     0.1711     6.5114   13.1339


## Training the models from scratch

### Defining the models

In [None]:
mf = MF(
    k=64,
    max_iter=50,
    learning_rate=1e-3,
    use_bias=False
)

gcn_basic = LightGCN(
    name="LightGCN",
    emb_size=64,
    num_epochs=50,
    learning_rate=5e-3,
    verbose=True,
    seed=123
)

deep_model = DeepSemanticLightGCN(
    name="LightGCN + Text",
    emb_size=64,
    num_epochs=50,
    learning_rate=0.005,
    id_to_text_map=id_to_plot,
    seed=123
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Training and comparing the models

In [None]:
cornac.Experiment(
    eval_method=hrs,
    models=[mf, gcn_basic, deep_model],
    metrics=metrics,
    user_based=True,
    verbose=True
).run()


[MF] Training started!


  0%|          | 0/50 [00:00<?, ?it/s]

Optimization finished!

[MF] Evaluation started!


Rating:   0%|          | 0/115015 [00:00<?, ?it/s]

Ranking:   0%|          | 0/5991 [00:00<?, ?it/s]


[LightGCN] Training started!


Training:   0%|          | 0/50 [00:00<?, ?iter/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch:   0%|          | 0/450 [00:00<?, ?it/s]


[LightGCN] Evaluation started!


Rating:   0%|          | 0/115015 [00:00<?, ?it/s]

Ranking:   0%|          | 0/5991 [00:00<?, ?it/s]


[LightGCN + Text] Training started!
[LightGCN + Text] Preparing Deep Semantic Embeddings...


Batches:   0%|          | 0/110 [00:00<?, ?it/s]

[LightGCN + Text] Training Autoencoder to compress embeddings...
[LightGCN + Text] Compression complete. Loss: 0.0018
[LightGCN + Text] Semantic embeddings ready for injection.
[LightGCN + Text] Injected semantic weights into 'feature_dict.item'


Training:   0%|          | 0/50 [00:00<?, ?epoch/s]

Epoch 1:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 2:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 3:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 4:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 5:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 6:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 7:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 8:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 9:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 10:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 11:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 12:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 13:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 14:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 15:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 16:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 17:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 18:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 19:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 20:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 21:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 22:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 23:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 24:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 25:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 26:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 27:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 28:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 29:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 30:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 31:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 32:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 33:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 34:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 35:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 36:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 37:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 38:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 39:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 40:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 41:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 42:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 43:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 44:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 45:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 46:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 47:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 48:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 49:   0%|          | 0/450 [00:00<?, ?it/s]

Epoch 50:   0%|          | 0/450 [00:00<?, ?it/s]

[LightGCN + Text] Training finished. Embeddings saved.

[LightGCN + Text] Evaluation started!


Rating:   0%|          | 0/115015 [00:00<?, ?it/s]

Ranking:   0%|          | 0/5991 [00:00<?, ?it/s]


TEST:
...
                |    MAE |   RMSE |    AUC |    MAP | NDCG@10 | Precision@10 | Recall@10 | Train (s) | Test (s)
--------------- + ------ + ------ + ------ + ------ + ------- + ------------ + --------- + --------- + --------
MF              | 0.3987 | 0.4539 | 0.8062 | 0.0367 |  0.0559 |       0.0555 |    0.0385 |    1.7069 |  16.5537
LightGCN        | 0.5264 | 0.6893 | 0.9396 | 0.1809 |  0.2804 |       0.2289 |    0.1692 |  729.2545 |  13.7777
LightGCN + Text | 0.5253 | 0.6886 | 0.9407 | 0.1842 |  0.2843 |       0.2325 |    0.1711 |  716.3847 |  13.1059



### Saving the models locally

In [None]:
path = 'models/'
mf_path = mf.save(path)
gcn_path = gcn_basic.save(path)
gcn_bert_path = deep_model.save(path)

print(f'Matrix Factorization model saved to {mf_path}')
print(f'LightGCN model saved to {gcn_path}')
print(f'LightGCN + Text model saved to {gcn_bert_path}')

MF model is saved to models/MF/2026-01-15_16-07-24-103238.pkl
LightGCN model is saved to models/LightGCN/2026-01-15_16-07-24-126231.pkl
LightGCN + Text model is saved to models/LightGCN + Text/2026-01-15_16-07-24-140692.pkl
Matrix Factorization model saved to models/MF/2026-01-15_16-07-24-103238.pkl
LightGCN model saved to models/LightGCN/2026-01-15_16-07-24-126231.pkl
LightGCN + Text model saved to models/LightGCN + Text/2026-01-15_16-07-24-140692.pkl


### Saving the models to google drive (checkpoint)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!mkdir -p "/content/drive/MyDrive/Recommendation/MF"
!mkdir -p "/content/drive/MyDrive/Recommendation/LightGCN"
!mkdir -p "/content/drive/MyDrive/Recommendation/LightGCN + Text"

In [None]:
!cp -r models/MF/* "/content/drive/MyDrive/Recommendation/MF/"
!cp -r models/LightGCN/* "/content/drive/MyDrive/Recommendation/LightGCN/"
!cp -r "models/LightGCN + Text"/* "/content/drive/MyDrive/Recommendation/LightGCN + Text/"

## Analysis of Results

### Comparison with Standard Baseline (MF)
The most striking result is the massive performance gap between the standard **Matrix Factorization (MF)** and the Graph-based models.
* **Recall@10:** Our hybrid model achieved **0.1711** compared to MF's **0.0385**. This represents a **>340% improvement**.
* **NDCG@10:** Ranking quality jumped from **0.0559** to **0.2843**, indicating that the graph architecture is vastly superior at ordering items correctly for the user.
* *Interpretation:* MF relies solely on direct user-item interactions, which fail in sparse datasets. The GNN architecture overcomes this by propagating signals through the graph.

### Impact of Semantic Injection (Text)
Comparing **LightGCN** vs. **LightGCN + Text**, we observe a consistent improvement when adding BERT embeddings:
* **Gain:** Recall@10 improved from 0.1692 to **0.1711**, and NDCG@10 from 0.2804 to **0.2843**.
* **Significance:** While the numerical gap is smaller than the jump from MF, it validates the hypothesis: initializing the latent space with semantic knowledge (item descriptions) helps the model distinguish items better than ID embeddings alone.
* **Cold Start:** This marginal gain suggests that text embeddings effectively "fill in the gaps" where pure structural information (interaction graph) is missing or noisy.

## 3. Conclusion
In this project, we successfully designed and benchmarked a **Hybrid Recommender System** combining **BERT** and **LightGCN** within the **Cornac** framework.

**Key Findings:**
1.  **Architecture:** Graph Neural Networks are significantly more effective than Matrix Factorization for this task.
2.  **Semantics:** Injecting language model embeddings improves both retrieval (Recall) and ranking (NDCG) performance, confirming that text descriptions contain valuable signal for recommendation.
3.  **Objectives Met:** The final model achieves a strong **Recall@10 of 17.11%** and **NDCG@10 of 28.43%**, successfully demonstrating the synergy between NLP and Collaborative Filtering.