# Introduction to PEARLM: Code Explanation



This notebook is designed as an introductory guide to help students understand the basic concepts needed to implement a Path Language Model using the Hopwise library. Through the example of the PEARLM (Path-Enhanced Autoregressive Language Model) [[1]](#r1) model, it provides a practical explanation of how to build an explainable recommendation system based on knowledge graphs.



The notebook is organized into two main sections:
- **1️⃣ Understanding the PEARLM Class: Code Workflow**: This section explains the implementation of the PEARLM class, including its core functions and architecture.
- **2️⃣ Decoding Stage: Knowledge Graph-Constrained Decoding (KGCD)**: This section describes the decoding process constrained by the knowledge graph.

### ⚙️ Setup Workspace

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 1️⃣ Understanding the PEARLM Class: Code Workflow

This section outlines the core functions in the PEARLM model class.

<div style="background-color:#f0f4f8; border-left: 5px solid #4a90e2; padding:15px; margin:10px 0; border-radius:8px;">
<strong> PEARLM </strong> is a path-language-modeling recommender. It learns the sequence of entity-relation triplets from a knowledge graph as a next-token prediction task.</div>

<img src="https://raw.githubusercontent.com/mallociFrancesca/XAIKGRLGM/a77f9ea5633475efe43038ef2a11e1341342e0ef/hands-on-session/pearlm-arch.png" alt="PEARLM Architecture" width="900" height="200">


**Table of Contents:**

**📦 [0. Packages](#S0)**
- Import necessary libraries and modules

**🧱 [1. Model Initialization (`__init__`)](#S1)**
- Initializes the PEARLM model using a transformer (pretrained GPT-2 model) with custom settings.
- Defines a fixed sequence template of token types (e.g., `[<SPECIAL>, <ENTITY>, <RELATION>, <ENTITY>, ..., <SPECIAL>]`).
- Defines the loss function for the next-prediction token.

**🔁 [2. Forward Pass (`forward`)](#S2)**
- Defines how the model processes an input during:
    - `training` (when `labels` are given): it also calculates the cross-entropy loss.
    - `inference` (when generating predictions): it directly returns the predicted tokens.

**🔮 [3. Prediction Interface (`predict`)](#S3)**
- A convenience method for inference that internally calls `forward()`.


**✨ [4. Generate Interface (`generate`)](#S4)**
- Responsible for generating sequences of tokens based on the input provided, incorporating custom constraints like Knowledge Graph-Constrained Decoding (KGCD).


<a name="S0"></a>
## 📦 0. Packages

In [None]:
# math
from enum import Enum

# typing
from typing import Optional, Union

#Pytorch
import torch
from torch import nn

# HuggingFace library
from transformers import AutoConfig, GPT2LMHeadModel
from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions

# Hopwise library
from hopwise.data import Interaction
from hopwise.model.abstract_recommender import ExplainablePathLanguageModelingRecommender
from hopwise.utils import PathLanguageModelingTokenType

<a id="S1"></a>
## 🧱 1. Model Initialization (`__init__`)

Internally, hopwise manages the token types used by language models through an enumeration defined as follows, which PEARLM uses to assign a unique ID to each token type:
- S => SPECIAL
- E => ENTITY (any KG node is labelled with this token type, including, users, and items)
- R => RELATION

In [None]:
class PathLanguageModelingTokenType(Enum):
    """Type of tokens in paths for Path Language Modeling.

    - ``SPECIAL``: Special tokens, like start and end of a path.
    - ``ENTITY``: Entity tokens.
    - ``RELATION``: Relation tokens.
    - ``USER``: User tokens.
    - ``ITEM``: Item tokens.
    """

    SPECIAL = ("S", 0)
    ENTITY = ("E", 1)
    RELATION = ("R", 2)
    USER = ("U", 3)
    ITEM = ("I", 4)

In [None]:

class PEARLM(ExplainablePathLanguageModelingRecommender, GPT2LMHeadModel):
    def __init__(self, config, dataset):

        #  Initialize the constructor of the parent class
        ExplainablePathLanguageModelingRecommender.__init__(self, config, dataset)

        # Load settings from the configuration parameters
        self.use_kg_token_types = config["use_kg_token_types"]

         # Load the pre-initialized tokenizer from the dataset
        tokenizer = dataset.tokenizer

         # Configure the causal language model (based on distilGPT2) with custom settings
        transformers_config = AutoConfig.from_pretrained(
            "distilgpt2",
            **{
                "vocab_size": self.n_tokens,  # Defines the different tokens that can be represented by the inputs_ids passed to the forward method. Shortcut for len(tokenizer)
                "n_ctx": config["context_length"], # Maximum number of tokens that the model can account for when processing a response
                "n_positions": config["context_length"], # Maximum number of positions (same as context length)
                "pad_token_id": tokenizer.pad_token_id, # ID of the special tokens used by the tokenizer
                "bos_token_id": tokenizer.bos_token_id, # ID of the special tokens used by the tokenizer
                "eos_token_id": tokenizer.eos_token_id, # ID of the special tokens used by the tokenizer
                "n_embd": config["embedding_size"],  # Embedding dimension
                "n_head": config["num_heads"], # Number of attention heads
                "n_layer": config["num_layers"], # Number of transformer layers
            },
        )

        # Initialize the model with the architecture defined in "transformers_config"
        GPT2LMHeadModel.__init__(self, transformers_config)

        # Move the model on CPU or GPU
        self.to(config["device"])

        # Path Language Modeling Token Type Template Definition
        # This block sets up a template to differentiate between SPECIAL, ENTITY, and RELATION tokens in the KG path.
        # It augments the tokenizer with three special tokens: <SPECIAL>, <ENTITY>, and <RELATION>,
        # which will later be used to assign semantic roles to each position in the input path sequence.

        # If you require the model to use token types
        if self.use_kg_token_types:

            # Save current vocabulary size before adding the new token types
            prev_vocab_size = len(dataset.tokenizer)

            # Unpack each enumeration value in type (str) and type_id (int) for the token type definition
            spec_type, spec_type_id = PathLanguageModelingTokenType.SPECIAL.value
            ent_type, ent_type_id = PathLanguageModelingTokenType.ENTITY.value
            rel_type, rel_type_id = PathLanguageModelingTokenType.RELATION.value

            # Define the tokens types to be added to the tokenizer: SPECIAL, ENTITY, RELATION
            # These are used to identify the semantic role of each token in the path (e.g., user, movie, acted_in)
            # Example: "<ENTITY>" might correspond to "Tom Hanks", "<RELATION>" to "acted_in"
            token_types = [f"<{token_type.name}>" for token_type in [spec_type, ent_type, rel_type]] # <SPECIAL>, <ENTITY>, <RELATION>


            # Add the type tokens to the tokenizer so they can be referenced like normal tokens
            for token_type in token_types:
                dataset.tokenizer.add_tokens(token_type)

            # Update the vocabulary size registered by the attribute `n_tokens` after adding new tokens
            self.n_tokens = len(dataset.tokenizer)


            # Retrieve the corresponding integer token IDs of the added token types
            # Since we added them after the original vocabulary, we shift their values accordingly
            spec_type, ent_type, rel_type = (
                spec_type + prev_vocab_size,
                ent_type + prev_vocab_size,
                rel_type + prev_vocab_size,
            )

            # Build a fixed type pattern (template) for a KG path:
            # e.g., With path_hop_length = 2, the structure would be:
            # [<SPECIAL>, <ENTITY>, <RELATION>, <ENTITY>, <RELATION>, <ENTITY>, <SPECIAL>]
            # This will be used to inform the model which kind of token is expected at each position
            self.token_type_ids = torch.LongTensor(
                [spec_type, ent_type] + [rel_type, ent_type] * dataset.path_hop_length + [spec_type]
            ).to(config["device"])

            # Resize the transformer's token embedding matrix to include the new tokens type we just defined
            self.transformer.resize_token_embeddings(len(dataset.tokenizer))

        # Define the loss function used during training: Cross Entropy for next-token prediction
        self.loss = nn.CrossEntropyLoss()

        # Final setup step required by HuggingFace models to complete module initialization
        self.post_init()



    def predict(self, input_ids, **kwargs):
        return self.forward(input_ids, **kwargs)


    def generate(self, inputs, **kwargs):
        kwargs["logits_processor"] = self.logits_processor_list
        kwargs["num_return_sequences"] = kwargs.pop("paths_per_user")
        return super(GPT2LMHeadModel, self).generate(**inputs, **kwargs)


#### Class Definition

In [None]:
class PEARLM(ExplainablePathLanguageModelingRecommender, GPT2LMHeadModel):
    def __init__(self, config, dataset):

This class inherits from:

- **`ExplainablePathLanguageModelingRecommender`**: It is an abstract class from hopwise library that unifies PathLanguageModelingRecommender and ExplainableRecommender interfaces.
  All the path-language-modeling models that generate explanations should implement this class.
  It provides methods and logic for:
  - modeling user-item interactions as paths in the graph
  - defining custom logit processors during inference
  - postprocessing path sequences to be fed into the evaluation pipeline
  - decoding token sequences into the explanation standard format
  - exposing the explainability interface

- **`GPT2LMHeadModel`**: the HuggingFace implementation of the GPT-2 language model, with a linear head on top for causal language modeling.  
  This enables the model to generate sequences token-by-token in an autoregressive manner ([Documentation](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2LMHeadModel)).

The `__init__` method of the `PEARLM` class is responsible for configuring and initializing the model.
It takes as args:
- **`config`**: A configuration dictionary containing all necessary hyperparameters and settings for the model.
- **`dataset`**: A structured dataset object that includes the tokenizer, knowledge graph paths, and other relevant data required for training and inference.

----

#### Class Inizialization

In [None]:
class PEARLM(ExplainablePathLanguageModelingRecommender, GPT2LMHeadModel):
    def __init__(self, config, dataset):

         # Load the pre-initialized tokenizer from the dataset
        tokenizer = dataset.tokenizer

         # Configure the causal language model (based on distilGPT2) with custom settings
        transformers_config = AutoConfig.from_pretrained("distilgpt2", {...})

This lines  `transformers_config = AutoConfig.from_pretrained("distilgpt2", {...})`  creates a new model configuration starting from the pre-trained GPT-2 architecture.

Args:

| Parameter              | Description |
|------------------------|-------------|
| `vocab_size`           | Number of unique tokens, including all KG entities, relations, and special tokens (like `[BOS]`, `[EOS]`, `[PAD]`). |
| `n_ctx`, `n_positions` | Maximum number of tokens in the input sequence (i.e., the maximum path length supported by the model). |
| `pad_token_id`         | ID of the padding token used to align sequences to the same length. |
| `bos_token_id`         | ID for the beginning-of-sequence token. It signals the start of a generated path. |
| `eos_token_id`         | ID for the end-of-sequence token. It marks the end of a reasoning path. |
| `n_embd`               | Dimensionality of the embeddings used for each token (i.e., the size of the hidden representation). |
| `n_head`               | Number of attention heads in each Transformer block. Determines how many attention subspaces are used. |
| `n_layer`              | Total number of Transformer layers (blocks) in the model. More layers = deeper model. |


---

#### Input Preparation: Token type sequence

This block sets up a **template** to differentiate between `SPECIAL`, `ENTITY`, and `RELATION` tokens in the KG path.

It augments the tokenizer with three special tokens: `<SPECIAL>`, `<ENTITY>`, and `<RELATION>`, which will later be used to assign semantic roles to each position in the input path sequence.

In [None]:
        # Path Language Modeling Token Type Template Definition
        # This block sets up a template to differentiate between SPECIAL, ENTITY, and RELATION tokens in the KG path.
        # It augments the tokenizer with three special tokens: <SPECIAL>, <ENTITY>, and <RELATION>,
        # which will later be used to assign semantic roles to each position in the input path sequence.

        # If you require the model to use token types
        if self.use_kg_token_types:

            # Save current vocabulary size before adding the new token types
            prev_vocab_size = len(dataset.tokenizer)

            # Unpack each enumeration value in type (str) and type_id (int) for the token type definition
            spec_type, spec_type_id = PathLanguageModelingTokenType.SPECIAL.value
            ent_type, ent_type_id = PathLanguageModelingTokenType.ENTITY.value
            rel_type, rel_type_id = PathLanguageModelingTokenType.RELATION.value

            # Define the tokens types to be added to the tokenizer: SPECIAL, ENTITY, RELATION
            # These are used to identify the semantic role of each token in the path (e.g., user, movie, acted_in)
            # Example: "<ENTITY>" might correspond to "Tom Hanks", "<RELATION>" to "acted_in"
            token_types = [f"<{token_type.name}>" for token_type in [spec_type, ent_type, rel_type]] # <SPECIAL>, <ENTITY>, <RELATION>


            # Add the type tokens to the tokenizer so they can be referenced like normal tokens
            for token_type in token_types:
                dataset.tokenizer.add_tokens(token_type)

            # Update the vocabulary size registered by the attribute `n_tokens` after adding new tokens
            self.n_tokens = len(dataset.tokenizer)


            # Retrieve the corresponding integer token IDs of the added token types
            # Since we added them after the original vocabulary, we shift their values accordingly
            spec_type, ent_type, rel_type = (
                spec_type + prev_vocab_size,
                ent_type + prev_vocab_size,
                rel_type + prev_vocab_size,
            )

            # Build a fixed type pattern (template) for a KG path:
            # e.g., With path_hop_length = 2, the structure would be:
            # [<SPECIAL>, <ENTITY>, <RELATION>, <ENTITY>, <RELATION>, <ENTITY>, <SPECIAL>]
            # This will be used to inform the model which kind of token is expected at each position
            self.token_type_ids = torch.LongTensor(
                [spec_type, ent_type] + [rel_type, ent_type] * dataset.path_hop_length + [spec_type]
            ).to(config["device"])

            # Resize the transformer's token embedding matrix to include the new tokens type we just defined
            self.transformer.resize_token_embeddings(len(dataset.tokenizer))

----
<a name="#S2"></a>

## 🔁 2. Forward Pass (`forward`)

This function defines how the model processes an input during:

- `training` (when `labels` are given): it also calculates the cross-entropy loss.
- `inference` (when generating predictions): it directly returns the predicted tokens.

In [None]:
# PEARLM extends the GPT2LMHeadModel class.
# The "forward" method is overrided to include custom logic for handling type embeddings (ENTITY, RELATION, SPECIAL).
def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None, # Indices of input sequence tokens in the vocabulary.
        past_key_values: Optional[tuple[tuple[torch.Tensor]]] = None, # Cached (key,value) pairs to speed up autoregressive decoding
        attention_mask: Optional[torch.FloatTensor] = None, # Prevents the model from giving weight to padding tokens (PAD)
        token_type_ids: Optional[torch.LongTensor] = None, # Segment token indices to indicate first and second portions of the inputs.
        position_ids: Optional[torch.LongTensor] = None,  # Positional indices for each token
        head_mask: Optional[torch.FloatTensor] = None, # Disable specific attention heads in the model
        inputs_embeds: Optional[torch.FloatTensor] = None, # Embedded representation of the input_ids
        encoder_hidden_states: Optional[torch.Tensor] = None, # Used in encoder-decoder settings: whether or not to provide the encoder output to the decoder
        encoder_attention_mask: Optional[torch.FloatTensor] = None, # Attention mask for encoder output (if used)
        labels: Optional[torch.LongTensor] = None, # Target tokens for next-token prediction during training
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None, # Whether to return attention weights (for interpretability/debugging)
        output_hidden_states: Optional[bool] = None, # Whether or not to return the hidden states of all layers. (for interpretability/debugging)
        return_dict: Optional[bool] = None, # Whether to return a `ModelOutput` object instead of a tuple
        **kwargs,  # Additional arguments for compatibility with HuggingFace Trainer
    ) -> Union[tuple, CausalLMOutputWithCrossAttentions]: # specifies the type of return

        # Handle custom input format: if input_ids is an Interaction object, extract its fields
        # Hopwise uses the Interaction class to manage and represent input data in a structured way.
        if isinstance(input_ids, Interaction):
            token_type_ids = input_ids["token_type_ids"]   # Type information: ENTITY, RELATION, SPECIAL
            attention_mask = input_ids["attention_mask"]   # Indicates valid tokens (non-PAD)
            input_ids = input_ids["input_ids"]             # Actual token indices

        # Inference mode: dynamically align token_type_ids with input_ids length
        # During generation (inference), input_ids grows one token at a time.
        # Since self.token_type_ids is a fixed-length template, we slice it to match the current input length.
        # Example: if input_ids has length 4, we select the first 3 token types from the template and repeat the last one (usually <SPECIAL>) as fallback.
        # This ensures the token_type_ids has the same length as input_ids, preventing shape mismatches.
        # We then expand the token_type_ids across the batch dimension.
        if self.use_kg_token_types:
            token_type_ids = self.token_type_ids[[*range(input_ids.shape[1] - 1), -1]]
            token_type_ids = token_type_ids.expand(input_ids.shape[0], -1)

        # Pass inputs to the GPT2 transformer
        transformer_outputs = self.transformer(
            input_ids,
            past_key_values=past_key_values,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
            use_cache=use_cache and labels is None,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict or self.config.use_return_dict,
        )

        # Extract the last hidden states (one vector per token)
        sequence_output = transformer_outputs[0]

        # Apply the language modeling head to generate prediction scores (logits) for each token
        prediction_scores = self.lm_head(sequence_output)

        # Prepare to compute training loss if labels are provided
        lm_loss = None
        if labels is not None:

            # Shift logits and labels for next-token prediction (autoregressive)
            # Shift prediction scores and input ids by one
            shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
            labels = labels[:, 1:].contiguous()

            # Compute cross-entropy loss between predicted logits and actual next tokens
            lm_loss = self.loss(
                shifted_prediction_scores.view(-1, self.config.vocab_size),
                labels.view(-1)
            )


        # Handle output formatting (tuple or dictionary) based on return_dict flag

        # --- return_dict=False ----
        # Return a dictionary
        # training-mode: (loss, logits, hidden_states, attentions)
        # inference-mode: (logits, hidden_states, attentions)
        if not return_dict:
            output = (prediction_scores,) + transformer_outputs[2:]
            return ((lm_loss,) + output) if lm_loss is not None else output

        # --- return_dict=True ----
        # Return standard HuggingFace output object
        return CausalLMOutputWithCrossAttentions(
            loss=lm_loss,
            logits=prediction_scores,
            past_key_values=transformer_outputs.past_key_values,
            hidden_states=transformer_outputs.hidden_states,
            attentions=transformer_outputs.attentions,
            cross_attentions=transformer_outputs.cross_attentions,
        )

#### Method Overriding

PEARLM extends the GPT2LMHeadModel class.

The `forward` method is overrided to include custom logic for handling type embeddings (`ENTITY`, `RELATION`, `SPECIAL`).

In [None]:
def forward(...Args...) -> Union[tuple, CausalLMOutputWithCrossAttentions]:

Args:

| Parameter               | Description |
|-------------------------|-------------|
| `input_ids`             | Indices of input sequence tokens in the vocabulary. |
| `past_key_values`       | Cached (key, value) pairs to speed up autoregressive decoding. |
| `attention_mask`        | Prevents the model from giving weight to padding tokens (PAD). |
| `token_type_ids`        | Segment token indices to indicate first and second portions of the inputs. |
| `position_ids`          | Positional indices for each token. |
| `head_mask`             | Disables specific attention heads in the model. |
| `inputs_embeds`         | Embedded representation of the `input_ids`. |
| `encoder_hidden_states` | Used in encoder-decoder settings to provide the encoder output to the decoder. |
| `encoder_attention_mask`| Attention mask for the encoder output (if used). |
| `labels`                | Target tokens for next-token prediction during training. |
| `use_cache`             | Whether to use caching to speed up generation. |
| `output_attentions`     | Whether to return attention weights (useful for interpretability/debugging). |
| `output_hidden_states`  | Whether to return hidden states of all layers (for interpretability/debugging). |
| `return_dict`           | Whether to return a `ModelOutput` object instead of a tuple. |


#### Output Method

In [None]:
def forward(

    #--Args --#

    return_dict: Optional[bool] = None,

) -> Union[tuple, CausalLMOutputWithCrossAttentions]:

    #-- Other Code --#

    # Handle output formatting (tuple or dictionary) based on return_dict flag

    # --- return_dict=True ----
    # Return a dictionary
    # training-mode: (loss, logits, hidden_states, attentions)
    # inference-mode: (logits, hidden_states, attentions)
    if not return_dict:
        output = (prediction_scores,) + transformer_outputs[2:]
        return ((lm_loss,) + output) if lm_loss is not None else output

    # --- return_dict=False ----
    # Return standard HuggingFace output object
    return CausalLMOutputWithCrossAttentions(
        loss=lm_loss,
        logits=prediction_scores,
        past_key_values=transformer_outputs.past_key_values,
        hidden_states=transformer_outputs.hidden_states,
        attentions=transformer_outputs.attentions,
        cross_attentions=transformer_outputs.cross_attentions,
    )

#### Pass inputs to the Casual Language Model

<img src="https://raw.githubusercontent.com/mallociFrancesca/XAIKGRLGM/a77f9ea5633475efe43038ef2a11e1341342e0ef/hands-on-session/pearlm-pass-input.png" alt="PEARLM Pass Input" width="450" height="250">


In [None]:
# Pass inputs to the GPT2 transformer
transformer_outputs = self.transformer(
    input_ids,
    past_key_values=past_key_values,
    attention_mask=attention_mask,
    token_type_ids=token_type_ids,
    position_ids=position_ids,
    head_mask=head_mask,
    inputs_embeds=inputs_embeds,
    encoder_hidden_states=encoder_hidden_states,
    encoder_attention_mask=encoder_attention_mask,
    use_cache=use_cache and labels is None,
    output_attentions=output_attentions,
    output_hidden_states=output_hidden_states,
    return_dict=return_dict or self.config.use_return_dict,
)

#### Prediction score generation

<img src="https://raw.githubusercontent.com/mallociFrancesca/XAIKGRLGM/a77f9ea5633475efe43038ef2a11e1341342e0ef/hands-on-session/pearlm-prediction-score-generation.png" alt="PEARLM Prediction Score Generation" width="450" height="250">


In [None]:
# Extract the last hidden states (one vector per token)
sequence_output = transformer_outputs[0]

# Apply the language modeling head to generate prediction scores (logits) for each token
prediction_scores = self.lm_head(sequence_output)

#### Training phase: Loss Calculation

In [None]:
# Prepare to compute training loss if labels are provided
lm_loss = None
if labels is not None:

    # Shift logits and labels for next-token prediction (autoregressive)
    # Shift prediction scores and input ids by one
    shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()  # Remove last token (usually EOS)
    labels = labels[:, 1:].contiguous()  # Remove first token (usually BOS)

    # Compute cross-entropy loss between predicted logits and actual next tokens
    lm_loss = self.loss(
        shifted_prediction_scores.view(-1, self.config.vocab_size),
        labels.view(-1)
    )

Imagine to have the following sequence path:

```python
input_ids = ["BOS", "Brad_Pitt", "acted_in", "Fight_Club", "EOS"]
```

Each token at position `i` must predict the token at position `i+1`.

| Token in Input | Should Predict |
|----------------|----------------|
| BOS            | Brad Pitt      |
| Brad Pitt      | acted_in       |
| acted_in       | Fight_Club     |
| Fight_Club     | EOS            |

So the correct `labels` (i.e., the tokens to be predicted) are:
```python
labels    = ["Brad_Pitt", "acted_in", "Fight_Club", "EOS"]  # without EOS
```


We don't need the `[BOS]`
token between the labels → so we shift right (skip the first token):
```python
labels = labels[:, 1:].contiguous()  # Starting with the token in position 1 and moving to the right
```

Equally, after `[EOS]` there is nothing to predict. So we can remove it from the tokens
```python
shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
```

---
<a id="#S3"></a>
## 🔮 3. Prediction Interface (`predict`)

- A convenience method for inference that internally calls `forward()`.

In [None]:
# method to do inference
def predict(self, input_ids, **kwargs):
    return self.forward(input_ids, **kwargs)

---

## ✨ 4. Generate Interface (`generate`)

The `generate` method is responsible for producing sequences of tokens based on the input provided. It overrides the HuggingFace `GPT2LMHeadModel`'s `generate` method by incorporating custom logic specific to the PEARLM model.

Key features of the `generate` method:
- **Logits Processor**: Applies a list of custom constraints (`logits_processor_list`) to the model's logits during generation. This ensures that the generated tokens adhere to specific rules, such as knowledge graph constraints. In PEARLM, the default logits processor is KGCD, implemented by the class `ConstrainedLogitsProcessorWordLevel`
- **Paths per User**: Specifies the number of sequences to generate for each user during recommendation tasks.



In [None]:
def generate(self, inputs, **kwargs):
    kwargs["logits_processor"] = self.logits_processor_list
    kwargs["num_return_sequences"] = kwargs.pop("paths_per_user")
    return super(GPT2LMHeadModel, self).generate(**inputs, **kwargs)

---

<a id="#S5"></a>

# 2️⃣ Decoding Stage: Knowledge Graph-Constrained Decoding (KGCD)

Decoding in causal language models (CLM), such as GPT-2, involves generating a sequence of tokens sequentially. In PEARLM,
this process is enhanced with the Knowledge Graph-Constrained Decoding (KGCD) method.

<div style="background-color: #fff8e1; border-left: 5px solid #ffc107; padding: 16px 20px; margin: 15px 0; border-radius: 8px; box-shadow: 0 2px 5px rgba(0,0,0,0.05); font-family: sans-serif; line-height: 1.6;">
  <p style="margin: 0;">
    <b>💡</b> The objective of the <b>Knowledge Graph-Constrained Decoding </b> in PEARLM is to ensure that the generated explanation paths strictly adhere to the actual structure of the Knowledge Graph.
  </p>
</div>

Specifically, KGCD works by modifying the decoding step in causal language models so that only valid next tokens (i.e., entities or relations that are logically and structurally reachable according to the KG) are considered during generation. If a token is not valid in the KG context, its probability is set to negative infinity, making it not selectable.

<img src="https://raw.githubusercontent.com/mallociFrancesca/XAIKGRLGM/a77f9ea5633475efe43038ef2a11e1341342e0ef/hands-on-session/pearlm-kgcd-2.png" alt="PEARLM KGCD 2" width="600" height="250">


The KGCD is implemented by the class [ConstrainedLogitsProcessorWordLevel](https://github.com/tail-unica/hopwise/blob/main/hopwise/model/logits_processor.py) declared in `hopwise/hopwise/model/logits_processor.py`.

In [None]:
class ConstrainedLogitsProcessorWordLevel(LogitsProcessor):
    # Extends HuggingFace's LogitsProcessor to implement KG-Constrained Decoding (KGCD),
    # which filters the logits based on the structure of the Knowledge Graph (KG)

        # ...........


        #######
        ## Some auxiliary functions
        ######


        # ...........

        # Main method
        def __call__(self, input_ids, scores):
        # input_ids: partial sequences generated so far.
        # scores: model logits for the next token.

        current_len = input_ids.shape[-1]  # current length of the sequence
        has_bos_token = self.is_bos_token_in_input(input_ids)  # True if the first token in the sequence is <bos>

        # generation step → apply KG-based constraints to the logits

        # Initially, consider all sequences generated so far (e.g., from beam search)
        # In the following steps (for recommendation tasks), only sequences with unique end-contexts will be selected

        ## Example: input_ids might contain 5 partial sequences like
        # [ [user123, watched], [user123, watched], [user123, liked], [user456, watched], [user123, watched] ]
        # In this case, the first, second, and last are identical in their last tokens
        unique_input_ids = input_ids

        # If we are in the recommendation task and not yet at the last generation step,
        # apply deduplication to optimize the computation of KG constraints
        if self.task == self.RECOMMENDATION_TASK and current_len < self.max_sequence_length - 1 - has_bos_token:

            # Determine whether the next token to generate is a relation or an entity:
            # - If the next token is a relation → only the last entity is needed (1 token) → (e.g., [user123]) → last_n_tokens = 1
            # - If the next token is an entity → the last 2 tokens are needed (entity, relation) → (e.g., [user123, watched]) → last_n_tokens = 2
            last_n_tokens = 2 if self.is_next_token_entity(input_ids) else 1

            # Apply deduplication: select unique sequences based only on the relevant last tokens (1 or 2)
            # This avoids recomputing the same mask for sequences that share the same context

            # np.unique returns:
            # - input_ids_indices → positions of unique sequences
            # - input_ids_inv → for each original row, indicates which unique row it corresponds to
            #
            # Example (with last_n_tokens=2):
            #   input_ids[:, -2:] → [ [user123, watched], [user123, watched], [user123, liked], [user456, watched], [user123, watched] ]
            #   np.unique(...) → keeps only 3 unique ones: [ [user123, watched], [user123, liked], [user456, watched] ]
            #   input_ids_indices = [0, 2, 3]
            #   input_ids_inv = [0, 0, 1, 2, 0] → maps each original row to its corresponding unique row

            _, input_ids_indices, input_ids_inv = np.unique(
                input_ids.cpu().numpy()[:, -last_n_tokens:],  # relevant tokens for KG constraint
                axis=0,
                return_index=True,   # positions of unique sequences
                return_inverse=True  # maps each original row to its deduplicated sequence
            )

            # Update the sequences for which masks will be computed: only those with unique contexts
            # Example: unique_input_ids = input_ids[[0, 2, 3]]
            unique_input_ids = input_ids[input_ids_indices]

        # Initialize the token mask matrix: shape [#unique_sequences, vocab_size]
        full_mask = np.zeros((unique_input_ids.shape[0], len(self.tokenizer)), dtype=bool)


        # For each sequence, identify valid and invalid tokens for the next step
        # invalid tokens = True, valid tokens = False
        #
        # Example unique_input_ids:
        # [
        #   [user123, watched],     # idx = 0
        #   [user123, liked],       # idx = 1
        #   [user456, watched]      # idx = 2
        # ]
        # For each sequence, we query the KG to get the valid tokens to generate next
        for idx in range(unique_input_ids.shape[0]):
            if self.task == self.RECOMMENDATION_TASK:
                # Extracts the "key" from the context (last 1 or 2 tokens, e.g., [user123, watched])
                # Queries the KG to find which tokens are valid as next steps
                # Example:
                #   key = (user123, watched)
                #   candidate_tokens = [movieA, movieB]
                key, candidate_tokens = self.process_scores_rec(unique_input_ids, idx)

            elif self.task == self.LINK_PREDICTION_TASK:
                # Alternative case for link prediction (Under Development)
                key, candidate_tokens = self.process_scores_lp(unique_input_ids, idx)


            # Creates a boolean mask as long as the vocabulary:
            # - tokens NOT in candidate_tokens are set to True (invalid)
            # - tokens in candidate_tokens are set to False (valid)
            #
            # Example:
            #   vocab = [watched, liked, movieA, movieB, movieC]
            #   candidate_tokens = [movieA, movieB]
            #   banned_mask = [True, True, False, False, True]
            banned_mask = self.get_banned_mask(key, candidate_tokens)

            # Stores the mask in the corresponding row of full_mask
            # full_mask becomes a [unique_sequences x vocab] matrix with True for disallowed tokens
            full_mask[idx] = banned_mask


        # Applies the mask to the logits, setting -inf for the banned tokens:
        # This prevents the model from generating invalid tokens according to the KG
        # - In the recommendation task, since deduplication occurred, we remap the mask to each original sequence
        # - Otherwise, we apply it directly


        # In this case we need to "remap" each original sequence
        # to its corresponding deduplicated mask (using input_ids_inv)
        #
        # Example:
        # - input_ids has 5 original sequences
        # - unique_input_ids has 3 unique contexts
        # - input_ids_inv = [0, 0, 1, 2, 0] → maps each original row to its row in full_mask
        # - full_mask = [
        #     [True, True, False, False, True],   # context (user123, watched)
        #     [True, True, True, True, False],    # context (user123, liked)
        #     [True, True, False, True, True]     # context (user456, watched)
        #   ]
        # → scores[full_mask[input_ids_inv]] = -inf → mask applied to each original sequence
        if self.task == self.RECOMMENDATION_TASK and current_len < self.max_sequence_length - 1 - has_bos_token:
            scores[full_mask[input_ids_inv]] = -math.inf
        else:
            scores[full_mask] = -math.inf

    return scores


# References

<a name="r1">[1]</a> Balloccu, G., Boratto, L., Cancedda, C., Fenu, G., & Marras, M. (2023). Faithful Path Language Modelling for Explainable Recommendation over Knowledge Graph. arXiv preprint arXiv:2310.16452. https://arxiv.org/abs/2310.16452