# Knowledge Graph Embeddings

In this notebook, you'll learn how to compute knowledge embeddings using the Hopwise library.

What you’ll do:

* 1️⃣ Train TransE knowledge embedding model

* 2️⃣ Visualize and interpret the learned embeddings




### ⚙️ Setup Workspace

1. Import the necessary module to access Google Drive from Colab and mount you Google Drive to the Colab enviroment. This allows you to access files and folders stored in your Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

2. Install the hopwise libray

In [None]:
%%capture
!uv pip install hopwise[cli,tsne]

3. Install the openTSNE libray

In [None]:
!uv pip install openTSNE

4.  To view the installed libraries in the right sidebar, run the following command:

In [None]:
!ln -s /usr/local/lib/python3.11/dist-packages /content/dist-packages

5. To check if you are using the GPU, run the following code:

In [None]:
import torch
if torch.cuda.is_available():
    device_id = torch.cuda.current_device()
    device_name = torch.cuda.get_device_name(device_id)
    print(f"CUDA Device ID: {device_id}")
    print(f"CUDA Device Name: {device_name}")
else:
    print("No CUDA device is available.")

### 📦 0 - Packages

Let's import the required libraries.

In [None]:
import os
import pandas as pd
import torch

# Import required components from the hopwise library
from hopwise.quick_start import run_hopwise
from hopwise.config import Config
from hopwise.data import create_dataset
from hopwise.data.dataset import KnowledgeBasedDataset

###  📊 1 - Dataset Overview

To train the TransE model, we'll use the MovieLens 100k (ml-100k) dataset, which is integrated directly into the Hopwise library.


**🔎 About MovieLens 100k**


The MovieLens 100k dataset is a widely used benchmark for developing and evaluating recommender systems. It was collected by the GroupLens Research Project at the University of Minnesota and is recognized for its compact size and rich information.

There are multiple versions of MovieLens datasets (e.g., ml-1m, ml-10m), but for the purposes of this notebook, we adopt the [ml-100k version]( https://grouplens.org/datasets/movielens/100k/).

This version contains 100,000 ratings (on a 1–5 scale), contributed by 943 users on 1,682 movies, along with user demographic data.

**📊 Dataset Statistics**


| Name    | Dates   | Users | Movies | Ratings  |
|---------|---------|--------|--------|----------|
| ML 100K | '97–'98 | 943    | 1,682  | 100,000  |


**📁 Dataset Files in Hopwise**

In the Hopwise repository, the dataset is located at: [hopwise/dataset_example/ml-100k](https://github.com/tail-unica/hopwise/tree/main/hopwise/dataset_example/ml-100k)

It consists of several key files:


| **File name** | **Content**                        | **Columns**                                |
|------------|------------------------------------|---------------------------------------------------|
| `.inter`   | User-item interaction              | `user_id, item_id, rating, timestamp`     |
| `.user`    | User features                       | `user_id, age, gender, occupation, zip_code`                            |
| `.item`    | Item features                       | `item_id, movie_title, release_year, class`                               |
| `.kg`      | Triplets in a knowledge graph      | `head_entity, relation, tail_entity`              |
| `.link`    | Item-entity linkage data           | `item_id, entity_id`                                 |


**⚙️ Dataset Loading and Summary**

In this section, we configure and load the **MovieLens 100k (ml-100k)**.  We also print out key statistics about the dataset, including knowledge graph details.

In [None]:
# --------------------------------------------
# Define the configuration
# --------------------------------------------

# This configuration dictionary specifies:
# - The dataset to load ('ml-100k')
# - The KGE model to use ('TransE')
# - Which columns to load from the interaction and item files
config = {
    "dataset": 'ml-100k',
    "model": 'TransE',
    "load_col": {
        "inter": ["user_id", "item_id", "rating", "timestamp"],
        "item": ["item_id", "movie_title"]
    },
}

# Convert dictionary to a Config object required by hopwise
config = Config(config_dict=config)

In [None]:
# --------------------------------------------
# Load the dataset
# --------------------------------------------

# Create and load the dataset using the specified configuration
dataset = create_dataset(config)

# --------------------------------------------
# Print dataset-level statistics
# --------------------------------------------

print(f"📊 Dataset: {dataset.dataset_name}")
print(f"Number of users: {dataset.user_num}")
print(f"Number of items: {dataset.item_num}")
print(f"Number of interactions: {dataset.inter_num}")
# print(f"Number of movie titles: {dataset.num('movie_title')}")
# print(f"Sparsity: {dataset.sparsity:.6f}")

In [None]:
# -------------------------
# Knowledge Graph
# -------------------------
if isinstance(dataset, KnowledgeBasedDataset):
    print("\n📘 Knowledge Graph Statistics:")
    print(f"Number of entities: {dataset.entity_num}")
    print(f"Number of relations: {dataset.relation_num}")
    print(f"Number of KG triplets: {len(dataset.kg_feat)}")

    # Display raw triplets with IDs
    print("\n🔢 Sample KG Triplets (IDs):")
    kg_ids = dataset.kg_feat[['head_id', 'relation_id', 'tail_id']].head(5)
    display(kg_ids)

    # Decode and display triplets with string names
    decoded_kg = kg_ids.copy()
    decoded_kg['head_name'] = decoded_kg['head_id'].apply(lambda x: dataset.id2token('head_id', x))
    decoded_kg['relation_name'] = decoded_kg['relation_id'].apply(lambda x: dataset.id2token('relation_id', x))
    decoded_kg['tail_name'] = decoded_kg['tail_id'].apply(lambda x: dataset.id2token('tail_id', x))

    print("\n🧾 Sample KG Triplets (Names):")
    display(decoded_kg[['head_name', 'relation_name', 'tail_name']])

### 📚 2 - Introduction to Knowledge Graph Embeddings in Hopwise

The Hopwise library  integrates **14 state-of-the-art Knowledge Graph Embedding (KGE) methods**, enabling reasoning over structured knowledge.

✅ Supported KGE methods:
- **Translational Distance Models**: `TransE`, `TransH`, `TransD`, `TransR`, `TorusE`  
- **Semantic Matching Models**: `RESCAL`, `DistMult`, `ComplEx`, `HolE`, `Analogy`, `TuckER`  
- **Neural Network-based Models**: `ConvE`, `ConvKB`, `RotatE`

These models enable entities and relations from a knowledge graph to be embedded into continuous vector spaces, supporting tasks like **recommendation**, and  **link prediction**.




The implementation and management of these **embedding models** are handled by the **model layer**.

The implementation file for each model is located in the folder: 📁 [`hopwise/model/knowledge_graph_embedding_recommender`](https://github.com/tail-unica/hopwise/tree/main/hopwise/model/knowledge_graph_embedding_recommender)



The **configuration** of models is managed through the **config layer**, located at: 📁 [`hopwise/config`](https://github.com/tail-unica/hopwise/tree/main/hopwise/config)

The config layer supports three flexible formats for defining model behavior:

1. Configuration files (`.yaml`);
2. Command-line arguments;
3. Python parameter dictionaries;



While users can provide their own configurations, **default settings** for each model are automatically loaded from:

📁 [`hopwise/properties/model`](https://github.com/tail-unica/hopwise/tree/main/hopwise/properties/model)

These default properties ensure that each model has a working configuration out of the box, which can then be overridden or extended as needed.

### 🕸️ 3 - Training TransE Model

In this section, we'll walk through training the **TransE** model.

The TransE model belongs to the **Translational distance models**.

<div style="background-color:#f0f4f8; border-left: 5px solid #4a90e2; padding:15px; margin:10px 0; border-radius:8px;">
  <p><b>Translational distance models:</b><br><br>
    Transform the relationship as a distance from the <b> head entity</b> to the <b> tail entity</b> and defines the <b> scoring function</b> by the distance [1].
  </p>
</div>

#### 📐 Definition



<div style="background-color:#fff1d7; border-left: 5px solid #d6a76f; padding:15px; margin:10px 0; border-radius:8px;">
  <p><strong>TransE model</strong><br>
is a representative <em>translational distance model</em> that represents <strong>entities</strong> and <strong>relations</strong> as vectors in the same semantic space of dimension ℝ<sup>d</sup>, where <em>d</em> is the dimensionality of the embedding space.</p>
</div>

A Fact is represented as a triplet **(h, r, t)** where:
- **h** = head entity
- **r** = relation
- **t** = tail entity

The relation **r** is interpreted as a **translation vector** connecting **h** and **t**. The model is trained such that the embedding of **h + r** is close to **t**. That is:


$$ \mathbf{h} + \mathbf{r} \approx \mathbf{t} $$

👉 **Intuition**: the idea is that if a head entity and a relation are known, adding them should point to the correct tail entity.

**Scoring Function**

TransE performs a linear translation and uses a **scoring function** based on the distance between **h + r** and **t**. The score **f(h, r, t)** is defined as:

$$ f(h, r, t) = \| \mathbf{h} + \mathbf{r} - \mathbf{t} \| $$

TransE uses the L<sub>2</sub> or L<sub>1</sub> distance between vectors to measure how plausible a triple is. The lower the distance, the more likely the triplet is to be true.




The model is trained to **minimize the distance** between valid triplets and **maximize** the distance for corrupted (invalid) ones, enabling the embeddings to capture the structure and semantics of the graph.

<img src="https://raw.githubusercontent.com/mallociFrancesca/XAIKGRLGM/a77f9ea5633475efe43038ef2a11e1341342e0ef/hands-on-session/transE.png" alt="TransE Diagram" width="600">


#### ⚙️ Configuration

To train the TransE model, we first need to define the configuration parameters for both the model architecture and the training pipeline.

Using Hopwise, you can configure the model in one of the following ways:

- **Python Dictionary configuration**: provide the settings directly using a Python dictionary.
- **YAML configuration**: load settings from a `.yaml` file.
- **Default configuration**: use the predefined configuration available at: [hopwise/properties/model/PEARLM.yaml](https://github.com/tail-unica/hopwise/blob/main/hopwise/properties/model/PEARLM.yaml)

In [None]:
# Define the model and dataset to use
model = 'TransE'
dataset = 'ml-100k'

# Custom configuration dictionary
config_dict = {
     # 'gpu_id': 0,  # GPU device ID to use
    'epochs': 1,  # Number of training epochs
    'eval_step': 50,  # Evaluation frequency (every N epochs)
    'data_path': '/content/drive/MyDrive/XAIKGRLGM/hands-on-session/data/dataset/',  # Path to the dataset
    'checkpoint_dir':'/content/drive/MyDrive/XAIKGRLGM/hands-on-session/data/checkpoint/',  # Directory to save model checkpoints
    'show_progress': True  # Toggle progress bar visibility during training/evaluation
}

#### 🔁 Train

Now, we are ready to train the model. Using the following command.

⚠️ We have already the **precomputed transE**. So you **don't need** to run this command now.

In [None]:
run_hopwise(model=model,
            dataset=dataset,
            config_dict=config_dict)

You should receive an output like

```bash
{'best_valid_score': {<KnowledgeEvaluationType.REC: 1>: 0.1018},
 'valid_score_bigger': True,
 'best_valid_result': {<KnowledgeEvaluationType.REC: 1>: OrderedDict([('recall@10',
                0.0903),
               ('mrr@10', 0.1924),
               ('ndcg@10', 0.1018),
               ('hit@10', 0.4772),
               ('precision@10', 0.0785)])},
 'test_result': OrderedDict([('recall@10', 0.1044),
              ('mrr@10', 0.2241),
              ('ndcg@10', 0.1225),
              ('hit@10', 0.4995),
              ('precision@10', 0.0914)])}
```

💾 **Saving Model Outputs**

By default, the trained model checkpoint will be saved in the `hopwise/saved` directory as a `.pth` file. The filename includes the model name and a timestamp, following this format:

`TransE-Month-day-year_timestamp.pth`




If you need to change the output directory, you have two options:

1. **Override it dynamically** using the `checkpoint_dir` parameter in your configuration dictionary.
2. **Modify the default path** in the global config file:  
   📄 [`hopwise/properties/overall.yaml`](https://github.com/tail-unica/hopwise/blob/main/hopwise/properties/overall.yaml)


📥 **Loading a Pretrained TransE Embeddings**


To reuse previously trained TransE embeddings, you can simply load the saved model checkpoint. This checkpoint is stored as a `.pth` file and includes:

- The model's learned parameters (i.e., the embeddings)
- Training metadata (e.g., epoch, best validation score)
- The full configuration used during training

We use `torch.load()` to deserialize the checkpoint file and inspect its contents.

In [None]:
from pathlib import Path
BASE_DIR = Path('/home/kt/code/courses/bsu-2025-ai-summer-workshop-kgs')

In [None]:
# Path to the saved checkpoint (update with your actual filename)
checkpoint_name = "/content/drive/MyDrive/XAIKGRLGM/hands-on-session/data/checkpoint/TransE-Jul-08-2025_19-42-10.pth"
checkpoint_name = BASE_DIR / 'data/checkpoint/TransE-Jul-08-2025_19-42-10.pth'

# Load the checkpoint
checkpoint = torch.load(checkpoint_name, weights_only=False)

The structure of the checkpoint is defined within the `Trainer` class, specifically in the `_save_checkpoint()` method located in the [`hopwise/trainer/trainer.py`](https://github.com/tail-unica/hopwise/blob/main/hopwise/trainer/trainer.py) file.

This method is responsible for saving the model's state during training.

```bash
state = {
    'config': self.config, # configuration settings used for the model and training process.
    'epoch': epoch, # current epoch number at the time of saving
    'cur_step': self.cur_step, # current training step
    'best_valid_score': self.best_valid_score, #  best validation score achieved so far
    'state_dict': self.model.state_dict(), #  model's parameters
    'optimizer': self.optimizer.state_dict(), # state of the optimizer, which includes parameters like learning rates and momentums.
}
```

This structure ensures that all necessary information is preserved, allowing for training to be resumed easily or for the model to be evaluated later.

Print the config structure:

In [None]:
# Display top-level keys in the checkpoint dictionary
checkpoint.keys()

The learned parameters are stored in the `state_dict`. Let's visualize which embeddings have been saved:

In [None]:
# Display which weights (learned embeddings) were saved in the 'state_dict'
checkpoint["state_dict"].keys()

The three `.weight` entries in the `state_dict` (`user_embedding.weight`, `entity_embedding.weight`, and `relation_embedding.weight`) are the **embedding matrices** learned during training of the TransE model.

Each matrix contains a dense vector representation for a specific type of node in the knowledge graph:

- `user_embedding.weight`: embeddings for all users  
- `entity_embedding.weight`: embeddings for all entities (items, tags, etc.)  
- `relation_embedding.weight`: embeddings for all relation types (e.g., "watched", "belongs_to")

These embeddings capture the **semantic and structural role** of each node or relation in the graph, based on the TransE training objective.

The typical shape of each embedding matrix is: `(num_items, embedding_dim)`

For example:
- If you have 10,000 entities and the embedding size is 100, then:  
  `entity_embedding.weight` → shape = `(10000, 100)`

### 🔍 4 - Visualizing TransE Embeddings with t-SNE


In the previous steps, we trained a **TransE model** to learn embeddings for three types of elements in a knowledge graph:
- **Entities** (e.g., movies, genres)
- **Users**
- **Relations** (e.g., "watched", "belongs_to")

These embeddings exist in a high-dimensional space (e.g., 100 or 200 dimensions), which makes them difficult to interpret or visualize directly.

To gain insight into how these embeddings are structured, we want to **project them into a 2D space** while preserving their relative distances.

This can help us identify patterns such as clustering, separation between types, or outliers.

To achieve this, we use the **t-distributed Stochastic Neighbor Embedding (t-SNE)** algorithm.


**🧠 What is t-SNE?**

**t-SNE** is a nonlinear dimensionality reduction technique that:
- Converts high-dimensional similarities between data points into conditional probabilities.
- Projects the data into a lower-dimensional space (typically 2D or 3D).
- Optimizes the layout to preserve local structures (i.e., nearby points remain close).
- Is particularly effective for visualizing embeddings and clustering behavior.

Unlike linear methods like PCA, t-SNE focuses on maintaining neighborhood relationships, making it well-suited for exploring how different embeddings relate to one another.


To perform this visualization, we use the [`openTSNE`](https://opentsne.readthedocs.io/en/stable/tsne_algorithm.html) library [2], a fast and extensible implementation of the t-SNE algorithm.

We will:

* 1️⃣ Run t-SNE separately on user, entity, and relation embeddings;
* 2️⃣ Visualize each type in its own scatter plot;
* 3️⃣ Finally, combine all embeddings into a single 2D plot to explore their global structure together.

In [None]:
# === Import the required Library ===

import os
import numpy as np
import pandas as pd
import torch

import plotly.express as px
from openTSNE import TSNE
from hopwise.utils import init_seed

In [None]:
# Define the path to the saved TransE checkpoint (.pth file).
# This file contains the model's configuration and all trained parameters (weights).

checkpoint_name = "/content/drive/MyDrive/XAIKGRLGM/hands-on-session/data/checkpoint/TransE-Jul-08-2025_19-42-10.pth"
checkpoint_name = BASE_DIR / 'data/checkpoint/TransE-Jul-08-2025_19-42-10.pth'

In [None]:
# === Load Pretrained TransE Model and Prepare for Visualization ===

# Load the checkpoint
checkpoint = torch.load(checkpoint_name, weights_only=False)


# Extract and apply the stored configuration to ensure consistent random seed
# and reproducibility settings for visualization (especially important for t-SNE).
config = checkpoint["config"]
init_seed(config["seed"], config["reproducibility"])


# Convert all model weights in the checkpoint from GPU tensors to CPU NumPy arrays.
# This is necessary because visualization tools (e.g., t-SNE, Plotly) work with NumPy.
for weight in checkpoint["state_dict"].keys():
    checkpoint["state_dict"][weight] = checkpoint["state_dict"][weight].to(torch.device("cpu")).numpy()

In [None]:
# === Define a custom  function to generate interactive 2D scatter plots using Plotly ===

# Embedding points are colored by ID or type for interpretability.
# Color gradient represents the point IDs, with darker purple indicating lower IDs and brighter yellow indicating higher IDs.

def plot_fn(embeddings, desc="Entity"):
    #ids = list(range(embeddings.shape[0]))
    fig = px.scatter(
        x=embeddings[:, 0],
        y=embeddings[:, 1],
        #color=ids,
        labels={"x": "Embedding Dimension 1", "y": "Embedding Dimension 2", "color": f"{desc} ID"},
        title=f"{config['model']} {desc} Embeddings",
        width=1024,
        height=1024,
        template="plotly_white",
    )

    fig.show()

**Setup t-SNE for Dimensionality Reduction**

Initializes the t-SNE algorithm from openTSNE to reduce high-dimensional embeddings (e.g., 100D) to 2D for visualization.

In [None]:
# Initialize t-SNE for dimensionality reduction:
# - perplexity=30: balances local/global structure, defines how many neighbors are considered (typical range is 5–50)
# - n_jobs=8: enables parallel processing across 8 CPU threads for faster computation
# - initialization="random": starts with a random 2D layout (instead of PCA), introducing randomness
# - metric="cosine": uses cosine distance instead of Euclidean, better suited for high-dimensional embeddings
# - random_state=config["seed"]: fixes the random seed to ensure reproducible results
# - verbose=True: prints progress updates during optimization

tsne = TSNE(
    perplexity=30,
    n_jobs=8,
    initialization="random",
    metric="cosine",
    random_state=config["seed"],
    verbose=True,
)

Now, we are ready to run t-SNE on each embeddings type (`user`, `entity`, `relation`).

**Plot Users**

In [None]:
# Extract the pretrained user embedding weights from the checkpoint.
user_weights = checkpoint["state_dict"]["user_embedding.weight"]

# Apply t-SNE to reduce the user embeddings from high-dimensional space to 2D.
tsne_embeddings_users = tsne.fit(user_weights)

The result `tsne_embeddings_users` is also a NumPy array of shape `(N, 2)`, where `N` is the number of users.

The array is ready to be plotted, as shown below:

In [None]:
plot_fn(tsne_embeddings_users, "User")

* The points appear to be fairly evenly distributed across the entire plane, without significant concentrations or visible clusters.
This behavior is clearly due to the fact that, for educational purposes, we trained the model for only one epoch, so the resulting embeddings are not very informative.

**Plot entities**

In [None]:
# Extract the pretrained entity embedding weights from the checkpoint.
entity_weights = checkpoint["state_dict"]["entity_embedding.weight"]

# Apply t-SNE to reduce the entity embeddings from high-dimensional space to 2D.
tsne_embeddings_entities = tsne.fit(entity_weights)

The result `tsne_embeddings_entities` is also a NumPy array of shape `(N, 2)`, where `N` is the number of entities.

The array is ready to be plotted, as shown below:

In [None]:
plot_fn(tsne_embeddings_entities, "Entity")

* Several tight clusters can be seen, especially in the bottom left and around the periphery.
* These likely represent semantically similar entities (e.g., movies of the same genre, or related items).
* The central region is densely compacted, suggesting that many entities are relatively similar.

**Plot Relations**

In [None]:
# Extract the pretrained relation embedding weights from the checkpoint.
relation_weights = checkpoint["state_dict"]["relation_embedding.weight"]

# Apply t-SNE to reduce the relation embeddings from high-dimensional space to 2D.
tsne_embeddings_relations = tsne.fit(relation_weights)

The result `tsne_embeddings_relations` is also a NumPy array of shape `(N, 2)`, where `N` is the number of relations.

The array is ready to be plotted, as shown below:

In [None]:
plot_fn(tsne_embeddings_relations, "Relation")

* The plot contains only ~25 points, corresponding to the total number of relation types in the dataset.

* The distance between points may suggest semantic dissimilarity (e.g., “rated” might be far from “belongs_to”).


----


We  applied t-SNE separately to each type of embedding (`users`, `entities`, and `relations`) and visualized them in individual plots.

Now, we'll combine all of these embeddings into a single unified plot to explore how they are distributed relative to one another in the same 2D space.

**Combine all embeddings into one plot**

In [None]:
def combine_embeddings(**kwargs):
    embeddings_list = list()
    identifiers_list = list()

    for embeddings_name, embeddings in kwargs.items():
        embeddings_list.append(embeddings)
        identifiers_list.extend([f"{embeddings_name} {id}" for id in range(embeddings.shape[0])])
        print(f"[+] {embeddings_name}: {embeddings.shape}")

    embeddings_list = np.concatenate(embeddings_list, axis=0)

    combined_df = pd.DataFrame(
        {
            "x": embeddings_list[:, 0],
            "y": embeddings_list[:, 1],
            "type": [id.split(" ")[0] for id in identifiers_list],
            "identifier": identifiers_list,
        }
    )

    fig = px.scatter(
        combined_df,
        x="x",
        y="y",
        color="type",
        hover_data=["identifier"],
        labels={"x": "Embedding Dimension 1", "y": "Embedding Dimension 2", "type": "Embedding Type"},
        title=f"Visualising Combined Embeddings {checkpoint_name}",
        width=1024,
        height=1024,
        template="plotly_white",
    )
    fig.show()

In [None]:
# Combine user, entity, and relation t-SNE embeddings into one visualization.
# This allows us to see how these different types of embeddings are distributed
# in the same 2D space and whether they cluster distinctly or overlap.

combine_embeddings(
    user=tsne_embeddings_users,
    entity=tsne_embeddings_entities,
    relation=tsne_embeddings_relations
)

# References

[1] Q. Yan, J. Fan, M. Li, G. Qu and Y. Xiao, "A Survey on Knowledge Graph Embedding," 2022 7th IEEE International Conference on Data Science in Cyberspace (DSC), Guilin, China, 2022, pp. 576-583, doi: 10.1109/DSC55868.2022.00086.

[2] Poličar, Pavlin G., Martin Stražar, and Blaž Zupan. "openTSNE: a modular Python library for t-SNE dimensionality reduction and embedding." Journal of Statistical Software 109 (2024): 1-30.