# Node2vec

> [!WARNING]
> This tutorial is experimental and requires `mlx_cluster` to be installed which
> currently requires mlx 0.18.

**Goal:** This tutorial will guide you through implementing node2vec to generate vector embeddings for nodes in a simple undirected graph.

**Concepts:** `MLX`, `Node2vec`


In [1]:
from collections import defaultdict
import mlx.core as mx
from mlx_graphs.datasets import PlanetoidDataset

## Dataset

For this first tutorial, we will use the [PlanetoidDataset](https://chrsmrrs.github.io/datasets/docs/datasets/) collection, which comprises of citation networks for `Cora`, `Pubmed` and `CiteSeer`. 

We will be using `Cora` dataset consisting of **2708** nodes and **10,056** edges. The dataset can be easily accessed via `PlanetoidDataset` class 

In [2]:
dataset = PlanetoidDataset("Cora")

Loading cora data ... Done


In [3]:
dataset

cora(num_graphs=1)

We can access dataset properties directly from `dataset`object

In [4]:
# Some useful properties
print("Dataset attributes")
print("-" * 20)
print(f"Number of graphs: {len(dataset)}")
print(f"Number of node features: {dataset.num_node_features}")
print(f"Number of edge features: {dataset.num_edge_features}")
print(f"Number of graph features: {dataset.num_graph_features}")
print(f"Number of graph classes to predict: {dataset.num_graph_classes}\n")

# Statistics of the dataset
stats = defaultdict(list)
for g in dataset:
    stats["Mean node degree"].append(g.num_edges / g.num_nodes)
    stats["Mean num of nodes"].append(g.num_nodes)
    stats["Mean num of edges"].append(g.num_edges)

print("Dataset stats")
print("-" * 20)
for k, v in stats.items():
    mean = mx.mean(mx.array(v)).item()
    print(f"{k}: {mean:.2f}")

Dataset attributes
--------------------
Number of graphs: 1
Number of node features: 1433
Number of edge features: 0
Number of graph features: 0
Number of graph classes to predict: 0

Dataset stats
--------------------
Mean node degree: 3.90
Mean num of nodes: 2708.00
Mean num of edges: 10556.00


In [5]:
dataset[0]

GraphData(
	edge_index(shape=(2, 10556), int32)
	node_features(shape=(2708, 1433), float32)
	node_labels(shape=(2708,), int32)
	train_mask(shape=(2708,), bool)
	val_mask(shape=(2708,), bool)
	test_mask(shape=(2708,), bool))

## Creating a simple neural network using node2vec

In [6]:
from mlx_graphs.algorithms import Node2Vec

## Specify hyperparameters for node2vec. 

The most important hyperparameters for node2vec are `p` and `q` where 
1. `p` : specifies the likelihood of revisiting a node in the walk (return parameter). When this is low the algorithm is  more likely to take a step back.
2. `q` : specifies likelhood of exploring nodes that are further away from the source. When this is high the algorithm is more likely to explore the neighbourhood
3. `embedding_dim`: dimemnsions of embedding model
4. `walk_length`: Number of nodes to consider in a walk
5. `context_size`: The actual context size which is considered for positive samples. This parameter increases the effective sampling rate by reusing samples across different source nodes.

In [7]:
embedding_dim = 128
walk_length = 50
context_size = 10
walks_per_node = 10
num_negative_samples = 1
p = 1.0
q = 1.0

Try and train a simple model loop

## Benchmark on CPU and GPU

In [8]:
import mlx.nn as nn
import mlx.optimizers as optim
import time
optimizer = optim.Adam(learning_rate=0.001)

In [9]:
def training_loop(model, optimizer, num_epochs=10, batch_size=32):
    """Your training function"""
    for epoch in range(num_epochs):
        total_loss = 0
        dataloader = model.dataloader(batch_size=batch_size)
        for pos, neg in dataloader:
            loss, grad = nn.value_and_grad(model, model.loss)(pos, neg)
            total_loss += loss.item()
            optimizer.update(model, grad)
        print(f"Epoch {epoch}: batch loss = {total_loss/batch_size:.5f}")

In [10]:
def benchmark_training(model_creation_fn, optimizer_creation_fn, device='gpu', num_epochs=10, batch_size=32):
    """
    Benchmark training on specified device
    
    Args:
        model_creation_fn: Function that returns a new model instance
        optimizer_creation_fn: Function that takes model and returns optimizer
        device: 'gpu' or 'cpu'
    """
    print(f"🚀 {device.upper()} Benchmark")
    print("-" * 40)
    
    # Set device
    if device == 'cpu':
        use_gpu = False
    else:
        use_gpu = True
    
    # Create fresh model and optimizer for this benchmark
    model = model_creation_fn(use_gpu=use_gpu)
    optimizer = optimizer_creation_fn(model)
    
    # Benchmark the training
    start_time = time.perf_counter()
    training_loop(model, optimizer, num_epochs, batch_size)
    del model
    total_time = time.perf_counter() - start_time
    
    print(f"🎯 {device.upper()} Total Time: {total_time:.3f} seconds")
    print(f"📈 Average per epoch: {total_time/num_epochs:.3f} seconds")
    
    return total_time

In [11]:
def create_model(use_gpu=False):
    """Create Node2Vec model"""
    model = Node2Vec(
        edge_index=dataset[0].edge_index,
        num_nodes=dataset[0].num_nodes,
        embedding_dim=embedding_dim,
        walk_length=walk_length,
        context_size=context_size,
        walks_per_node=walks_per_node,
        num_negative_samples=num_negative_samples,
        p=p,
        q=q,
        use_gpu=use_gpu,
    )
    return model

In [12]:
def create_optimizer(model):
    """Create optimizer - replace with your actual optimizer"""
    # Example - update with your actual optimizer
    optimizer = optim.Adam(learning_rate=0.001)
    return optimizer

In [13]:
print("Starting benchmarks...\n")

# GPU benchmark
gpu_time = benchmark_training(create_model, create_optimizer, num_epochs=100, device='gpu')

print("\n" + "="*50 + "\n")

# CPU benchmark  
cpu_time = benchmark_training(create_model, create_optimizer, num_epochs=100, device='cpu')

# Summary
print("\n📊 BENCHMARK SUMMARY")
print("=" * 50)
print(f"GPU Time:     {gpu_time:.3f}s")
print(f"CPU Time:     {cpu_time:.3f}s")
print(f"GPU Speedup:  {cpu_time/gpu_time:.2f}x faster")

Starting benchmarks...

🚀 GPU Benchmark
----------------------------------------
Epoch 0: batch loss = 2.68540
Epoch 1: batch loss = 2.25251
Epoch 2: batch loss = 2.22112
Epoch 3: batch loss = 2.20850
Epoch 4: batch loss = 2.20118
Epoch 5: batch loss = 2.19510
Epoch 6: batch loss = 2.19153
Epoch 7: batch loss = 2.19094
Epoch 8: batch loss = 2.18868
Epoch 9: batch loss = 2.18604
Epoch 10: batch loss = 2.18575
Epoch 11: batch loss = 2.18490
Epoch 12: batch loss = 2.18312
Epoch 13: batch loss = 2.18215
Epoch 14: batch loss = 2.18156
Epoch 15: batch loss = 2.17968
Epoch 16: batch loss = 2.17803
Epoch 17: batch loss = 2.17907
Epoch 18: batch loss = 2.17829
Epoch 19: batch loss = 2.17767
Epoch 20: batch loss = 2.17855
Epoch 21: batch loss = 2.17641
Epoch 22: batch loss = 2.17812
Epoch 23: batch loss = 2.17572
Epoch 24: batch loss = 2.17765
Epoch 25: batch loss = 2.17583
Epoch 26: batch loss = 2.17600
Epoch 27: batch loss = 2.17441
Epoch 28: batch loss = 2.17587
Epoch 29: batch loss = 2.17451