# DGL API — Graph Construction, Node Features, Edge Weights, GraphSAGE & Evaluation

**Goal:** Demonstrate the native DGL API (heterographs, edge data, GraphSAGE) and our thin wrapper (`dgl_utils.py`) on a tiny example and on a subsample of MovieLens if present. We cover:
- Bipartite user–movie graph with **edge weights** (ratings)
- **Node features** via movie genres
- GraphSAGE encoder (homogeneous view)
- **Link prediction** metrics: Precision@K / Recall@K
- **Rating RMSE** via a fast regressor on frozen GNN embeddings
- **Top-N recommendations** for a user

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os, logging, random
import numpy as np
import pandas as pd
import torch
import dgl

import dgl_utils as du

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("DGL_API")

SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)

print("torch:", torch.__version__, "| dgl:", dgl.__version__)


In [None]:
DATA_DIR = "data"  # put ratings.csv and movies.csv here if available
RATINGS_CSV = os.path.join(DATA_DIR, "ratings.csv")
MOVIES_CSV  = os.path.join(DATA_DIR, "movies.csv")

USE_REAL = os.path.exists(RATINGS_CSV)
print("Use real MovieLens data:", USE_REAL)

In [None]:
if USE_REAL:
    data = du.load_movielens(RATINGS_CSV, MOVIES_CSV if os.path.exists(MOVIES_CSV) else None)
    ratings = data["ratings"]
    movies  = data.get("movies", None)

    # (Optional) quick subsample to keep API demo fast.
    # Keep users & movies with moderate activity, then sample edges.
    # Tweak these numbers if your laptop is strong.
    max_edges = 100_000  # ~ quick demo size
    if len(ratings) > max_edges:
        ratings = ratings.sample(n=max_edges, random_state=SEED).reset_index(drop=True)
    print(ratings.head(), "\nrows:", len(ratings))
    if movies is not None:
        print(movies.head(), "\nrows:", len(movies))
else:
    # Toy fallback (runs anywhere):
    ratings = pd.DataFrame({
        "userId":   [10,10,11,12,12,13,13,13],
        "movieId":  [100,101,100,102,103,100,102,104],
        "rating":   [4.0,5.0,3.0,4.5,2.5,4.0,3.5,5.0],
        "timestamp":[1,2,3,4,5,6,7,8],
    })
    movies = pd.DataFrame({
        "movieId": [100,101,102,103,104],
        "title": ["Toy Story", "Jumanji", "Grumpier Old Men", "Waiting to Exhale", "Father of the Bride Part II"],
        "genres": ["Adventure|Animation|Children|Comedy|Fantasy",
                   "Adventure|Children|Fantasy",
                   "Comedy|Romance",
                   "Comedy|Drama|Romance",
                   "Comedy"]
    })
    print("Using toy data.")

In [None]:
df2, maps = du.remap_ids(ratings, "userId", "movieId")
num_users = df2["u"].nunique()
num_movies = df2["v"].nunique()
num_edges  = len(df2)
num_users, num_movies, num_edges

In [None]:
g = du.build_bipartite_graph(df2, num_users, num_movies, rating_col="rating")

print("Node types:", g.ntypes)
print("Edge types:", g.etypes)
print("Users:", g.num_nodes("user"), " Movies:", g.num_nodes("movie"))
print("Edges (rates):", g.num_edges(("user","rates","movie")))
print("Edge data keys:", g.edges[('user','rates','movie')].data.keys())
print("First few ratings:", g.edges[('user','rates','movie')].data["rating"][:5])


In [None]:
movie_feat_tensor = None
genre_vocab = []
if movies is not None:
    movie_feat_tensor, genre_vocab = du.build_movie_genre_onehot(movies, maps["item_map"])
    print("Movie genre feature matrix:", tuple(movie_feat_tensor.shape), " #genres:", len(genre_vocab))
else:
    print("No movies.csv available -> skipping movie genre features.")


In [None]:
splits = du.make_edge_splits(g, etype=("user","rates","movie"), test_size=0.1, val_size=0.1, seed=SEED)
{k: v.shape for k, v in splits.items()}

train_pairs = du.eids_to_pairs(g, splits["train_eids"])
val_pairs   = du.eids_to_pairs(g, splits["val_eids"])
test_pairs  = du.eids_to_pairs(g, splits["test_eids"])

# Also capture ratings per set for RMSE later.
r = g.edges[('user','rates','movie')].data["rating"].numpy()
train_ratings = [r[i] for i in splits["train_eids"].tolist()]
val_ratings   = [r[i] for i in splits["val_eids"].tolist()]
test_ratings  = [r[i] for i in splits["test_eids"].tolist()]

len(train_pairs), len(val_pairs), len(test_pairs)


In [None]:
embeddings = du.train_link_prediction(
    g, splits,
    embed_dim=32,
    epochs=2,          # keep small for API demo; your example notebook can run longer
    lr=1e-3,
    device="cpu",
    movie_feat_tensor=movie_feat_tensor  # None if we have no movies.csv
)
user_emb  = embeddings["user_emb"]
movie_emb = embeddings["movie_emb"]
user_emb.shape, movie_emb.shape

In [None]:
metrics_k10 = du.evaluate_precision_recall_at_k(user_emb, movie_emb, test_pairs, k=10)
metrics_k5  = du.evaluate_precision_recall_at_k(user_emb, movie_emb, test_pairs, k=5)
print("P@10/R@10:", metrics_k10)
print("P@5/R@5 :", metrics_k5)

In [None]:
# Fit on train edges, validate on test edges — fast and interpretable.
reg = du.fit_edge_regressor_ridge(user_emb, movie_emb, train_pairs, train_ratings, alpha=1.0)
rmse_test = du.rmse_from_regressor(reg, user_emb, movie_emb, test_pairs, test_ratings)
print("Rating RMSE (test):", rmse_test)

In [None]:
user_seen = du.build_user_seen_map(g, splits["train_eids"])
len(user_seen), list(next(iter(user_seen.values())))[0:5]

In [None]:
sample_user = 0  # change it to explore
topn = du.recommend_topk_for_user(
    sample_user,
    user_emb,
    movie_emb,
    seen_items=user_seen.get(sample_user, set()),
    k=10,
)
title_lookup = du.id_maps_to_title_lookup(movies, maps.get("item_map"))
recs = [(mid, title_lookup.get(mid, f"movie_{mid}")) for mid in topn]
recs[:10]

In [None]:
# Popularity = most-rated movies in *train* split.
u_tr, v_tr = g.find_edges(splits["train_eids"], etype=("user","rates","movie"))
counts = np.bincount(v_tr.numpy(), minlength=g.num_nodes("movie"))
pop_rank = np.argsort(-counts)  # descending
# Exclude seen items for the same user:
pop_recs = [m for m in pop_rank.tolist() if m not in user_seen.get(sample_user, set())][:10]
[(mid, title_lookup.get(mid, f"movie_{mid}")) for mid in pop_recs]

In [None]:
summary = {
    "num_users": user_emb.shape[0],
    "num_movies": movie_emb.shape[0],
    "num_edges": g.num_edges(("user","rates","movie")),
    "P@10": metrics_k10["precision@k"],
    "R@10": metrics_k10["recall@k"],
    "P@5":  metrics_k5["precision@k"],
    "R@5":  metrics_k5["recall@k"],
    "RMSE": rmse_test,
}
summary


## Notes (for API notebook)
- **Edge weights** (ratings) are stored on `("user","rates","movie")`.
- **Node features**: if `movies.csv` is available, movie genre one-hot features are fused with learnable embeddings via a linear projector.
- **Encoder**: homogeneous GraphSAGE for clarity; the example notebook can explore hetero modules & neighbor sampling.
- **Metrics**:
  - **P@K/R@K** from dot-product link scores (implicit rec quality).
  - **RMSE** from a fast ridge regressor on frozen embeddings (explicit rating quality).
- **Top-N**: demo recs excluding training items, with title mapping.

### Next steps (for the Example notebook)
- Longer training (more epochs), neighbor sampling, or hetero GNN layers.
- Popularity/MF baselines and detailed comparisons.
- Add side information (directors/actors) to a **heterogeneous** graph.
- Explore **temporal** splits and time-aware evaluation.
