# 09 - Link Prediction com Embeddings Node2Vec e DeepWalk

Este notebook utiliza os embeddings gerados por **Node2Vec** e **DeepWalk** para a tarefa de **predi√ß√£o de arestas**, usando o mesmo split de arestas criado no Notebook 08.

Fluxo:

1. Carregar os splits de arestas (`08_edge_splits.csv`).
2. Carregar embeddings `node2vec_embeddings.csv` e `deepwalk_embeddings.csv`.
3. Construir features para arestas a partir dos embeddings (Hadamard + |diferen√ßa|).
4. Treinar regress√£o log√≠stica para prever links.
5. Avaliar em train/val/test (AUC, AP) e salvar os resultados em CSV.


In [1]:
!pip install node2vec

[0m

In [2]:
print("Running Node2Vec + DeepWalk generator from Jupyter notebook...")
!python ../analysis/node2vec_deepwalk.py

print("\nVerifying files:\n")
!ls -lh /workspace/results | grep embeddings

Running Node2Vec + DeepWalk generator from Jupyter notebook...
EDGELIST_PATH: /workspace/data/collaboration.edgelist.txt exists= True
Saving results in: /workspace/results
   source  target
0       0    1680
1       0    6918
2       0   19642
3       1    4131
4       1    5645

=== GRAPH OVERVIEW ===
nodes: 23133
edges: 93439

Computing transition probabilities: 100%|‚ñà| 23133/23133 [00:05<00:00, 4304.25it/
Generating walks (CPU: 1): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:06<00:00,  1.21s/it]
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Saved: /workspace/results/node2vec_embeddings.csv

Computing transition probabilities: 100%|‚ñà| 23133/23133 [00:05<00:00, 4206.55it/


In [3]:
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, average_precision_score
import matplotlib.pyplot as plt

plt.style.use("default")

# === Detectar automaticamente o diret√≥rio raiz do projeto ===
NOTEBOOK_DIR = Path.cwd()                      # /workspaces/.../notebooks
PROJECT_DIR = NOTEBOOK_DIR.parent              # /workspaces/.../
DATA_DIR = PROJECT_DIR / "data"
RESULTS_DIR = PROJECT_DIR / "results"
FIG_DIR = RESULTS_DIR / "figures"

RESULTS_DIR.mkdir(exist_ok=True, parents=True)
FIG_DIR.mkdir(exist_ok=True, parents=True)

# === Caminhos dos arquivos produzidos anteriormente ===
SPLITS_PATH    = RESULTS_DIR / "08_edge_splits.csv"
NODE2VEC_CSV   = RESULTS_DIR / "node2vec_embeddings.csv"
DEEPWALK_CSV   = RESULTS_DIR / "deepwalk_embeddings.csv"

print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("PROJECT_DIR :", PROJECT_DIR)
print("DATA_DIR    :", DATA_DIR)
print("RESULTS_DIR :", RESULTS_DIR)
print("NODE2VEC   :", NODE2VEC_CSV, "| exists =", NODE2VEC_CSV.exists())
print("DEEPWALK   :", DEEPWALK_CSV, "| exists =", DEEPWALK_CSV.exists())
print("SPLITS     :", SPLITS_PATH, "| exists =", SPLITS_PATH.exists())

if not SPLITS_PATH.exists():
    raise FileNotFoundError("Arquivo 08_edge_splits.csv n√£o encontrado ‚Äî execute o Notebook 08 primeiro.")

NOTEBOOK_DIR: /workspaces/upe-ppgec-netsci-2025-1-projeto-icbvo/notebooks
PROJECT_DIR : /workspaces/upe-ppgec-netsci-2025-1-projeto-icbvo
DATA_DIR    : /workspaces/upe-ppgec-netsci-2025-1-projeto-icbvo/data
RESULTS_DIR : /workspaces/upe-ppgec-netsci-2025-1-projeto-icbvo/results
NODE2VEC   : /workspaces/upe-ppgec-netsci-2025-1-projeto-icbvo/results/node2vec_embeddings.csv | exists = True
DEEPWALK   : /workspaces/upe-ppgec-netsci-2025-1-projeto-icbvo/results/deepwalk_embeddings.csv | exists = True
SPLITS     : /workspaces/upe-ppgec-netsci-2025-1-projeto-icbvo/results/08_edge_splits.csv | exists = True


## 1. Carregar splits de arestas (os mesmos do Notebook 08)


In [4]:
df_edges_all = pd.read_csv(SPLITS_PATH)
print(df_edges_all.head())
print("Splits:")
print(df_edges_all["split"].value_counts())
print("Labels:")
print(df_edges_all["label"].value_counts())

df_train = df_edges_all[df_edges_all["split"] == "train"].reset_index(drop=True)
df_val = df_edges_all[df_edges_all["split"] == "val"].reset_index(drop=True)
df_test = df_edges_all[df_edges_all["split"] == "test"].reset_index(drop=True)

print("Train shape:", df_train.shape)
print("Val shape:", df_val.shape)
print("Test shape:", df_test.shape)


       u      v  label  split
0  14097  19856      1  train
1  16449  10738      1  train
2  10217  12337      1  train
3   8516  12628      1  train
4  10310  18193      1  train
Splits:
split
train    130814
test      28034
val       28030
Name: count, dtype: int64
Labels:
label
1    93439
0    93439
Name: count, dtype: int64
Train shape: (130814, 4)
Val shape: (28030, 4)
Test shape: (28034, 4)


## 2. Fun√ß√µes gen√©ricas para carregar embeddings e montar features de arestas


In [5]:
def load_embeddings(path: Path) -> pd.DataFrame:
    print("Loading:", path)
    
    if not path.exists():
        raise FileNotFoundError(f"Arquivo n√£o encontrado: {path}")

    df = pd.read_csv(path)

    # Tenta detectar automaticamente a coluna 'node'
    possible_cols = ["node", "id", "vertex", "index"]
    found = None
    for c in possible_cols:
        if c in df.columns:
            found = c
            break

    if found is None:
        raise ValueError(f"Nenhuma coluna de id encontrada em {path}. "
                         f"Esperado uma dessas: {possible_cols}")

    df = df.rename(columns={found: "node"})
    df["node"] = df["node"].astype(int)

    return df



def build_emb_dict(df_emb: pd.DataFrame) -> dict:
    embed_cols = [c for c in df_emb.columns if c.startswith("dim_")]
    emb_dim = len(embed_cols)
    print("Embedding dimension:", emb_dim)
    emb_dict = {
        int(row["node"]): row[embed_cols].to_numpy(dtype=float)
        for _, row in df_emb.iterrows()
    }
    return emb_dict, emb_dim


def edge_to_features(u: int, v: int, emb_dict: dict, emb_dim: int) -> np.ndarray:
    if u not in emb_dict or v not in emb_dict:
        raise KeyError(f"Node {u} or {v} not found in embeddings.")
    z_u = emb_dict[u]
    z_v = emb_dict[v]
    had = z_u * z_v
    diff = np.abs(z_u - z_v)
    return np.concatenate([had, diff], axis=0)


def build_Xy(df_edges: pd.DataFrame, emb_dict: dict, emb_dim: int):
    feat_dim = emb_dim * 2
    X = np.zeros((len(df_edges), feat_dim), dtype=float)
    y = df_edges["label"].to_numpy(dtype=int)
    for i, row in df_edges.iterrows():
        X[i] = edge_to_features(int(row["u"]), int(row["v"]), emb_dict, emb_dim)
    return X, y


def evaluate_link_prediction(emb_name: str, df_train, df_val, df_test, emb_dict: dict, emb_dim: int):
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import roc_auc_score, average_precision_score

    print(f"\n=== Avaliando embeddings: {emb_name} ===")

    X_train, y_train = build_Xy(df_train, emb_dict, emb_dim)
    X_val, y_val = build_Xy(df_val, emb_dict, emb_dim)
    X_test, y_test = build_Xy(df_test, emb_dict, emb_dim)

    print("X_train shape:", X_train.shape, "y_train shape:", y_train.shape)

    clf = LogisticRegression(max_iter=1000, class_weight="balanced", n_jobs=-1)
    clf.fit(X_train, y_train)

    def eval_split(name, X, y):
        y_scores = clf.predict_proba(X)[:, 1]
        auc = roc_auc_score(y, y_scores)
        ap = average_precision_score(y, y_scores)
        print(f"{emb_name} | {name} - AUC: {auc:.4f}, AP: {ap:.4f}")
        return {
            "embedding": emb_name,
            "split": name,
            "auc": auc,
            "ap": ap,
            "num_samples": len(y),
            "pos_samples": int(y.sum()),
            "neg_samples": int((y == 0).sum()),
        }

    results = []
    results.append(eval_split("train", X_train, y_train))
    results.append(eval_split("val", X_val, y_val))
    results.append(eval_split("test", X_test, y_test))

    return pd.DataFrame(results)


## 3. Avaliar Node2Vec


In [6]:
df_node2vec = load_embeddings(NODE2VEC_CSV)
print("Node2Vec embeddings shape:", df_node2vec.shape)
emb_dict_n2v, emb_dim_n2v = build_emb_dict(df_node2vec)

df_results_node2vec = evaluate_link_prediction(
    emb_name="node2vec",
    df_train=df_train,
    df_val=df_val,
    df_test=df_test,
    emb_dict=emb_dict_n2v,
    emb_dim=emb_dim_n2v,
)

n2v_results_path = RESULTS_DIR / "09_node2vec_link_prediction_results.csv"
df_results_node2vec.to_csv(n2v_results_path, index=False)
print("Saved Node2Vec link prediction results to:", n2v_results_path)

display(df_results_node2vec)


Loading: /workspaces/upe-ppgec-netsci-2025-1-projeto-icbvo/results/node2vec_embeddings.csv
Node2Vec embeddings shape: (23133, 65)
Embedding dimension: 64

=== Avaliando embeddings: node2vec ===
X_train shape: (130814, 128) y_train shape: (130814,)
node2vec | train - AUC: 0.9994, AP: 0.9992
node2vec | val - AUC: 0.9996, AP: 0.9996
node2vec | test - AUC: 0.9995, AP: 0.9995
Saved Node2Vec link prediction results to: /workspaces/upe-ppgec-netsci-2025-1-projeto-icbvo/results/09_node2vec_link_prediction_results.csv


Unnamed: 0,embedding,split,auc,ap,num_samples,pos_samples,neg_samples
0,node2vec,train,0.999385,0.999237,130814,65407,65407
1,node2vec,val,0.99958,0.999562,28030,14015,14015
2,node2vec,test,0.999527,0.999473,28034,14017,14017


## 4. Avaliar DeepWalk (Node2Vec com p = q = 1)


In [7]:
df_deepwalk = load_embeddings(DEEPWALK_CSV)
print("DeepWalk embeddings shape:", df_deepwalk.shape)
emb_dict_dw, emb_dim_dw = build_emb_dict(df_deepwalk)

df_results_deepwalk = evaluate_link_prediction(
    emb_name="deepwalk",
    df_train=df_train,
    df_val=df_val,
    df_test=df_test,
    emb_dict=emb_dict_dw,
    emb_dim=emb_dim_dw,
)

dw_results_path = RESULTS_DIR / "09_deepwalk_link_prediction_results.csv"
df_results_deepwalk.to_csv(dw_results_path, index=False)
print("Saved DeepWalk link prediction results to:", dw_results_path)

display(df_results_deepwalk)


Loading: /workspaces/upe-ppgec-netsci-2025-1-projeto-icbvo/results/deepwalk_embeddings.csv
DeepWalk embeddings shape: (23133, 33)
Embedding dimension: 32

=== Avaliando embeddings: deepwalk ===
X_train shape: (130814, 64) y_train shape: (130814,)
deepwalk | train - AUC: 0.9971, AP: 0.9971
deepwalk | val - AUC: 0.9973, AP: 0.9975
deepwalk | test - AUC: 0.9977, AP: 0.9978
Saved DeepWalk link prediction results to: /workspaces/upe-ppgec-netsci-2025-1-projeto-icbvo/results/09_deepwalk_link_prediction_results.csv


Unnamed: 0,embedding,split,auc,ap,num_samples,pos_samples,neg_samples
0,deepwalk,train,0.997065,0.997082,130814,65407,65407
1,deepwalk,val,0.997323,0.9975,28030,14015,14015
2,deepwalk,test,0.997668,0.997758,28034,14017,14017


## 5. Comparar com resultados anteriores (GNN + heur√≠sticas)

Nesta c√©lula, vamos carregar:

- `08_link_prediction_results.csv` (GNN + embeddings GNN + logistic regression);
- `baseline_heuristics_results.csv` (CN, Jaccard, AA, etc.);
- e combinar com os resultados de Node2Vec / DeepWalk.


In [8]:
gnn_results_path = RESULTS_DIR / "08_link_prediction_results.csv"
baseline_results_path = RESULTS_DIR / "baseline_heuristics_results.csv"

if not gnn_results_path.exists():
    print("‚ö† 08_link_prediction_results.csv n√£o encontrado. Rode o Notebook 08 para incluir GNN na compara√ß√£o.")
    df_gnn = pd.DataFrame()
else:
    df_gnn = pd.read_csv(gnn_results_path)
    df_gnn["embedding"] = "gnn"

if not baseline_results_path.exists():
    print("‚ö† baseline_heuristics_results.csv n√£o encontrado. Rode baseline_heuristics.py para incluir heur√≠sticas na compara√ß√£o.")
    df_baseline = pd.DataFrame()
else:
    df_baseline = pd.read_csv(baseline_results_path)
    # renomear para ficar compat√≠vel: heuristic -> embedding
    df_baseline = df_baseline.rename(columns={"heuristic": "embedding"})

df_all_results = []
df_all_results.append(df_results_node2vec)
df_all_results.append(df_results_deepwalk)
if not df_gnn.empty:
    df_all_results.append(df_gnn)
if not df_baseline.empty:
    df_all_results.append(df_baseline)

df_all_results = pd.concat(df_all_results, ignore_index=True)

comparison_path = RESULTS_DIR / "09_link_prediction_comparison.csv"
df_all_results.to_csv(comparison_path, index=False)
print("Saved combined results to:", comparison_path)

display(df_all_results[df_all_results["split"] == "test"])


‚ö† baseline_heuristics_results.csv n√£o encontrado. Rode baseline_heuristics.py para incluir heur√≠sticas na compara√ß√£o.
Saved combined results to: /workspaces/upe-ppgec-netsci-2025-1-projeto-icbvo/results/09_link_prediction_comparison.csv


Unnamed: 0,embedding,split,auc,ap,num_samples,pos_samples,neg_samples
2,node2vec,test,0.999527,0.999473,28034,14017,14017
5,deepwalk,test,0.997668,0.997758,28034,14017,14017
8,gnn,test,0.919111,0.909965,28034,14017,14017


In [9]:
print('Notebook 09 finished successfully. üöÄ')

Notebook 09 finished successfully. üöÄ
