# 05 — Link Prediction (CV ↔ Job)

## Objectif
Prédire des liens manquants entre CV et Jobs.

On compare 2 familles :
1) **Méthodes structurelles (graph-based)** :
   - score 2-hop Common Neighbors (sur bipartite)
   - Adamic-Adar (via projection CV–CV ou Job–Job)
2) **Méthodes sémantiques (LLM / embeddings)** :
   - embeddings des textes CV & Job
   - score par similarité cosinus

## Évaluation
On retire une partie des arêtes (test edges), on prédit des liens candidats,
et on calcule **Precision@K** :
- parmi les Top-K liens prédits, combien sont réellement dans le test set ?

## Entrées
- `results/graph_bipartite.graphml`
- `data/processed/cv_clean.csv`
- `data/processed/job_clean.csv`
- `data/processed/edges.csv`

## Sorties
- `results/linkpred_structural.csv`
- `results/linkpred_semantic.csv`
- `results/linkpred_metrics.json`

In [1]:
import os
import json
import random
import numpy as np
import pandas as pd

import networkx as nx
from networkx.algorithms import bipartite

from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
G = nx.read_graphml("results/graph_bipartite.graphml")
cv = pd.read_csv("data/processed/cv_clean.csv")
job = pd.read_csv("data/processed/job_clean.csv")
edges = pd.read_csv("data/processed/edges.csv")

cv_nodes = [n for n, d in G.nodes(data=True) if d.get("node_type") == "cv"]
job_nodes = [n for n, d in G.nodes(data=True) if d.get("node_type") == "job"]

print("Graph:", G.number_of_nodes(), "nodes |", G.number_of_edges(), "edges")
print("CV nodes:", len(cv_nodes), "Jobs nodes:", len(job_nodes))
print("edges table:", edges.shape)

Graph: 390 nodes | 1200 edges
CV nodes: 300 Jobs nodes: 90
edges table: (1200, 2)


## 2) Train/Test split sur les arêtes

On retire une fraction des arêtes pour tester la qualité de la prédiction.
On construit un graphe d'entraînement `G_train` (sans les test edges).

In [3]:
all_edges = list(G.edges())
train_edges, test_edges = train_test_split(all_edges, test_size=0.2, random_state=42)

G_train = G.copy()
G_train.remove_edges_from(test_edges)

test_set = set((u, v) if u < v else (v, u) for (u, v) in test_edges)  # canonical

print("Train edges:", len(train_edges))
print("Test edges:", len(test_edges))

Train edges: 960
Test edges: 240


## 3) Méthode structurelle 1 : score 2-hop Common Neighbors (bipartite)

Score(CV, Job) = nombre de chemins de longueur 2 entre eux,
i.e. nombre de nœuds voisins partagés (dans le bipartite).

Interprétation :
- plus il y a de chemins 2-hop, plus CV et Job sont proches structurellement.

In [4]:
def two_hop_cn(Gb, cv_id, job_id):
    # neighbors(cv) are jobs, neighbors(job) are cvs
    return len(set(Gb.neighbors(cv_id)).intersection(set(Gb.neighbors(job_id))))

### Génération de candidats négatifs (paires inconnues)

On ne peut pas scorer toutes les paires CV×Job si c’est trop grand.
Donc on échantillonne des candidats (non-liés dans G_train).

In [5]:
random.seed(42)

N_CANDIDATES = 20000  # ajuste selon taille, commence par 20k
candidates = []

while len(candidates) < N_CANDIDATES:
    c = random.choice(cv_nodes)
    j = random.choice(job_nodes)
    if not G_train.has_edge(c, j):
        candidates.append((c, j))

print("Candidates sampled:", len(candidates))

Candidates sampled: 20000


In [6]:
struct_scores = []
for c, j in candidates:
    struct_scores.append((c, j, two_hop_cn(G_train, c, j)))

df_struct = pd.DataFrame(struct_scores, columns=["cv_id", "job_id", "score_cn"])
df_struct.sort_values("score_cn", ascending=False).head(10)

Unnamed: 0,cv_id,job_id,score_cn
19983,CV_0063,JOB_0021,0
19982,CV_0194,JOB_0047,0
19981,CV_0135,JOB_0045,0
19980,CV_0126,JOB_0010,0
19979,CV_0125,JOB_0017,0
19978,CV_0218,JOB_0036,0
19977,CV_0047,JOB_0065,0
19976,CV_0098,JOB_0007,0
19975,CV_0053,JOB_0087,0
19974,CV_0270,JOB_0015,0


## 4) Évaluation : Precision@K (structurel)

On regarde les top-K prédictions et on vérifie combien sont réellement dans `test_edges`.

In [7]:
def precision_at_k(pred_pairs, test_edge_set, k=200):
    top = pred_pairs[:k]
    hits = 0
    for (c, j) in top:
        e = (c, j) if c < j else (j, c)
        if e in test_edge_set:
            hits += 1
    return hits / k

# Top predictions
df_struct_sorted = df_struct.sort_values("score_cn", ascending=False)
top_pairs_struct = list(zip(df_struct_sorted["cv_id"], df_struct_sorted["job_id"]))

p_at_200_struct = precision_at_k(top_pairs_struct, test_set, k=200)
p_at_500_struct = precision_at_k(top_pairs_struct, test_set, k=500)

print("Precision@200 (struct CN):", p_at_200_struct)
print("Precision@500 (struct CN):", p_at_500_struct)

Precision@200 (struct CN): 0.015
Precision@500 (struct CN): 0.014


## 5) Méthode sémantique : Embeddings (sentence-transformers)

On encode `clean_text` des CV et Jobs, puis on score par similarité cosinus.



In [8]:
try:
    from sentence_transformers import SentenceTransformer
    print("✅ sentence-transformers available")
except ModuleNotFoundError:
    !pip -q install sentence-transformers
    from sentence_transformers import SentenceTransformer
    print("✅ Installed sentence-transformers")

MODEL_NAME = "all-MiniLM-L6-v2"
model = SentenceTransformer(MODEL_NAME)

✅ sentence-transformers available


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [9]:
# Text lists aligned with cv/job tables
cv_texts = cv["clean_text"].astype(str).tolist()
job_texts = job["clean_text"].astype(str).tolist()

cv_ids = cv["cv_id"].tolist()
job_ids = job["job_id"].tolist()

cv_emb = model.encode(cv_texts, batch_size=32, show_progress_bar=True)
job_emb = model.encode(job_texts, batch_size=32, show_progress_bar=True)

print("cv_emb:", cv_emb.shape, "job_emb:", job_emb.shape)

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

cv_emb: (300, 384) job_emb: (90, 384)


### Scoring des candidats par cosine similarity
On utilise la même liste `candidates` (paires non-liées dans `G_train`).

In [10]:
# Build index for fast lookup
cv_index = {cid: i for i, cid in enumerate(cv_ids)}
job_index = {jid: i for i, jid in enumerate(job_ids)}

sem_scores = []
for c, j in candidates:
    i = cv_index[c]
    k = job_index[j]
    s = cosine_similarity(cv_emb[i].reshape(1, -1), job_emb[k].reshape(1, -1))[0, 0]
    sem_scores.append((c, j, float(s)))

df_sem = pd.DataFrame(sem_scores, columns=["cv_id", "job_id", "score_cosine"])
df_sem.sort_values("score_cosine", ascending=False).head(10)

Unnamed: 0,cv_id,job_id,score_cosine
17617,CV_0229,JOB_0004,0.851541
19359,CV_0229,JOB_0004,0.851541
10726,CV_0197,JOB_0076,0.843132
7817,CV_0198,JOB_0004,0.841745
1436,CV_0229,JOB_0076,0.841396
15993,CV_0049,JOB_0088,0.840429
11819,CV_0229,JOB_0088,0.839434
6813,CV_0246,JOB_0004,0.836336
9576,CV_0246,JOB_0004,0.836336
16423,CV_0294,JOB_0004,0.836133


## 6) Évaluation : Precision@K (sémantique)

In [11]:
df_sem_sorted = df_sem.sort_values("score_cosine", ascending=False)
top_pairs_sem = list(zip(df_sem_sorted["cv_id"], df_sem_sorted["job_id"]))

p_at_200_sem = precision_at_k(top_pairs_sem, test_set, k=200)
p_at_500_sem = precision_at_k(top_pairs_sem, test_set, k=500)

print("Precision@200 (semantic):", p_at_200_sem)
print("Precision@500 (semantic):", p_at_500_sem)

Precision@200 (semantic): 0.05
Precision@500 (semantic): 0.076


## 7) Sauvegarde des résultats et métriques

In [12]:
os.makedirs("results", exist_ok=True)

df_struct_sorted.to_csv("results/linkpred_structural.csv", index=False)
df_sem_sorted.to_csv("results/linkpred_semantic.csv", index=False)

metrics = {
    "precision_at_200_struct_cn": p_at_200_struct,
    "precision_at_500_struct_cn": p_at_500_struct,
    "precision_at_200_semantic": p_at_200_sem,
    "precision_at_500_semantic": p_at_500_sem,
    "n_candidates_scored": int(N_CANDIDATES),
    "test_edges": int(len(test_edges))
}

with open("results/linkpred_metrics.json", "w") as f:
    json.dump(metrics, f, indent=2)

print("✅ Saved:")
print("- results/linkpred_structural.csv")
print("- results/linkpred_semantic.csv")
print("- results/linkpred_metrics.json")
metrics

✅ Saved:
- results/linkpred_structural.csv
- results/linkpred_semantic.csv
- results/linkpred_metrics.json


{'precision_at_200_struct_cn': 0.015,
 'precision_at_500_struct_cn': 0.014,
 'precision_at_200_semantic': 0.05,
 'precision_at_500_semantic': 0.076,
 'n_candidates_scored': 20000,
 'test_edges': 240}

# Conclusion

On a comparé :
- une approche **structurelle** (2-hop common neighbors)
- une approche **sémantique** (embeddings + cosine similarity)

On a évalué via **Precision@K** sur un split train/test.
