# 06 — Node Classification (CV / Job)

## Objectif
Construire des features à partir du graphe (et des communautés, embeddings) puis entraîner un modèle de classification.

### Classification CV (principale)
- prédire `level` : {junior, intermediate, senior}
- prédire `profile_type` : {specialized, polyvalent}
-  prédire `domain`

### Features possibles
1) Structure :
- degré du nœud (dans le bipartite)
-  degree centrality sur projection CV–CV
2) Communauté :
- `community_id` (Louvain sur projection CV–CV)
3) Sémantique :
- embeddings du texte `clean_text` (si disponibles)

## Entrées
- `results/graph_bipartite.graphml`
- `data/processed/cv_clean.csv`
- `data/processed/job_clean.csv`
- `data/processed/edges.csv`
- `results/cv_with_communities.csv`
-  résultats embeddings depuis Notebook 05

## Sorties
- `results/cv_classification_report.txt`
- `results/cv_confusion_matrix.csv`
-  métriques JSON

In [1]:
import os
import json
import numpy as np
import pandas as pd

import networkx as nx
from networkx.algorithms import bipartite

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

In [2]:
G = nx.read_graphml("results/graph_bipartite.graphml")
cv = pd.read_csv("data/processed/cv_clean.csv")
job = pd.read_csv("data/processed/job_clean.csv")
edges = pd.read_csv("data/processed/edges.csv")

print("Graph:", G.number_of_nodes(), "nodes |", G.number_of_edges(), "edges")
print("cv:", cv.shape, "| job:", job.shape, "| edges:", edges.shape)

Graph: 390 nodes | 1200 edges
cv: (300, 6) | job: (90, 5) | edges: (1200, 2)


## 2) Features structurelles (degré bipartite)

On ajoute le degré du nœud CV dans le graphe biparti.

In [3]:
degree_dict = dict(G.degree())
cv["degree_bipartite"] = cv["cv_id"].map(degree_dict).fillna(0).astype(int)

cv[["cv_id", "degree_bipartite"]].head(3)

Unnamed: 0,cv_id,degree_bipartite
0,CV_0000,4
1,CV_0001,4
2,CV_0002,4


## 3) Ajouter la communauté (si disponible)

On charge `results/cv_with_communities.csv` si tu as fait le Notebook 04.
Sinon on met `community_id = -1`.

In [7]:
# Path vers le fichier généré au Notebook 04
comm_path = "results/cv_with_communities.csv"

if os.path.exists(comm_path):
    cv_comm = pd.read_csv(comm_path)

    # Vérification des colonnes attendues
    if "cv_id" in cv_comm.columns and "community_id" in cv_comm.columns:
        cv = cv.merge(
            cv_comm[["cv_id", "community_id"]],
            on="cv_id",
            how="left"
        )

        # Remplacer les NaN éventuels
        cv["community_id"] = cv["community_id"].fillna(-1).astype(int)
        print("✅ Communities successfully merged.")
    else:
        print("⚠️ 'community_id' column not found in file. Setting default -1.")
        cv["community_id"] = -1
else:
    print("⚠️ Community file not found. Setting community_id = -1.")
    cv["community_id"] = -1

# Aperçu
cv[["cv_id", "degree_bipartite", "community_id"]].head(5)

⚠️ Community file not found. Setting community_id = -1.


Unnamed: 0,cv_id,degree_bipartite,community_id
0,CV_0000,4,-1
1,CV_0001,4,-1
2,CV_0002,4,-1
3,CV_0003,4,-1
4,CV_0004,4,-1


In [8]:
USE_EMBEDDINGS = True  # mets False si tu veux aller vite

if USE_EMBEDDINGS:
    try:
        from sentence_transformers import SentenceTransformer
    except ModuleNotFoundError:
        !pip -q install sentence-transformers
        from sentence_transformers import SentenceTransformer

    model = SentenceTransformer("all-MiniLM-L6-v2")

    cv_texts = cv["clean_text"].astype(str).tolist()
    cv_emb = model.encode(cv_texts, batch_size=32, show_progress_bar=True)

    # Réduction dimensionnelle simple (optionnel) : garder 32 dims
    from sklearn.decomposition import PCA
    pca_dim = 32
    pca = PCA(n_components=min(pca_dim, cv_emb.shape[1]), random_state=42)
    cv_emb_red = pca.fit_transform(cv_emb)

    for k in range(cv_emb_red.shape[1]):
        cv[f"emb_{k}"] = cv_emb_red[:, k]

    print("✅ Embeddings added:", cv_emb_red.shape)
else:
    print("Embeddings disabled.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

✅ Embeddings added: (300, 32)


In [9]:
target_col = "level"  # change to "profile_type" or "domain" if needed

if target_col not in cv.columns:
    raise ValueError(f"Target column '{target_col}' not found in cv table. Available columns: {list(cv.columns)}")

print("Target distribution:")
display(cv[target_col].value_counts())

Target distribution:


Unnamed: 0_level_0,count
level,Unnamed: 1_level_1
junior,144
intermediate,116
senior,40


In [10]:
# Features
feature_cols = ["degree_bipartite", "community_id"] + [c for c in cv.columns if c.startswith("emb_")]

X = cv[feature_cols].copy()
y = cv[target_col].astype(str)

# Identify types
numeric_features = [c for c in feature_cols if c.startswith("emb_") or c == "degree_bipartite"]
categorical_features = ["community_id"]

preprocess = ColumnTransformer(
    transformers=[
        ("num", "passthrough", numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
    ]
)

model = RandomForestClassifier(
    n_estimators=300,
    max_depth=12,
    random_state=42
)

clf = Pipeline(steps=[("preprocess", preprocess), ("model", model)])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

clf.fit(X_train, y_train)
pred = clf.predict(X_test)

report = classification_report(y_test, pred)
print(report)

              precision    recall  f1-score   support

intermediate       0.96      1.00      0.98        23
      junior       1.00      0.97      0.98        29
      senior       1.00      1.00      1.00         8

    accuracy                           0.98        60
   macro avg       0.99      0.99      0.99        60
weighted avg       0.98      0.98      0.98        60



## 7) Matrice de confusion + sauvegarde

In [11]:
labels = sorted(y.unique())
cm = confusion_matrix(y_test, pred, labels=labels)
cm_df = pd.DataFrame(cm, index=labels, columns=labels)

display(cm_df)

os.makedirs("results", exist_ok=True)

with open("results/cv_classification_report.txt", "w") as f:
    f.write(report)

cm_df.to_csv("results/cv_confusion_matrix.csv", index=True)

print("✅ Saved:")
print("- results/cv_classification_report.txt")
print("- results/cv_confusion_matrix.csv")

Unnamed: 0,intermediate,junior,senior
intermediate,23,0,0
junior,1,28,0
senior,0,0,8


✅ Saved:
- results/cv_classification_report.txt
- results/cv_confusion_matrix.csv


## 8) Ablation study (rapide)

On compare 3 configs :
1) degree only
2) degree + community
3) degree + community + embeddings (si dispo)

Cela montre clairement l’impact du graph mining / communautés / sémantique.

In [12]:
def run_experiment(use_comm=True, use_emb=True):
    cols = ["degree_bipartite"]
    if use_comm:
        cols += ["community_id"]
    if use_emb:
        cols += [c for c in cv.columns if c.startswith("emb_")]

    X = cv[cols].copy()
    y = cv[target_col].astype(str)

    numeric_features = [c for c in cols if c.startswith("emb_") or c == "degree_bipartite"]
    categorical_features = ["community_id"] if "community_id" in cols else []

    transformers = []
    if numeric_features:
        transformers.append(("num", "passthrough", numeric_features))
    if categorical_features:
        transformers.append(("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features))

    preprocess = ColumnTransformer(transformers=transformers)

    model = RandomForestClassifier(n_estimators=300, max_depth=12, random_state=42)

    pipe = Pipeline(steps=[("preprocess", preprocess), ("model", model)])

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    pipe.fit(X_train, y_train)
    pred = pipe.predict(X_test)

    rep = classification_report(y_test, pred, output_dict=True)
    return rep["accuracy"], rep["weighted avg"]["f1-score"]

experiments = []
experiments.append(("degree_only",) + run_experiment(use_comm=False, use_emb=False))
experiments.append(("degree+community",) + run_experiment(use_comm=True, use_emb=False))

if any(c.startswith("emb_") for c in cv.columns):
    experiments.append(("degree+community+emb",) + run_experiment(use_comm=True, use_emb=True))
else:
    experiments.append(("degree+community+emb", None, None))

exp_df = pd.DataFrame(experiments, columns=["setting", "accuracy", "f1_weighted"])
display(exp_df)

exp_df.to_csv("results/cv_ablation_results.csv", index=False)
print("✅ Saved: results/cv_ablation_results.csv")

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Unnamed: 0,setting,accuracy,f1_weighted
0,degree_only,0.483333,0.314981
1,degree+community,0.483333,0.314981
2,degree+community+emb,0.983333,0.983364


✅ Saved: results/cv_ablation_results.csv


# Conclusion

On a :
- construit des features structurelles (degree)
- intégré les communautés (community_id)
- (optionnel) intégré des embeddings (sémantique)
- entraîné un modèle baseline (RandomForest)
- évalué avec rapport + matrice de confusion
- montré l'impact via ablation

