# Multi-Stage Job Advertisement Analysis — Training Bert Zone Identification Model

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mansamoussa/llm-skill-extractor/blob/main/notebooks/02_train_bert.ipynb)

---

### Objective
Train a **multilingual BERT token classification model** that predicts zone labels for each token in a job advertisement, using the preprocessed datasets generated in *01_data_preparation.ipynb*.

This notebook will:
1. Load:
   - The preprocessed `train_dataset` and `test_dataset`
   - The `id2label.json` and `label2id.json` mappings  
2. Initialize a `bert-base-multilingual-cased` model for token classification  
3. Configure and run the full training loop:
   - Optimizer (AdamW)
   - Learning rate scheduler  
   - Weighted loss function to handle class imbalance  
   - Periodic validation  
4. Save artifacts:
   - The best-performing model checkpoint (`best_model.pt`)
   - TensorBoard logs for visualization  
5. Evaluate model performance using **seqeval** metrics:
   - Precision  
   - Recall  
   - F1-score  

### Input Data
- `data/train_dataset.pt` — tokenized, labeled training chunks  
- `data/test_dataset.pt` — tokenized, labeled evaluation chunks  
- `model/id2label.json` — mapping from label IDs → label names  
- `model/label2id.json` — mapping from label names → label IDs  

### Output
- **`model/best_model.pt`** — best model checkpoint based on validation loss  
- **TensorBoard logs** stored under `logs/train/`  
- **Evaluation results** including seqeval classification report

In [1]:
!pip install -q transformers seqeval tensorboard

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


In [5]:
import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader

from transformers import (
    BertForTokenClassification,
    BertTokenizerFast,
    get_linear_schedule_with_warmup
)

import json
from pathlib import Path
from sklearn.utils.class_weight import compute_class_weight
from seqeval.metrics import classification_report, f1_score
import numpy as np

from torch.utils.tensorboard import SummaryWriter

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
PROJECT_ROOT = "/content/drive/MyDrive/GroupWork_GEN03"
import os

# Define paths
train_dataset_path = f"{PROJECT_ROOT}/processed_data/train_dataset.pt"
test_dataset_path  = f"{PROJECT_ROOT}/processed_data/test_dataset.pt"
id2label_path      = f"{PROJECT_ROOT}/model/id2label.json"
label2id_path      = f"{PROJECT_ROOT}/model/label2id.json"
model_save_path    = f"{PROJECT_ROOT}/model/best_model.pt"

paths = {
    "train_dataset.pt": train_dataset_path,
    "test_dataset.pt": test_dataset_path,
    "id2label.json": id2label_path,
    "label2id.json": label2id_path,
}

# Validate all paths
missing = [name for name, p in paths.items() if not os.path.exists(p)]

if missing:
    raise FileNotFoundError(
        "❌ Missing required input files:\n" +
        "\n".join(f" - {name}" for name in missing) +
        "\n\nPlease verify where Notebook 01 has exported."
    )
else:
    print("✅ All required files found.")

✅ All required files found.


In [7]:
# --- Fix for PyTorch 2.6 unpickling TensorDataset ---
from torch.utils.data import TensorDataset
import torch
torch.serialization.add_safe_globals([TensorDataset])
# -----------------------------------------------------

# Load datasets (must use weights_only=False for full objects)
train_dataset = torch.load(train_dataset_path, weights_only=False)
test_dataset  = torch.load(test_dataset_path,  weights_only=False)

# Load id2label mapping
with open(id2label_path, "r") as f:
    id2label = json.load(f)

label2id = {v: k for k, v in id2label.items()}
num_labels = len(label2id)

id2label, label2id, num_labels


({'0': 'O',
  '1': 'Fähigkeiten und Inhalte',
  '2': 'Abschlüsse',
  '3': 'Anstellung',
  '4': 'Erfahrung',
  '5': 'Challenges',
  '6': 'Bewerbungsprozess',
  '7': 'Firmenbeschreibung',
  '8': 'Benefits',
  '9': 'Arbeitsumfeld',
  '10': 'Firmenkundenbeschreibung'},
 {'O': '0',
  'Fähigkeiten und Inhalte': '1',
  'Abschlüsse': '2',
  'Anstellung': '3',
  'Erfahrung': '4',
  'Challenges': '5',
  'Bewerbungsprozess': '6',
  'Firmenbeschreibung': '7',
  'Benefits': '8',
  'Arbeitsumfeld': '9',
  'Firmenkundenbeschreibung': '10'},
 11)

In [8]:
# Load id2label mapping (keys are strings → convert to int)
with open(id2label_path, "r") as f:
    id2label_raw = json.load(f)

# Convert: {"0": "O"} → {0: "O"}
id2label = {int(k): v for k, v in id2label_raw.items()}

# Create label2id: {"O": 0, ...}
label2id = {v: k for k, v in id2label.items()}

num_labels = len(label2id)

id2label, label2id, num_labels


({0: 'O',
  1: 'Fähigkeiten und Inhalte',
  2: 'Abschlüsse',
  3: 'Anstellung',
  4: 'Erfahrung',
  5: 'Challenges',
  6: 'Bewerbungsprozess',
  7: 'Firmenbeschreibung',
  8: 'Benefits',
  9: 'Arbeitsumfeld',
  10: 'Firmenkundenbeschreibung'},
 {'O': 0,
  'Fähigkeiten und Inhalte': 1,
  'Abschlüsse': 2,
  'Anstellung': 3,
  'Erfahrung': 4,
  'Challenges': 5,
  'Bewerbungsprozess': 6,
  'Firmenbeschreibung': 7,
  'Benefits': 8,
  'Arbeitsumfeld': 9,
  'Firmenkundenbeschreibung': 10},
 11)