# Multi-Stage Job Advertisement Analysis — Training Bert Zone Identification Model

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mansamoussa/llm-skill-extractor/blob/main/notebooks/02_train_bert.ipynb)

---

### Objective
Train a **multilingual BERT token classification model** that predicts zone labels for each token in a job advertisement, using the preprocessed datasets generated in *01_data_preparation.ipynb*.

This notebook will:
1. Load:
   - The preprocessed `train_dataset` and `test_dataset`
   - The `id2label.json` and `label2id.json` mappings  
2. Initialize a `bert-base-multilingual-cased` model for token classification  
3. Configure and run the full training loop:
   - Optimizer (AdamW)
   - Learning rate scheduler  
   - Weighted loss function to handle class imbalance  
   - Periodic validation  
4. Save artifacts:
   - The best-performing model checkpoint (`best_model.pt`)
   - TensorBoard logs for visualization  
5. Evaluate model performance using **seqeval** metrics:
   - Precision  
   - Recall  
   - F1-score  

### Input Data
- `data/train_dataset.pt` — tokenized, labeled training chunks  
- `data/test_dataset.pt` — tokenized, labeled evaluation chunks  
- `model/id2label.json` — mapping from label IDs → label names  
- `model/label2id.json` — mapping from label names → label IDs  

### Output
- **`model/best_model.pt`** — best model checkpoint based on validation loss  
- **TensorBoard logs** stored under `logs/train/`  
- **Evaluation results** including seqeval classification report 