# Résumé Atlas Classification Benchmark

This Colab notebook fine‑tunes or trains several text‑classification models on the same train/validation/test splits of the [Résumé Atlas](https://huggingface.co/datasets/ahmedheakl/resume-atlas) dataset used in your previous experiment and reports comparable metrics.

**Models**
- TF‑IDF + SVM
- FastText (Wiki vectors)
- CareerBERT‑base
- CareerBERT‑large
- RoBERTa‑DA

At the end, you’ll get a concise table with *Top‑k* accuracies (1/3/5/10), Top‑1 accuracy, F1‑macro, Precision‑macro, and Recall‑macro for head‑to‑head comparison.

## 🔧 Install libraries

In [9]:
# 🔧  Устанавливаем ровно то, что нужно
!pip install -U pip setuptools wheel

!pip install -U "numpy==1.26.4" "fasttext==0.9.2" "sentencepiece>=0.1.99"



In [1]:
!pip install peft==0.10.0

Collecting peft==0.10.0
  Downloading peft-0.10.0-py3-none-any.whl.metadata (13 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.13.0->peft==0.10.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.13.0->peft==0.10.0)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.13.0->peft==0.10.0)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.13.0->peft==0.10.0)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.13.0->peft==0.10.0)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from 

In [2]:
!pip -q install -U "transformers>=4.41" "datasets>=2.19" "evaluate>=0.4" \
                  "sentencepiece" "accelerate>=0.31" \
                  "nltk>=3.9"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.7/41.7 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.2/11.2 MB[0m [31m108.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

## 📚 Imports & helpers

In [5]:
import numpy as np, pandas as pd, torch, re, string
#import fasttext
from datasets import load_dataset, DatasetDict, concatenate_datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.calibration import CalibratedClassifierCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from tqdm.auto import tqdm
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer)
import random, os
from pathlib import Path
tqdm.pandas()

def set_all_seeds(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
set_all_seeds()

## 📑 Load & preprocess Résumé Atlas

In [6]:
RAW = load_dataset('ahmedheakl/resume-atlas')
full = RAW['train'] if 'train' in RAW else concatenate_datasets(list(RAW.values()))

TEXT_COLS  = ['text','resume_text','ocr_text','content']
text_col   = next(c for c in full.column_names if c.lower() in TEXT_COLS)
label_col  = 'Category'

# 70/10/20 stratified split identical to original notebook
y = np.array(full[label_col]); idx = np.arange(len(full))
tr, tmp, y_tr, y_tmp = train_test_split(idx, y, test_size=0.3,
                                        stratify=y, random_state=42)
val, test, _, _ = train_test_split(tmp, y_tmp, test_size=2/3,
                                   stratify=y_tmp, random_state=42)
splits = DatasetDict(train=full.select(tr.tolist()),
                     validation=full.select(val.tolist()),
                     test=full.select(test.tolist()))

label_list = sorted(set(splits['train'][label_col]))
label2id = {l:i for i,l in enumerate(label_list)}
id2label = {i:l for l,i in label2id.items()}
num_labels = len(label_list)

def add_numeric_label(example):
    example['label'] = label2id[example[label_col]]
    return example
splits = splits.map(add_numeric_label, remove_columns=[])
print(splits)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/215 [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/53.6M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/13389 [00:00<?, ? examples/s]

Map:   0%|          | 0/9372 [00:00<?, ? examples/s]

Map:   0%|          | 0/1339 [00:00<?, ? examples/s]

Map:   0%|          | 0/2678 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Category', 'Text', 'label'],
        num_rows: 9372
    })
    validation: Dataset({
        features: ['Category', 'Text', 'label'],
        num_rows: 1339
    })
    test: Dataset({
        features: ['Category', 'Text', 'label'],
        num_rows: 2678
    })
})


## 📐 Metric helpers (Top‑k, F1‑macro, …)

In [7]:
def topk(prob: np.ndarray, y: np.ndarray, ks=(1,3,5,10)):
    idx = np.argsort(-prob, 1)
    return {f'Top-{k}': float((y[:,None] == idx[:,:k]).any(1).mean()) for k in ks}

def compute_metrics(probs: np.ndarray, y_true: np.ndarray):
    preds = probs.argmax(1)
    top_metrics = topk(probs, y_true)
    acc = accuracy_score(y_true, preds)
    pma, rma, f1ma, _ = precision_recall_fscore_support(
        y_true, preds, average='macro', zero_division=0)
    return {**top_metrics,
            'Top-1': float(acc),
            'F1-macro': float(f1ma),
            'Precision-macro': float(pma),
            'Recall-macro': float(rma)}

## 📝 TF‑IDF + Linear SVM

In [6]:
from scipy.special import softmax
vec = TfidfVectorizer(stop_words='english', max_features=50_000, ngram_range=(1,2))
svm = LinearSVC()
clf = CalibratedClassifierCV(svm)  # enables predict_proba
X_train = vec.fit_transform(splits['train'][text_col])
y_train = splits['train']['label']
clf.fit(X_train, y_train)

X_test = vec.transform(splits['test'][text_col])
probs = clf.predict_proba(X_test)
metrics_svm = compute_metrics(probs, np.array(splits['test']['label']))
print(metrics_svm)



{'Top-1': 0.8293502613890963, 'Top-3': 0.9469753547423451, 'Top-5': 0.9719940253920837, 'Top-10': 0.988050784167289, 'F1-macro': 0.8194839742539888, 'Precision-macro': 0.8300514031524281, 'Recall-macro': 0.8157612484682367}


## 🏃‍♂️ FastText (Wiki) + Logistic Regression

In [7]:
# Download pretrained wiki vectors (English)
import fasttext.util, os, urllib.request, zipfile, pathlib, io, gzip, shutil, subprocess, sys, textwrap
ft_path = 'cc.en.300.bin'
if not Path(ft_path).exists():
    !wget -q https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
    !gunzip cc.en.300.bin.gz
ft_model = fasttext.load_model(ft_path)

def embed_ft(texts):
    return np.vstack([ft_model.get_sentence_vector(t) for t in texts])

X_train_ft = embed_ft(splits['train'][text_col])
X_test_ft  = embed_ft(splits['test'][text_col])

lr = LogisticRegression(max_iter=1000, n_jobs=-1, multi_class='multinomial')
lr.fit(X_train_ft, y_train)
probs_ft = lr.predict_proba(X_test_ft)
metrics_ft = compute_metrics(probs_ft, np.array(splits['test']['label']))
print(metrics_ft)



{'Top-1': 0.5007468259895445, 'Top-3': 0.7221807318894697, 'Top-5': 0.8170276325616131, 'Top-10': 0.90739357729649, 'F1-macro': 0.41740864759210117, 'Precision-macro': 0.46947918399848326, 'Recall-macro': 0.45741714745998796}


## 🤖 Helper to fine‑tune transformer models

In [9]:
def finetune_transformer(model_ckpt: str, output_dir: str, epochs: int = 2, batch: int = 8,
                         lr: float = 2e-5):
    tok = AutoTokenizer.from_pretrained(model_ckpt)
    def tokenize(batch):
        return tok(batch[text_col], truncation=True, padding='max_length', max_length=256)
    tok_splits = splits.map(tokenize, batched=True)
    tok_splits.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

    model = AutoModelForSequenceClassification.from_pretrained(
        model_ckpt, num_labels=num_labels, id2label=id2label, label2id=label2id)

    args = TrainingArguments(
        output_dir=output_dir, eval_strategy='epoch',
        learning_rate=lr, per_device_train_batch_size=batch,
        per_device_eval_batch_size=batch, num_train_epochs=epochs,
        weight_decay=0.01, logging_steps=100, save_strategy='no')
    trainer = Trainer(model=model, args=args,
                      train_dataset=tok_splits['train'],
                      eval_dataset=tok_splits['validation'])
    trainer.train()

    preds = trainer.predict(tok_splits['test'])
    probs = torch.softmax(torch.tensor(preds.predictions), dim=-1).numpy()
    return compute_metrics(probs, preds.label_ids)

### CareerBERT‑base

In [10]:
careerbert_base_ckpt = 'lwolfrum2/careerbert-g'  # replace with correct base checkpoint if different
metrics_cb_base = finetune_transformer(careerbert_base_ckpt, 'careerbert_base')
print(metrics_cb_base)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/732 [00:00<?, ?B/s]

Map:   0%|          | 0/9372 [00:00<?, ? examples/s]

Map:   0%|          | 0/1339 [00:00<?, ? examples/s]

Map:   0%|          | 0/2678 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at lwolfrum2/careerbert-g and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mlkomarova-pr[0m ([33mcv_res[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


  return forward_call(*args, **kwargs)


Epoch,Training Loss,Validation Loss
1,1.2205,1.065324
2,0.8723,0.919237


{'Top-1': 0.7740851381628081, 'Top-3': 0.8651979088872293, 'Top-5': 0.8909634055265123, 'Top-10': 0.938386855862584, 'F1-macro': 0.7627657815088293, 'Precision-macro': 0.7826034752844017, 'Recall-macro': 0.7595941135465849}


### CareerBERT‑large

In [11]:
careerbert_large_ckpt = 'lwolfrum2/careerbert-jg'  # replace with correct large checkpoint
metrics_cb_large = finetune_transformer(careerbert_large_ckpt, 'careerbert_large')
print(metrics_cb_large)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/732 [00:00<?, ?B/s]

Map:   0%|          | 0/9372 [00:00<?, ? examples/s]

Map:   0%|          | 0/1339 [00:00<?, ? examples/s]

Map:   0%|          | 0/2678 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at lwolfrum2/careerbert-jg and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  return forward_call(*args, **kwargs)


Epoch,Training Loss,Validation Loss
1,1.1864,1.027066
2,0.8019,0.883645


{'Top-1': 0.7893950709484691, 'Top-3': 0.8775205377147125, 'Top-5': 0.9032860343539956, 'Top-10': 0.9391336818521284, 'F1-macro': 0.781290366061841, 'Precision-macro': 0.7942181453680057, 'Recall-macro': 0.7800726744204052}


### RoBERTa‑DA

In [15]:
# --- 1. правильный идентификатор ---
roberta_da_ckpt = "mediabiasgroup/da-roberta-babe-ft"

# --- 2. допустить замену классификационной головы ---
def finetune_transformer(model_ckpt, output_dir, epochs=2, batch=8, lr=2e-5):
    tok = AutoTokenizer.from_pretrained(model_ckpt)

    def tokenize(batch):
        return tok(batch[text_col],
                   truncation=True, padding='max_length', max_length=256)

    tok_splits = splits.map(tokenize, batched=True)
    tok_splits.set_format(type='torch',
                          columns=['input_ids', 'attention_mask', 'label'])

    model = AutoModelForSequenceClassification.from_pretrained(
        model_ckpt,
        num_labels=num_labels,
        id2label=id2label,
        label2id=label2id,
        ignore_mismatched_sizes=True   # <-- главное добавление
    )

    args = TrainingArguments(
        output_dir=output_dir,
        eval_strategy='epoch',
        learning_rate=lr,
        per_device_train_batch_size=batch,
        per_device_eval_batch_size=batch,
        num_train_epochs=epochs,
        weight_decay=0.01,
        logging_steps=100,
        save_strategy='no',
    )
    trainer = Trainer(model=model, args=args,
                      train_dataset=tok_splits['train'],
                      eval_dataset=tok_splits['validation'])
    trainer.train()

    preds = trainer.predict(tok_splits['test'])
    probs = torch.softmax(torch.tensor(preds.predictions), dim=-1).numpy()
    return compute_metrics(probs, preds.label_ids)


In [16]:
#roberta_da_ckpt = 'Datadave09/DA-RoBERTa'
metrics_roberta_da = finetune_transformer(roberta_da_ckpt, 'roberta_da')
print(metrics_roberta_da)

Map:   0%|          | 0/9372 [00:00<?, ? examples/s]

Map:   0%|          | 0/1339 [00:00<?, ? examples/s]

Map:   0%|          | 0/2678 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at mediabiasgroup/da-roberta-babe-ft and are newly initialized because the shapes did not match:
- classifier.out_proj.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([43]) in the model instantiated
- classifier.out_proj.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([43, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  return forward_call(*args, **kwargs)


Epoch,Training Loss,Validation Loss
1,1.8599,1.50678
2,1.0746,0.95674


{'Top-1': 0.7871545929798357, 'Top-3': 0.8786407766990292, 'Top-5': 0.9171023151605676, 'Top-10': 0.9529499626587006, 'F1-macro': 0.7436502854277351, 'Precision-macro': 0.7426062801768375, 'Recall-macro': 0.7573154108037358}
