# PoC (rozszerzony): Fine-tuning bi-encodera na parach (text, demand_desc) + klasyfikacja przez max cosine similarity

Ten notebook robi:
1) Wczytanie danych z CSV/XLSX (`text`, `demand_id`, `demand_desc`)
2) Zbudowanie mapy `demand_id -> demand_desc` (prototypy klas)
3) Split train/val/test na danych polabelowanych
4) **Fine-tuning** bi-encodera (Sentence-Transformers) na parach:
   - `query: <text>` ↔ `passage: <opis demandu>`
   - loss: `MultipleNegativesRankingLoss` (in-batch negatives)
5) Predykcja klasy: dla zdania wybieramy label z najwyższym cosine similarity do embeddingu opisów klas
6) Ewaluacja (macro/micro F1) + zapis predykcji dla całego datasetu
7) (Opcjonalnie) pseudo-labeling na unlabeled i druga krótka runda dotrenowania

**Bez RAG, bez OpenAI.**

In [None]:
!pip -q install -U "sentence-transformers>=3.0.0" "datasets>=2.20.0" "transformers>=4.41.0" \
  "accelerate>=0.30.0" "scikit-learn>=1.4.0" "pandas>=2.0.0" "numpy>=1.24.0" "tqdm" "openpyxl"

## Importy + konfiguracja

In [None]:
import os
import random
import numpy as np
import pandas as pd
import torch

from tqdm.auto import tqdm
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report

from sentence_transformers import SentenceTransformer, InputExample
from sentence_transformers.util import cos_sim
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

## Wczytanie danych (CSV lub XLSX)

In [None]:
DATA_PATH = "dataset.csv"  # albo: "dataset.xlsx"

if DATA_PATH.lower().endswith(".xlsx"):
    df = pd.read_excel(DATA_PATH)
else:
    df = pd.read_csv(DATA_PATH)

required_cols = {"text", "demand_id", "demand_desc"}
missing = required_cols - set(df.columns)
if missing:
    raise ValueError(f"Brak wymaganych kolumn: {missing}")

df["text"] = df["text"].astype(str)
df["demand_id"] = df["demand_id"].replace("", np.nan)

print("Shape:", df.shape)
print("Labeled:", df["demand_id"].notna().sum(), "Unlabeled:", df["demand_id"].isna().sum())
df.head(3)

## Mapa: demand_id -> demand_desc (prototypy klas)

In [None]:
labeled = df[df["demand_id"].notna()].copy()
unlabeled = df[df["demand_id"].isna()].copy()

label_desc = (
    labeled[["demand_id", "demand_desc"]]
    .dropna()
    .drop_duplicates(subset=["demand_id"])
    .set_index("demand_id")["demand_desc"]
    .to_dict()
)

all_labels = sorted(label_desc.keys())
print("Unique labeled classes:", len(all_labels))

missing_desc = set(labeled["demand_id"].unique()) - set(label_desc.keys())
print("Labels missing desc:", len(missing_desc))
if missing_desc:
    print("Examples:", list(missing_desc)[:10])

## Split train/val/test

W razie problemów ze stratyfikacją (rzadkie klasy) przechodzimy na split bez stratyfikacji.

In [None]:
X = labeled["text"].values
y = labeled["demand_id"].values

def safe_stratified_split(X, y, test_size, seed):
    try:
        return train_test_split(X, y, test_size=test_size, random_state=seed, stratify=y)
    except ValueError:
        return train_test_split(X, y, test_size=test_size, random_state=seed, stratify=None)

X_train, X_tmp, y_train, y_tmp = safe_stratified_split(X, y, test_size=0.30, seed=SEED)
X_val, X_test, y_val, y_test = safe_stratified_split(X_tmp, y_tmp, test_size=0.50, seed=SEED)

print("Train/Val/Test:", len(X_train), len(X_val), len(X_test))

## Model bazowy (bi-encoder)

Rekomendacja startowa:
- `intfloat/multilingual-e5-base` (PL/EN)
Alternatywy:
- `sentence-transformers/paraphrase-multilingual-mpnet-base-v2`

In [None]:
BASE_MODEL = "intfloat/multilingual-e5-base"
model = SentenceTransformer(BASE_MODEL, device=device)

## Budowa datasetu do fine-tuningu: pary (text, demand_desc)

Dla każdego przykładu uczymy model mapować `text` blisko opisu właściwej klasy.
Loss: `MultipleNegativesRankingLoss` – inne opisy w batchu stają się negatywami (bardzo skuteczne w klasyfikacji wieloklasowej).

In [None]:
def build_pairs(texts, labels, label_desc_map):
    pairs = []
    skipped = 0
    for t, lab in zip(texts, labels):
        desc = label_desc_map.get(lab)
        if not isinstance(desc, str) or not desc.strip():
            skipped += 1
            continue
        # E5: query/passage
        pairs.append(InputExample(texts=[f"query: {t}", f"passage: {desc}"]))
    return pairs, skipped

train_pairs, sk_tr = build_pairs(X_train, y_train, label_desc)
val_pairs, sk_va = build_pairs(X_val, y_val, label_desc)
test_pairs, sk_te = build_pairs(X_test, y_test, label_desc)

print("Train pairs:", len(train_pairs), "skipped:", sk_tr)
print("Val pairs:", len(val_pairs), "skipped:", sk_va)
print("Test pairs:", len(test_pairs), "skipped:", sk_te)

## Fine-tuning

Uwaga praktyczna:
- przy 3k labeled, 2–4 epoki zwykle wystarczą
- batch 16–64 zależnie od GPU RAM

In [None]:
OUTPUT_DIR = "./outputs/sbert_demand_desc_finetuned"

args = SentenceTransformerTrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=torch.cuda.is_available(),
    evaluation_strategy="steps",
    eval_steps=250,
    save_strategy="steps",
    save_steps=250,
    save_total_limit=2,
    logging_steps=50,
    seed=SEED,
)

loss = MultipleNegativesRankingLoss(model)

# Evaluator: sprawdza czy embeddingi par (text, desc) są blisko
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(val_pairs, name="val_pairs")

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_pairs,
    eval_dataset=val_pairs,
    loss=loss,
    evaluator=evaluator,
)

trainer.train()
model.save(OUTPUT_DIR)
print("Saved fine-tuned model to:", OUTPUT_DIR)

## Klasyfikacja przez podobieństwo do opisów labeli (po fine-tuningu)

In [None]:
@torch.no_grad()
def embed_texts(st_model, texts, batch_size=128):
    embs = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Embedding"):
        batch = texts[i:i+batch_size]
        embs.append(st_model.encode(batch, convert_to_tensor=True, normalize_embeddings=True))
    return torch.cat(embs, dim=0)

label_texts = [f"passage: {label_desc[l]}" for l in all_labels]
label_emb = embed_texts(model, label_texts, batch_size=128)

@torch.no_grad()
def predict_labels(st_model, texts, label_emb, all_labels, batch_size=128):
    preds = []
    scores = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Predict"):
        batch = [f"query: {t}" for t in texts[i:i+batch_size]]
        text_emb = st_model.encode(batch, convert_to_tensor=True, normalize_embeddings=True)
        sims = cos_sim(text_emb, label_emb)  # [B, L]
        best = torch.argmax(sims, dim=1).cpu().numpy()
        best_scores = torch.max(sims, dim=1).values.cpu().numpy()
        preds.extend([all_labels[j] for j in best])
        scores.extend(best_scores.tolist())
    return np.array(preds), np.array(scores)

val_pred, _ = predict_labels(model, X_val, label_emb, all_labels)
test_pred, _ = predict_labels(model, X_test, label_emb, all_labels)

print("VAL macro F1 :", f1_score(y_val, val_pred, average="macro"))
print("VAL micro F1 :", f1_score(y_val, val_pred, average="micro"))
print("TEST macro F1:", f1_score(y_test, test_pred, average="macro"))
print("TEST micro F1:", f1_score(y_test, test_pred, average="micro"))

### Raport (może być długi dla 150 klas)

In [None]:
print(classification_report(y_test, test_pred, digits=3, zero_division=0))

## (Opcjonalnie) Pseudo-labeling + druga runda dotrenowania (self-training)

Cel: wykorzystać 30k unlabeled, ale tylko bardzo pewne predykcje.
Mechanizm:
- liczysz top1 i top2 podobieństwa
- wybierasz wiersze gdzie:
  - top1 >= THRESH
  - oraz (top1 - top2) >= MARGIN (zmniejsza pomyłki między podobnymi klasami)
- doklejasz te pary do train i robisz 1 epokę dotrenowania

Jeśli nie chcesz self-training, pomiń tę sekcję.

In [None]:
THRESH = 0.45
MARGIN = 0.06

@torch.no_grad()
def predict_top2(st_model, texts, label_emb, all_labels, batch_size=128):
    top1_label, top1_score, top2_score = [], [], []
    for i in tqdm(range(0, len(texts), batch_size), desc="Predict top2"):
        batch = [f"query: {t}" for t in texts[i:i+batch_size]]
        text_emb = st_model.encode(batch, convert_to_tensor=True, normalize_embeddings=True)
        sims = cos_sim(text_emb, label_emb)  # [B, L]
        vals, idxs = torch.topk(sims, k=2, dim=1)
        vals = vals.cpu().numpy()
        idxs = idxs.cpu().numpy()
        top1_label.extend([all_labels[j] for j in idxs[:,0]])
        top1_score.extend(vals[:,0].tolist())
        top2_score.extend(vals[:,1].tolist())
    return np.array(top1_label), np.array(top1_score), np.array(top2_score)

pseudo_pairs = []
if len(unlabeled) > 0:
    ul_texts = unlabeled["text"].astype(str).tolist()
    ul_pred, ul_s1, ul_s2 = predict_top2(model, ul_texts, label_emb, all_labels)
    keep = (ul_s1 >= THRESH) & ((ul_s1 - ul_s2) >= MARGIN)

    accepted = int(keep.sum())
    print("Pseudo accepted:", accepted, "out of", len(unlabeled))

    if accepted > 0:
        pseudo_texts = unlabeled.loc[keep, "text"].astype(str).values
        pseudo_labels = ul_pred[keep]

        pseudo_pairs, sk = build_pairs(pseudo_texts, pseudo_labels, label_desc)
        print("Pseudo pairs built:", len(pseudo_pairs), "skipped:", sk)
else:
    print("No unlabeled rows found.")

### Druga runda dotrenowania (1 epoka)

In [None]:
if pseudo_pairs:
    train_pairs_round2 = train_pairs + pseudo_pairs

    args2 = SentenceTransformerTrainingArguments(
        output_dir=OUTPUT_DIR + "_round2",
        num_train_epochs=1,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=64,
        learning_rate=1e-5,
        warmup_ratio=0.05,
        fp16=torch.cuda.is_available(),
        evaluation_strategy="steps",
        eval_steps=250,
        save_strategy="steps",
        save_steps=250,
        save_total_limit=2,
        logging_steps=50,
        seed=SEED,
    )

    trainer2 = SentenceTransformerTrainer(
        model=model,
        args=args2,
        train_dataset=train_pairs_round2,
        eval_dataset=val_pairs,
        loss=MultipleNegativesRankingLoss(model),
        evaluator=evaluator,
    )
    trainer2.train()
    model.save(args2.output_dir)
    print("Saved round2 model to:", args2.output_dir)

    # Recompute label embeddings
    label_emb = embed_texts(model, label_texts, batch_size=128)

    # Re-evaluate
    val_pred, _ = predict_labels(model, X_val, label_emb, all_labels)
    test_pred, _ = predict_labels(model, X_test, label_emb, all_labels)

    print("After self-training")
    print("VAL macro F1 :", f1_score(y_val, val_pred, average="macro"))
    print("VAL micro F1 :", f1_score(y_val, val_pred, average="micro"))
    print("TEST macro F1:", f1_score(y_test, test_pred, average="macro"))
    print("TEST micro F1:", f1_score(y_test, test_pred, average="micro"))
else:
    print("No pseudo pairs -> skipping round2.")

## Zapis predykcji dla całego datasetu (labeled + unlabeled)

In [None]:
all_texts = df["text"].astype(str).tolist()
pred_all, score_all = predict_labels(model, all_texts, label_emb, all_labels)

out = df.copy()
out["pred_demand_id"] = pred_all
out["pred_score"] = score_all

OUT_PATH = "predictions.parquet"
out.to_parquet(OUT_PATH, index=False)

print("Saved:", OUT_PATH)
out.head(3)

## Wskazówki strojenia

- Jeśli self-training psuje wyniki: podnieś `THRESH` / `MARGIN` lub wyłącz sekcję.
- Jeśli bierze za mało pseudo-labeli: obniż `THRESH` lub `MARGIN`, ale obserwuj F1.
- Jeśli masz bardzo podobne labelki: zwiększ batch size (więcej negatywów w batchu) i rozważ 4 epoki.
- Jeżeli Twoje teksty są długie: rozważ wcześniejsze cięcie do 1–2 zdań (ale pisałaś, że czyszczenie/format jest zrobione).