## ❓ Question Classification: Context Needed or Not

This notebook presents a **question classification task**: given a question, the goal is to determine whether it **requires additional context** or can be understood on its own.

### 📂 Project Overview
- **Dataset**: `labeled_data.csv` – custom labeled dataset created for this task.  
- **Main Model**: [allegro/herbert-base-cased](https://huggingface.co/allegro/herbert-base-cased) – a Polish BERT-based model fine-tuned for this classification problem.  
- **Saved Model**: trained weights are stored in `./model_classification/`.  
- **Other Models**: For comparison, additional classical ML models are implemented in `classify_question_other_model.ipynb`:  
  - Naive Bayes (`MultinomialNB`)  
  - Logistic Regression (`LogisticRegression`)  
  - Linear Support Vector Classifier (`LinearSVC`)  
  - Stochastic Gradient Descent Classifier (`SGDClassifier`)  

### 🎯 Goal
- Train and evaluate different models for **binary classification** (context needed vs. not needed).  
- Compare transformer-based (HerBERT) and classical ML approaches.  
- Analyze performance trade-offs and insights from both methods.  

In [14]:
import pandas as pd
import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

Loading data

In [15]:
df = pd.read_csv("labeled_data.csv", encoding="utf-8", sep=";", names=["text", "label"])
label_mapping = {"bez kontekstu": 0, "kontekst": 1}
id_mapping = {0: "bez kontekstu", 1: "kontekst"}

df["label"] = df["label"].map(label_mapping).astype(int)

Preparing the model

In [None]:
model_name = "allegro/herbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

dataset = Dataset.from_pandas(df)
dataset = dataset.map(tokenize_function, batched=True)
dataset = dataset.train_test_split(test_size=0.2, seed=42)

Map: 100%|██████████| 1000/1000 [00:00<00:00, 14424.12 examples/s]


Training

In [17]:
def compute_metrics(p):
    preds = p.predictions.argmax(-1)
    labels = p.label_ids
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [18]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(label_mapping))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

training_args = TrainingArguments(
    output_dir="./results",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=4,
    weight_decay=0.01,

)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()




Step,Training Loss




TrainOutput(global_step=400, training_loss=0.3437187194824219, metrics={'train_runtime': 900.1214, 'train_samples_per_second': 3.555, 'train_steps_per_second': 0.444, 'total_flos': 210488844288000.0, 'train_loss': 0.3437187194824219, 'epoch': 4.0})

In [23]:
trainer.save_model("./model_classification")
tokenizer.save_pretrained("./model_classification")

('./model_classification\\tokenizer_config.json',
 './model_classification\\special_tokens_map.json',
 './model_classification\\vocab.json',
 './model_classification\\merges.txt',
 './model_classification\\added_tokens.json',
 './model_classification\\tokenizer.json')

Model metrics

In [20]:
metrics = trainer.evaluate(eval_dataset=dataset["test"])
print(metrics)



{'eval_loss': 0.6084314584732056, 'eval_accuracy': 0.865, 'eval_f1': 0.86225411640398, 'eval_precision': 0.8635590725064409, 'eval_recall': 0.865, 'eval_runtime': 10.6526, 'eval_samples_per_second': 18.775, 'eval_steps_per_second': 2.347, 'epoch': 4.0}


In [21]:
def classify_question(question):
    inputs = tokenizer(question, padding="max_length", truncation=True, max_length=128, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        output = model(**inputs)
        
    probs = torch.nn.functional.softmax(output.logits, dim=-1)
    labels_pred = torch.argmax(probs)
    confidence = torch.max(probs)

    return id_mapping[labels_pred.item()], confidence.item()

Example question classifying

In [None]:
question = "Czy ona zakwalifikuje się do turnieju?"
label, confidence = classify_question(question)
confidence = confidence * 100
print(f"{label} (Pewność: {confidence:.2f}%)")

kontekst (Pewność: 99.27%)
