## Question Classification with Classical Machine Learning Models

This notebook implements several **classical machine learning algorithms** for the task of **question classification** – deciding whether a question **requires additional context** or can be understood independently.

### 📂 Models Implemented
- Naive Bayes (`MultinomialNB`)  
- Logistic Regression (`LogisticRegression`)  
- Linear Support Vector Classifier (`LinearSVC`)  
- Stochastic Gradient Descent Classifier (`SGDClassifier`)  

### 📊 Dataset
- **`labeled_data.csv`** – custom labeled dataset with two classes:  
  - **Needs context**  
  - **Does not need context**

### 🎯 Goal
- Train and evaluate baseline models for comparison with the transformer-based approach (HerBERT).  
- Provide insights into the performance trade-offs between traditional ML models and modern transformer models.  

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier

Loading data

In [2]:
df = pd.read_csv("labeled_data.csv", encoding="utf-8", sep=";", names=["text", "label"])
label_mapping = {"bez kontekstu": 0, "kontekst": 1}

df["label"] = df["label"].map(label_mapping).astype(int)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(df["text"], df["label"], test_size=0.2, random_state=42)

vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

models = {
    "Naive Bayes": MultinomialNB(),
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "LinearSVC": LinearSVC(),
    "SGDClassifier": SGDClassifier(),
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    results[name] = accuracy
    print(f"=== {name} ===")
    print(classification_report(y_test, y_pred, target_names=list(label_mapping.keys())))

=== Naive Bayes ===
               precision    recall  f1-score   support

bez kontekstu       0.71      1.00      0.83       134
     kontekst       1.00      0.18      0.31        66

     accuracy                           0.73       200
    macro avg       0.86      0.59      0.57       200
 weighted avg       0.81      0.73      0.66       200

=== Logistic Regression ===
               precision    recall  f1-score   support

bez kontekstu       0.71      0.99      0.82       134
     kontekst       0.86      0.18      0.30        66

     accuracy                           0.72       200
    macro avg       0.78      0.58      0.56       200
 weighted avg       0.76      0.72      0.65       200

=== LinearSVC ===
               precision    recall  f1-score   support

bez kontekstu       0.74      0.85      0.79       134
     kontekst       0.56      0.38      0.45        66

     accuracy                           0.69       200
    macro avg       0.65      0.61      0.62  

In [4]:
print("\nFinal results:", results)


Final results: {'Naive Bayes': 0.73, 'Logistic Regression': 0.72, 'LinearSVC': 0.695, 'SGDClassifier': 0.73}
