# Classifier

The goal of this notebook is to train and evaluate a simple baseline classifier for the problem of unsafe prompt detection. 

## Setup

In this section, we will install the dependencies required to run the code in this notebook.

In [None]:
import sys
import os

# Add project root to path
sys.path.append(os.path.abspath(".."))

In [None]:
from src.utils.dataset import get_project_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from datasets.arrow_dataset import Column

## Model training

In this section, we train a 

In [None]:
dataset = get_project_dataset()

X_train, y_train = dataset["train"]["text"], dataset["train"]["label"]
X_test, y_test = dataset["test"]["text"], dataset["test"]["label"]

In [None]:
# Create a pipeline that first converts raw text into TF-IDF vectors,
#  then trains a logistic regression classifier on those vectors.
clf = Pipeline([
    ("tfidf", TfidfVectorizer()), 
    ("logreg", LogisticRegression())
])

In [None]:
clf.fit(X_train, y_train)

In [None]:
def evaluate_classifier(
    model: Pipeline,
    X_train: Column,
    y_train: Column,
    X_test: Column,
    y_test: Column,
    digits: int = 4
) -> None:
    """Evaluate and print classification reports for train and test sets."""

    y_train_pred = model.predict(X_train)
    print("--- Train set ---")
    print(classification_report(y_train, y_train_pred, digits=digits))
    
    y_test_pred = model.predict(X_test)
    print("--- Test set ---")
    print(classification_report(y_test, y_test_pred, digits=digits))

In [None]:
evaluate_classifier(model=clf, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)


This is a very strong baseline. Due to the similiatiy between train and test matrix metrics, no meaninful overfitting of the training data.

For safety applications, we should try and push recall for unsafe prompts higher, even if it costs some precision

## Weight tuning

In our dataset exploration, we found a class imbalance: approximately 70% of examples are safe prompts, while only 30% are unsafe. This imbalance is also need in the 'support' column classification report. In this section, we try to increase recall for unsafe prompts by tuning class weights, to assign more importance to the unsafe classe.

In [None]:
# To address the 70/30 class imbalance, let's adjusts weights inversely proportional to class frequencies
clf = Pipeline([
    ("tfidf", TfidfVectorizer()), 
    ("logreg", LogisticRegression(class_weight="balanced"))
])

In [None]:
clf.fit(X_train, y_train)

In [None]:
evaluate_classifier(model=clf, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)

Giving fair importance to all classes leads to a more robust and accurate model, let's further explore for custom weightings.

In [None]:
def train_and_evaluate(
    X_train: Column,
    y_train: Column,
    X_test: Column,
    y_test: Column,
    class_weights: dict[int, float],
    digits: int = 4
) -> None:
    """Train and evaluate logistic regression with given class weights."""

    clf = Pipeline([
        ("tfidf", TfidfVectorizer()),
        ("logreg", LogisticRegression(class_weight=class_weights))
    ])

    clf.fit(X_train, y_train)

    evaluate_classifier(
        model=clf,
        X_train=X_train,
        y_train=y_train,
        X_test=X_test,
        y_test=y_test,
        digits=digits,
    )

In [None]:
class_weights={0: 1, 1: 5}
train_and_evaluate(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test, class_weights=class_weights)

A weighting ratio of about `1:5` is the maximum before recall stops improving and precision and accuracy begin to decline.