# Classical ML Benchmark
In this notebook I want to test some classical machine learning models on the cleaned tweets.  
The goal is to see how Logistic Regression, SVM, Naive Bayes, and Random Forest perform on detecting different types of cyberbullying.  
Later I’ll compare these results with transformer-based models.  

In [None]:
#imports
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt



df = pd.read_csv("clean_tweets.csv")



## Preprocessing

Here we convert the tweets into numerical features and we also encode the target labels into numbers.
TF-IDF creates the input features, and LabelEncoder makes the categories numeric.  
Now `X` and `y` are ready for training.  


In [3]:
df = df.dropna(subset=['tweet_text'])

# convert text into numerical features
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
x = vectorizer.fit_transform(df['tweet_text'])

# cncode the labels
le = LabelEncoder()
y = le.fit_transform(df['cyberbullying_type'])

## Splitting data into training and testing sets

80/20 split to keep the class distribution balanced

In [6]:
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, stratify=y, random_state=42
)

print("Training feature matrix shape:", x_train.shape)
print("Training labels shape:", y_train.shape)

print("Testing feature matrix shape:", x_test.shape)
print("Testing labels shape:", y_test.shape)

Training feature matrix shape: (35643, 5000)
Training labels shape: (35643,)
Testing feature matrix shape: (8911, 5000)
Testing labels shape: (8911,)


## Logistic Regression

In [18]:
lr = LogisticRegression(max_iter=500)
lr.fit(x_train, y_train)

y_pred_lr = lr.predict(x_test)

print("Logistic Regression Results:")
print(classification_report(y_test, y_pred_lr, target_names=le.classes_))

Logistic Regression Results:
                     precision    recall  f1-score   support

                age       0.95      0.97      0.96      1588
          ethnicity       0.98      0.98      0.98      1501
             gender       0.91      0.83      0.87      1527
  not_cyberbullying       0.54      0.51      0.52      1322
other_cyberbullying       0.59      0.67      0.62      1379
           religion       0.94      0.94      0.94      1594

           accuracy                           0.83      8911
          macro avg       0.82      0.82      0.82      8911
       weighted avg       0.83      0.83      0.83      8911



- Strong performance on attribute-specific bullying (age/ethnicity/religion: F1 >0.94).
- Weaker on performance non/vague bullying (F1 ~0.52–0.62)
- Confusion matrix highlights misclassifications between not/other_cyberbullying.

## SVM

In [9]:
svm = SVC(kernel='linear')
svm.fit(x_train, y_train)

y_pred_svm = svm.predict(x_test)

print("SVM Results:")
print(classification_report(y_test, y_pred_svm, target_names=le.classes_))

SVM Results:
                     precision    recall  f1-score   support

                age       0.97      0.98      0.97      1588
          ethnicity       0.98      0.98      0.98      1501
             gender       0.91      0.84      0.87      1527
  not_cyberbullying       0.56      0.50      0.53      1322
other_cyberbullying       0.59      0.73      0.65      1379
           religion       0.96      0.93      0.95      1594

           accuracy                           0.84      8911
          macro avg       0.83      0.82      0.83      8911
       weighted avg       0.84      0.84      0.84      8911



- SVM slightly outperforms logistic regression (84% vs. 83% accuracy).
- Strong on attribute-specific classes (age/ethnicity/religion: F1 >0.95).
- Improved recall for other cyberbullying
- still weak on not cyberbullying (F1 0.53)

## Multinomial Naive Bayes

In [11]:
nb = MultinomialNB()
nb.fit(x_train, y_train)

y_pred_nb = nb.predict(x_test)

print("Naive Bayes Results:")
print(classification_report(y_test, y_pred_nb, target_names=le.classes_))

Naive Bayes Results:
                     precision    recall  f1-score   support

                age       0.83      0.94      0.88      1588
          ethnicity       0.90      0.91      0.90      1501
             gender       0.86      0.78      0.82      1527
  not_cyberbullying       0.53      0.37      0.44      1322
other_cyberbullying       0.57      0.55      0.56      1379
           religion       0.78      0.95      0.86      1594

           accuracy                           0.76      8911
          macro avg       0.74      0.75      0.74      8911
       weighted avg       0.75      0.76      0.75      8911



- Decent performance on attribute-specific bullying with high recall but moderate precision.
- Poor on non/vague bullying (F1 0.44–0.56), especially low recall for not_cyberbullying (0.37).
- Model underperforms compared to logistic regression and SVM

## Random Forest

In [14]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(x_train, y_train)

y_pred_rf = rf.predict(x_test)

print("Random Forest Results:")
print(classification_report(y_test, y_pred_rf, target_names=le.classes_))

Random Forest Results:
                     precision    recall  f1-score   support

                age       0.96      0.98      0.97      1588
          ethnicity       0.98      0.98      0.98      1501
             gender       0.90      0.82      0.86      1527
  not_cyberbullying       0.52      0.42      0.47      1322
other_cyberbullying       0.53      0.66      0.59      1379
           religion       0.95      0.95      0.95      1594

           accuracy                           0.82      8911
          macro avg       0.81      0.80      0.80      8911
       weighted avg       0.82      0.82      0.81      8911



- Excellent on attribute-specific bullying (F1 0.95–0.98) with very high precision/recall
- Struggles with non/vague bullying (F1 0.47–0.59)
- Close to logistic regressionbut slightly lower than naive bayes on some specifics

## Saving Results

In [19]:
import pandas as pd
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score
import time

# collect metrics for each model
results = []

def add_results(name, y_true, y_pred):
    results.append({
        "Model": name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "F1-macro": f1_score(y_true, y_pred, average='macro'),
        "F1-weighted": f1_score(y_true, y_pred, average='weighted'),
        "Precision-macro": precision_score(y_true, y_pred, average='macro'),
        "Precision-weighted": precision_score(y_true, y_pred, average='weighted'),
        "Recall-macro": recall_score(y_true, y_pred, average='macro'),
        "Recall-weighted": recall_score(y_true, y_pred, average='weighted')
    })

# add all models
add_results("Logistic Regression", y_test, y_pred_lr)
add_results("SVM", y_test, y_pred_svm)
add_results("Naive Bayes", y_test, y_pred_nb)
add_results("Random Forest", y_test, y_pred_rf)

results_df = pd.DataFrame(results)

# display results
results_df

# save results
results_df.to_csv("classical_models_metrics.csv", index=False)
