## Naive Bayes

In [2]:
import torch
import pandas as pd
import numpy as np
import re
# from transformers import BertTokenizer, TFBertForSequenceClassification
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import accuracy_score
torch.set_default_device('cuda')
from gensim.parsing.preprocessing import remove_stopwords

from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [3]:
train_data = pd.read_csv("/content/drive/MyDrive/ML_Project/train.csv")
test_x = pd.read_csv("/content/drive/MyDrive/ML_Project/test.csv")
test_y = pd.read_csv("/content/drive/MyDrive/ML_Project/test_labels.csv")


In [4]:
def lowercase(txt):

    return txt.lower()

def remove_punctuation(txt):

    return re.sub(r"[^\w\s\d]", "", txt)

def remove_numbers(txt):

    return re.sub(r"\d", "", txt)

def normalize_sentence(txt):
    '''
    Aggregates all the above functions to normalize/clean a sentence
    '''
    txt = lowercase(txt)
    txt = remove_punctuation(txt)
    txt = remove_stopwords(txt)
    txt = remove_numbers(txt)
    return txt

The next step was to clean up the text. This meant taking out common words like "the" and "and" (these are called stopwords), and also removing any punctuation and numbers. Doing this helped to make the important words stand out more, which is important for correctly sorting and understanding the text.
We split our data into two sections. The first section, called 'test_x', had the texts that we wanted our model to sort. The second section, 'test_y', had the right categories for these texts. We did the same thing with our training data. We divided it into 'train_x' (the texts for training) and 'train_y' (the matching categories).

In [5]:
# Drop the rows whose labels are not available

test_data = pd.concat([test_x, test_y], axis=1)
test_data = test_data[(test_data['toxic'] != -1) & (test_data['severe_toxic'] != -1) & (test_data['obscene'] != -1) & (test_data['threat'] != -1) & (test_data['insult'] != -1) & (test_data['identity_hate'] != -1)]
test_data['comment_text']=test_data['comment_text'].apply(normalize_sentence)

test_data.shape

(63978, 9)

In [6]:
# Seperate test_x and test_y from test_data

test_x = np.array(test_data["comment_text"])
test_y = np.array(test_data[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]])
test_labels = np.argmax(test_y, axis=1)


print(f"test_x {test_x.shape}  ;  test_y {test_y.shape}  ;  test_labels {test_labels.shape}")

test_x (63978,)  ;  test_y (63978, 6)  ;  test_labels (63978,)


In [7]:
# Seperate train_x and train_y from train_data

train_data = train_data[(train_data['toxic'] != -1) & (train_data['severe_toxic'] != -1) & (train_data['obscene'] != -1) & (train_data['threat'] != -1) & (train_data['insult'] != -1) & (train_data['identity_hate'] != -1)]

train_data['comment_text']=train_data['comment_text'].apply(normalize_sentence)

train_data_x = np.array(train_data["comment_text"])
train_data_y = np.array(train_data[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]])

print(f"train_x {train_data_x.shape}  ;  train_y {train_data_y.shape}")

train_x (159571,)  ;  train_y (159571, 6)


In the training part, we used a tool called CountVectorizer. It keeps track of how many times each word appears in the text. We paired this tool with a MultiOutputClassifier that has a MultinomialNB classifier. This mix works well for texts that can fit into more than one category.

We also made a different classifier using the TF-IDF Vectorizer. This tool is useful because it does two things: it counts words in the text and finds out which words are most important. For instance, common words like "is" or "of" might be used a lot, but they don't always tell us much about the text. The TF-IDF Vectorizer helps us pay attention to rarer but more telling words. We combined this with a MultiOutputClassifier that also has a MultinomialNB classifier.

Our model's main component is the Naive Bayes classifier. This approach is based on the belief that every word in the text independently influences the likelihood of the text being assigned to a specific category.

In [11]:
vectorizer = CountVectorizer()

#using count
naive_bayes = Pipeline([
    ('count', CountVectorizer()),
    ('clf', MultiOutputClassifier(MultinomialNB()))
])
naive_bayes.fit(train_data_x, train_data_y)


y_pred = naive_bayes.predict(test_x)
print("Using Count Vectorizer:\n")
print("Accuracy",accuracy_score(test_y, y_pred))
print(classification_report(test_y,y_pred))


#using tfidf
naive_bayes_2 = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultiOutputClassifier(MultinomialNB()))
])

# Train the model
naive_bayes_2.fit(train_data_x, train_data_y)

# Predict on the test set
y_pred = naive_bayes_2.predict(test_x)

# Print the accuracy and classification report
print("Using TF-IDF Vectrization:\n")
print("Accuracy:", accuracy_score(test_y, y_pred))
print(classification_report(test_y, y_pred))

Using Count Vectorizer:

Accuracy 0.8796461283566226
              precision    recall  f1-score   support

           0       0.58      0.64      0.61      6090
           1       0.15      0.44      0.23       367
           2       0.57      0.58      0.57      3691
           3       0.01      0.02      0.02       211
           4       0.51      0.49      0.50      3427
           5       0.15      0.16      0.16       712

   micro avg       0.50      0.55      0.53     14498
   macro avg       0.33      0.39      0.35     14498
weighted avg       0.52      0.55      0.53     14498
 samples avg       0.05      0.05      0.05     14498



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Using TF-IDF Vectrization:

Accuracy: 0.9032323611241364
              precision    recall  f1-score   support

           0       0.92      0.20      0.32      6090
           1       0.00      0.00      0.00       367
           2       0.97      0.11      0.19      3691
           3       0.00      0.00      0.00       211
           4       0.93      0.04      0.08      3427
           5       0.00      0.00      0.00       712

   micro avg       0.93      0.12      0.21     14498
   macro avg       0.47      0.06      0.10     14498
weighted avg       0.85      0.12      0.20     14498
 samples avg       0.02      0.01      0.01     14498



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


For Count Vectorization, model is doing quite well with an accuracy of about 88%, meaning it usually makes the right choice about whether to assign a label or not. However, this figure doesn't give us the full picture. In situations where there are many labels to identify, the overall accuracy may not accurately reflect the model's performance on each individual label, particularly when some labels are far more frequent than others. For labels 0, 2, and 4, the model shows decent precision, recall, and F1-scores, indicating it performs acceptably in these categories. On the other hand, labels 1, 3, and 5 have lower precision and recall, leading to poor F1-scores. This suggests that the model has difficulty making correct predictions for these labels, often either missing them (leading to false negatives) or mistakenly applying them (resulting in false positives).

For the TF-IDF EVctorization, the model is pretty accurate, getting it right about 90.32% of the time when assigning a label to a sample. But this percentage doesn't give us the full story, particularly when dealing with an unbalanced dataset that has many label types. The precision, recall, and F1-scores for each label are quite low. Specifically, for labels 1, 3, and 5, these scores are zero, showing the model fails to correctly identify these labels. For labels 0, 2, and 4, although the precision is somewhat high, the recall and F1-scores are very low, indicating the model is cautious and only picks these labels in very clear situations, which leads to a lot of missed labels (false negatives).

The two methods we use for organizing words in text, TF-IDF and Count Vectorization, have different impacts on the results. TF-IDF gives more attention to rare words, which could lead it to overlook important common words. This might explain why it's not as good at correctly identifying every class. Count Vectorization, on the other hand, seems to do a better job with various types of classes, showing more even results. While TF-IDF is usually more accurate overall, it falls short in how it identifies each specific class, especially in recognizing every example in a class.