# Exercise 7: Hate Speech Classification using Multinomial Naive Bayes

Instructions:
- You do not need to split your data. Use the training, validation and test sets provided below.
- Use Multinomial Naive Bayes to train a model that can classify if a sentence is a hate speech or non-hate speech
- A sentence with a label of zero (0) is classified as non-hate speech
- A sentence with a label of one (1) is classified as a hate speech

Apply text pre-processing techniques such as
- Converting to lowercase
- Stop word Removal
- Removal of digits, special characters
- Stemming or Lemmatization but not both
- Count Vectorizer or TF-IDF Vectorizer but not both

Evaluate your model by:
- Providing input by yourself
- Creating a Confusion Matrix
- Calculating the Accuracy, Precision, Recall and F1-Score

In [19]:
import pandas as pd
import re
import nltk

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download the necessary resources for nltk (if needed)
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
splits = {'train': 'unique_train_dataset.csv', 'validation': 'unique_validation_dataset.csv', 'test': 'unique_test_dataset.csv'}

**Training Set**

Use this to train your model

In [21]:
df_train = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["train"])

In [22]:
df_train.head()

Unnamed: 0,text,label
0,Presidential candidate Mar Roxas implies that ...,1
1,Parang may mali na sumunod ang patalastas ng N...,1
2,Bet ko. Pula Ang Kulay Ng Posas,1
3,[USERNAME] kakampink,0
4,Bakit parang tahimik ang mga PINK about Doc Wi...,1


In [23]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21773 entries, 0 to 21772
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    21773 non-null  object
 1   label   21773 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 340.3+ KB


In [24]:
#check for missing data
print("Missing data in each column:\n" + str(df_train.isnull().sum()))

Missing data in each column:
text     0
label    0
dtype: int64


In [25]:
df_train['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,10994
0,10779


**Validation Set**

Use this set to evaluate your model

In [26]:
df_validation = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["validation"])

**Test Set**
  
Use this set to test your model

In [27]:
df_test = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["test"])

In [28]:
# Preprocessing Function for Tagalog Text with Extended Stop Words and Lemmatization
def preprocess_text(text):

    # Converting to lowercase
    text = text.lower()

    # Removing special characters and digits
    text = re.sub(r'\d+', '', text)  # remove digits
    text = re.sub(r'[^\w\s]', '', text)  # remove special characters

    # Extended list of stopwords (Tagalog + English)
    stopwords = set([
        "akin", "aking", "ako", "alin", "am", "amin", "aming", "ang", "ano", "anumang",
        "apat", "at", "atin", "ating", "bababa", "bago", "bakit", "bawat",
        "dahil", "dalawa", "dapat", "din", "dito", "gagawin", "gayunman", "ginagawa",
        "ginawa", "ginawang", "gumawa", "habang", "hanggang", "hindi", "iba",
        "ibaba", "ibabaw", "ibig", "ikaw", "ilagay", "ilalim", "ilan", "inyong", "isa", "isang",
        "itaas", "ito", "iyo", "iyon", "ka", "kahit", "kailangan", "kailanman", "kami",
        "kanila", "kanilang", "kanino", "kanya", "kanyang", "kapag", "kapwa", "karamihan",
        "katiyakan", "katulad", "kaya", "kaysa", "ko", "kong", "kulang", "kumuha",
        "laban", "lahat", "lamang", "likod", "maaari", "maaaring", "maging", "mahusay",
        "makita", "marami", "marapat", "masyado", "may", "mayroon", "mga", "minsan", "mismo",
        "mula", "na", "nabanggit", "naging", "nagkaroon", "nais", "nakita", "namin",
        "napaka", "narito", "nasaan", "ng", "ngayon", "ni", "nila", "nilang", "nito", "niya",
        "niyang", "o", "pa", "paano", "pababa", "paggawa", "pagitan", "pagkakaroon",
        "pagkatapos", "palabas", "pamamagitan", "panahon", "pangalawa", "para", "paraan",
        "pareho", "pataas", "pero", "pumunta", "sa", "saan", "sabi", "sabihin",
        "sarili", "sila", "sino", "tatlo", "tayo", "tulad", "una", "walang",
        "myself", "which", "your", "too", "and", "his", "we", "be", "both", "a", "because",
        "below", "just", "can", "between", "is", "after", "those", "down", "where", "against",
        "same", "don", "been", "what", "so", "into", "does", "are", "on", "an", "yourselves",
        "more", "during", "to", "or", "any", "yourself", "do", "he", "now", "as", "me",
        "further", "over", "few", "whom", "this", "above", "not", "it", "ourselves", "you",
        "her", "very", "once", "than", "about", "him", "only", "doing", "these", "but",
        "there", "here", "itself", "she", "most", "yours", "up", "until", "was", "with",
        "being", "off", "t", "will", "has", "own", "should", "nor", "i", "our", "out",
        "again", "then", "under", "all", "for", "why", "each", "if", "having", "s",
        "theirs", "from", "such", "while", "how", "my", "by", "had", "they", "its",
        "were", "did", "in", "no", "herself", "their", "through", "when"
    ])
    # Apply lemmatization to each word (after removing stopwords)
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split() if word not in stopwords])

    return text

In [29]:
# Applying preprocessing to the datasets
df_train['cleaned_text'] = df_train['text'].apply(preprocess_text)
df_validation['cleaned_text'] = df_validation['text'].apply(preprocess_text)
df_test['cleaned_text'] = df_test['text'].apply(preprocess_text)

In [30]:
# Vectorization
vectorizer = TfidfVectorizer()  # or CountVectorizer()
X_train = vectorizer.fit_transform(df_train['cleaned_text'])
y_train = df_train['label']
X_validation = vectorizer.transform(df_validation['cleaned_text'])
y_validation = df_validation['label']
X_test = vectorizer.transform(df_test['cleaned_text'])
y_test = df_test['label']

In [31]:
model = MultinomialNB()
model.fit(X_train, y_train)

In [32]:
#Training
y_pred_train = model.predict(X_train)
print("Training Accuracy: ", accuracy_score(y_train, y_pred_train))

Training Accuracy:  0.901345703394112


In [33]:
# Validation
y_pred_val = model.predict(X_validation)
print("Validation Accuracy: ", accuracy_score(y_validation, y_pred_val))

Validation Accuracy:  0.8360714285714286


In [34]:
# Test
y_pred_test = model.predict(X_test)
print("Test Accuracy: ", accuracy_score(y_test, y_pred_test))

Test Accuracy:  0.8359430604982206


In [35]:
# Evaluation Metrics
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_test))

Confusion Matrix:
[[1125  287]
 [ 174 1224]]


In [36]:
new_text = pd.Series("tanga ang galing mo")
new_text_transform = vectorizer.transform(new_text)
prediction = model.predict(new_text_transform)

# Interpret the prediction result
if prediction == 1:
    print("The sentence is a hate speech.")
else:
    print("The sentence is a non-hate speech.")


The sentence is a hate speech.
