<a href="https://colab.research.google.com/github/lawrenceguelos/CCMACLRL_EXERCISES_COM222-ML/blob/main/Exercise7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 7: Hate Speech Classification using Multinomial Naive Bayes

Instructions:
- You do not need to split your data. Use the training, validation and test sets provided below.
- Use Multinomial Naive Bayes to train a model that can classify if a sentence is a hate speech or non-hate speech
- A sentence with a label of zero (0) is classified as non-hate speech
- A sentence with a label of one (1) is classified as a hate speech

Apply text pre-processing techniques such as
- Converting to lowercase
- Stop word Removal
- Removal of digits, special characters
- Stemming or Lemmatization but not both
- Count Vectorizer or TF-IDF Vectorizer but not both

Evaluate your model by:
- Providing input by yourself
- Creating a Confusion Matrix
- Calculating the Accuracy, Precision, Recall and F1-Score

In [62]:
import pandas as pd
import re
import string
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk

# Download nltk resources if not already installed
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [63]:
splits = {'train': 'unique_train_dataset.csv', 'validation': 'unique_validation_dataset.csv', 'test': 'unique_test_dataset.csv'}

**Training Set**

Use this to train your model

In [64]:
df_train = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["train"])

**Validation Set**

Use this set to evaluate your model

In [65]:
df_validation = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["validation"])

**Test Set**
  
Use this set to test your model

In [66]:
df_test = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["test"])

## A. Understanding your training data

1. Check the first 10 rows of the training dataset

In [67]:
print(df_train.head(10))

                                                text  label
0  Presidential candidate Mar Roxas implies that ...      1
1  Parang may mali na sumunod ang patalastas ng N...      1
2                    Bet ko. Pula Ang Kulay Ng Posas      1
3                               [USERNAME] kakampink      0
4  Bakit parang tahimik ang mga PINK about Doc Wi...      1
5  "Ang sinungaling sa umpisa ay sinungaling hang...      1
6                                          Leni Kiko      0
7  Nahiya si Binay sa Makati kaya dito na lang sa...      1
8                            Another reminderHalalan      0
9  [USERNAME] Maybe because VP Leni Sen Kiko and ...      0


2. Check how many rows and columns are in the training dataset using `.info()`

In [68]:
print(df_train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21773 entries, 0 to 21772
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    21773 non-null  object
 1   label   21773 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 340.3+ KB
None


3. Check for NaN values

In [69]:
print(df_train.isna().sum())

text     0
label    0
dtype: int64


4. Check for duplicate rows

In [70]:
print(df_train.duplicated().sum())

0


5. Check how many rows belong to each class

In [71]:
print(df_train['label'].value_counts())

label
1    10994
0    10779
Name: count, dtype: int64


## B. Text pre-processing

6. Remove duplicate rows

In [72]:
df_train.drop_duplicates(inplace=True)

7. Remove rows with NaN values

In [73]:
df_train.dropna(inplace=True)

8. Convert all text to lowercase

In [74]:
df_train['text'] = df_train['text'].str.lower()

9. Remove digits, URLS and special characters

In [75]:
df_train['text'] = df_train['text'].apply(lambda x: re.sub(r'\d+', '', x))  # Remove digits
df_train['text'] = df_train['text'].apply(lambda x: re.sub(r'http\S+|www.\S+', '', x))  # Remove URLs
df_train['text'] = df_train['text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))  # Remove punctuation

10. Remove stop words

In [76]:
stop_words = set(stopwords.words('english'))
df_train['text'] = df_train['text'].apply(lambda x: ' '.join([word for word in word_tokenize(x) if word not in stop_words]))

11. Use Stemming or Lemmatization

In [77]:
lemmatizer = WordNetLemmatizer()
df_train['text'] = df_train['text'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in word_tokenize(x)]))

## C. Training your model

12. Put all text training data in variable **X_train**

In [78]:
X_train = df_train['text']

13. Put all training data labels in variable **y_train**

In [79]:
y_train = df_train['label']

14. Use `CountVectorizer()` or `TfidfVectorizer()` to convert text data to its numerical form.

Put the converted data to **X_train_transformed** variable

In [80]:
vectorizer = TfidfVectorizer()
X_train_transformed = vectorizer.fit_transform(X_train)

15. Create an instance of `MultinomalNB()`

In [81]:
model = MultinomialNB()

16. Train the model using `.fit()`

In [82]:
model.fit(X_train_transformed, y_train)

## D. Evaluate your model

17. Use `.predict()` to generate model predictions using the **validation dataset**


- Put all text validation data in **X_validation** variable

- Convert **X_validation** to its numerical form.

- Put the converted data to **X_validation_transformed**

- Put all predictions in **y_validation_pred** variable

In [83]:
X_validation = df_validation['text']
X_validation = X_validation.str.lower()
X_validation = X_validation.apply(lambda x: re.sub(r'\d+', '', x))
X_validation = X_validation.apply(lambda x: re.sub(r'http\S+|www.\S+', '', x))
X_validation = X_validation.apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
X_validation = X_validation.apply(lambda x: ' '.join([word for word in word_tokenize(x) if word not in stop_words]))
X_validation = X_validation.apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in word_tokenize(x)]))

# Convert the validation text data to its numerical form
X_validation_transformed = vectorizer.transform(X_validation)

18. Get the Accuracy, Precision, Recall and F1-Score of the model using the **validation dataset**

- Put all validation data labels in **y_validation** variable

In [84]:
y_validation = df_validation['label']

# Predict on validation data
y_validation_pred = model.predict(X_validation_transformed)

# Evaluate the model
accuracy = accuracy_score(y_validation, y_validation_pred)
precision = precision_score(y_validation, y_validation_pred)
recall = recall_score(y_validation, y_validation_pred)
f1 = f1_score(y_validation, y_validation_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

Accuracy: 0.8342857142857143
Precision: 0.8097719869706841
Recall: 0.8784452296819788
F1 Score: 0.8427118644067797


19. Create a confusion matrix using the **validation dataset**

In [85]:
cm = confusion_matrix(y_validation, y_validation_pred)
print(f"Confusion Matrix:\n{cm}")

Confusion Matrix:
[[1093  292]
 [ 172 1243]]


20. Use `.predict()` to generate the model predictions using the **test dataset**


- Put all text validation data in **X_test** variable

- Convert **X_test** to its numerical form.

- Put the converted data to **X_test_transformed**

- Put all predictions in **y_test_pred** variable

In [86]:
X_test = df_test['text']
X_test = X_test.str.lower()
X_test = X_test.apply(lambda x: re.sub(r'\d+', '', x))
X_test = X_test.apply(lambda x: re.sub(r'http\S+|www.\S+', '', x))
X_test = X_test.apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
X_test = X_test.apply(lambda x: ' '.join([word for word in word_tokenize(x) if word not in stop_words]))
X_test = X_test.apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in word_tokenize(x)]))

# Convert the test text data to its numerical form
X_test_transformed = vectorizer.transform(X_test)

21. Get the Accuracy, Precision, Recall and F1-Score of the model using the **test dataset**

- Put all test data labels in **y_validation** variable



In [87]:
y_test = df_test['label']

# Predict on test data
y_test_pred = model.predict(X_test_transformed)

# Evaluate the model
accuracy_test = accuracy_score(y_test, y_test_pred)
precision_test = precision_score(y_test, y_test_pred)
recall_test = recall_score(y_test, y_test_pred)
f1_test = f1_score(y_test, y_test_pred)

print(f"Test Accuracy: {accuracy_test}")
print(f"Test Precision: {precision_test}")
print(f"Test Recall: {recall_test}")
print(f"Test F1 Score: {f1_test}")

Test Accuracy: 0.8352313167259786
Test Precision: 0.8045602605863192
Test Recall: 0.8834048640915594
Test F1 Score: 0.8421411524036823


22. Create a confusion matrix using the **test dataset**

In [88]:
cm_test = confusion_matrix(y_test, y_test_pred)
print(f"Test Confusion Matrix:\n{cm_test}")

Test Confusion Matrix:
[[1112  300]
 [ 163 1235]]


## E. Test the model

23. Test the model by providing a non-hate speech input. The model should predict it as 0

In [91]:
input_non_hate = ["sana all"]
input_non_hate_transformed = vectorizer.transform(input_non_hate)
prediction_non_hate = model.predict(input_non_hate_transformed)
print(f"Prediction for non-hate input: {prediction_non_hate[0]}")

Prediction for non-hate input: 0


24. Test the model by providing a hate speech input. The model should predict it as 1

In [92]:
input_hate = ["kupal"]
input_hate_transformed = vectorizer.transform(input_hate)
prediction_hate = model.predict(input_hate_transformed)
print(f"Prediction for hate input: {prediction_hate[0]}")

Prediction for hate input: 1
