<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Capstone Project - Identifying Offensive Tweets

# Information

This is my capstone project for the General Assembly Data Science Immersive course.

This is the final notebook of this project.

In this notebook, the steps conducted are:

    1. Obtaining test score of the models

**CONTENT WARNING: This project includes content that are sensitive and may be offensive to some viewers. These topics include mentions (many negative) and slurs of race, religion, and gender.**

**NOTE: All text information that are used in this project are directly taken from the websites and do not reflect what I believe in. All tags (whether a tweet is racist/sexist, or not) are taken as is from the source.**

For the purpose of this project, the offensive tweets of interest are ones that are racist and sexist. 

Racist tweets are defined as those that have antagonistic sentiments toward certain religious figures or individuals from a religious group, and/or individuals or groups from a certain race. Given the dataset 'classified_tweets' not separating the racist and sacrilegious/blasphemous (anti-religious) tweets, the 'racist' tag will be applied for both categories.

Sexist tweets are defined as those that have misogynistic, homophobic, and/or transphobic sentiments.

# Background

Twitter is a micro-blogging social media platform with 217.5 million daily active users globally. With 500 million new tweets (posts) daily, the topics of these tweets varies widely – k-pop, politics, financial news… you name it! Individuals use it for news, entertainment, and discussions, while corporations use them to as a marketing tool to reach out to a wide audience. Given the freedom Twitter accords to its user, Twitter can provide a conducive environment for productive discourse, but this freedom can also be abused, manifesting in the forms of racism and sexism.

# Problem Statement

With Twitter’s significant income stream coming from advertisers, it is imperative that Twitter keeps a substantial user base. On the other hand, Twitter should maintain a safe space for users and provide some level of checks for the tweets the users put out into the public space, and the first step would be to identify tweets that espouse racist or sexist ideologies, and then Twitter can direct the users to appropriate sources of information where users can learn more about the community that they offend or their subconscious biases so they will be more aware of their racist/sexist tendencies. Thus, to balance, Twitter has to be accurate in filtering inappropriate tweets from innocuous ones, and the kind of inappropriateness of flagged tweets (tag - racist or sexist).

F1-scores will be the primary metric as it looks at both precision and recall, each looking at false positives (FPs) and false negatives (FNs) respectively, and is a popular metric for imbalanced data as is the case with the dataset used.

For the purpose of explanation, racist tweets are used as the ‘positive’ case.

In this context, FPs are the cases where the model erroneously flags out tweets as racist when the tweet is actually innocuous/sexist. FNs are cases where the model erroneously flags out tweets as innocuous/sexist but the tweets are actually racist.

There is a need to balance the identification of an offensive tweet when it is indeed offensive and the need to maintain a high level of user experience (something that would be jeopardized when the model erroneously flags innocuous tweets as offensive).

Thus, higher F1-score is the preferred metric to assess model performance.

# Importing Libraries

In [1]:
# Standard libraries
import numpy as np
import pandas as pd

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For NLP data cleaning and preprocessing
import re, string, nltk, itertools
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Pickle to save model
import pickle

# For NLP Machine Learning processes
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from imblearn.over_sampling import RandomOverSampler, SMOTE

# Pipeline
from imblearn.pipeline import Pipeline

# Naive Bayes
from sklearn.naive_bayes import MultinomialNB

# Random Forest
from sklearn.ensemble import RandomForestClassifier

# Support Vector Machine
from sklearn.svm import SVC

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Transformers library for BERT
import transformers
from transformers import BertModel
from transformers import BertTokenizer
from transformers import AdamW, get_linear_schedule_with_warmup

# Evaluation Metrics
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.metrics import roc_auc_score, roc_curve, plot_roc_curve, RocCurveDisplay
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report, plot_confusion_matrix

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Changing display settings
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_row', 100)

# Importing Dataset

In [19]:
data = pd.read_csv('D:/General Assembly/GA-projects/Capstone - Revisited/data/data_v1.csv')

In [20]:
data.head()

Unnamed: 0,tag,set,text
0,2,train,way insult direct man unflattering hat worn predominantly men meant ins…
1,1,train,ordinary muslim idiot person like know make sure qur’an muslim nothing claim jihad come back tal...
2,1,train,give buildup sweeden government idiotu behave like allrounder see reach god know many fake news ...
3,0,train,sure pot cooked hot mkr killerblondes abarmezh86
4,1,train,christian part palestinian kill driven palestinian muslim


# General Preprocessing of Test Dataset

In [21]:
# Retrieving test set from entire dataset
train = data.loc[data['set'] == 'train']
test = data.loc[data['set'] == 'test']

In [22]:
# Looking for missing values
test.isna().sum()

tag       0
set       0
text    228
dtype: int64

In [23]:
# Dropping rows with missing values
test = test.dropna()
test.isna().sum()

tag     0
set     0
text    0
dtype: int64

In [24]:
# Splitting test dataset into X and y columns
X = test['text']
y = test['tag']

# Loading Trained Models

These models are:
1. Multinomial Naive Bayes using CountVectorizer
2. Random Forest using TfidfVectorizer
3. Support Vector Machine using CountVectorizer
4. BERT model

They are used because they have a higher validation score compared to the other version using the other vectorizer.

In [43]:
# Loading cvec Multinomial NB model
multi_nb_model = pickle.load(open("D:/General Assembly/GA-projects/Capstone - Revisited/data/cvec_multinomial.pkl", "rb"))

# Loading tvec Random Forest model
rf_model = pickle.load(open("D:/General Assembly/GA-projects/Capstone - Revisited/data/tvec_rf.pkl", "rb"))

# loading cvec SVM model
svm_model = pickle.load(open("D:/General Assembly/GA-projects/Capstone - Revisited/data/cvec_svm.pkl", "rb"))

# Retrieving Test Score

### Multinomial Naive Bayes (CountVectorizer)

In [25]:
# Predicting tags on test set
multi_nb_pred = multi_nb_model.predict(X)

In [26]:
# Showing confusion matrix
confusion_matrix(y, multi_nb_pred)

array([[9281, 5530, 2225],
       [  64,  869,   12],
       [ 579,  209,  937]], dtype=int64)

In [27]:
# Calcuating F1-score
multi_nb_f1 = f1_score(y, multi_nb_pred, average = 'weighted')
multi_nb_f1

0.6397351961194994

### Random Forest (TF-IDF Vectorizer)

In [28]:
# Predicting tags on test set
rf_pred = rf_model.predict(X)

In [29]:
# Showing confusion matrix
confusion_matrix(y, rf_pred)

array([[15252,   681,  1103],
       [  125,   800,    20],
       [  533,    14,  1178]], dtype=int64)

In [30]:
# Calcuating F1-score
rf_f1 = f1_score(y, rf_pred, average = 'weighted')
rf_f1

0.8831018640291008

### Support Vector Machine (CountVectorizer)

In [31]:
# Predicting tags on test set
svm_pred = svm_model.predict(X)

In [32]:
# Showing confusion matrix
confusion_matrix(y, svm_pred)

array([[15347,   607,  1082],
       [  208,   722,    15],
       [  550,    14,  1161]], dtype=int64)

In [33]:
# Calcuating F1-score
svm_f1 = f1_score(y, svm_pred, average = 'weighted')
svm_f1

0.8819736691263095

### BERT

Unlike the other models, there is a need to have a BERT-specific preprocessing of test data.

In [35]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [36]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [37]:
# Tokenize train tweets
encoded_tweets = [tokenizer.encode(sent, add_special_tokens=True) for sent in X]

# Find the longest tokenized tweet
max_len = max([len(sent) for sent in encoded_tweets])
print('Max length: ', max_len)

Max length:  114


In [38]:
# Setting max length to the longest text in test set
MAX_LEN = max_len

In [39]:
# Defining a custom tokenizer function using the loaded tokenizer.
def bert_tokenizer(data):
    input_ids = []
    attention_masks = []
    for sent in data:
        encoded_sent = tokenizer.encode_plus(
            text=sent,
            add_special_tokens=True,        # Add `[CLS]` and `[SEP]` special tokens
            max_length=MAX_LEN,             # Choose max length to truncate/pad
            pad_to_max_length=True,         # Pad sentence to max length 
            return_attention_mask=True      # Return attention mask
            )
        input_ids.append(encoded_sent.get('input_ids'))
        attention_masks.append(encoded_sent.get('attention_mask'))

    # Convert lists to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)

    return input_ids, attention_masks

In [41]:
test_inputs, test_masks = bert_tokenizer(X)

In [44]:
y = y.to_numpy()

test_labels = torch.from_numpy(y)

In [45]:
# Create the DataLoader for our test set
test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size= 8)

In [48]:
# Creating BERT class
class Bert_Classifier(nn.Module):
    def __init__(self, freeze_bert=False):
        super(Bert_Classifier, self).__init__()
        # Specify hidden size of BERT, hidden size of the classifier, and number of labels
        n_input = 768
        n_hidden = 50
        # 3 n_output because there are 3 categories of tweets ('none': 0, 'sexism': 1, 'racism': 2)
        n_output = 3
        # Instantiate BERT model
        self.bert = BertModel.from_pretrained('bert-base-uncased')

        # Add dense layers to perform the classification
        self.classifier = nn.Sequential(
            nn.Linear(n_input,  n_hidden),
            nn.ReLU(),
            nn.Linear(n_hidden, n_output)
        )
        # Add possibility to freeze the BERT model
        # to avoid fine tuning BERT params (usually leads to worse results)
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False
        
    def forward(self, input_ids, attention_mask):
        # Feed input data to BERT
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask)
        
        # Extract the last hidden state of the token `[CLS]` for classification task
        last_hidden_state_cls = outputs[0][:, 0, :]

        # Feed input to classifier to compute logits
        logits = self.classifier(last_hidden_state_cls)

        return logits

In [47]:
# Creating function for BERT to predict
def bert_predict(model, test_dataloader):
    
    # Define empty list to host the predictions
    preds_list = []
    
    # Put the model into evaluation mode
    model.eval()
    
    for batch in test_dataloader:
        batch_input_ids, batch_attention_mask = tuple(t.to(device) for t in batch)[:2]
        
        # Avoid gradient calculation of tensors by using "no_grad()" method
        with torch.no_grad():
            logit = model(batch_input_ids, batch_attention_mask)
        
        # Get index of highest logit
        pred = torch.argmax(logit,dim=1).cpu().numpy()
        # Append predicted class to list
        preds_list.extend(pred)

    return preds_list

In [49]:
# Importing BERT model
bert50_model = pickle.load(open("D:/General Assembly/GA-projects/Capstone - Revisited/data/bert50.pkl", "rb"))

In [50]:
# Predicting tags on test set
bert50_pred = bert_predict(bert50_model, test_dataloader)

In [51]:
# Showing confusion matrix
confusion_matrix(y, bert50_pred)

array([[14341,   948,  1747],
       [   90,   844,    11],
       [  387,    24,  1314]], dtype=int64)

In [52]:
# Calcuating F1-score
bert_f1 = f1_score(y, bert50_pred, average = 'weighted')
bert_f1

0.8556958214702586