# Background

Twitter is a micro-blogging social media platform with 217.5 million daily active users globally. With 500 million new tweets (posts) daily, the topics of these tweets varies widely – k-pop, politics, financial news… you name it! Individuals use it for news, entertainment, and discussions, while corporations use them to as a marketing tool to reach out to a wide audience. Given the freedom Twitter accords to its user, Twitter can provide a conducive environment for productive discourse, but this freedom can also be abused, manifesting in the forms of racism and sexism.

# Problem Statement

With Twitter’s significant income stream coming from advertisers, it is imperative that Twitter keeps a substantial user base. On the other hand, Twitter should maintain a safe space for users and provide some level of checks for the tweets the users put out into the public space, and the first step would be to identify tweets that espouse racist or sexist ideologies, and then Twitter can direct the users to appropriate sources of information where users can learn more about the community that they offend or their subconscious biases so they will be more aware of their racist/sexist tendencies. Thus, to balance, Twitter has to be accurate in filtering inappropriate tweets from innocuous ones, and the kind of inappropriateness of flagged tweets (tag - racist or sexist).

F1-scores will be the primary metric as it looks at both precision and recall, each looking at false positives (FPs) and false negatives (FNs) respectively, and is a popular metric for imbalanced data as is the case with the dataset used.

For the purpose of explanation, racist tweets are used as the ‘positive’ case.

In this context, FPs are the cases where the model erroneously flags out tweets as racist when the tweet is actually innocuous/sexist. FNs are cases where the model erroneously flags out tweets as innocuous/sexist but the tweets are actually racist.

Thus, higher F1-scores are preferred.

# Importing Libraries

In [63]:
# Standard libraries
import numpy as np
import pandas as pd

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For NLP data cleaning and preprocessing
import re, string, nltk, itertools
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# Pickle to save model
import pickle

# For NLP Machine Learning processes
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from imblearn.over_sampling import RandomOverSampler, SMOTE

# Pipeline
from imblearn.pipeline import Pipeline

# Naive Bayes
from sklearn.naive_bayes import MultinomialNB

# Random Forest
from sklearn.ensemble import RandomForestClassifier

# XGBoost
import xgboost as xgb
from xgboost import XGBClassifier, plot_importance

# Support Vector Machine
from sklearn.svm import SVC

# PyTorch LSTM
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Tokenization for LSTM
from collections import Counter
from gensim.models import Word2Vec

# Transformers library for BERT
import transformers
from transformers import BertModel
from transformers import BertTokenizer
from transformers import AdamW, get_linear_schedule_with_warmup

# Evaluation Metrics
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.metrics import roc_auc_score, roc_curve, plot_roc_curve, RocCurveDisplay
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report, plot_confusion_matrix

In [64]:
# Setting seed for reproducibility
import random

seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)

In [65]:
# Changing display settings
pd.set_option('display.max_row', 100)
pd.set_option('display.max_colwidth', None)

# Importing Dataset

In [66]:
twitter = pd.read_csv('../Capstone/data/twitter_char_4_gram_lemm_text.csv')

In [67]:
twitter.columns

Index(['Annotation', 'Text_lemm_char_4_gram'], dtype='object')

In [68]:
twitter.head()

Unnamed: 0,Annotation,Text_lemm_char_4_gram
0,0,read cont onte ntex text extn xtno chan hang ange mean eani anin ning hist isto stor tory isla slam lami amic slav lave aver very
1,0,idio diot clai laim peop eopl ople stop beco ecom come terr erro rror rori oris rist make terr erro rror rori oris rist isla slam lami amic mica ical call ally brai rain dead
2,1,call sexi exis xist auto plac lace woul ould rath athe ther talk
3,2,wron rong foll ollo llow exam xamp ampl mple moha oham hamm amme mmed qura uran exac xact actl ctly
4,0,saud audi prea reac each ache cher tort ortu rtur ture five year earo arol rold daug augh ught ghte hter deat eath rele elea leas ease


# Train/Test Split

In [69]:
# Splitting the creating the X and Y columns for the character 4-gram based on lemmatized text dataset
X, y = twitter['Text_lemm_char_4_gram'], twitter['Annotation']

In [70]:
# Conducting train/test split
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify = y, random_state = seed_value)

# Random Forest Model

### Random Forest Hyperparameter Tuning

In [22]:
# Set up a pipeline:
# 1. Instantiating CountVectorizer
# 2. SMOTE sampling - due to imbalance of classes
# 3. Random Forest

pipe_rf = Pipeline([
        ('cvec', CountVectorizer()),
        ('sampling', SMOTE()),
        ('rf', RandomForestClassifier())
    ])

In [25]:
# Set up hyperparameters tuning                    
pipe_rf_params = {
    'cvec__max_features': [2_000, 3_000, 4_000],       
    'cvec__min_df': [3, 4, 5],                           
    'cvec__max_df': [0.5, 0.6, 0.7],                           
    'cvec__ngram_range': [(1,1)],
    'rf__n_estimators': [50, 70, 90],
    # To match the metric used for the BERT model
    'rf__criterion': ['entropy']
}

### GridSearch CV

In [26]:
# Instantiate GridSearchCV.

gs_rf = GridSearchCV(pipe_rf, 
                  param_grid = pipe_rf_params, # Parameters values we are searching over
                  cv=5, # 5-fold cross-validation
                 n_jobs = -1) 

In [29]:
%%time
# Fitting Random Forest model
gs_rf.fit(X_train, y_train)

CPU times: total: 46 s
Wall time: 57min 55s


In [30]:
# Making predictions
rf_predictions = gs_rf.predict(X_val)

In [31]:
# Calculating F1 score of Random Forest model
rf_f1_score = f1_score(y_val, rf_predictions, average = 'weighted')

In [32]:
rf_f1_score

0.7286691209604766

The tuned Random Forest model has the F1 score of 0.728

### Saving Tuned Random Forest Model

In [33]:
# Saving
pickle.dump(gs_rf, open("../Capstone/rf_model.pkl", "wb"))

In [71]:
# Loading
rf_model = pickle.load(open("../Capstone/rf_model.pkl", "rb"))

In [72]:
# Predicting tags on train set
rf_model_pred = rf_model.predict(X_train)

In [73]:
# Calculating f1-score on train set
rf_train_f1 = f1_score(y_train, rf_model_pred, average = 'weighted')

In [74]:
rf_train_f1

0.9808416305860602

Given the large difference betweeen the f1 score on the train set (0.980) and the validation set (0.728), it can be observed that the random forest model overfits on the training dataset.

# Support Vector Machine (SVM)

In [48]:
# Set up a pipeline:
# 1. Standard scaling
# 2. SMOTE sampling
# 3. SVM

svm_pipe = Pipeline([
        ('cvec', CountVectorizer()),
        ('sampling', SMOTE(n_jobs = -1)),
        ('svc', SVC())
    ])

In [56]:
# Set up hyperparameters tuning                  
svm_params = {
    'sampling__sampling_strategy': ['minority'],   
    'sampling__k_neighbors': [3, 5],                  
    'svc__gamma':['auto','scale'],                        
    'svc__kernel':['linear'],     
    'svc__random_state': [42],                    
    'svc__probability' : [True],                                              
    'svc__shrinking' : [True, False],                   
    'svc__class_weight' : ['balanced'],           
    
}

### GridSearch CV

In [57]:
# Instantiate GridSearchCV.
gs_svm = GridSearchCV(svm_pipe, 
                  param_grid = svm_params, # Parameters values we are searching over
                  cv=5, # 5-fold cross-validation
                  n_jobs = -1) 

In [58]:
%%time
#Fitting the XGboost model
svm_model = gs_svm.fit(X_train, y_train)

CPU times: total: 4min 7s
Wall time: 53min 49s


In [59]:
# Making predictions
svm_predictions = svm_model.predict(X_val)

In [60]:
# Calculating F1 score of XGBoost model
svm_f1_score = f1_score(y_val, svm_predictions, average = 'weighted')

In [61]:
svm_f1_score

0.7496347637745211

The tuned SVM Model has a F1 score of 0.749

### Saving SVM Model

In [62]:
# Saving
pickle.dump(svm_model, open("../Capstone/svm_model.pkl", "wb"))

In [76]:
# loading
svm_model = pickle.load(open("../Capstone/svm_model.pkl", "rb"))

In [77]:
# Predicting tags on train set
svm_model_pred = svm_model.predict(X_train)

In [78]:
# Calculating f1-score on train set
svm_train_f1 = f1_score(y_train, svm_model_pred, average = 'weighted')

In [79]:
svm_train_f1

0.9407568055256157

Given the large difference betweeen the f1 score on the train set (0.940) and the validation set (0.749), it can be observed that the SVM model overfits on the training dataset.