
**Combating Hate Speech Using NLP and machine Learning**



**Objective:** Using NLP and ML, create a model to identify hate speech in Twitter.



**Problem Statement:** Twitter is one of the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being uesd as a medium to spread hate.



You are a data scientist at Twitter, and you will help Twitter in identifying the tweets with hate speech and removing them from the paltform. You will use NLP techniques, Perform specific cleanup for tweets data, and make a robust model.



**Domain:** Social media 



**Analysis to be done:**  Clean up tweets and build a classification model by using NLP techniques, cleanup specific for tweets data, regularization and hyperparameter tuning using stratified K-fold and cross validation to get the best model.



**Content:**



**Id:** identifer number of the tweet



**Label:** 0(non-hate)/ 1(hate)



**Tweet**: the text in the tweet

**Date:** 15-11-22

# **Importing the required libraries**

In [None]:
import pandas as pd   # To work with data frames
import numpy as np    # advanced math library
import os, re         # Helps in working with file paths 
                      # Library to work with regular expression

import nltk                              # It is a suite of libraries and programs for symbolic and statistical natural language processing for English.
from nltk.tokenize import TweetTokenizer # Tokenize using tweet Tokenize from NLTK
nltk.download('stopwords')               # This is used to download all lthe stop words in english language
from nltk.corpus import stopwords        # Importing all the stopwords
from string import punctuation           # To get all the punctuations used in english language.

# For lemmitization
nltk.download('omw-1.4')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

from collections import Counter          # From the collection module we are importing the class Counter 

from sklearn.model_selection import train_test_split # For dividing the data into train and test

from sklearn.linear_model import LogisticRegression # importing the required models 
from sklearn.ensemble import RandomForestClassifier # importing the random Forest model
from sklearn.metrics import classification_report, accuracy_score,confusion_matrix,matthews_corrcoef # For importing all the required metrics
from sklearn.model_selection import GridSearchCV, StratifiedKFold # Using Grid Search to find the optimum parameters
                                                                  # LIbrary for using stratified K Fold

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Importing the dataset**

In [None]:
from google.colab import files

uploaded = files.upload()

Saving Tweets_USA.csv to Tweets_USA (1).csv


In [None]:
inp_tweets0 = pd.read_csv("Tweets_USA.csv")

# **Exploring the dataset**

In [None]:
# Observing the first 10 observations
# Now there are three columns in the data, 
# first is the ids, second being the labels hate speech for 1 and positive tweets being 0
# third columns being the tweets
inp_tweets0.head(10) 

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...
7,8,0,the next school year is the year for exams.ð...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...


In [None]:
# To check the number of tweet with hate speech and not.
# The data is highly imbalanced. The number of tweets categorized as hate speech are very few 
inp_tweets0['label'].value_counts(normalize = True)

0    0.929854
1    0.070146
Name: label, dtype: float64

In [None]:
# Observing one of the observations
# Using .sample() on the series to select a random tweet
# This returns a series
#Now to extract out value, we apply the .values function 
# This gives an array
# THe index is used to extract that value from the array
inp_tweets0['tweet'].sample(random_state = 102).values#[0]

array(['you believe your miscegenation genocide will stop the "breeding" of black and white ppl off face of our eah, @user   @user'],
      dtype=object)

# **Data Cleaning**

# Step 1: Get the tweets into a list, for easy text clean up and manipulations

In [None]:
# To extract out tweet column alone, this would give a numpy array
tweets0 = inp_tweets0.tweet.values

In [None]:
# Number of tweets in the dataset
len(tweets0)

31962

In [None]:
# Observing the five tweets
tweets0[:5]

array([' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
       "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
       '  bihday your majesty',
       '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
       ' factsguide: society now    #motivation'], dtype=object)

# Step 2: Coverting the all the tweets  to lower case. This avoides duplicates

In [None]:
tweets_lower = [twt.lower() for twt in tweets0]

In [None]:
tweets_lower[7]

"the next school year is the year for exams.ð\x9f\x98¯ can't think about that ð\x9f\x98\xad #school #exams   #hate #imagine #actorslife #revolutionschool #girl"

# Step 3: Using regular Expression library to remove the unnecessary characters, which would confuse our machine learning model.

1. Remove user handles , begin with @

In [None]:
# Checking what regular expression to use to remove the '@name_of_the_handle'
re.sub("@\w+","", "@Jose you are amazing!!, https//www.google.com")

' you are amazing!!, https//www.google.com'

In [None]:
# Using list comprehension to create a new list containing tweet with no handle names
# Applying the RE on all the tweets
tweets_nouser = [re.sub("@\w+","",twt) for twt in tweets_lower]

In [None]:
tweets_nouser[:3]

['  when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "  thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty']

2. Removing the URLS.

In [None]:
# Checking what regular Expression to use, to remove the URLs
re.sub("\w+://\S+","","@Jose you are amazing https://google.com")

'@Jose you are amazing '

In [None]:
# Now using the regular expression to remove all the urls from the tweets
tweets_nourl = [re.sub("\w+://\S+","",twt) for twt in tweets_nouser]

In [None]:
# all the URLs in the tweets have been removed
tweets_nourl[:5]

['  when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "  thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

3. Removing the non-ASCII Characters

In [None]:
re.sub(r'[^\x00-\x7f]',"",' â #ireland consumer price index (mom) climbed from previous 0.2% to 0.5% in may   #blog #silver #gold #forex')

'  #ireland consumer price index (mom) climbed from previous 0.2% to 0.5% in may   #blog #silver #gold #forex'

In [None]:
tweets_ASCII = [re.sub(r'[^\x00-\x7f]',"",twt) for twt in tweets_nouser]

In [None]:
tweets_ASCII[:5]

['  when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "  thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in ur!!!   ',
 ' factsguide: society now    #motivation']

In [None]:
tweets_ASCII[28]

"happy father's day    "

# Step 4: Applying word tokenizing on the tweets, for further cleaning.

Breaking a sentence into its parts allows a machine to understand the parts as well as the whole. This will help the program understand each of the words by themselves, as well as how they function in the larger text. This is especially important for larger amounts of text as it allows the machine to count the frequencies of certain words as well as where they frequently appear. This is important for later steps in natural language processing.

In [None]:
tkn = TweetTokenizer() # Creating an instance of the class

In [None]:
# This is how tokenization works
# We use the tokenize method of the class TweetTokenizer on the first tweet
print(tkn.tokenize(tweets_ASCII[0])) 

['when', 'a', 'father', 'is', 'dysfunctional', 'and', 'is', 'so', 'selfish', 'he', 'drags', 'his', 'kids', 'into', 'his', 'dysfunction', '.', '#run']


In [None]:
# Applying the same to the entire array of tweets
tweet_token = [tkn.tokenize(sent) for sent in tweets_ASCII]

In [None]:
# This is the last tweet in the dataset, each word in the tweet has been considered as a seperate element, 
# this helps us to apply further operation like stop words removal , removing punctuation, etc easily.
print(tweet_token[31961])

['thank', 'you', 'for', 'you', 'follow']


# Step 5:  Remove punctuations and stop words and other redundant terms like 'rt,'amp', '#'

In [None]:
stop_nltk = stopwords.words("english") # Listing all the stop words in a list
stop_punct = list(punctuation)         # listing all the punctuation in a list

In [None]:
stop_punct

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~']

In [None]:
stop_punct.extend(['...','```',"''",'..']) # adding more expressions into the list of punctuations

In [None]:
stop_context = ['rt','amp']  # Adding few more words for stop words

In [None]:
# Finally creating a list with all the expressions to be removed
stop_final = stop_nltk + stop_punct + stop_context 

**Function to**

**Remove stop words from a single tokenize sentence**

**remove # tags**

**remove terms with length = 1**

In [None]:
def del_stop(sent):
    return [re.sub("#","",term) for term in sent if ((term not in stop_final) & (len(term)>1))]

In [None]:
# Trying this function on one of the tweets to check if its functioning properly
del_stop(tweet_token[4])

['factsguide', 'society', 'motivation']

In [None]:
# Applying the function on all tweets in the list
tweets_clean = [del_stop(tweet) for tweet in tweet_token]

In [None]:
# So the function works fine
del_stop(tweets_clean[5])

['2/2',
 'huge',
 'fan',
 'fare',
 'big',
 'talking',
 'leave',
 'chaos',
 'pay',
 'disputes',
 'get',
 'allshowandnogo']

# Step 6: Applying lemmitization to the tweets

Lemmatization is a text normalization technique used in Natural Language Processing (NLP), that switches any kind of a word to its base root mode. Lemmatization is responsible for grouping different inflected forms of words into the root form, having the same meaning

In [None]:
lemmatizer = WordNetLemmatizer()  # Creating an instance of the object

In [None]:
# Creating a function to apply lemmetization to the tweets
def apply_lemmetization(sent):
    return [lemmatizer.lemmatize(term) for term in sent]  

In [None]:
tweets_clean[5]

['2/2',
 'huge',
 'fan',
 'fare',
 'big',
 'talking',
 'leave',
 'chaos',
 'pay',
 'disputes',
 'get',
 'allshowandnogo']

Now to check whether the given function works fine, we apply it to one of the tweets.

**Note:** The word 'disputes' have been replaced with 'dispute'.

In [None]:
apply_lemmetization(tweets_clean[5])

['2/2',
 'huge',
 'fan',
 'fare',
 'big',
 'talking',
 'leave',
 'chaos',
 'pay',
 'dispute',
 'get',
 'allshowandnogo']

In [None]:
# Applying it to the whole dataset
tweets_clean = [apply_lemmetization(sent) for sent in tweets_clean]

# To check out the top terms in the tweets

In [None]:
# Creating a new list and adding each of the words in all the tweets to a list term_list
# Here, we use .extend(), and not append, check the below cell to understand the reason
term_list = []
for tweet in tweets_clean:
  term_list.extend(tweet)

In [None]:
term_list[:15]

['father',
 'dysfunctional',
 'selfish',
 'drag',
 'kid',
 'dysfunction',
 'run',
 'thanks',
 'lyft',
 'credit',
 "can't",
 'use',
 'cause',
 'offer',
 'wheelchair']

In [None]:
lis = [['a'],['b'],['c']]
ls = []
lss = []
# Using append
for x in lis:
  ls.append(x)
print(ls)

# Using extend
for x in lis:
  lss.extend(x)
print(lss)

[['a'], ['b'], ['c']]
['a', 'b', 'c']


In [None]:
# This creates a new instance of class counter, It is like dictionary with 'Key' being the words and 'Value' being the count of words in the list
res = Counter(term_list) 
print(len(res))
print(type(res))

38345
<class 'collections.Counter'>


In [None]:
# Finding out the top 10 words which have been used in all the tweets combined
res.most_common(10)

[('love', 2863),
 ('day', 2808),
 ('happy', 1696),
 ('time', 1252),
 ('life', 1244),
 ('like', 1090),
 ('today', 1037),
 ("i'm", 1017),
 ('get', 999),
 ('new', 998)]

# Step 7: Now converting the tokenize words back as a sentence
Joining the tokens back into strings

In [None]:
tweets_clean[30000]

['never', 'msg', 'first', 'dun', 'msg', 'first', 'disappointed']

In [None]:
# Initially it was a list within a list, but now its a simple list, where each element which is of type string is a review (There are no stop words or punctuations)
tweets_clean = [" ".join(tweet) for tweet in tweets_clean]

In [None]:
tweets_clean[30000]

'never msg first dun msg first disappointed'

In [None]:
len(tweets_clean)

31962

# **Separate X and Y and perform train test split, 70-30**
Now after the data cleaning we divide the data into test and train.

In [None]:
len(inp_tweets0['label'])

31962

In [None]:
X = tweets_clean  # List containing all the tweets
y = inp_tweets0.label.values # A numpy array containing the true labels

In [None]:
X_train, X_test, y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 42)

## **Document term matrix using TfIdf**
Now to convert the textual data to numerical data, so that we could input it into our model we use vectorization.There are various vectorization techniques like:
  One-hot Encoding (OHE)

  Count Vectorizer

  Bag-of-Words (BOW)

  N-grams

  Term Frequency-Inverse Document Frequency (TF-IDF)

Here, we use TF-IDF (Term Frequency and Inverse Document Frequency)- It is a product of two measures:


                                              tfidf(t,d,D) = tf(t,d) X idf(t,D)

In [None]:
# Create a document term matrix using count vectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer  # Using the tfidf for vectorization

In [None]:
vectorizer = TfidfVectorizer(max_features = 7000) # Rather than using all the features (unique words) we use only few features based on the max TF value

In [None]:
len(X_train), len(X_test) # This is the length of the train and test dataset

(22373, 9589)

In [None]:
# Vectorizing the X_train data and X_test_data
X_train_bow = vectorizer.fit_transform(X_train) # Applying fit and transform
X_test_bow = vectorizer.transform(X_test)       # Applying transform only

In [None]:
X_train_bow.shape, X_test_bow.shape   

((22373, 7000), (9589, 7000))

# **Model building**

Now since we have numeric data, we can fit our model

**MODEL 1:** Giving equal weights to all the samples

In [None]:
logreg = LogisticRegression() #  Creating an instance of the object

In [None]:
logreg.fit(X_train_bow, y_train) #Fiting the model

LogisticRegression()

In [None]:
y_train_pred = logreg.predict(X_train_bow) # Predicting the y_train

In [None]:
y_test_pred = logreg.predict(X_test_bow)   # Predicting the y_test

In [None]:
# Printing the confusion matrix
print(confusion_matrix(y_test,y_test_pred)) 

[[8877   28]
 [ 466  218]]


In [None]:
# Accuracy of the model
accuracy_score(y_train,y_train_pred) 

0.9563759889152103

In [None]:
# Accuracy of the model
accuracy_score(y_test,y_test_pred) 

0.9484826363541558

In [None]:
# Printing the classification report
print(classification_report(y_test,y_test_pred))

              precision    recall  f1-score   support

           0       0.95      1.00      0.97      8905
           1       0.89      0.32      0.47       684

    accuracy                           0.95      9589
   macro avg       0.92      0.66      0.72      9589
weighted avg       0.95      0.95      0.94      9589



**Interpretation:** Even though we have got a good accuracy on the model, but our model fails to predict the tweets with 'hate speech'. Tweets which donot actually belong to hate speech is classified as 'hate speech' (False Positive). This is evident from the low sensitivity and F1 Score.

# **MODEL 2:** Having unequal weights for the samples

Here, new weights are assigned as:
 
 weight_class j = total_records / (No. of classes * no. of samples of class j)

In [None]:
logreg = LogisticRegression(class_weight = "balanced") # Creating an instance of the model

In [None]:
logreg.fit(X_train_bow, y_train) # Fiting using the model

LogisticRegression(class_weight='balanced')

In [None]:
y_train_pred = logreg.predict(X_train_bow) # Predicting the training set using the fitted model
y_test_pred = logreg.predict(X_test_bow)   # Predicting the test set using the fitted model

In [None]:
confusion_matrix(y_train,y_train_pred)  # Printing the confusion matrix

array([[19917,   898],
       [   33,  1525]])

In [None]:
accuracy_score(y_train, y_train_pred)

0.9583873418853082

In [None]:
accuracy_score(y_test,y_test_pred)

0.93617686932944

In [None]:
print(classification_report(y_test,y_test_pred))

              precision    recall  f1-score   support

           0       0.98      0.95      0.97      8905
           1       0.54      0.78      0.63       684

    accuracy                           0.94      9589
   macro avg       0.76      0.86      0.80      9589
weighted avg       0.95      0.94      0.94      9589



**Conclusion:** Here, it could be observed that by giving a higher weight to the minority class, there has been an improvement in predicting the True Positives. But, there has been an increase in the False Negatives. The F1 score has improves by 16%.

# **MODEL 3:** By changing the inverse of regularization parameters C (which is actually C = (1/lambda), That means lower the C, higher is the regularization.

In [None]:
# Create the parameter grid based on the results of random search
param_grid = {
    'C': [0.01,0.05,0.1,0.5,0.8]
}

In [None]:
classifier_lr = LogisticRegression(class_weight= "balanced") # Creating an instance of the parameter

In [None]:
# Initiating the grid search model
grid_search = GridSearchCV(estimator = classifier_lr,
                           param_grid = param_grid,
                           cv = StratifiedKFold(4),
                           n_jobs = -1, verbose = 1,
                           scoring = "recall")

In [None]:
grid_search.fit(X_train_bow,y_train) # Fiting the model

Fitting 4 folds for each of 5 candidates, totalling 20 fits


GridSearchCV(cv=StratifiedKFold(n_splits=4, random_state=None, shuffle=False),
             estimator=LogisticRegression(class_weight='balanced'), n_jobs=-1,
             param_grid={'C': [0.01, 0.05, 0.1, 0.5, 0.8]}, scoring='recall',
             verbose=1)

In [None]:
grid_search.best_estimator_ # This gives the best parameters

LogisticRegression(C=0.5, class_weight='balanced')

In [None]:
# Using the best estimator to make predictions on the best set

In [None]:
y_test_pred = grid_search.best_estimator_.predict(X_test_bow) # Predicting the training dataset

In [None]:
y_train_pred = grid_search.best_estimator_.predict(X_train_bow) # Predicting the test dataset

In [None]:
accuracy_score(y_train, y_train_pred) # The accuracy of the model in predicting train dataset

0.9519957091136638

In [None]:
accuracy_score(y_test,y_test_pred) # The accuracy of the model in predicting test dataset

0.9338825737824591

In [None]:
confusion_matrix(y_test,y_test_pred) # The confusion Matrix

array([[8414,  491],
       [ 143,  541]])

In [None]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.98      0.94      0.96      8905
           1       0.52      0.79      0.63       684

    accuracy                           0.93      9589
   macro avg       0.75      0.87      0.80      9589
weighted avg       0.95      0.93      0.94      9589



**CONCLUSION:** By increasing the l2 regularization, there has been no much improvement in the F! score. Here, C = 0.5, gave the best performance

# **MODEL 4:** Random Forest

In [None]:
model = RandomForestClassifier(n_estimators = 500,class_weight = "balanced") # Creating a instance of the class

In [None]:
model.fit(X_train_bow,y_train) # Fiting the model

RandomForestClassifier(class_weight='balanced', n_estimators=500)

In [None]:
predict_train = model.predict(X_train_bow) # Predicting the train dataset

In [None]:
predict_test = model.predict(X_test_bow) # Predicting the test dataset

In [None]:
accuracy_score(y_train, predict_train) # The accuracy of the model in predicting the train dataset

0.9993295490099674

In [None]:
accuracy_score(y_test, predict_test) # The accuracy of the model in predicting the test dataset

0.9589112524767963

In [None]:
confusion_matrix(y_test, predict_test) # The confusion matrix

array([[8816,   89],
       [ 305,  379]])

In [None]:
print(classification_report(y_test, predict_test)) # Printing the classification report

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      8905
           1       0.81      0.55      0.66       684

    accuracy                           0.96      9589
   macro avg       0.89      0.77      0.82      9589
weighted avg       0.96      0.96      0.96      9589



**Conclusion:**  We have observed that Random Forest is best in classifying the tweets as 'hate speech' and 'non- hate speech'. There has been an increase in the F1 score as well as Accuracy. Since the data was heavily imbalanced data, F1 score is considered to be ideal for model performance parameter.

# Changing Threshold

In [None]:
# Based on the fitted model, predicting the probabilities
pred_prob = model.predict_proba(X_test_bow)

In [None]:
pred_prob[0][1] # Predicted probabilities of 1s

0.006

In [None]:
# Rather than considering the cut-off value as 0.5, choosing 0.425
pred_class = []
for x in pred_prob:
  if x[1] >=0.425:
    pred_class.append(1)
  else:
    pred_class.append(0)

In [None]:
len(pred_class)

9589

In [None]:
# New confusion Matrix
print(confusion_matrix(y_test,pred_class))

[[8752  153]
 [ 263  421]]


In [None]:
# Printing the classification report
print(classification_report(y_test,pred_class))

              precision    recall  f1-score   support

           0       0.97      0.98      0.98      8905
           1       0.73      0.62      0.67       684

    accuracy                           0.96      9589
   macro avg       0.85      0.80      0.82      9589
weighted avg       0.95      0.96      0.95      9589



In [None]:
matthews_corrcoef(y_test,pred_class)

0.6491373714056502

**Conclusion:** The model is comparitively performing better. There has been further improvement in the F1 Score. Also, value of Mathew is also near 1. Thus, the prediction is also not random.