#### Help Twitter Combat Hate Speech Using NLP and Machine Learning

Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium  to spread hate.

Task here is to identify the hate tweets, using NLP techniques. 

###### Analysis to be done:

* Clean up tweets.
* Build a classification model, use NLP. 
* Cleanup specific for tweets data. 
* Regularization and hyperparameter tuning using stratified k-fold.
* Cross validation.


Content: 

* id: identifier number of the tweet
* Label: 0 (non-hate) /1 (hate)
* Tweet: the text in the tweet

In [67]:
import pandas as pd
import numpy as np
import os, re

In [68]:
# Load the tweets file using read_csv function from Pandas package.

dt_tweets = pd.read_csv("TwitterHate.csv")
dt_tweets.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [69]:
# Check the distribution level 1 or 0

dt_tweets.label.value_counts

<bound method IndexOpsMixin.value_counts of 0        0
1        0
2        0
3        0
4        0
        ..
31957    0
31958    0
31959    0
31960    1
31961    0
Name: label, Length: 31962, dtype: int64>

In [70]:
dt_tweets.label.value_counts(normalize=True)

0    0.929854
1    0.070146
Name: label, dtype: float64

Nearly all of data is leveled as 0 (non-hate)

In [71]:
print(dt_tweets.tweet)

0         @user when a father is dysfunctional and is s...
1        @user @user thanks for #lyft credit i can't us...
2                                      bihday your majesty
3        #model   i love u take with u all the time in ...
4                   factsguide: society now    #motivation
                               ...                        
31957    ate @user isz that youuu?ðððððð...
31958      to see nina turner on the airwaves trying to...
31959    listening to sad songs on a monday morning otw...
31960    @user #sikh #temple vandalised in in #calgary,...
31961                     thank you @user for you follow  
Name: tweet, Length: 31962, dtype: object


In [72]:
tweets = dt_tweets.tweet.values

In [73]:
print(tweets)

[' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run'
 "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked"
 '  bihday your majesty' ...
 'listening to sad songs on a monday morning otw to work is sad  '
 '@user #sikh #temple vandalised in in #calgary, #wso condemns  act  '
 'thank you @user for you follow  ']


###### cleanup in below 7 steps: 

* Normalize the case.
* Using regular expressions, remove user handles that begins with '@’.
* Using regular expressions, remove URLs.
* Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.
* Remove stop words.
* Remove redundant terms like ‘amp’, ‘rt’, etc.
* Remove ‘#’ symbols from the tweet while retaining the term.
* Remove terms with a length of 1

In [74]:
# Step 1: Normalize the case.

tweets_lower = [twt.lower() for twt in tweets0]

In [76]:
# Step 2: Using regular expressions, remove user handles that begins with '@’. First test regular expression on sample tweet.

re.sub("@\w+","", "@NLP course rocks! http://johndoe.com/mk")

' course rocks! http://johndoe.com/mk'

In [77]:
# Apply regular expression tweeter handles over the whole tweet.

tweets_nouser = [re.sub("@\w+","", twt) for twt in tweets_lower]

In [78]:
# Step 3: Test regular expression to remove URL:

re.sub("\w+://\S+","", "@John this course rocks! http://johndoe.com/mk")

'@John this course rocks! '

In [79]:
# Remove URL in all tweets.
tweets_nourl = [re.sub("\w+://\S+","", twt) for twt in tweets_nouser]

In [80]:
# Step 4: Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.

from nltk.tokenize import TweetTokenizer
tkn = TweetTokenizer()

tweet_token = [tkn.tokenize(sent) for sent in tweets_nourl]
print(tweet_token[0])

['when', 'a', 'father', 'is', 'dysfunctional', 'and', 'is', 'so', 'selfish', 'he', 'drags', 'his', 'kids', 'into', 'his', 'dysfunction', '.', '#run']


In [81]:
# Step 5: Remove stop word

from nltk.corpus import stopwords
from string import punctuation

stop_nltk = stopwords.words("english")
stop_punct = list(punctuation)

In [83]:
# Step 6: Remove redundant terms like ‘amp’, ‘rt’, etc.

# Adding some specific punctuation from the data:
stop_punct.extend(['...','``',"''",".."])
stop_context = ['rt', 'amp']

In [84]:
# Step 7 & 8: Remove stop words from a single tokenized sentence, Remove # tags, Remove terms with length = 1
# sum up stop_nltk, stop_punct, stop_context from steps 5 & 6

stop_final = stop_nltk + stop_punct + stop_context

def del_stop(sent):
    return [re.sub("#","",term) for term in sent if ((term not in stop_final) & (len(term)>1))]
tweets_clean = [del_stop(tweet) for tweet in tweet_token]

tweets_clean = [del_stop(tweet) for tweet in tweet_token]
tweets_clean[0]

['father', 'dysfunctional', 'selfish', 'drags', 'kids', 'dysfunction', 'run']

In [85]:
#Adding all terms to one big list.

term_list = []
for tweet in tweets_clean:
    term_list.extend(tweet)

In [86]:
# Use counter and find the 10 most common terms.

from collections import Counter
res = Counter(term_list)
res.most_common(10)

[('love', 2748),
 ('day', 2276),
 ('happy', 1684),
 ('time', 1131),
 ('life', 1118),
 ('like', 1047),
 ("i'm", 1018),
 ('today', 1013),
 ('new', 994),
 ('thankful', 946)]

In [87]:
# Join the tokens back to form strings.  This will be required for the vectorizers.

tweets_clean = [" ".join(tweet) for tweet in tweets_clean]

In [88]:
# Features and label.

X = tweets_clean
y = inp_tweets0.label.values

In [89]:
#Split test and train

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)

In [90]:
# Use TF-IDF  values for the terms as feature to get into a vector space model. 
# Instantiate with a maximum of 5000 terms in your vocabulary.
# Fit the model.
# Apply to train set.

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features = 5000)
X_train_bow = vectorizer.fit_transform(X_train)

X_test_bow = vectorizer.transform(X_test)
X_train_bow.shape, X_test_bow.shape

((22373, 5000), (9589, 5000))

Model building: Ordinary Logistic Regression
* Instantiate LogisticRegression from sklearn with default parameters.
* Fit on the train data.
* Make predictions for the train and the test set.


In [91]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train_bow, y_train)

y_train_pred = logreg.predict(X_train_bow)
y_test_pred = logreg.predict(X_test_bow)

Model evaluation: Accuracy, recall, and f1_score
* Report the accuracy on the train set.
* Report the recall on the train set:decent, high, or low?
* Get the f1_score on the train set.


In [92]:
from sklearn.metrics import accuracy_score, classification_report
scr1 = accuracy_score(y_train, y_train_pred)
print(scr1)
print(classification_report(y_train, y_train_pred))

0.9560184150538595
              precision    recall  f1-score   support

           0       0.96      1.00      0.98     20815
           1       0.96      0.39      0.55      1558

    accuracy                           0.96     22373
   macro avg       0.96      0.69      0.76     22373
weighted avg       0.96      0.96      0.95     22373



f1 score is 95.6%. Due to heavy class imbalance (0 -> 0.929854 and 1 -> 0.070146, this accuracy may not be good if we are not capturing the 1s well at all.

So let’s look at the classification report: 

In [93]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98     20815
           1       0.96      0.39      0.55      1558

    accuracy                           0.96     22373
   macro avg       0.96      0.69      0.76     22373
weighted avg       0.96      0.96      0.95     22373



Recall is just 39% for 1 class which is not so great at all. Need to adjust the class imbalance, as model is more onto the 0s. Adjust in Logistic Regression model

In [95]:
# Instantiating with the right parameter.

# Train again with the adjustment and evaluate.
# 1. Train the model on the train set.
# 2. Evaluate the predictions on the train set: accuracy, recall, and f1_score.

logreg = LogisticRegression(class_weight="balanced")
logreg.fit(X_train_bow, y_train)

LogisticRegression(class_weight='balanced')

In [96]:
# Evaluating the train set:
y_train_pred = logreg.predict(X_train_bow)
accuracy_score(y_train, y_train_pred)

0.9535153980244044

Accuracy score is 95.35%.  It is a little lower than the previous model, but the classification report will give better details.

In [97]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       1.00      0.95      0.97     20815
           1       0.60      0.97      0.74      1558

    accuracy                           0.95     22373
   macro avg       0.80      0.96      0.86     22373
weighted avg       0.97      0.95      0.96     22373



Here Recall is 97% for class 1s ! Improved a lot. The f1_score is also better at 0.74. 

Regularization and Hyperparameter tuning:
* Import GridSearch and StratifiedKFold because of class imbalance.
* Provide the parameter grid to choose for ‘C’ and ‘penalty’ parameters.
* Use a balanced class weight while instantiating the logistic regression.

In [98]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

In [100]:
# Create the parameter grid based on the results of random search. 

param_grid = {
    'C': [0.01,0.1,1,10,100],
    'penalty': ["l1","l2"]
}

In [101]:
# Instantiating the logistic regression model with a balanced class weight.

classifier_lr = LogisticRegression(class_weight="balanced")

Find the parameters with the best recall in cross validation.
* Choose ‘recall’ as the metric for scoring.
* Choose stratified 4 fold cross validation scheme.
* Fit it on the train set.

You’ll supply stratified k-fold as out cv strategy to GridSearchCV. You need to stratify as there is heavy class imbalance in the dataset.

In [102]:
grid_search = GridSearchCV(estimator = classifier_lr, param_grid = param_grid, 
                          cv = StratifiedKFold(4), n_jobs = -1, verbose = 1, scoring = "recall" )

In [103]:
# Fitting on the train data to get the best parameters:
grid_search.fit(X_train_bow, y_train)

Fitting 4 folds for each of 10 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:    3.6s finished


GridSearchCV(cv=StratifiedKFold(n_splits=4, random_state=None, shuffle=False),
             estimator=LogisticRegression(class_weight='balanced'), n_jobs=-1,
             param_grid={'C': [0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']},
             scoring='recall', verbose=1)

In [104]:
grid_search.best_estimator_

LogisticRegression(C=1, class_weight='balanced')

From cross validation, the best parameters are: C = 1, penalty = “l2”.

Predict and evaluate using the best estimator.
* Use the best estimator from the grid search to make predictions on the test set.
* What is the recall on the test set for the toxic comments?
* What is the f1_score?

In [105]:
y_test_pred = grid_search.best_estimator_.predict(X_test_bow)
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.98      0.94      0.96      8905
           1       0.49      0.77      0.60       684

    accuracy                           0.93      9589
   macro avg       0.73      0.85      0.78      9589
weighted avg       0.95      0.93      0.93      9589



The f1_score for 1 class is 0.60 and the recall is 0.77. Great!