# Help Twitter Combat Hate Speech Using NLP and Machine Learning

## DESCRIPTION

Using NLP and ML, make a model to identify hate speech (racist or sexist tweets) in Twitter.

## Problem Statement:  

Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium  to spread hate. 

You are a data scientist at Twitter, and you will help Twitter in identifying the tweets with hate speech and removing them from the platform. You will use NLP techniques, perform specific cleanup for tweets data, and make a robust model.

## Domain: Social Media

Analysis to be done: Clean up tweets and build a classification model by using NLP techniques, cleanup specific for tweets data, regularization and hyperparameter tuning using stratified k-fold and cross validation to get the best model.

## Content: 

* id: identifier number of the tweet

* Label: 0 (non-hate) /1 (hate)

* Tweet: the text in the tweet

In [1]:
import re
import collections
import string
import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score, recall_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold

In [2]:
# 1. Load the tweets file using read_csv function from Pandas package.
data = pd.read_csv('./data/TwitterHate.csv')
data.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [3]:
# 2. Get the tweets into a list for easy text cleanup and manipulation.
tweets = data['tweet'].values.tolist()
len(tweets)

31962

In [4]:
#3. To cleanup: 
#   1. Normalize the casing.
#   2. Using regular expressions, remove user handles. These begin with '@’.
#   3. Using regular expressions, remove URLs.
#   4. Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.
#   5. Remove stop words.
#   6. Remove redundant terms like ‘amp’, ‘rt’, etc.
#   7. Remove ‘#’ symbols from the tweet while retaining the term.

def clean_text(df, text_field):
    df[text_field] = df[text_field].str.lower() #1
    df[text_field] = df[text_field].apply(lambda elem: re.sub("@[A-Za-z0-9]+","", elem)) #2
    df[text_field] = df[text_field].apply(lambda elem: re.sub(r'^https?:\/\/.*[\r\n]*', '', elem)) #3
    return df

#3. #1,2,3
clean_data = clean_text(data.copy(), 'tweet')
clean_data.head()

Unnamed: 0,id,label,tweet
0,1,0,when a father is dysfunctional and is so sel...
1,2,0,thanks for #lyft credit i can't use cause th...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [5]:
#3. #4
tk = TweetTokenizer()
clean_data['tweet'] = clean_data['tweet'].apply(lambda t: tk.tokenize(t))
clean_data.head()

Unnamed: 0,id,label,tweet
0,1,0,"[when, a, father, is, dysfunctional, and, is, ..."
1,2,0,"[thanks, for, #lyft, credit, i, can't, use, ca..."
2,3,0,"[bihday, your, majesty]"
3,4,0,"[#model, i, love, u, take, with, u, all, the, ..."
4,5,0,"[factsguide, :, society, now, #motivation]"


In [6]:
#3. #5 
stop_words = set(stopwords.words('english'))
clean_data['tweet'] = clean_data['tweet'].apply(lambda t: [w for w in t if not w in stop_words])
clean_data.head()

Unnamed: 0,id,label,tweet
0,1,0,"[father, dysfunctional, selfish, drags, kids, ..."
1,2,0,"[thanks, #lyft, credit, can't, use, cause, off..."
2,3,0,"[bihday, majesty]"
3,4,0,"[#model, love, u, take, u, time, urð, , , ±,..."
4,5,0,"[factsguide, :, society, #motivation]"


In [7]:
#3. #6 #7
clean_data['tweet'] = clean_data['tweet'].apply(lambda t: [w for w in t if not w in ['rt', 'amp']]) #6
clean_data['tweet'] = clean_data['tweet'].apply(lambda t: [re.sub(r'#', '', w) for w in t]) #7
clean_data.head()

Unnamed: 0,id,label,tweet
0,1,0,"[father, dysfunctional, selfish, drags, kids, ..."
1,2,0,"[thanks, lyft, credit, can't, use, cause, offe..."
2,3,0,"[bihday, majesty]"
3,4,0,"[model, love, u, take, u, time, urð, , , ±, ..."
4,5,0,"[factsguide, :, society, motivation]"


In [8]:
# 4. Extra cleanup by removing terms with a length of 1.
clean_data['tweet'] = clean_data['tweet'].apply(lambda t: [w for w in t if len(w) > 1])

#and some more cleanup
clean_data['tweet'] = clean_data['tweet'].apply(lambda t: [w for w in t if not w in string.punctuation])
clean_data['tweet'] = clean_data['tweet'].apply(lambda t: [w for w in t if w.isalpha()])

clean_data.head()

Unnamed: 0,id,label,tweet
0,1,0,"[father, dysfunctional, selfish, drags, kids, ..."
1,2,0,"[thanks, lyft, credit, use, cause, offer, whee..."
2,3,0,"[bihday, majesty]"
3,4,0,"[model, love, take, time, urð]"
4,5,0,"[factsguide, society, motivation]"


In [9]:
# 5. Check out the top terms in the tweets:
#   1. First, get all the tokenized terms into one large list.
#   2. Use the counter and find the 10 most common terms.

tweet_words=clean_data['tweet'].values.tolist()
tweet_words = [val for sublist in tweet_words for val in sublist]

word_counter = collections.Counter(tweet_words)
for word, count in word_counter.most_common(10):
    print(word, ": ", count)

love :  2748
day :  2276
happy :  1684
time :  1131
life :  1118
like :  1047
today :  1013
new :  994
thankful :  946
positive :  931


In [10]:
# 6. Data formatting for predictive modeling:
#   1. Join the tokens back to form strings. This will be required for the vectorizers.
#   2. Assign x and y.
#   3. Perform train_test_split using sklearn.

# 6. 1.
clean_data['tweet'] = clean_data['tweet'].str.join(' ')

In [11]:
# 6. 2.
X = clean_data[['tweet']]
y = clean_data[['label']]

In [12]:
# 6. 3.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(len(X_train), len(X_test), len(y_train), len(y_test))

23971 7991 23971 7991


In [13]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(23971, 1)
(7991, 1)
(23971, 1)
(7991, 1)


In [14]:
# 7. We’ll use TF-IDF values for the terms as a feature to get into a vector space model.
#   1. Import TF-IDF  vectorizer from sklearn.
#   2. Instantiate with a maximum of 5000 terms in your vocabulary.
#   3. Fit and apply on the train set.
#   4. Apply on the test set.
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vect.fit_transform(X_train['tweet'])

X_test_tfidf = tfidf_vect.transform(X_test['tweet'])
X_test_tfidf.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [15]:
#8. Model building: Ordinary Logistic Regression
#   1. Instantiate Logistic Regression from sklearn with default parameters.
#   2. Fit into  the train data.
#   3. Make predictions for the train and the test set.

#9. Model evaluation: Accuracy, recall, and f_1 score.
#   1. Report the accuracy on the train set.
#   2. Report the recall on the train set: decent, high, or low.
#   3. Get the f1 score on the train set.

lr = LogisticRegression()
lr_model = lr.fit(X_train_tfidf, np.ravel(y_train))

y_pred_train = lr.predict(X_train_tfidf)
y_pred = lr_model.predict(X_test_tfidf)

print("--------------On Train data----------------")
print("Accuracy: ", accuracy_score(y_train, y_pred_train))  
print("Recall: ", recall_score(y_train,y_pred_train))
print("F1 Score: ", f1_score(y_train,y_pred_train))  
print("--------------On Test data----------------")
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("Recall: ", recall_score(y_test,y_pred))
print("F1 Score: ", f1_score(y_test,y_pred)) 

--------------On Train data----------------
Accuracy:  0.9551124275165825
Recall:  0.3778966131907308
F1 Score:  0.5417376490630322
--------------On Test data----------------
Accuracy:  0.9504442497810036
Recall:  0.32737030411449014
F1 Score:  0.48031496062992124


In [16]:
#10. Looks like you need to adjust the class imbalance, as the model seems to focus on the 0s.
#   1. Adjust the appropriate class in the LogisticRegression model.

#11. Train again with the adjustment and evaluate.
#   1. Train the model on the train set.
#   2. Evaluate the predictions on the train set: accuracy, recall, and f_1 score.

lr = LogisticRegression(class_weight="balanced")
lr_model = lr.fit(X_train_tfidf, np.ravel(y_train))

y_pred_train = lr.predict(X_train_tfidf)
y_pred = lr_model.predict(X_test_tfidf)

print("--------------On Train data----------------")
print("Accuracy: ", accuracy_score(y_train, y_pred_train))  
print("Recall: ", recall_score(y_train,y_pred_train))
print("F1 Score: ", f1_score(y_train,y_pred_train))  
print("--------------On Test data----------------")
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("Recall: ", recall_score(y_test,y_pred))
print("F1 Score: ", f1_score(y_test,y_pred))  

--------------On Train data----------------
Accuracy:  0.9468941637812357
Recall:  0.9714795008912656
F1 Score:  0.7197886858903808
--------------On Test data----------------
Accuracy:  0.9241646852709298
Recall:  0.7906976744186046
F1 Score:  0.5932885906040268


In [17]:
#12. Regularization and Hyperparameter tuning:
#   1. Import GridSearch and StratifiedKFold because of class imbalance.
#   2. Provide the parameter grid to choose for ‘C’ and ‘penalty’ parameters.
#   3. Use a balanced class weight while instantiating the logistic regression.

#13. Find the parameters with the best recall in cross validation.
#   1. Choose ‘recall’ as the metric for scoring.
#   2. Choose stratified 4 fold cross validation scheme.
#   3. Fit into  the train set.

#14. What are the best parameters?

In [18]:
X = X_train_tfidf
y = np.ravel(y_train)

param_grid = [{'C': [10, 5.0, 1.0, 0.1], 'penalty': ['l1', 'l2']}]

best = {}
bestrecall = 0;
i=1
kf = StratifiedKFold(n_splits=20, random_state=1,shuffle=True)
for train_index,test_index in kf.split(X,y):
     print('\n{} of kfold {}'.format(i,kf.n_splits))
     xtr,xvl = X[train_index],X[test_index]
     ytr,yvl = y[train_index],y[test_index]
     model = GridSearchCV(LogisticRegression(class_weight="balanced", solver='liblinear', max_iter=1000), param_grid, scoring='recall', n_jobs=-1)
     model.fit(xtr, ytr)
     print (model.best_params_)
     pred=model.predict(xvl)
     recall = recall_score(yvl,pred)
     print("Recall: ", recall)
     if recall > bestrecall:
            bestrecall = recall
            best = model.best_params_
     i+=1
    
print("Best Params: ", best) #14


1 of kfold 20
{'C': 1.0, 'penalty': 'l2'}
Recall:  0.7976190476190477

2 of kfold 20
{'C': 1.0, 'penalty': 'l2'}
Recall:  0.7023809523809523

3 of kfold 20
{'C': 0.1, 'penalty': 'l2'}
Recall:  0.7738095238095238

4 of kfold 20
{'C': 0.1, 'penalty': 'l2'}
Recall:  0.75

5 of kfold 20
{'C': 1.0, 'penalty': 'l2'}
Recall:  0.6785714285714286

6 of kfold 20
{'C': 1.0, 'penalty': 'l2'}
Recall:  0.7023809523809523

7 of kfold 20
{'C': 1.0, 'penalty': 'l2'}
Recall:  0.8452380952380952

8 of kfold 20
{'C': 0.1, 'penalty': 'l2'}
Recall:  0.7976190476190477

9 of kfold 20
{'C': 1.0, 'penalty': 'l2'}
Recall:  0.8470588235294118

10 of kfold 20
{'C': 1.0, 'penalty': 'l2'}
Recall:  0.8470588235294118

11 of kfold 20
{'C': 1.0, 'penalty': 'l2'}
Recall:  0.7764705882352941

12 of kfold 20
{'C': 1.0, 'penalty': 'l2'}
Recall:  0.8571428571428571

13 of kfold 20
{'C': 1.0, 'penalty': 'l2'}
Recall:  0.75

14 of kfold 20
{'C': 1.0, 'penalty': 'l2'}
Recall:  0.7261904761904762

15 of kfold 20
{'C': 1.0, 'p

In [19]:
#15. Predict and evaluate using the best estimator.
#   1. Use the best estimator from the grid search to make predictions on the test set.
#   2. What is the recall on the test set for the toxic comments?
#   3. What is the f_1 score?

In [22]:
lr = LogisticRegression(C=0.1,penalty='l2',class_weight="balanced")
lr_model = lr.fit(X_train_tfidf, np.ravel(y_train))

y_pred_train = lr.predict(X_train_tfidf)
y_pred = lr_model.predict(X_test_tfidf)

print("--------------On Train data----------------")
print("Accuracy: ", accuracy_score(y_train, y_pred_train))  
print("Recall: ", recall_score(y_train,y_pred_train))
print("F1 Score: ", f1_score(y_train,y_pred_train))  
print("--------------On Test data----------------")
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("Recall: ", recall_score(y_test,y_pred))
print("F1 Score: ", f1_score(y_test,y_pred))  

--------------On Train data----------------
Accuracy:  0.9307496558341329
Recall:  0.9073083778966132
F1 Score:  0.6478574459058125
--------------On Test data----------------
Accuracy:  0.914779126517332
Recall:  0.7817531305903399
F1 Score:  0.5620578778135048
