DESCRIPTION

Using NLP and ML, make a model to identify hate speech (racist or sexist tweets) in Twitter.

Problem Statement:  
Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium  to spread hate. 
You are a data scientist at Twitter, and you will help Twitter in identifying the tweets with hate speech and removing them from the platform. You will use NLP techniques, perform specific cleanup for tweets data, and make a robust model.

Domain: Social Media

Analysis to be done: Clean up tweets and build a classification model by using NLP techniques, cleanup specific for tweets data, regularization and hyperparameter tuning using stratified k-fold and cross validation to get the best model.

Content: 

id: identifier number of the tweet

Label: 0 (non-hate) /1 (hate)

Tweet: the text in the tweet

In [1]:
# load data
import pandas as pd
data = pd.read_csv('TwitterHate.csv')
data.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [2]:
# text clean up
from nltk import word_tokenize, TweetTokenizer # works the same as word_tokenize but keeps hashtags
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('stopwords')
import re
import string
import warnings
warnings.filterwarnings('ignore')
import numpy as np

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/LNonyane/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
lemmatizer = WordNetLemmatizer()
stop_words = stopwords.words('english')
#tokenizer = TweetTokenizer()
data['cleaned_tweet'] = data['tweet']\
.apply(lambda x : ' '.join\
 ([lemmatizer.lemmatize\
  (word.lower()) \
  for word in word_tokenize\
  (re.sub(r'([^\s\w]|_)+', ' ',\
   str(x))) if word not in stop_words]))

In [4]:
data[['cleaned_tweet', 'tweet', 'label']]

Unnamed: 0,cleaned_tweet,tweet,label
0,user father dysfunctional selfish drag kid dys...,@user when a father is dysfunctional and is s...,0
1,user user thanks lyft credit use cause offer w...,@user @user thanks for #lyft credit i can't us...,0
2,bihday majesty,bihday your majesty,0
3,model love u take u time urð ð ð ð ð ð ð ð,#model i love u take with u all the time in ...,0
4,factsguide society motivation,factsguide: society now #motivation,0
...,...,...,...
31957,ate user isz youuu ð ð ð ð ð ð ð ð ð â ï,ate @user isz that youuu?ðððððð...,0
31958,see nina turner airwave trying wrap mantle gen...,to see nina turner on the airwaves trying to...,0
31959,listening sad song monday morning otw work sad,listening to sad songs on a monday morning otw...,0
31960,user sikh temple vandalised calgary wso condem...,"@user #sikh #temple vandalised in in #calgary,...",1


In [5]:
# remove terms with a length of 1
data['extra_cleanup_tweet'] = data['cleaned_tweet']\
.apply(lambda x : ' '.join\
 ([lemmatizer.lemmatize\
  (word.lower()) \
  for word in word_tokenize\
  (re.sub(r'\b\w{1}?\b', ' ',\
   str(x)))]))
#re.sub(r'\b\w{1}\b', ' ', c)

In [6]:
data[['extra_cleanup_tweet', 'cleaned_tweet', 'tweet', 'label']]

Unnamed: 0,extra_cleanup_tweet,cleaned_tweet,tweet,label
0,user father dysfunctional selfish drag kid dys...,user father dysfunctional selfish drag kid dys...,@user when a father is dysfunctional and is s...,0
1,user user thanks lyft credit use cause offer w...,user user thanks lyft credit use cause offer w...,@user @user thanks for #lyft credit i can't us...,0
2,bihday majesty,bihday majesty,bihday your majesty,0
3,model love take time urð,model love u take u time urð ð ð ð ð ð ð ð,#model i love u take with u all the time in ...,0
4,factsguide society motivation,factsguide society motivation,factsguide: society now #motivation,0
...,...,...,...,...
31957,ate user isz youuu,ate user isz youuu ð ð ð ð ð ð ð ð ð â ï,ate @user isz that youuu?ðððððð...,0
31958,see nina turner airwave trying wrap mantle gen...,see nina turner airwave trying wrap mantle gen...,to see nina turner on the airwaves trying to...,0
31959,listening sad song monday morning otw work sad,listening sad song monday morning otw work sad,listening to sad songs on a monday morning otw...,0
31960,user sikh temple vandalised calgary wso condem...,user sikh temple vandalised calgary wso condem...,"@user #sikh #temple vandalised in in #calgary,...",1


In [7]:
df_tweet_list = data['extra_cleanup_tweet'].values.tolist()
print(df_tweet_list)



In [8]:
token_list = ' '.join(df_tweet_list)

In [9]:
print(token_list)



In [10]:
# Tokenize using work_tokenize library
from textblob import TextBlob
def get_text_tokens(text):
    tokens = word_tokenize(text)
    return tokens
get_text_tokens(token_list)

['user',
 'father',
 'dysfunctional',
 'selfish',
 'drag',
 'kid',
 'dysfunction',
 'run',
 'user',
 'user',
 'thanks',
 'lyft',
 'credit',
 'use',
 'cause',
 'offer',
 'wheelchair',
 'van',
 'pdx',
 'disapointed',
 'getthanked',
 'bihday',
 'majesty',
 'model',
 'love',
 'take',
 'time',
 'urð',
 'factsguide',
 'society',
 'motivation',
 'huge',
 'fan',
 'fare',
 'big',
 'talking',
 'leave',
 'chaos',
 'pay',
 'dispute',
 'get',
 'allshowandnogo',
 'user',
 'camping',
 'tomorrow',
 'user',
 'user',
 'user',
 'user',
 'user',
 'user',
 'user',
 'dannyâ',
 'next',
 'school',
 'year',
 'year',
 'exam',
 'think',
 'school',
 'exam',
 'hate',
 'imagine',
 'actorslife',
 'revolutionschool',
 'girl',
 'love',
 'land',
 'allin',
 'cavs',
 'champion',
 'cleveland',
 'clevelandcavaliers',
 'user',
 'user',
 'welcome',
 'gr8',
 'ireland',
 'consumer',
 'price',
 'index',
 'mom',
 'climbed',
 'previous',
 'may',
 'blog',
 'silver',
 'gold',
 'forex',
 'selfish',
 'orlando',
 'standwithorlando',
 

In [11]:
from collections import Counter
token_list = get_text_tokens(token_list)
counter = Counter(token_list)
counter.most_common(10)

[('user', 17582),
 ('day', 2804),
 ('love', 2801),
 ('amp', 1776),
 ('happy', 1687),
 ('time', 1249),
 ('life', 1225),
 ('today', 1090),
 ('like', 1082),
 ('get', 1000)]