# Hate Speech Detection with Machine Learning

* [Data Preparation](#explore)
  * [Tweet samples in each class](#samples)
  * [Cleaning and preprocessing](#cleaning)
* [Word Frequency and Tweet Sentiment](#eda)

In [1]:
import pandas as pd
import numpy as np
import random
from IPython.display import display
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

<a id='explore'></a>
## Data Preparation

In [2]:
df_original = pd.read_csv('labeled_data.csv')
df = df_original.copy()
df.drop(columns='Unnamed: 0', inplace=True)
display(df.head())
print(df.shape)

# get value counts in each class
counts = np.array(df['class'].value_counts())
display(pd.DataFrame(counts, columns=['count'], 
             index=['1 - offensive language', '2 - neither', '0 - hate speech']))  

# sample some tweets in each class to display them
hate_sample = df[df['class'] == 0].sample(5, random_state=6).tweet.tolist()
offensive_sample = df[df['class'] == 1].sample(5, random_state=2).tweet.tolist()
neither_sample = df[df['class'] == 2].sample(5, random_state=2).tweet.tolist()

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


(24783, 6)


Unnamed: 0,count
1 - offensive language,19190
2 - neither,4163
0 - hate speech,1430


<a id='samples'></a>
### Sample of regular tweets

In [3]:
for tweet in neither_sample:
    print(tweet, '\n')

RT @BlissTabitha: Homeowner Shoots Armed Suspect Who Attacked and Pistol Whipped Him http://t.co/VLuEwUGclg 

A pharaohs nd his angry bird http://t.co/jMC2eR32Ka 

&#8220;@DIYcookery: Oreo cookies&#128523; http://t.co/8zcjWW56Zz&#8221;bruh &#128525;&#128539;&#128553; 

RT @NDFootball: Winston connects with Greene on the slant across the middle for the score. Ball game is tied, 17-17. 

RT @alicia_garcia3: "That's why God made brownies. To replace boys." 



### Sample of offensive tweets

In [4]:
for tweet in offensive_sample:
    print(tweet, '\n')

u know I came to stunt 
so drop that pussy bitch 

RT @BrwnSkinGirly: Soulja boy bitch so Aggravating &#128553;&#128553;&#128553;all that nagging &amp; shit .. I hope Ion sound like her &#128514;&#128514;&#128564; 

Damn...payback is a bitch 

RT @SavageJayy: Bruh some of yall bitches lucky and do not realize it TISK TISK 

Lil bitch ass 



### Sample of hate tweets

In [5]:
for tweet in hate_sample:
    print(tweet, '\n')

"You ol trout mouth ass bitch" 
DEEEEAAAADD 

Every spic cop should be killed. Those pigs only got their job to 'get even' but white people never did anything to them. 

I hate fat bitches 

@joh2guy I LOVE RACISM! "Blame whitey" for EVERYTHING!
Now that you mention it! I bet the white man put the Desert tortoise on #BundyRanch 

RT @occdissent: Ann Coulter admits on FOX News that America has a negro problem. #crime #tcot #gettingreal 



<a id="cleaning"></a>
### Cleaning and preprocessing

In [6]:
# some regex patterns
twitter_handle_re = r'@\w*'
url_re = r'\w+://[\w\-\./]+'
extra_whitespace_re = r'\s+'

# make a new column for the clean tweets: first remove twitter handles (e.g. @SavageJayy)
df['clean_tweet'] = df.tweet.str.replace(twitter_handle_re, '')

# remove urls
df.clean_tweet = df.clean_tweet.str.replace(url_re, '')

# remove special characters, punctuation, numbers and replace with spaces.
# Basically, remove everything but characters (both lower, uppercase) and hashtags
df.clean_tweet = df.clean_tweet.str.replace(r'[^a-zA-Z#]', ' ')

# remove excess whitespace, make all words lower case
df.clean_tweet = df.clean_tweet.str.replace(extra_whitespace_re, ' ')
df.clean_tweet = df.clean_tweet.str.lower()

# make a new column for tokenized tweets 
df['tokenized_tweet'] = df.clean_tweet.apply(word_tokenize)

def remove_stops(tokenized_tweet):
    '''
    Remove stopwords from tokenized tweets.    
    '''
    stops = stopwords.words('english')
    no_stops = [t for t in tokenized_tweet if t not in stops]
    
    return no_stops

def stem_tweet(tokenized_tweet):
    '''
    Stem each tweet, ie. strip suffixes from words.  For example, 'player',
    'plays', 'played', 'playing' are all variations of the word 'play'.
    '''
    stemmer = PorterStemmer()
    stemmed_tweet = [stemmer.stem(t) for t in tokenized_tweet]
    
    return stemmed_tweet

def remove_lone_hashtags(tokenized_tweet):
    '''
    There are lots of hashtags just sitting everywhere by themselves.
    Get rid of them.
    '''
    no_hashtags = [t for t in tokenized_tweet if t != '#']
    
    return no_hashtags
    
df.tokenized_tweet = df.tokenized_tweet.apply(remove_stops)
#df.tokenized_tweet = df.tokenized_tweet.apply(stem_tweet)
df.tokenized_tweet = df.tokenized_tweet.apply(remove_lone_hashtags)


def sample_tweets(df, col, size=3, random_state=0):
    '''
    Sample and print some tweets in each class.
    '''
    hate = df[df['class'] == 0].sample(size, random_state=random_state)[col].tolist()
    offensive = df[df['class'] == 1].sample(size, random_state=random_state)[col].tolist()
    neither = df[df['class'] == 2].sample(size, random_state=random_state)[col].tolist()

    print('**REGULAR TWEETS:\n')
    for tweet in neither:
        print(tweet, '\n')
        
    print('**OFFENSIVE TWEETS:\n')
    for tweet in offensive:
        print(tweet, '\n')
        
    print('**HATE TWEETS:\n')
    for tweet in hate:
        print(tweet, '\n')

In [7]:
display(df.head())

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,clean_tweet,tokenized_tweet
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,rt as a woman you shouldn t complain about cl...,"[rt, woman, complain, cleaning, house, amp, ma..."
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,rt boy dats cold tyga dwn bad for cuffin dat ...,"[rt, boy, dats, cold, tyga, dwn, bad, cuffin, ..."
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,rt dawg rt you ever fuck a bitch and she star...,"[rt, dawg, rt, ever, fuck, bitch, start, cry, ..."
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,rt she look like a tranny,"[rt, look, like, tranny]"
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,rt the shit you hear about me might be true o...,"[rt, shit, hear, might, true, might, faker, bi..."


In [8]:
sample_tweets(df, 'clean_tweet', random_state=10)

**REGULAR TWEETS:

 thuggery cheating talking trash who s doing that besides jameis and isn t nd in the middle of cheating issues  

charlie sheen 

i hope charlie brought the lube for this test  

**OFFENSIVE TWEETS:

my momma keep talking to me like bitch gtf i m on twitter  

the way i fuck her you would think i love this bitch 

rt #jerryspringer on #raw the white trash side of me is applauding  

**HATE TWEETS:

i swear these anon fags go to protests just to take pictures to post to twitter look i was there like me  

#iowa is full of white trash 

rt this nigga is a fuckin faggot  



In [9]:
sample_tweets(df, 'tokenized_tweet', random_state=10)

**REGULAR TWEETS:

['thuggery', 'cheating', 'talking', 'trash', 'besides', 'jameis', 'nd', 'middle', 'cheating', 'issues'] 

['charlie', 'sheen'] 

['hope', 'charlie', 'brought', 'lube', 'test'] 

**OFFENSIVE TWEETS:

['momma', 'keep', 'talking', 'like', 'bitch', 'gtf', 'twitter'] 

['way', 'fuck', 'would', 'think', 'love', 'bitch'] 

['rt', 'jerryspringer', 'raw', 'white', 'trash', 'side', 'applauding'] 

**HATE TWEETS:

['swear', 'anon', 'fags', 'go', 'protests', 'take', 'pictures', 'post', 'twitter', 'look', 'like'] 

['iowa', 'full', 'white', 'trash'] 

['rt', 'nigga', 'fuckin', 'faggot'] 



<a id="eda"></a>
## Exploratory Data Analysis