# Hate Speech Detection with Machine Learning

* [Data Preparation](#explore)
  * [Tweet samples in each class](#samples)
  * [Some cleaning and preprocessing](#cleaning)

In [53]:
import pandas as pd
import numpy as np
import random
from IPython.display import display
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

<a id='explore'></a>
## Data Preparation

In [8]:
df_original = pd.read_csv('labeled_data.csv')
df = df_original.copy()
df.drop(columns='Unnamed: 0', inplace=True)
display(df.head())
print(df.shape)

# get value counts in each class
counts = np.array(df['class'].value_counts())
display(pd.DataFrame(counts, columns=['count'], 
             index=['1 - offensive language', '2 - neither', '0 - hate speech']))  

# sample some tweets in each class to display them
hate_sample = df[df['class'] == 0].sample(5, random_state=6).tweet.tolist()
offensive_sample = df[df['class'] == 1].sample(5, random_state=2).tweet.tolist()
neither_sample = df[df['class'] == 2].sample(5, random_state=2).tweet.tolist()

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


(24783, 6)


Unnamed: 0,count
1 - offensive language,19190
2 - neither,4163
0 - hate speech,1430


<a id='samples'></a>
### Sample of regular tweets

In [9]:
for tweet in neither_sample:
    print(tweet, '\n')

RT @BlissTabitha: Homeowner Shoots Armed Suspect Who Attacked and Pistol Whipped Him http://t.co/VLuEwUGclg 

A pharaohs nd his angry bird http://t.co/jMC2eR32Ka 

&#8220;@DIYcookery: Oreo cookies&#128523; http://t.co/8zcjWW56Zz&#8221;bruh &#128525;&#128539;&#128553; 

RT @NDFootball: Winston connects with Greene on the slant across the middle for the score. Ball game is tied, 17-17. 

RT @alicia_garcia3: "That's why God made brownies. To replace boys." 



### Sample of offensive tweets

In [10]:
for tweet in offensive_sample:
    print(tweet, '\n')

u know I came to stunt 
so drop that pussy bitch 

RT @BrwnSkinGirly: Soulja boy bitch so Aggravating &#128553;&#128553;&#128553;all that nagging &amp; shit .. I hope Ion sound like her &#128514;&#128514;&#128564; 

Damn...payback is a bitch 

RT @SavageJayy: Bruh some of yall bitches lucky and do not realize it TISK TISK 

Lil bitch ass 



### Sample of hate tweets

In [11]:
for tweet in hate_sample:
    print(tweet, '\n')

"You ol trout mouth ass bitch" 
DEEEEAAAADD 

Every spic cop should be killed. Those pigs only got their job to 'get even' but white people never did anything to them. 

I hate fat bitches 

@joh2guy I LOVE RACISM! "Blame whitey" for EVERYTHING!
Now that you mention it! I bet the white man put the Desert tortoise on #BundyRanch 

RT @occdissent: Ann Coulter admits on FOX News that America has a negro problem. #crime #tcot #gettingreal 



<a id="cleaning"></a>
### Some cleaning and preprocessing

In [51]:
# some regex patterns
twitter_handle_re = r'@\w*'
url_re = r'\w+://[\w\-\./]+'
extra_whitespace_re = r'\s+'

# make a new column for the clean tweets: first remove twitter handles (e.g. @SavageJayy)
df['clean_tweet'] = df.tweet.str.replace(twitter_handle_re, '')

# remove urls
df.clean_tweet = df.clean_tweet.str.replace(url_re, '')

# remove special characters, punctuation, numbers and replace with spaces.
# Basically, remove everything but characters (both lower, uppercase) and hashtags
df.clean_tweet = df.clean_tweet.str.replace(r'[^a-zA-Z#]', ' ')

# remove excess whitespace, make all words lower case
df.clean_tweet = df.clean_tweet.str.replace(extra_whitespace_re, ' ')
df.clean_tweet = df.clean_tweet.str.lower()

# tokenize each tweet 
df.clean_tweet = df.clean_tweet.apply(word_tokenize)
stops = stopwords.words('english')

def tokenize_remove_stops(tokenized_tweet):
    '''
    Tokenize each tweet and remove stopwords.
    '''
    

def sample_tweets(df, col, size=3, random_state=0):
    '''
    Sample and print some tweets in each class.
    '''
    hate = df[df['class'] == 0].sample(size, random_state=random_state)[col].tolist()
    offensive = df[df['class'] == 1].sample(size, random_state=random_state)[col].tolist()
    neither = df[df['class'] == 2].sample(size, random_state=random_state)[col].tolist()

    print('**REGULAR TWEETS:\n')
    for tweet in neither:
        print(tweet, '\n')
        
    print('**OFFENSIVE TWEETS:\n')
    for tweet in offensive:
        print(tweet, '\n')
        
    print('**HATE TWEETS:\n')
    for tweet in hate:
        print(tweet, '\n')

In [44]:
sample_tweets(df, 'tweet', random_state=11)

**REGULAR TWEETS:

Weekend is here. What an amazing week this has been. Let's use this extended weekend to celebrate our successes my fellow queer folk. 

Yeah that's where you're supposed to put the trash can.... #rude http://t.co/molWF0py 

AND #NorthKorea called #Barack a monkey. LOL! 
RT! 
*If the shoe fits! 'Cause it sure looks like he married one! 
http://t.co/F0pGDalfaA&#8221; 

**OFFENSIVE TWEETS:

@whiteangelss84 @fields_devante bitch stfu u livin off our tax money too, we pay shit just like u so don't get that white power shit to head 

bitches please in napa fewer than 100 people were injured and literally 2 of those were reported as critical 

RT @FightCIubs: Resisting the urge to smack a bitch http://t.co/kwe5j13O9b 

**HATE TWEETS:

@SoftestMuffin @_tee13 @TorahBlaze Best believe We aint no christian slave brainwash black spooks miss white man, unlike yoself, she devil 

Typically hateful, anti-Christian, mentally ill and ugly dyke trash pig couple Jennifer McCarthy and M

In [52]:
sample_tweets(df, 'clean_tweet', random_state=11)

**REGULAR TWEETS:

['weekend', 'is', 'here', 'what', 'an', 'amazing', 'week', 'this', 'has', 'been', 'let', 's', 'use', 'this', 'extended', 'weekend', 'to', 'celebrate', 'our', 'successes', 'my', 'fellow', 'queer', 'folk'] 

['yeah', 'that', 's', 'where', 'you', 're', 'supposed', 'to', 'put', 'the', 'trash', 'can', '#', 'rude'] 

['and', '#', 'northkorea', 'called', '#', 'barack', 'a', 'monkey', 'lol', 'rt', 'if', 'the', 'shoe', 'fits', 'cause', 'it', 'sure', 'looks', 'like', 'he', 'married', 'one', '#'] 

**OFFENSIVE TWEETS:

['bitch', 'stfu', 'u', 'livin', 'off', 'our', 'tax', 'money', 'too', 'we', 'pay', 'shit', 'just', 'like', 'u', 'so', 'don', 't', 'get', 'that', 'white', 'power', 'shit', 'to', 'head'] 

['bitches', 'please', 'in', 'napa', 'fewer', 'than', 'people', 'were', 'injured', 'and', 'literally', 'of', 'those', 'were', 'reported', 'as', 'critical'] 

['rt', 'resisting', 'the', 'urge', 'to', 'smack', 'a', 'bitch'] 

**HATE TWEETS:

['best', 'believe', 'we', 'aint', 'no', 'c