# Hate speech on Twitter: Cleaning data from Kaggle
## Team 8: 
 - Meera Whitson whitson.m@northeastern.edu
 - Anthony Bernardi bernardi.an@northeastern.edu


In [1]:
import pandas as pd
import numpy as np
from collections import Counter

After building the pipeline and scraping data ourselves (as shown in data_collection_twitter_scraping.ipynb), we have concluded that we need labeled data to successfully apply ML techniques. Thus, we are no longer using the Tweets we scraped from the Twitter API, and are now using [this](https://www.kaggle.com/arkhoshghalb/detecting-hate-tweets?select=train.csv) dataset from Kaggle. As discussed in the report, we need to construct our own features from the Tweet body field, as each item in this dataframe only has a string representing the Tweet body and a label. 

## Loading the data

In [2]:
# load up the dataframe
df_hate = pd.read_csv('hate_speech_train.csv', index_col='id')
df_hate.head()

Unnamed: 0_level_0,label,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,@user when a father is dysfunctional and is s...
2,0,@user @user thanks for #lyft credit i can't us...
3,0,bihday your majesty
4,0,#model i love u take with u all the time in ...
5,0,factsguide: society now #motivation


## Hate only dataset
Here, we create a dataframe containing only the Tweets that are labeled as hateful.

In [3]:
series_bool = df_hate['label'] == 1
df_hate_only = df_hate.loc[series_bool, :]

df_hate_only.head()

Unnamed: 0_level_0,label,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
14,1,@user #cnn calls #michigan middle school 'buil...
15,1,no comment! in #australia #opkillingbay #se...
18,1,retweet if you agree!
24,1,@user @user lumpy says i am a . prove it lumpy.
35,1,it's unbelievable that in the 21st century we'...


## Only not hate dataset
Here, we create a dataframe containing only the Tweets that are labeled as not hateful.

In [4]:
series_bool = df_hate['label'] == 0
df_not_hate = df_hate.loc[series_bool, :]

df_not_hate.head()

Unnamed: 0_level_0,label,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,@user when a father is dysfunctional and is s...
2,0,@user @user thanks for #lyft credit i can't us...
3,0,bihday your majesty
4,0,#model i love u take with u all the time in ...
5,0,factsguide: society now #motivation


## Count the unique words for hate vs non hate
We want to construct counters for both classes (hate and non hate) to keep track of the frequencies of individual words in each class. We can later use these as features of each class.

In [5]:
# set up dictionary for all the unique words
unique_words = dict()

# do hate and not hate words
hate_words = dict()
not_hate_words = dict()

# for every tweet that is hate speech
for tweet in df_hate_only['tweet']:
    # get the words in the tweet
    words = tweet.split()
    
    # for each word in the tweet
    for word in words:
        # count it in the dictionary
        if word in hate_words:
            hate_words[word] += 1
        else:
            hate_words[word] = 1
            
# for every tweet that is not hate speech
for tweet in df_not_hate['tweet']:
    # get the words in the tweet
    words = tweet.split()
    
    # for each word in the tweet
    for word in words:
        # count it in the dictionary
        if word in not_hate_words:
            not_hate_words[word] += 1
        else:
            not_hate_words[word] = 1

# for each key in value in hate speech, if it is not also in non-hate speech we add it to the total words
for k, v in hate_words.items():
    if not (k in not_hate_words):
        unique_words[k] = v

## Get the most common words that only occur in hate speech
We have found that a lot of words are very commmon in both classes, so they would not be good features to distinguish them. These are mostly words that are just common in general, such as "the", "a", "I", etc. We found that subtracting the words that are very common in not hate speech from the words that are common in hate speech leaves us with a set of about 4000 words that are decent indicators of whether a Tweet is hateful or not. 

In [6]:
# make a counter out of it          
unique_words_counter = Counter(unique_words)

# now we can get the 5000 most common words
most_common = unique_words_counter.most_common(5000)

# make a new df to not mess things up
df_words = df_hate.copy()

# get the x_feat_list
x_feat_list = [x[0] for x in most_common]

# for each word, add it to the dataframe as a list of 0's
for word in x_feat_list:
    if word != 'tweet' and word != 'label':
        df_words[word] = np.zeros(len(df_hate['tweet']))


df_words.head()

Unnamed: 0_level_0,label,tweet,#allahsoil,#sjw,if...,#miamiâ¦,#sikh,#bigot,vandalised,"#calgary,",...,mall.,omfg,offended!,#mailboxpride,#liberalisme,weasel,tony..,dipshit.,anybody?,....god
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,@user when a father is dysfunctional and is s...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,@user @user thanks for #lyft credit i can't us...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,bihday your majesty,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,#model i love u take with u all the time in ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0,factsguide: society now #motivation,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
# turn this into a dictionary so we can search for things in O(1) time
most_common_dict = dict()

for k, v in most_common:
    most_common_dict[k] = v

"""
del most_common_dict['@user']
del most_common_dict['the']
del most_common_dict['to']
del most_common_dict['a']
del most_common_dict['tweet']
""" 

"\ndel most_common_dict['@user']\ndel most_common_dict['the']\ndel most_common_dict['to']\ndel most_common_dict['a']\ndel most_common_dict['tweet']\n"

## For each Tweet: mark each word as '1' if it occurs in the Tweet
For a given word, a Tweet will have a value of '1' if that word does occur in the Tweet and '0' if it does not.

In [8]:
# make a new dataframe so we don't mess things up
df_words_copy = df_words.copy()

# for each tweet
for index in df_words_copy.index:
    # get the words in the tweet
    tweet = df_words_copy.loc[index, 'tweet']
    words = tweet.split()
    
    # for each word
    for word in words:
        # if it is one of the features, make it 1 for this row
        if word in most_common_dict:
            df_words_copy.loc[index, word] = 1
            
df_words_copy

Unnamed: 0_level_0,label,tweet,#allahsoil,#sjw,if...,#miamiâ¦,#sikh,#bigot,vandalised,"#calgary,",...,mall.,omfg,offended!,#mailboxpride,#liberalisme,weasel,tony..,dipshit.,anybody?,....god
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,@user when a father is dysfunctional and is s...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,@user @user thanks for #lyft credit i can't us...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,bihday your majesty,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,#model i love u take with u all the time in ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0,factsguide: society now #motivation,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31958,0,ate @user isz that youuu?ðððððð...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31959,0,to see nina turner on the airwaves trying to...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31960,0,listening to sad songs on a monday morning otw...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31961,1,"@user #sikh #temple vandalised in in #calgary,...",0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Saving the cleaned dataframe with added columns for features

In [9]:
df_words_copy.to_csv('hate_speech_train_with_features.csv')

This dataframe is 31962 rows and 4020 columns, so it is very large. The csv is 516.8 MB, so we will not be submitting it with the report.