# Hate Speech Detection - Part I: Data Gathering
In this notebook we gather data to build a general hate speech model that predicts if a tweet conveys hate speech or not. We save the train and test splits of the generated dataset in csv files (see end of notebook).

We want to train a Bertweet model. This model requires specific tweet normalization:
 - Tokenize  those  English  Tweets  using  “TweetTokenizer”  from  the  NLTK  toolkit  (Bird  et  al.,2009)
 - use the `emoji` package to translate emotion icons into text strings (here, each icon is referred to as a word token).
 - We also normalize the Tweets by converting user mentions and web/url  links  into  special  tokens `@USER` and `HTTPURL`, respectively

# 0. Setup

### Imports

In [1]:
# for reading csv files
import csv

# for shuffling data
import random

# for preprocessing 
import re # (finding @user for example)
import string # (for finding printable chars)
from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer()
import emoji

# for creating csv
import pandas as pd

# 1. Collecting Data
We will construct a dataset from two pre-existing Kaggle datasets (see “<i>Twitter Sentiment Analysis</i>” [here](https://www.kaggle.com/arkhoshghalb/twitter-sentiment-analysis-hatred-speech?select=train.csv) and “<i>Hate Speech and Offensive Language Dataset</i> [here](https://www.kaggle.com/mrmorj/hate-speech-and-offensive-language-dataset)”) of roughly 30 000 tweets each.
<br>
The result is a labeled (Hate / NotHate) dataset of 14 600 tweets with the following distribution [<b>NOTE</b> that if needed these distributions can be tweaked]: 
- Hate: 3 000 tweets ( ~ 20 % of the data). 
- Not Hate: 11600 tweets (~ 80 % of the data). This Not Hate subset contains:
    - 1 500 Offensive Language tweets ( ~ 13% of Not Hate data)
    - 10 100 “normal” tweets (~ 87% of Not Hate data).
    
Note that we will preprocess the data after constructing the data set.

## 1.1. <i>Twitter Sentiment Analysis</i> Dataset
First let's get the necessary data from this dataset

### 1.1.1 Read <i>Twitter Sentiment Analysis</i> Dataset

In [2]:
with open("../input/twitter-sentiment-analysis-hatred-speech/train.csv", "r", encoding="utf8") as f:
    dataset1 = [{k: v for k, v in row.items()} for row in csv.DictReader(f, skipinitialspace=True)] 

Let's take a look at what this dataset looks like:

In [3]:
dataset1[0]

{'id': '1',
 'label': '0',
 'tweet': '@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run'}

Label 0 means Not Hate and label 1 means Hate

### 1.1.2. Gather Hate Data from <i>Twitter Sentiment Analysis</i> Data set

In [4]:
hate_data1 = [row["tweet"] for row in dataset1 if row["label"] == "1"]

Let's see an example of Hate speech:

In [5]:
hate_data1[10]

'@user why not @user mocked obama for being black.  @user @user @user @user #brexit'

Let's also see how much hate speech data we've got from this data set:

In [6]:
len(hate_data1)

2242

### 1.1.3. Gather Non-Hate Data from <i>Twitter Sentiment Analysis</i> Data set
We only want to keep 5000 examples of Non-Hate speech from this dataset (the rest will be from the other dataset to keep things balanced).

In [7]:
nonhate_data1 = [row["tweet"] for row in dataset1 if row["label"] == "0"][:5000]

## 1.2. <i>Hate Speech and Offensive Language Dataset</i> Dataset
Now let's get the necessary data from this dataset

### 1.2.1. Read <i>Hate Speech and Offensive Language Dataset</i> Dataset

In [8]:
with open("../input/hate-speech-and-offensive-language-dataset/labeled_data.csv", "r", encoding="utf8") as f:
    dataset2 = [{k: v for k, v in row.items()} for row in csv.DictReader(f, skipinitialspace=True)] 

Let's take a look to see what this dataset looks like:

In [9]:
dataset2[0]

{'': '0',
 'count': '3',
 'hate_speech': '0',
 'offensive_language': '0',
 'neither': '3',
 'class': '2',
 'tweet': "!!! RT @mayasolovely: As a woman you shouldn't complain about cleaning up your house. &amp; as a man you should always take the trash out..."}

- Class 0: hate speech 
- Class 1: offensive language
- Class 2: "neither

### 1.2.2. Gather Hate Data from <i>Hate Speech and Offensive Language Dataset</i> Data set

In [10]:
hate_data2 = [row["tweet"] for row in dataset2 if row["class"] == "0"]

Let's see how much hate speech data we get from this dataset:

In [11]:
len(hate_data2)

1430

### 1.2.3. Gather Non-Hate Data from <i>Hate Speech and Offensive Language Dataset</i> Data set
Unlike for first dataset, here we'll take some offensive language data and "normal" data to construct the Non Hate subset for this dataset. We add offensive language to our set to try to teach the model that being offensive doesn’t always translate to hate.
Like said previously we want 1500 offensive language tweets

In [12]:
offensive_data2 = [row["tweet"] for row in dataset2 if row["class"] == "1"][:1500]
normal_data2 = [row["tweet"] for row in dataset2 if row["class"] == "2"][:4000]

Now we can combine these to get all Non-Hate Data for dataset 2:

In [13]:
nonhate_data2 = offensive_data2 + normal_data2

## 1.3. Combine both Datasets
Now we've got Hate and Non-Hate speech from both datasets we can contruct our final dataset.

### 1.3.1 Concatenate all data

In [14]:
# initialise final dataset
data = []

# concatenate all hate data
hate_data = hate_data1 + hate_data2

# concatenate all non-hate data
nonhate_data = nonhate_data1 + nonhate_data2

# add hate data to final dataset
for row in hate_data:
    data.append({"label": 1, "tweet": row})
    
# add non-hate data to final dataset
for row in nonhate_data:
    data.append({"label": 0, "tweet": row})

### 1.3.2. Shuffle data
We don't want for example all hatespeech data to be together.

In [15]:
random.shuffle(data)

Finally let's see what our final dataset looks like

In [16]:
print(len(data))
data[:10]

14172


[{'label': 0,
  'tweet': "&#8220;@Benkasso: I'll beat the pussy up, that's a hook right thur&#8221; &#128064;"},
 {'label': 1,
  'tweet': "Michael Sam being cut further proves that we fags don't belong in football or #NFL, it is a heterosexual sport for heterosexuals."},
 {'label': 0,
  'tweet': 'why #men get   - #emotions #masculinity #progressive #religion '},
 {'label': 0,
  'tweet': 'felt good to get to a meeting tonight! #aa #recovery #sobriety #grateful #sober   #blessed'},
 {'label': 1,
  'tweet': "here's what ignorance &amp;  looks like. it ain't all swastikas &amp; burning crosses... "},
 {'label': 1, 'tweet': 'dull british . '},
 {'label': 0,
  'tweet': 'happy bihday my gorgeous thing! today is all about you so soak it up you superstar!  â\x80¦ '},
 {'label': 0,
  'tweet': '#decors   buffalo simulation: buffalo for you to take in the vicinity of their homes to do. in this way, you '},
 {'label': 0,
  'tweet': 'RT @OnionSports: Yankees Unveil Beautiful Derek Jeter Cage In Monu

# 2. Preprocessing Data
Some tweets are messy (see previous output for examples). We need to clean up this mess as much as possible. For this we fine a function `clean_tweet` that removes links, retweets (RT), mentions (@) and unprintable characters.

## 2.1. `clean_tweet` function

In [17]:
def clean_tweet(tweet):
    
    # tokenize tweet
    tokenized_tweet = " ".join(tokenizer.tokenize(tweet))

    # remove unprintable characters (ð\x9f\x98\x81ð\x9f\x98\x81 for ex)
    printable_tweet = ''.join(filter(lambda x: x in set(string.printable), tokenized_tweet))
    
    # replace links (https://www. for ex) HTTPURL
    linkless_tweet = re.sub(r"http\S+", "HTTPURL", printable_tweet)
    
    # remove retweets (RT)
    retweetless_tweet = re.sub(r"RT", "", linkless_tweet)
    
    # replace mentions (@user for example) with @USER
    mentionless_tweet = re.sub(r"@[\S]+", "@USER", retweetless_tweet)
    
    # remove other odd stuff (&... for ex)
    other = re.sub(r"&[\S]+", "", mentionless_tweet)
    
    # replace emojis with str
    emojiless_tweet = emoji.demojize(other)
    
    # remove unwanted whitespace (ex "Hello        XWorld" => "hell world")
    whitespaceless = re.sub(r'\s+', ' ', emojiless_tweet)
    
    return whitespaceless

Example:

In [18]:
clean_tweet("@user is that the name of any upcoming new track? ð\x9f\x98\x81ð\x9f\x98\x81 #2pm #kpop https:www.google.com")

'@USER is that the name of any upcoming new track ? #2pm #kpop HTTPURL'

## 2.2. Clean all data

In [19]:
clean_data = [{"label": row["label"], "tweet": clean_tweet(row["tweet"])} for row in data]
clean_data[:10]

[{'label': 0,
  'tweet': " @USER : I'll beat the pussy up , that's a hook right thur "},
 {'label': 1,
  'tweet': "Michael Sam being cut further proves that we fags don't belong in football or #NFL , it is a heterosexual sport for heterosexuals ."},
 {'label': 0,
  'tweet': 'why #men get - #emotions #masculinity #progressive #religion'},
 {'label': 0,
  'tweet': 'felt good to get to a meeting tonight ! #aa #recovery #sobriety #grateful #sober #blessed'},
 {'label': 1,
  'tweet': "here's what ignorance & looks like . it ain't all swastikas & burning crosses ..."},
 {'label': 1, 'tweet': 'dull british .'},
 {'label': 0,
  'tweet': 'happy bihday my gorgeous thing ! today is all about you so soak it up you superstar ! '},
 {'label': 0,
  'tweet': '#decors buffalo simulation : buffalo for you to take in the vicinity of their homes to do . in this way , you'},
 {'label': 0,
  'tweet': ' @USER : Yankees Unveil Beautiful Derek Jeter Cage In Monument Park HTTPURL HTTPURL'},
 {'label': 1,
  'twe

# 3. Splitting Data
Now we want to split our data into train and test splits. We will go for <b>15% test</b> and <b>85% train/dev</b>

In [20]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(clean_data, test_size = 0.15)

# 4. Save Data to csv for further use.
Store train and test splits in csv files.

In [21]:
# convert the lists of dictionaries to pandas DataFrames
train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

# convert to csv
train_df.to_csv("hatespeech_train.csv", index = False)
test_df.to_csv("hatespeech_test.csv", index = False)

Let's check the content of these csv files:

In [22]:
!head -n 3 hatespeech_test.csv

label,tweet
1,@USER here comes a #supermistict douchebag who can only poke his nose as sme of all . #burninhell
0,""" my father gave me the greatest gift anyone could give another person - he believed in me "" #fathersday to all the dads out there "
