## Part 1: getting the data from URL

In [1]:
from urllib.request import urlopen
import pandas as pd

### Part 1.1: Fortuna et al dataset
This dataset can be easily accessed on b2share.eudat.eu website.


*Reference for datasets: https://hatespeechdata.com/*

In [2]:
link = "https://b2share.eudat.eu/api/files/792b86e1-e676-4a0d-971f-b41a1ffb9b18/dataset_dummy_classes.csv"
f = urlopen(link)
dataset_fortuna = f.read()
f.close()

csv_file = open('datasets/dataset_fortuna.csv', 'wb')
csv_file.write(dataset_fortuna)
csv_file.close()

In [3]:
dataset_fortuna = pd.read_csv("datasets/dataset_fortuna.csv")
dataset_fortuna.head()

Unnamed: 0,tweet_id,Hate.speech,Sexism,Body,Racism,Ideology,Homophobia,Origin,Religion,Health,...,Thin.women,Arabic,East.europeans,Africans,South.Americans,Brazilians,Migrants,Homossexuals,Thin.people,Ageing
0,6061792000.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,6061983000.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,6062596000.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,6062667000.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,6062752000.0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Part 1.2: De Pelle et al dataset
This dataset can be easily accessed on github. On this case, we just downloaded the file OffComBR2.arff, renamed it to dataset_depelle.csv and made it available on our datasets folder.

Github page of the project: https://github.com/rogersdepelle/OffComBR

In [4]:
dataset_depelle = pd.read_csv("datasets/dataset_depelle.csv")
dataset_depelle.head()

Unnamed: 0,class,data
0,yes,'Votaram no PEZAO Agora tomem no CZAO'
1,no,'cuidado com a poupanca pessoal Lembram o que ...
2,no,'Sabe o que eu acho engracado os nossos govern...
3,yes,'os cariocas tem o que merecem um pessoal que ...
4,no,'Podiam retirar dos lucros dos bancos '


## Part 2: putting all the data together
As we could see, the formats of the datasets are different.
On Fortuna's dataset, we have the data in tweet ID, and not text. Moreover, we have the information whether it is considered as hate speech (with 0 or 1) and the classification of that hate. We'll use only the first two collumns.
On De Pelle's dataset, we have the data in text format and the class in yes/no.

We'll join those two datasets in one, with only the information of "is hatespeech" in 0/1 and data as text.

### Part 2.1: getting the tweet text from tweet ID with Tweepy library

We'll do it with the help of Tweepy, a Python library for accessing the Twitter API.
We noticed that some tweets or accounts were already deleted (eg.: el 308).
Thus, we added a try/except to include a 'Null' value. We are going to delete those after.

In [5]:
import tweepy
import tqdm

consumer_key = 'dPeSIW3Wx5ao7RT7iA3uOV4a4'
consumer_secret = 'Q17GbT0jhLnEFgj8OE1EVn0pYtlZwvtEmLc1sT1iHA7kBU8SBf'

access_token = '1200906763308019717-5vpYfJI2KFFCy0K3rCyVnGDRblKfwq'
access_token_secret = 'F12vJ4SCGGtUsBI9tp6Eg7HXRsjg5Zy6Nsm8rNKQFoNKN'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

In [6]:
list_tweet_id = dataset_fortuna['tweet_id']
list_tweet = []

In [7]:
for tweet_id in tqdm.tqdm(list_tweet_id):
    try:
        tweet = api.get_status(int(tweet_id))
        list_tweet.app end(tweet.text)
    except tweepy.TweepError:
        list_tweet.append('NA')

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 5668/5668 [1:34:38<00:00,  1.00s/it]


In [20]:
dataset_fortuna['data'] = list_tweet
dataset_fortuna.describe()

Unnamed: 0,Hate.speech
count,5668.0
mean,0.216655
std,0.412002
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [33]:
dataset_fortuna

Unnamed: 0,data,Hate.speech
0,"""não come mel, morde marimbondo""",0
1,"não tem pinto, tem orgulho !",0
2,Não vê essa merda de Crepúsculo! Pra isso temo...,0
3,"não da tapa na bundinha, da cotovelada nas cos...",0
4,o diminutivo INHO não acompanha a trajetória d...,1
...,...,...
5663,,1
5664,,1
5665,,1
5666,,1


In [36]:
# dropping all NA values, we'll delete those tweets that are not available anymore.
# as we can't check the content, I prefered to drop them.
dataset_fortuna.loc[dataset_fortuna["data"] == 'NA', "data"] = pd.NaT
dataset_fortuna = dataset_fortuna.dropna()

In [37]:
dataset_fortuna["data"]

0                        "não come mel, morde marimbondo"
1                            não tem pinto, tem orgulho !
2       Não vê essa merda de Crepúsculo! Pra isso temo...
3       não da tapa na bundinha, da cotovelada nas cos...
4       o diminutivo INHO não acompanha a trajetória d...
                              ...                        
5646    chateada q o meu fechamento é vc mozao tá sain...
5648    RT @eviesramos: eu amo a indústria sapatao bra...
5658             Várias sapatão linda aqui na escola kero
5660    اااااه يالوقاحه كانوا يبي يحولونه هيئة التحقيق...
5661    @bielvolei @paullinh4 Basta frequentar uns min...
Name: data, Length: 1514, dtype: object

In [38]:
dataset_fortuna.describe()

Unnamed: 0,Hate.speech
count,1514.0
mean,0.195509
std,0.396723
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


### Part 2.2: Change dataset_depelle to same format as dataset_fortuna

In [13]:
cols = dataset_depelle.columns.tolist()
cols = cols[::-1] #reverting cols
dataset_depelle = dataset_depelle[cols]
dataset_depelle['class'] = dataset_depelle['class'].map({'yes': 1, 'no': 0})
dataset_depelle.columns = ['data', 'Hate.speech']
dataset_depelle.head()

Unnamed: 0,data,Hate.speech
0,'Votaram no PEZAO Agora tomem no CZAO',1
1,'cuidado com a poupanca pessoal Lembram o que ...,0
2,'Sabe o que eu acho engracado os nossos govern...,0
3,'os cariocas tem o que merecem um pessoal que ...,1
4,'Podiam retirar dos lucros dos bancos ',0


### Part 2.3: Bringing it all together and saving all the work

In [39]:
full_dataset =  pd.concat([dataset_fortuna, dataset_depelle], ignore_index=True)
full_dataset

Unnamed: 0,data,Hate.speech
0,"""não come mel, morde marimbondo""",0
1,"não tem pinto, tem orgulho !",0
2,Não vê essa merda de Crepúsculo! Pra isso temo...,0
3,"não da tapa na bundinha, da cotovelada nas cos...",0
4,o diminutivo INHO não acompanha a trajetória d...,1


In [40]:
full_dataset.describe()

Unnamed: 0,Hate.speech
count,2764.0
mean,0.258683
std,0.43799
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [42]:
full_dataset.to_csv('datasets\\full_dataset.csv', index=False)