## Part 1: getting the data from URL

In [1]:
from urllib.request import urlopen
import pandas as pd

### Part 1.1: Fortuna et al dataset
This dataset can be easily accessed on b2share.eudat.eu website.


*Reference for datasets: https://hatespeechdata.com/*

In [2]:
try:
    dataset_fortuna = pd.read_csv("datasets/dataset_fortuna.csv")
except FileNotFoundError:
    url = "https://raw.githubusercontent.com/paulafortuna/Portuguese-Hate-Speech-Dataset/master/2019-05-28_portuguese_hate_speech_hierarchical_classification.csv"
    f = urlopen(url)
    dataset_fortuna = f.read()
    f.close()
    csv_file = open('datasets/dataset_fortuna.csv', 'wb')
    csv_file.write(dataset_fortuna)
    csv_file.close()
    dataset_fortuna = pd.read_csv("datasets/dataset_fortuna.csv")


In [3]:
dataset_fortuna.head()

Unnamed: 0,text,Hate.speech,Sexism,Body,Racism,Ideology,Homophobia,Origin,Religion,Health,...,Thin.women,Arabic,East.europeans,Africans,South.Americans,Brazilians,Migrants,Homossexuals,Thin.people,Ageing
0,"""não come mel, morde marimbondo""",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"não tem pinto, tem orgulho !",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Não vê essa merda de Crepúsculo! Pra isso temo...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"não da tapa na bundinha, da cotovelada nas cos...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,o diminutivo INHO não acompanha a trajetória d...,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Part 1.2: De Pelle et al dataset
This dataset can be easily accessed on github. On this case, we just downloaded the file OffComBR2.arff, renamed it to dataset_depelle.csv and made it available on our datasets folder.

Github page of the project: https://github.com/rogersdepelle/OffComBR

In [4]:
try:
    dataset_depelle_3 = pd.read_csv("datasets/dataset_depelle_3.csv")
except FileNotFoundError:
    url = "https://raw.githubusercontent.com/rogersdepelle/OffComBR/master/OffComBR3.arff"
    f = urlopen(url)
    dataset_depelle_3 = f.read()
    f.close()
    csv_file = open('datasets/dataset_depelle_3.csv', 'wb')
    csv_file.write(dataset_depelle_3)
    csv_file.close()
    dataset_depelle_3 = pd.read_csv("datasets/dataset_depelle_3.csv")

In [5]:
dataset_depelle_3.head()

Unnamed: 0,class,data
0,yes,'Votaram no PEZAO Agora tomem no CZAO'
1,no,'cuidado com a poupanca pessoal Lembram o que ...
2,no,'Sabe o que eu acho engracado os nossos govern...
3,no,'Podiam retirar dos lucros dos bancos '
4,no,'CADE O GALVAO PRA NARRAR AGORA FALIIIIUUUUU...


## Part 2: putting all the data together
As we could see, the formats of the datasets are different.
On Fortuna's dataset, we have the information whether it is considered as hate speech (with 0 or 1) and the classification of that hate. We'll use only the first two collumns (text and hate.speech).
On De Pelle's dataset, we have the data in text format and the class in yes/no.

We'll join those two datasets in one, with only the information of "hatespeech" in 0/1 and data as text.

In [6]:
dataset_fortuna = dataset_fortuna[['text', 'Hate.speech']]
dataset_fortuna.columns = ['text', 'hate_speech']

In [7]:
dataset_fortuna

Unnamed: 0,text,hate_speech
0,"""não come mel, morde marimbondo""",0
1,"não tem pinto, tem orgulho !",0
2,Não vê essa merda de Crepúsculo! Pra isso temo...,0
3,"não da tapa na bundinha, da cotovelada nas cos...",0
4,o diminutivo INHO não acompanha a trajetória d...,1
...,...,...
5663,Na minha sala só tem viado e sapatão e a cois...,1
5664,PARABENS SAPATÃO SDDS @attomiter https://t.co/...,1
5665,RT @toquedeveludo: Agora um poema:\r\nEu sou s...,1
5666,O mundo das sapatao é mais ligado do que eu im...,1


### Part 2.2: Change dataset_depelle to same format as dataset_fortuna

In [8]:
cols = dataset_depelle_3.columns.tolist()
cols = cols[::-1] #reverting cols
dataset_depelle_3 = dataset_depelle_3[cols]
dataset_depelle_3['class'] = dataset_depelle_3['class'].map({'yes': 1, 'no': 0})
dataset_depelle_3.columns = ['text', 'hate_speech']
dataset_depelle_3

Unnamed: 0,text,hate_speech
0,'Votaram no PEZAO Agora tomem no CZAO',1
1,'cuidado com a poupanca pessoal Lembram o que ...,0
2,'Sabe o que eu acho engracado os nossos govern...,0
3,'Podiam retirar dos lucros dos bancos ',0
4,'CADE O GALVAO PRA NARRAR AGORA FALIIIIUUUUU...,0
...,...,...
1028,'Cruz so tem agilidade mesmo poder de nocaute ...,0
1029,'Meus caros amigos enigmaticosNao deveriam com...,0
1030,'Ele chamava pra atras da escola e sentava nos...,1
1031,'Jhalim Rabei ate fiquei assustado comecei a l...,0


### Part 2.3: Bringing it all together and saving all the work

In [9]:
full_dataset_3 = pd.concat([dataset_fortuna, dataset_depelle_3], ignore_index=True)
full_dataset_3

Unnamed: 0,text,hate_speech
0,"""não come mel, morde marimbondo""",0
1,"não tem pinto, tem orgulho !",0
2,Não vê essa merda de Crepúsculo! Pra isso temo...,0
3,"não da tapa na bundinha, da cotovelada nas cos...",0
4,o diminutivo INHO não acompanha a trajetória d...,1
...,...,...
6696,'Cruz so tem agilidade mesmo poder de nocaute ...,0
6697,'Meus caros amigos enigmaticosNao deveriam com...,0
6698,'Ele chamava pra atras da escola e sentava nos...,1
6699,'Jhalim Rabei ate fiquei assustado comecei a l...,0


In [10]:
full_dataset_3.isnull().sum()

text           0
hate_speech    0
dtype: int64

In [11]:
full_dataset_3.to_csv('datasets\\full_dataset_3.csv', index=False)