### Bibliografia
https://github.com/fmplaza/OffendES
Poner descripción, readme con que es cada cosa, etc... todo todo

### Dataset summary

Focusing on young influencers from the well-known social platforms of Twitter, Instagram, and YouTube, the Department of Computer Science, Advanced Studies Center in ICT (CEATIC) collected a corpus composed of Spanish comments manually labeled on offensive pre-defined categories. From the total corpus, they have selected 30,416 posts to be publicly published. The posts are labeled with the following categories:

- Offensive, the target is a person (**OFP**). Offensive text targeting a specific individual.
- Offensive, the target is a group of people or collective (**OFG**). Offensive text targeting a group of people belonging to the same ethnic group, gender or sexual orientation, political ideology, religious belief, or other common characteristics.
- Non-offensive, but with expletive language (**NOE**). A text that contains rude words, blasphemes, or swearwords but without the aim of offending, and usually with a positive connotation.
- Non-offensive (**NO**). Text that is neither offensive nor contains expletive language

### Data instances

For each instance, there is a string for the id of the tweet, a string for the emotion class, a string for the offensive class, and a string for the event. See the to explore more examples.

{'comment_id': '8003',
 'influencer': 'dalas',
 'comment': 'Estupido aburrido',
 'label': 'NO',
 'influencer_gender': 'man',
 'media': youtube
 }

### Data fields

- comment_id: a string to identify the comment
- influencer: a string containing the influencer associated with the comment
- comment: a string containing the text of the comment
- label: a string containing the offensive gold label
- influencer_gender: a string containing the genre of the influencer
- media: a string containing the social media platform where the comment has been retrieved

### Reading tsv files

In [1]:
import pandas as pd
from tabulate import tabulate # to print prettier / formated tables

# Passing the TSV file to read_csv() function with tab separator
# This function will read data from files
def read_csv():
    test = pd.read_csv('test_set.tsv', sep='\t')
    train = pd.read_csv('training_set.tsv', sep='\t')
    val = pd.read_csv('dev_set.tsv', sep='\t')
    return test,train,val

In [2]:
# Reading .csv files
test, train, validation = read_csv()

In [3]:
# Counting the number of occurences of each label on the different dataframes. 
test_counts = test['label'].value_counts()
train_counts = train['label'].value_counts() 
validation_counts = validation['label'].value_counts() 

# Showing 3 counts concated with the tabulate library
occurences = pd.concat([train_counts, test_counts, validation_counts], axis=1)
headers = ['class', 'train', 'test', 'validation']
print(tabulate(occurences, headers, tablefmt="rounded_grid"))

╭─────────┬─────────┬────────┬──────────────╮
│ class   │   train │   test │   validation │
├─────────┼─────────┼────────┼──────────────┤
│ NO      │   13212 │   9651 │           64 │
├─────────┼─────────┼────────┼──────────────┤
│ OFP     │    2051 │   2340 │           22 │
├─────────┼─────────┼────────┼──────────────┤
│ NOE     │    1235 │   1404 │           10 │
├─────────┼─────────┼────────┼──────────────┤
│ OFG     │     212 │    211 │            4 │
╰─────────┴─────────┴────────┴──────────────╯


### Dataset creation
### Source data

Twitter, Youtube, Instagram

### Who are the annotators?


Amazon Mechanical Turkers

### Taking a first look of the entire datasets  !!!!!!!!!!!! HACER TAMBIEN GRÁFICOS !!!!!!!!!!!!!!!!!

In [4]:
test

Unnamed: 0,comment_id,comment,influencer,influencer_gender,media,label
0,54745,Lacasito moreno,wismichu,man,instagram,NO
1,5595,Yo pensaba que celopan era gay,miare,woman,youtube,NO
2,53477,la bruja del 77,miare,woman,instagram,NO
3,7385,Se va a liar bien gorda,wildhater,man,youtube,NO
4,551,Y Ami que chucha me importa boliviano hijodeputa,dalas,man,twitter,OFP
...,...,...,...,...,...,...
13601,34118,No es por nada pero te vas para el infierno a ...,dulceida,woman,youtube,NO
13602,1878,"Dalas: ""¡No soy un psicópata!"" Also Dalas: *L...",dalas,man,twitter,NO
13603,32921,A mi esa señora me da puto mal royo desde que ...,soyunapringada,woman,youtube,NOE
13604,8750,En todas las familias hay un dalas 😂 creo que ...,dalas,man,youtube,NO


In [5]:
train

Unnamed: 0,comment_id,comment,influencer,influencer_gender,media,label
0,52564,"En vez de la magia de mi melena, la magia de m...",dalas,man,instagram,NO
1,32984,"A ver, los milenials y la gente normal necesit...",soyunapringada,woman,youtube,NO
2,58447,Me encanta todo el contenido que haces se nota...,wildhater,man,instagram,NO
3,10341,a Laura sige así que vales mucho más que 10 o ...,lauraescane,woman,youtube,NO
4,53087,"Y si no mes gusta Dalas, que hacen aquí,lárgue...",dalas,man,instagram,NO
...,...,...,...,...,...,...
16705,57470,Hijo de tu puta madre estoy mamadisimo 😎,dalas,man,instagram,OFP
16706,35,"yo que hace 4 años lo veía, ahora me doy cuent...",dalas,man,twitter,OFP
16707,18564,Esta re blanco el wismi,wismichu,man,youtube,OFP
16708,46485,algo que no veo en esa botella rosada es que s...,windygirk,woman,youtube,OFP


In [6]:
validation

Unnamed: 0,comment_id,comment,influencer,influencer_gender,media,label
0,27341,Me encanta el videooo porciento aidii he subid...,dulceida,woman,youtube,NO
1,21310,"Ropa cara?veo dulceida shop, Zara.. y de todas...",lauraescane,woman,youtube,NO
2,23809,Y la perra seguia y seguia.jpg :v,windygirk,woman,youtube,OFP
3,14532,Malditas drogas,wismichu,man,youtube,NO
4,51651,"perdona el spam , es la primera vez que trato ...",dalas,man,instagram,NO
...,...,...,...,...,...,...
95,58740,Me alegra dalas . Soy nuevo seguidor y me aleg...,dalas,man,instagram,NO
96,6796,En resumen Dalas le callo la puta boca.,nauterplay,man,youtube,NOE
97,53415,"Genial!! Que te lleven a Alcaser, a ver las Ca...",soyunapringada,woman,instagram,NO
98,49449,Hola dalas quiero que sepáis que tienes mi tot...,dalas,man,instagram,NO
