# Abusive Language Online

Explotary Data Analysis for **[Wikipedia Talk Datasets](https://figshare.com/projects/Wikipedia_Talk/16731):** dataset, which contains three types of data:
  * aggression
  * attack
  * toxic
  
**Datasets Layout:**
```bash
$ tree data
data
├── aggression_annotated_comments.tsv
├── aggression_annotations.tsv
├── attack_annotated_comments.tsv
├── attack_annotations.tsv
├── toxicity_annotated_comments.tsv
└── toxicity_annotations.tsv

0 directories, 6 files
```

In [1]:
import os
import pandas as pd



DATA_PATH = 'data'
DATA_AGGRESSION_COMMENTS = os.path.join(DATA_PATH, 'aggression_annotated_comments.tsv')
DATA_AGGRESSION_ANNOTATIONS = os.path.join(DATA_PATH, 'aggression_annotations.tsv')
DATA_ATTACK_COMMENTS = os.path.join(DATA_PATH, 'attack_annotated_comments.tsv')
DATA_ATTACK_ANNOTATIONS = os.path.join(DATA_PATH, 'attack_annotations.tsv')
DATA_TOXIC_COMMENTS = os.path.join(DATA_PATH, 'toxicity_annotated_comments.tsv')
DATA_TOXIC_ANNOTATIONS = os.path.join(DATA_PATH, 'toxicity_annotations.tsv')

In [2]:
aggression_comments_df = pd.read_csv(DATA_AGGRESSION_COMMENTS, sep='\t', index_col=0)
aggression_annotations_df = pd.read_csv(DATA_AGGRESSION_ANNOTATIONS, sep='\t', index_col=False)

aggression_annotations_df.head()

Unnamed: 0,rev_id,worker_id,aggression,aggression_score
0,37675,1362,1.0,-1.0
1,37675,2408,0.0,1.0
2,37675,1493,0.0,0.0
3,37675,1439,0.0,0.0
4,37675,170,0.0,0.0


In [3]:
aggression_labels = aggression_annotations_df.groupby('rev_id')['aggression'].mean() > 0.5
len(aggression_labels.astype('int'))

115864

In [4]:
aggression_comments_df['label'] = aggression_labels.astype(int)
aggression_comments_df.loc[aggression_comments_df['label'] == 0, 'label'] = 100
aggression_comments_df[aggression_comments_df['label'] == 1].head()

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split,label
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
694840,`NEWLINE_TOKENNEWLINE_TOKEN:Click on my ``Anno...,2003,True,user,random,train,1
801279,Iraq is not good ===NEWLINE_TOKENNEWLINE_TO...,2003,True,article,random,train,1
1450441,`NEWLINE_TOKENNEWLINE_TOKENBuddha - ``Some sug...,2003,True,article,random,train,1
2702703,NEWLINE_TOKENNEWLINE_TOKEN____NEWLINE_TOKENfuc...,2004,True,user,random,train,1
4632658,"i have a dick, its bigger than yours! hahaha",2004,True,article,blocked,train,1


In [5]:
import sys
sys.path.insert(0, '/home/daedalus/abusive-language-online/alo')
from alo import dataset

In [6]:
%%time

wiki = dataset.WikiTalk()
data = wiki.load(stacked=False)

CPU times: user 29 µs, sys: 2 µs, total: 31 µs
Wall time: 16.5 µs


In [7]:
a, b, c = data

In [8]:
print(len(a), len(b), len(c), sum([len(a), len(b), len(c)]))

115864 115864 159686 391414


In [9]:
%%time

wiki = dataset.WikiTalk()
data = wiki.load(stacked=True)

CPU times: user 5.86 s, sys: 348 ms, total: 6.21 s
Wall time: 4.2 s


In [10]:
len(data)

391414

In [11]:
data[data['label'] == 1].head()

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split,abusive_type,label
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
801279.0,Iraq is not good ===NEWLINE_TOKENNEWLINE_TO...,2003,False,article,random,train,attack,1
2702703.0,NEWLINE_TOKENNEWLINE_TOKEN____NEWLINE_TOKENfuc...,2004,False,user,random,train,attack,1
4632658.0,"i have a dick, its bigger than yours! hahaha",2004,False,article,blocked,train,attack,1
6545332.0,NEWLINE_TOKENNEWLINE_TOKEN== renault ==NEWLINE...,2004,True,user,blocked,train,attack,1
6545351.0,NEWLINE_TOKENNEWLINE_TOKEN== renault ==NEWLINE...,2004,True,user,blocked,test,attack,1


In [12]:
data[data['label'] == 2].head()

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split,abusive_type,label
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
694840.0,`NEWLINE_TOKENNEWLINE_TOKEN:Click on my ``Anno...,2003,True,user,random,train,aggression,2
801279.0,Iraq is not good ===NEWLINE_TOKENNEWLINE_TO...,2003,True,article,random,train,aggression,2
1450441.0,`NEWLINE_TOKENNEWLINE_TOKENBuddha - ``Some sug...,2003,True,article,random,train,aggression,2
2702703.0,NEWLINE_TOKENNEWLINE_TOKEN____NEWLINE_TOKENfuc...,2004,True,user,random,train,aggression,2
4632658.0,"i have a dick, its bigger than yours! hahaha",2004,True,article,blocked,train,aggression,2


In [13]:
data[data['label'] == 3].head()

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split,abusive_type,label
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
597212.0,`NEWLINE_TOKENNEWLINE_TOKENAfter the wasted bi...,2003,False,article,random,test,toxicity,3
1266286.0,NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENNEWLINE...,2003,True,user,random,test,toxicity,3
1502668.0,"BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOB...",2003,True,user,blocked,test,toxicity,3
2187425.0,```Nazi filth`` is impolite NEWLINE_TOKENNEWL...,2004,True,article,random,train,toxicity,3
3129678.0,"Prior to Quickpolls, he would have been perma...",2004,True,user,random,train,toxicity,3
