# Building Datasets

In this notebook we construct positive and negative datasets from [Wiki-Detox's](https://meta.wikimedia.org/wiki/Research:Detox) labeled [toxicity dataset](https://figshare.com/articles/Wikipedia_Talk_Labels_Toxicity/4563973).

We will use `pandas` to read and query the data in order to split it into two files: `toxic_comments` (positives) and `non_toxic_comments`.

In [1]:
import pandas as pd

In [2]:
comments = pd.read_csv('datasets/toxicity_annotated_comments.tsv', delimiter='\t', index_col='rev_id', encoding='utf-8')

In [3]:
comments.head()

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2232.0,This:NEWLINE_TOKEN:One can make an analogy in ...,2002,True,article,random,train
4216.0,`NEWLINE_TOKENNEWLINE_TOKEN:Clarification for ...,2002,True,user,random,train
8953.0,Elected or Electoral? JHK,2002,False,article,random,test
26547.0,`This is such a fun entry. DevotchkaNEWLINE_...,2002,True,article,random,train
28959.0,Please relate the ozone hole to increases in c...,2002,True,article,random,test


In [4]:
annotations = pd.read_csv('datasets/toxicity_annotations.tsv', delimiter='\t', encoding='utf-8')

In [5]:
annotations.head()

Unnamed: 0,rev_id,worker_id,toxicity,toxicity_score
0,2232.0,723,0,0.0
1,2232.0,4000,0,0.0
2,2232.0,3989,0,1.0
3,2232.0,3341,0,0.0
4,2232.0,1574,0,1.0


In [6]:
comment_annotations = annotations.groupby('rev_id')['toxicity'].mean() > .5

In [7]:
comment_annotations.head()

rev_id
2232.0     False
4216.0     False
8953.0     False
26547.0    False
28959.0    False
Name: toxicity, dtype: bool

In [8]:
assert len(comments) == len(comment_annotations)

In [9]:
comments['toxic'] = comment_annotations

In [10]:
comments.head()

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split,toxic
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2232.0,This:NEWLINE_TOKEN:One can make an analogy in ...,2002,True,article,random,train,False
4216.0,`NEWLINE_TOKENNEWLINE_TOKEN:Clarification for ...,2002,True,user,random,train,False
8953.0,Elected or Electoral? JHK,2002,False,article,random,test,False
26547.0,`This is such a fun entry. DevotchkaNEWLINE_...,2002,True,article,random,train,False
28959.0,Please relate the ozone hole to increases in c...,2002,True,article,random,test,False


In [11]:
comments['comment'] = comments['comment'].str.replace('NEWLINE_TOKEN|TAB_TOKEN', ' ')

In [12]:
comments.head()

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split,toxic
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2232.0,This: :One can make an analogy in mathematical...,2002,True,article,random,train,False
4216.0,` :Clarification for you (and Zundark's righ...,2002,True,user,random,train,False
8953.0,Elected or Electoral? JHK,2002,False,article,random,test,False
26547.0,`This is such a fun entry. Devotchka I once...,2002,True,article,random,train,False
28959.0,Please relate the ozone hole to increases in c...,2002,True,article,random,test,False


In [13]:
comments.query('toxic').head()

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split,toxic
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
597212.0,"` After the wasted bit on his sexuality, I ha...",2003,False,article,random,test,True
1266286.0,"Erik, for crying out loud. You legally can...",2003,True,user,random,test,True
1502668.0,"BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOB...",2003,True,user,blocked,test,True
2187425.0,```Nazi filth`` is impolite `,2004,True,article,random,train,True
3129678.0,"Prior to Quickpolls, he would have been perma...",2004,True,user,random,train,True


In [14]:
toxic = comments.query('toxic')
non_toxic = comments.query('~toxic')

In [15]:
assert len(toxic) + len(non_toxic) == len(comments)

In [16]:
toxic.to_csv('datasets/toxic_comments.csv', encoding='utf-8')
non_toxic.to_csv('datasets/non_toxic_comments.csv', encoding='utf-8')