# Building Datasets

In this notebook we construct positive and negative datasets from [Wiki-Detox's](https://meta.wikimedia.org/wiki/Research:Detox) labeled [toxicity dataset](https://figshare.com/articles/Wikipedia_Talk_Labels_Toxicity/4563973).

We will use `pandas` to read the comments dataset and the comments annotations (moderations) dataset. We'll then group the moderations per comment and compute the unhealthy threshold based on the average votes (where comments moderated as unhealthy by more than 50% of the moderators are considered unhealthy).

In [1]:
import os
import urllib
import pandas as pd

In [2]:
# download the toxicity dataset (comments and annotations)
comments_url = 'https://ndownloader.figshare.com/files/7394542'
annotations_url = 'https://ndownloader.figshare.com/files/7394539'

comments_file = 'datasets/toxicity_annoated_comments.tsv'
annotations_file = 'datasets/toxicity_annotations.tsv'

# avoid re-downloading if this has already been run
if not os.path.isfile(comments_file):
    urllib.urlretrieve(comments_url, comments_file)
if not os.path.isfile(annotations_file):
    urllib.urlretrieve(annotations_url, annotations_file)

In [3]:
# comments dataset
comments = pd.read_csv(comments_file, delimiter='\t', index_col='rev_id', encoding='utf-8')

In [4]:
comments.head()

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2232.0,This:NEWLINE_TOKEN:One can make an analogy in ...,2002,True,article,random,train
4216.0,`NEWLINE_TOKENNEWLINE_TOKEN:Clarification for ...,2002,True,user,random,train
8953.0,Elected or Electoral? JHK,2002,False,article,random,test
26547.0,`This is such a fun entry. DevotchkaNEWLINE_...,2002,True,article,random,train
28959.0,Please relate the ozone hole to increases in c...,2002,True,article,random,test


In [5]:
# moderations dataset
moderations = pd.read_csv(annotations_file, delimiter='\t', encoding='utf-8')

In [6]:
moderations.head()

Unnamed: 0,rev_id,worker_id,toxicity,toxicity_score
0,2232.0,723,0,0.0
1,2232.0,4000,0,0.0
2,2232.0,3989,0,1.0
3,2232.0,3341,0,0.0
4,2232.0,1574,0,1.0


In [7]:
# True if more than 50% of the moderators voted yes, False otherwise
comment_labels = moderations.groupby('rev_id')['toxicity'].mean() > .5

In [8]:
comment_labels.head()

rev_id
2232.0     False
4216.0     False
8953.0     False
26547.0    False
28959.0    False
Name: toxicity, dtype: bool

In [9]:
assert len(comments) == len(comment_labels)

In [10]:
comments['label'] = comment_labels

In [11]:
comments.head()

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split,label
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2232.0,This:NEWLINE_TOKEN:One can make an analogy in ...,2002,True,article,random,train,False
4216.0,`NEWLINE_TOKENNEWLINE_TOKEN:Clarification for ...,2002,True,user,random,train,False
8953.0,Elected or Electoral? JHK,2002,False,article,random,test,False
26547.0,`This is such a fun entry. DevotchkaNEWLINE_...,2002,True,article,random,train,False
28959.0,Please relate the ozone hole to increases in c...,2002,True,article,random,test,False


In [12]:
healthy = comments.query('~label')
unhealthy = comments.query('label')

In [13]:
healthy.head()

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split,label
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2232.0,This:NEWLINE_TOKEN:One can make an analogy in ...,2002,True,article,random,train,False
4216.0,`NEWLINE_TOKENNEWLINE_TOKEN:Clarification for ...,2002,True,user,random,train,False
8953.0,Elected or Electoral? JHK,2002,False,article,random,test,False
26547.0,`This is such a fun entry. DevotchkaNEWLINE_...,2002,True,article,random,train,False
28959.0,Please relate the ozone hole to increases in c...,2002,True,article,random,test,False


In [14]:
unhealthy.head()

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split,label
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
597212.0,`NEWLINE_TOKENNEWLINE_TOKENAfter the wasted bi...,2003,False,article,random,test,True
1266286.0,NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENNEWLINE...,2003,True,user,random,test,True
1502668.0,"BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOB...",2003,True,user,blocked,test,True
2187425.0,```Nazi filth`` is impolite NEWLINE_TOKENNEWL...,2004,True,article,random,train,True
3129678.0,"Prior to Quickpolls, he would have been perma...",2004,True,user,random,train,True


In [15]:
assert len(healthy) + len(unhealthy) == len(comments)

In [16]:
comments.to_csv('datasets/labeled_comments.csv', encoding='utf-8')