# Bias in Wikipedia hostile speech analysis

The purpose of this notebook is to identify potential bias in human-annotated data and describe implications of those biases. <br>
The goal is to analyze multiple datasets of hostile speech used by Wikimedia Research and Jigsaw for labelling and uncover potential biases in the data. 

## Table of Content

1. [Import packages and datasets](#Import-packages-and-datasets)
2. [Exploratory data analysis](#Exploratory-data-analysis)
3. [Analysis #1: Words most associated with hostile speech](#Analysis-1:-words-most-associated-with-hostile-speech)
   1. [Research question](#Research-question:)
   2. [Generate toxicity label](#Generate-binary-toxicity-label)
   3. [Clean the comment column](#Clean-the-comment-column)
   4. [Join label with comments](#Join-toxicity-label-and-comments)
   5. [Rerun analysis for the aggressive posts](#Rerun-the-analysis-for-aggressive-comments)
   5. [Count the most frequent word in toxic posts using sklearn's CountVectorizer](#Apply-sklearn's-CountVectorizer-to-get-a-bag-of-token-words-and-a-matrix-of-token-counts)
   5. [Observations for toxic comments](#Observations-of-toxic-comments)
   6. [Rerun the analysis on aggressive comments](#Rerun-the-analysis-for-aggressive-comments)
   7. [Observations for aggressive comments](#Observations-of-aggressive-comments)
   8. [Implications of analysis 1](#Implications-of-analysis-1)
   
4. [Analysis #2: Demographic of crowdsourcing workers](#Analysis-2:-demographic-information-from-the-crowdsourcing-workers-for-toxicity-analysis)
   1. [Research question](#Research-question:-Are-the-annotators-truly-representative-of-the-general-public?)
   2. [Check number of workers](#Check-number-of-unique-workers)
   3. [Check the distribution of demographic info](#Check-the-distribution-of-demographic-columns)
   4. [Observations of demographic analysis](#Observations-of-demographic-analysis)
   5. [Implications of demographic analysis](#Implications-of-demographic-analysis)
5. [Discuss further implications](#Discuss-further-implications)

## Import packages and datasets

In [2]:
import json
import csv
import requests
import os
import pandas as pd
import numpy as np
import datetime
import re
from sklearn.feature_extraction.text import CountVectorizer
from functools import reduce 
pd.set_option('display.max_rows', 100)
%matplotlib inline
import matplotlib.pyplot as plt

ROOT_PATH = os.getcwd()
DATA_PATH = os.path.join(ROOT_PATH, "data")

In [3]:
os.listdir(DATA_PATH)

['toxicity_annotated_comments.tsv',
 'aggression_annotations.tsv',
 'toxicity_annotations.tsv',
 'aggression_annotated_comments.tsv',
 'toxicity_worker_demographics.tsv',
 'aggression_worker_demographics.tsv']

In [4]:
toxic_comments = pd.read_csv(os.path.join(DATA_PATH, 'toxicity_annotated_comments.tsv'), sep = '\t')
toxic_annotations = pd.read_csv(os.path.join(DATA_PATH, 'toxicity_annotations.tsv'), sep = '\t')
toxic_demographics = pd.read_csv(os.path.join(DATA_PATH, 'toxicity_worker_demographics.tsv'), sep = '\t')

agg_comments = pd.read_csv(os.path.join(DATA_PATH, 'aggression_annotated_comments.tsv'), sep = '\t')
agg_annotations = pd.read_csv(os.path.join(DATA_PATH, 'aggression_annotations.tsv'), sep = '\t')
agg_demographics = pd.read_csv(os.path.join(DATA_PATH, 'aggression_worker_demographics.tsv'), sep = '\t')

## Exploratory data analysis

Check what variables are included in each dataset

In [9]:
toxic_comments.head()

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split
0,2232.0,This:NEWLINE_TOKEN:One can make an analogy in ...,2002,True,article,random,train
1,4216.0,`NEWLINE_TOKENNEWLINE_TOKEN:Clarification for ...,2002,True,user,random,train
2,8953.0,Elected or Electoral? JHK,2002,False,article,random,test
3,26547.0,`This is such a fun entry. DevotchkaNEWLINE_...,2002,True,article,random,train
4,28959.0,Please relate the ozone hole to increases in c...,2002,True,article,random,test


In [11]:
toxic_annotations.head()

Unnamed: 0,rev_id,worker_id,toxicity,toxicity_score
0,2232.0,723,0,0.0
1,2232.0,4000,0,0.0
2,2232.0,3989,0,1.0
3,2232.0,3341,0,0.0
4,2232.0,1574,0,1.0


In [7]:
toxic_demographics.head()

Unnamed: 0,worker_id,gender,english_first_language,age_group,education
0,85,female,0,18-30,bachelors
1,1617,female,0,45-60,bachelors
2,1394,female,0,,bachelors
3,311,male,0,30-45,bachelors
4,1980,male,0,45-60,masters


## 	Analysis 1: words most associated with hostile speech

### Research question: 
- What are the most common words associated with comments labelled as hostile speech? 
- Are there any differences between the most frequent words for different types of hostile speech (toxicity and aggresive)? 

### Generate binary toxicity label
#### Check the distribution of the toxicity score

In [12]:
toxic_annotations.groupby('rev_id')['toxicity'].mean().describe()

count    159686.000000
mean          0.145049
std           0.253866
min           0.000000
25%           0.000000
50%           0.000000
75%           0.200000
max           1.000000
Name: toxicity, dtype: float64

#### Use 0.5 as the threshold for denoting a comment as toxic

In [17]:
def generate_label(df, key_column, label_column, new_name):
    avg_score = df.groupby(key_column)[label_column].mean().reset_index(name = new_name)
    avg_score["".join([new_name, "_bool"])] = (avg_score[new_name] > 0.5) * 1
    return avg_score

In [18]:
avg_toxicity_score = generate_label(toxic_annotations, 'rev_id', 'toxicity', 'toxic')
avg_toxicity_score

Unnamed: 0,rev_id,toxic,toxic_bool
0,2232.0,0.1,0
1,4216.0,0.0,0
2,8953.0,0.0,0
3,26547.0,0.0,0
4,28959.0,0.2,0
...,...,...,...
159681,699848324.0,0.0,0
159682,699851288.0,0.0,0
159683,699857133.0,0.0,0
159684,699891012.0,0.4,0


### Clean the comment column

#### Remove newline, tab tokens, and special characters

In [19]:
def clean_comment_col(df):
    df_copy = df.copy()
    df_copy['comment'] = df_copy['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
    df_copy['comment'] = df_copy['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))
    df_copy['comment'] = df_copy['comment'].apply(lambda x: re.sub('[^A-Za-z]+', ' ', x))
    
    return df_copy

In [22]:
toxic_comments_rm_special_char = clean_comment_col(toxic_comments)

### Join toxicity label and comments

In [23]:
def join_label_with_text(label_df, comment_df, key_column, bool_column):
    combined_df = comment_df.join(label_df[[key_column, bool_column]].set_index(key_column), on = key_column)
    
    return combined_df

In [25]:
comment_with_label = join_label_with_text(avg_toxicity_score, toxic_comments_rm_special_char, 'rev_id', 'toxic_bool')

### Apply sklearn's CountVectorizer to get a bag of token words and a matrix of token counts

#### Extract rows with label toxic

In [30]:
def extract_target_hostile_column(combined_df, bool_column):
    return combined_df[combined_df[bool_column] == 1]

In [31]:
toxic_comment = extract_target_hostile_column(comment_with_label, 'toxic_bool')
toxic_comment.shape

(15362, 8)

#### Get tokens and counts

In [32]:
def get_tokens_and_counts(combined_df):
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(combined_df['comment'])
    word_token = vectorizer.get_feature_names()
    count_matrix = X.toarray()
    
    return word_token, count_matrix

In [33]:
word_tokens, counts_matrix = get_tokens_and_counts(toxic_comment)

#### Calculate frequency of each token and get the most frequent words for toxic comments

In [55]:
def words_freqs_sorted(tokens, matrix):
    freq_for_each_token = np.sum(matrix, axis = 0)
    token_tuples = list(zip(tokens,freq_for_each_token))
    word_freq_df = pd.DataFrame(token_tuples, columns=['Word','frequency'])\
                    .sort_values(by = 'frequency', ascending=False)
    word_freq_df['rank'] = list(range(1, len(tokens)+1))
    return word_freq_df

In [56]:
word_freq_df = words_freqs_sorted(word_tokens, counts_matrix)

In [57]:
word_freq_df[:50]

Unnamed: 0,Word,frequency,rank
31517,you,39353,1
27870,the,20718,2
1060,and,15881,3
28277,to,15800,4
14674,is,12720,5
19523,of,11500,6
11101,fuck,9959,7
31545,your,9171,8
1464,are,8486,9
27863,that,8295,10


### Observations of toxic comments

Besides the most common English words, words including 'fuck', 'nigger', 'shit', 'suck', 'fucking', 'ass', 'faggot', 'hate', 'don' appear in the top 50 most common words for comments that have average toxicity annotations over 0.5.

### Rerun the analysis for aggressive comments

In [59]:
agg_annotations.groupby('rev_id')['aggression'].mean().describe()

count    115864.000000
mean          0.185828
std           0.271089
min           0.000000
25%           0.000000
50%           0.100000
75%           0.250000
max           1.000000
Name: aggression, dtype: float64

In [61]:
# generate label column
avg_aggressive_score = generate_label(agg_annotations, 'rev_id', 'aggression', 'agg')

In [63]:
# clean comment column
agg_comments_rm_special_char = clean_comment_col(agg_comments)

In [64]:
# join label with comment
agg_comment_with_label = join_label_with_text(avg_aggressive_score, agg_comments_rm_special_char, 
                                              'rev_id', 'agg_bool')

In [66]:
agg_comment = extract_target_hostile_column(agg_comment_with_label, 'agg_bool')
agg_comment.shape

(14782, 8)

In [67]:
# get tokens and counts
word_tokens_agg, counts_matrix_agg = get_tokens_and_counts(agg_comment)

In [73]:
word_freq_agg_df = words_freqs_sorted(word_tokens_agg, counts_matrix_agg)
word_freq_agg_df[:50]

Unnamed: 0,Word,frequency,rank
32230,you,49535,1
28540,the,23445,2
28912,to,18309,3
1105,and,17886,4
15050,is,15309,5
11374,fuck,14587,6
20050,of,13046,7
32254,your,10661,8
1507,are,9199,9
28531,that,9073,10


### Observations of aggressive comments

For comments with average aggressive score over 0.5, words including 'fuck', 'nigger', 'shit', 'suck', 'ass', 'faggot', 'hate', 'fucking', 'fat', 'penis', 'die', 'don', 'gay' appear in the top 50 most commonly-associated words. 
All top 50 commonly-associated words with toxic are present here, and their ranks are higher. There are words such as 'penis', 'fat', 'gay' that are not present in the top 50 of toxic comments and may potentially used as a negative attack. 

### Implications of analysis 1

- Using ngram for the model might generate many false alarms and not ideal for identifying hostile speech.
  - Reasoning: While words like 'fuck', 'nigger', 'shit', and 'hate' can be a clear indicator for hostile speech, it might be present in an article that discusses associated cultural or social issues. Using ngram and using word frequency to train the model ignore the context and the predictions are not fully representitive of the real-world online discussions.
- Using LSTM, which takes into account the relationships between words, might be a better model.

## Analysis 2: demographic information from the crowdsourcing workers for toxicity analysis

### Research question: Are the annotators truly representative of the general public? 

### Check number of unique workers

In [71]:
toxic_demographics.worker_id.nunique()

3591

### Check the distribution of demographic columns

In [74]:
demo_columns = list(toxic_demographics.columns)
demo_columns.remove('worker_id')
for column in demo_columns:
    print(toxic_demographics[[column]].value_counts())
    print('\n')

gender
male      2327
female    1263
other        1
dtype: int64


english_first_language
0                         2925
1                          666
dtype: int64


age_group
18-30        1862
30-45        1247
45-60         296
Under 18       79
Over 60        30
dtype: int64


education   
bachelors       1441
hs              1026
masters          546
professional     441
some              93
doctorate         41
none               3
dtype: int64




In [73]:
toxic_demographics[['gender','age_group']].value_counts()

gender  age_group
male    18-30        1265
        30-45         790
female  18-30         597
        30-45         457
male    45-60         154
female  45-60         142
male    Under 18       61
female  Over 60        21
        Under 18       17
male    Over 60         9
other   Under 18        1
dtype: int64

In [70]:
toxic_demographics[['english_first_language', 'education']].value_counts()

english_first_language  education   
0                       bachelors       1148
                        hs               842
                        masters          449
                        professional     383
1                       bachelors        293
                        hs               184
                        masters           97
0                       some              77
1                       professional      58
0                       doctorate         25
1                       some              16
                        doctorate         16
                        none               2
0                       none               1
dtype: int64

In [77]:
toxic_demographics[['gender','age_group', 'education', 'english_first_language']].value_counts()

gender  age_group  education     english_first_language
male    18-30      bachelors     0                         438
                   hs            0                         347
        30-45      bachelors     0                         226
female  18-30      bachelors     0                         215
male    30-45      hs            0                         155
female  30-45      bachelors     0                         153
male    30-45      masters       0                         145
female  18-30      hs            0                         132
male    30-45      professional  0                         122
        18-30      professional  0                         116
                   masters       0                         112
                   bachelors     1                         106
female  30-45      hs            0                          77
        18-30      masters       0                          75
        30-45      masters       0                          59

### Observations of demographic analysis
- Gender: 
  - 64.8% are male
  - 35.2% are female
- English as the first language:
  - 81.5% no
  - 18.5% yes
- Age group
  - 51.9% 18-30 years old
  - 34.7% 30-45 years old
- Education
  - 40.1% bachelors
  - 28.6% high school
  
The most dominant group is male with bachelor's degree, aged 18-30, and whose english is not the first language (12%).  <br> 
The second dominant group is male with high school degree, aged 18-30, and whose english is not primary language (9%).

In order to check if the distribution is representative of the population, I compare these ratios with the demographic data of Wikipedia editors from ["Wikipedia:Who writes Wikipedia?"](https://en.wikipedia.org/wiki/Wikipedia:Who_writes_Wikipedia%3F#:~:text=28%25%20editors%20are%20aged%2040%2B.&text=59%25%20of%20the%20editors%20are%20aged%2017%20to%2040.&text=The%20English%20Wikipedia%20currently%20has,contributors%20participate%20in%20community%20discussions). The statistics shows that as of 2008, 84% of English Wikipedia editors were male. Provided that the comments data is from 2001 to 2016, the 64.8% reflects the gender gap between men and women editors. <br>
For age, the stats shows that 40% of the editors are from age 18-30, less than the ratio from the dataset. There is a possibility that the data has over-representation of people from age 18-30. <br>
The source does not provide enough information regards to the education background and speaking language for comparison. 

### Implications of demographic analysis

- Although the toxicity demographic dataset might be representative of the demographic profile of wikipedia editors, these ratios are different from the distribution of global internet users or global population. 
- The model is overly affected by the dominant groups and under affected by minority groups.   

## Discuss further implications

- Which, if any, of these demo applications would you expect the Perspective API to perform poorly in? Why?
  - I think that most of these demo appplications, such as toxicity timeline or comment slider, will provide a biased result of hostile speech. As implied by the first analysis, some comments can be falsely classified as toxic while in fact they are not. On the other hand, my second analysis suggests that the model is trained to be more sensitive to certain demographic groups and might miss subtile hostile speech by other minority groups. 
  
- What are some kinds of hostile speech that would be difficult to accurately detect using the approach used to train the Perspective API models? 
  - Speech that does not use any key indicator hostile words and instead use more subtle words, but the general tone is toxic. For example, "we believe that women naturally have a relatively weak sense of direction and they should not drive on the road." 
  
- Imagine you are one of the Google data scientists who maintains the Perspective API. If you wanted to improve the model or the API itself to make it work better for any of these purposes, how should they go about doing that?
  - I would try to make the distribution of crowdsourcing workers more representative of the general internect users population. 
  - I would use a LSTM model rather than a logistic model, since the context of words are essential.
  - I would source comments and posts not just from online discussion forums but also from social media like Facebook or Twitter. They provide a large amount of information about online discussions.