# Assignment 2: Exploring sources of bias in data
## Introduction

**All datasets have limitations. Any of these limitations can be a potential source of bias.** In the context of data, a "bias" is really just a divergence between what a dataset is supposed to capture about the world--according to the *dataset designer's intention*, or according to the *end-user's expectations*--and what's actually represented in the data. 

So biases in data are often highly *contextual*, which can make them subtle and hard to spot. Similarly, it's hard to predict ahead of time what the *consequences* of those biases might be, because it depends on what the data is being used for.

Nevertheless some sources of "bias" in data show up over and over again. Some of these are:
- **duplicated data:** the same data shows up multiple times
- **incomplete data:** some data is missing from the dataset
- **misleading data:** it looks like a piece of data means one thing, but it actually means something different
- **unrepresentative data:** the dataset doesn't represent the population it's gathered from OR the population it's intended to model


Any of these sources of bias, unless their properly documented (with a data statement, etc) can cause a researcher who is using the data to reach incorrect conclusions. They can also cause a machine learning model that is trained on that dataset to make classification errors.

The nature or scale of these consequences can be hard to predict. That's why any time you prepare to use data that you didn't gather yourself, it pays to spend some time exploring the dataset and thinking critically about how the limitations you discover, and how they might affect your analysis or your machine learning model.

For this assignment, you'll be working with the Wikipedia Talk Corpus. You created a data statement for this dataset in class, so you're already an 'expert' on it compared to most people! For more background on the Wikipedia Talk Corpus, please review the background infrormation doc. LINKME

In assignment 1, you learned how to process and analyze a dataset created by someone else. In that assignment, we assumed that the data was "true", in other words we assumed that there were no major errors: that the dataset we had accurately captured all the bike and pedestrian traffic on the Burke-Gilman trail over the course of a period of time.

In this assignment, we'll also be analyzing a dataset, but this time we won't assume that the data is complete and correct. Instead, we will try to identify several different ways in which the data might be WRONG, and what that might mean for how we should use this data. 

You will load a dataset into your copy of this Jupyter Notebook and identify some of the potential limitations of that dataset. Then you'll be asked to perform some analysis on that data, and reflect on whether any of the sources of bias you've identified might cause a machine learning model that was trained on this data to be biased, and what that bias might look like. 

The homework questions are listed at the bottom of this notebook, and you are expected to write your responses in the notebook itself, and submit a link to it for grading.

## Part 1: Cleaning and analyzing the annotator demographic data

For the first part of this assignment, we're going to prepare one set of Wikipedia Talk data--the annotator demographics files--for analysis. In the process we'll perform a few "sanity checks" to make sure we understand what the data means, and know any limitations.

This sort of "[data wrangling](https://en.wikipedia.org/wiki/Data_wrangling)" is a critical, if sometimes tedious, first step for any data analysis project.

### 1.1 Load the data into the notebook

According to the documetation, the worker demographic data for the Wikipedia Talk Corpus is spread across three files:
- ``toxicity_worker_demographics.tsv``
- ``aggression_worker_demographics.tsv``
- ``attack_worker_demographics.tsv``

We will need to combine the data in these three files to come up with our canonical list of workers.

First we'll load each of the annotator datafiles into our Notebook and save them into data structures that's easy to work with. In this case, I'm choosing to save each of these files as a list-of-dictionaries, since that's fairly standard, and it makes it easy to check your work as you go.

By the way: ``.tsv`` stands for "tab-separated values", and it means that this file is organized into rows and columns, like a spreadsheet, and the data values for each column are separated by "tab" characters.

In [58]:
#import the csv module, a little code toolkit for working with spreadsheet-style data files
import csv

The function below will load in a tab-separated (.tsv) file and convert it into a lists-of-dictionaries. 

If you don't have much experience with Python, this (and some of the other code in this notebook) might be hard to understand. That's okay! For now, it's most important that you know what it does.

If you have a ***lot*** of experience with Python, the code in this notebook might seem really, really primitive. That's also okay! Remember: in this course we're primarily interested in data, not code. Code is just one of the many tools we use to ask and answer questions about data.

In [59]:
def prepare_datasets(file_path):
    """ 
    Accepts: path to a tab-separated plaintext file
    Returns: a list containing a dictionary for every row in the file, 
        with the file column headers as keys
    """
    
    with open(file_path) as infile:
        reader = csv.DictReader(infile, delimiter='\t')
        list_of_dicts = [dict(r) for r in reader]
        
    return list_of_dicts

### 1.2 Identifying duplicated datafiles
Let's load our three .tsv files into Python and store them as three variables with relevant names, so that we know which is which. 

Once we've created these three lists-of-dicts, we will do two things to check our work so far: 
- we will print the first annotator's demographic data (list index ``[0]``) so that we know what the format looks like
- we will print the length of each list (the ``len`` function), to see how many rows is in each file. Each row should correspond to one crowdworker/annotator.

***Note:*** for the cell below to run, your version of these datafiles and folders will need have the same names as the ones below, and your version of this Notebook will need to be stored in the same directory as the three folders that hold the datafiles.

In [60]:
#load the data from the flat files into three lists-of-dictionaries
toxicity_annotators = prepare_datasets("Wikipedia_Talk_Labels_Toxicity_4563973/toxicity_worker_demographics.tsv")
print(toxicity_annotators[0])
print(len(toxicity_annotators))

attack_annotators = prepare_datasets("Wikipedia_Talk_Labels_Personal_Attacks_4054689/attack_worker_demographics.tsv")
print(attack_annotators[0])
print(len(attack_annotators))

aggression_annotators = prepare_datasets("Wikipedia_Talk_Labels_Personal_Attacks_4054689/attack_worker_demographics.tsv")
print(aggression_annotators[0])
print(len(aggression_annotators))



{'worker_id': '85', 'gender': 'female', 'english_first_language': '0', 'age_group': '18-30', 'education': 'bachelors'}
3591
{'worker_id': '833', 'gender': 'female', 'english_first_language': '0', 'age_group': '45-60', 'education': 'bachelors'}
2190
{'worker_id': '833', 'gender': 'female', 'english_first_language': '0', 'age_group': '45-60', 'education': 'bachelors'}
2190


Interesting! This tells us a few things that we didn't know before:
- looks like the demographic data matches what's listed in [the schema](https://meta.wikimedia.org/wiki/Research:Detox/Data_Release#Schema_for_{attack/aggression/toxicity}_worker_demographics.tsv), which is great!
- the "toxicity" dataset was annotated by a lot more people (3,591) than the "attack" or "aggression" datasets (2,190)
- ``aggression_worker_demographics.tsv`` and ``attack_worker_demographics.tsv`` seem to contain the same number of workers, and the worker at the beginning of each list has the same ID and demographic data.

Let's dig a little deeper here. Are the ***last*** entries in both of these lists also identical?

In [61]:
print(attack_annotators[-1]) # "-1" tells Python to find the last item in any list
print(aggression_annotators[-1])


{'worker_id': '3876', 'gender': 'female', 'english_first_language': '1', 'age_group': '30-45', 'education': 'bachelors'}
{'worker_id': '3876', 'gender': 'female', 'english_first_language': '1', 'age_group': '30-45', 'education': 'bachelors'}


Yes, the last rows in these two lists are also identical! 

And in fact if you had opened the two lists in a text editor or spreadsheet program, you would find that the *aggression and attack .tsv files contain exactly the same data.* By the way, it doesn't say anywhere in the dataset documentation that these two files are identical. 

That brings us to our first lesson about bias: watch out for duplicate data! 

1. ***What would have happened if we had just combined these three files and then analyzed the worker demographics?***
 
Fortunately, now that we know there's duplicate data we can work around it. Since two files are identical, we only need to use one of them. So from now on, we will ignore ``aggression_annotators`` entirely. 

Since want to remember that ``attack_annotators`` really refers to both "attack" and "aggression" annotators, we can just rename the variable we're using to store that dataset.

In [63]:
attack_aggro_annotators = attack_annotators

Okay, that looks good. Now we can continue getting our data ready for analysis--while keeping an eye out for additional duplicate data!

### 1.3 Understanding the properties of your data

Whenever you are working with data that you didn't create, it's very useful to perform some basic sanity-checking to make sure the data actually means what you think it means. 

Let's look at the ``worker_id`` field. 
In this case, since we know that in part 3 of this assigment we will want to compare ***characteristics of the annotators*** with ***characteristics of the annotated data***, we know that one thing we need to do is understand how these two datasets relate to one another.  check is what this ``worker_id`` field actually means. 

The [schema](https://meta.wikimedia.org/wiki/Research:Detox/Data_Release#Schema_for_{attack/aggression/toxicity}_worker_demographics.tsv) says that the ``worker_id`` field contains an "anonymized crowd-worker id" and that this ID is meant to join the worker demographics datafiles with the annotator comments datafiles. 

That makes sense, but it doesn't tell us everything we need to know before we start our analysis. 

1. ***Do identical worker IDs in ``toxicity_annotators`` and ``attack_aggro_annotators`` refer to the same worker?*** If we are going to create a clean list that contains demographic information about all of the annotators, we need to know if ``worker_id`` in one file corresponds to the same ``worker_id`` in another file. In other words, if the same crowdworker participated in labelling the "toxicity" dataset and the "aggression" dataset, are they identified by the same ID? 
2. ***Do we have demographic data for all workers?*** The schema says that this demographic data "was obtained by an **optional** demographic survey administered after the labelling task". If only a small number of the workers who annotated the data filled out the survey, then we might not end up with a very representative picture of who these workers were.

FIXME FIXME ALSO do we have demographics about all workers?

However, if we want to analyze worker demographics, we still need to know: ***does ``worker_id`` mean the same thing ACROSS different datasets?*** 

In other words, if we see ``worker_id`` "1234" appears in both .tsv files, does that mean that worker 1234 participated in both toxic and attack/aggression labelling campaigns?

If so, we would need to de-duplicate our combined worker list in order to pull accurate demographic data about the workers.

To check this, let's combine our two ``worker_id`` lists and then run ``determine_dupes`` to see if there's any overlap.


If ``worker_id`` is unique...
- it will be easy for us to see whether any of the same workers annotated both the "toxicity" dataset and the "aggression and personal attacks" datasets
- we can pool the two worker datasets and look for overall demographic trends among the workers
- we can examine patterns in worker behavior, such as whether male-identified workers tended to identify more "bad" (toxic, aggressive, attacky) behavior than female-identified workers

### 1.4 Checking for duplicate worker IDs

If we want to see if worker_id means the same thing across datasets, the first step is to see if any of the same worker_ids exist across the two datasets.

To check this, let's first pull all the worker IDs out of each dataset and combine them into a single list. Then we can check whether that list contains any duplicate values. If it does, we know that there is at least 1 value for ``worker_id`` that appears in both datasets.

**Note:** *We're assuming that there are no duplicate values for ``worker_id`` within each dataset. (There aren't, I checked). It's a pretty safe assumption though, because worker_id is intended to be a unique key that links the ``worker_demographics.tsv`` files and the ``annotated_comments.tsv`` datasets. Can you explain why it would be an issue if there were duplicate values for ``worker_id`` within the individual datasets?*

In [10]:
#pull the worker ids out of the individual files
tox_w_ids = [item['worker_id'] for item in toxicity_annotators]
aa_w_ids = [item['worker_id'] for item in attack_aggro_annotators]


#create a new list to hold all the ids
all_w_ids = []

#combine the two worker_id lists into the new list
all_w_ids.extend(tox_w_ids)
all_w_ids.extend(aa_w_ids)

In [15]:
#how many worker ids do we have, total? 
#this number should match the total count of toxicity_annotators and attack_aggro_annotators (3591 + 2190 = 5781)
print(len(all_w_ids))

5781


In [14]:
#what does our new list look like? Let's print the first five values
print(all_w_ids[0:5])

['85', '1617', '1394', '311', '1980']


Below is our duplicate-checker function. You pass it a list of values, and it will return "True" if it finds at least 1 duplicate value in that list. Can you figure out how it works?

In [11]:
def determine_dupes(w_ids):
    set_of_ids = set()
    
    found_a_dupe = False
    
    for w in w_ids:
        if w in set_of_ids:
            found_a_dupe = True
        else:
            set_of_ids.add(w)

    return found_a_dupe

In [12]:
#check for duplicates in the combined worker id list
has_dupes = determine_dupes(all_w_ids)

#if this prints 'True', that means we found at least one duplicate ID
print(has_dupes)

True


Hmmm... So it looks like there's at least 1 duplicate! Okay, we'll need to do some additional verification before we can decide what to do with that information. 

The next thing we'll do is check how many duplicates there are. We'll write a short script that reads through ``all_w_ids`` and every time it finds a value that appears more than once, it adds that value to a new list.

Can you figure out how the script below works?

In [19]:
#create an empty list to hold any duplicate worker_id values we find
dupes = []

#look through the data, if you encounter any value more than once, add it to our 'dupes' list
for w in all_w_ids:
    if all_w_ids.count(w) > 1:
        dupes.append(w)

#how many values were added to 'dupes'?        
print(len(dupes))

#how many worker_ids are present twice in the dataset?
print(len(dupes)/2)

3716
1858.0


Huh, so it looks like 1,850 of the ``worker_ids`` in our merged dataset are present more than once. That's a lot of duplication, since our list was only 5,781 rows in the first place, including the dupes! 

We will definitely need to account for this duplication before we start analyzing worker demographics.

### 1.5 Check for duplicate worker demographic metadata
Let's run some spot-checks to see if it's just the value for ``worker_id`` that is duplicated across the two datasets (meaning that the these duplicate ids correspond to different workers with different demographics), or if ``worker_id`` really corresponds to the same people across ``toxicity_annotators`` and ``attack_annotators``.

Now, we'll perform some spot checks, meaning we'll visually compare the demographic data of workers with duplicate IDs, to see if they look like the same worker, or not.

First, we'll extract 10 random ids from our dupe set.

In [21]:
#handy Python library that lets you select things randomly from a list
import random

In [22]:
#store our random sample of dupes in its own list
#bonus: can you explain why we created "dupeset" rather than just grabbing 10 random values from "dupes"?
dupeset = set(dupes)

dupe_sample = random.sample(dupeset, 10)

#print to confirm everything looks how we expect it to
print(dupe_sample)

['619', '1930', '884', '786', '3310', '2637', '2828', '981', '1951', '3861']


Now, we'll use this ``dupe_sample`` list that we created to pull the corresponding worker demographics from each of our two datasets, using the function below. See if you can figure out how the function works!

In [24]:
def worker_id_lookup(annotator_list, dupe_id):
    """
    Takes a list of dictionaries 
    and a list of known duplicate values for 
    'worker_id' in that list.
    
    If a duplicate value of worker_id is found, 
    print the complete demographic data for that worker
    
    """
    
    for a in annotator_list:
        if a['worker_id'] == dupe_id:
            print(a)
            
#loop through the duplicate sample list and call our worker_id_lookup function
# to check each dataset for corresponding worker demographic data
for d in dupe_sample:
    worker_id_lookup(toxicity_annotators, d)
    worker_id_lookup(attack_aggro_annotators, d)

{'worker_id': '619', 'gender': 'male', 'english_first_language': '0', 'age_group': '18-30', 'education': 'professional'}
{'worker_id': '619', 'gender': 'male', 'english_first_language': '0', 'age_group': '30-45', 'education': 'masters'}
{'worker_id': '1930', 'gender': 'male', 'english_first_language': '0', 'age_group': '18-30', 'education': 'masters'}
{'worker_id': '1930', 'gender': 'female', 'english_first_language': '0', 'age_group': '30-45', 'education': 'bachelors'}
{'worker_id': '884', 'gender': 'female', 'english_first_language': '0', 'age_group': '18-30', 'education': 'bachelors'}
{'worker_id': '884', 'gender': 'female', 'english_first_language': '1', 'age_group': '30-45', 'education': 'hs'}
{'worker_id': '786', 'gender': 'male', 'english_first_language': '0', 'age_group': '18-30', 'education': 'hs'}
{'worker_id': '786', 'gender': 'male', 'english_first_language': '0', 'age_group': '45-60', 'education': 'hs'}
{'worker_id': '3310', 'gender': 'female', 'english_first_language': '0

Oh noes! well, it looks like few if any of these ``worker_id`` values match the same demographic info across the two datasets. So based on the data we have, we should assume that these datasets were labelled by two entirely different sets of crowdworkers.

**Note:** *This kind of thing happens a lot, and often isn't called out in dataset documentation. This is another gotcha that trips people up.* 

### 1.6 Final preparation of the dataset

Before we start our analysis of worker demographics, let's do two more things:

1. Since we know that worker ID isn't unique, let's give each worker in our dataset a **truly** unique ID, so that we don't forget and start treating ``worker_id`` like it's unique.
2. While we're at it, let's also add a new field to each worker's demographic dictionary that lists which dataset the worker worked on (toxicity or attack/aggressive).

In [26]:
def add_dataset_id(list_of_dicts, dataset_ref):
        
    for w in list_of_dicts:
        w.update({"dataset" : dataset_ref})
        
    return list_of_dicts

In [27]:
toxicity_annotators = add_dataset_id(toxicity_annotators, "toxicity")
attack_aggro_annotators = add_dataset_id(attack_aggro_annotators, "attack and aggression")

In [29]:
#did it work?
print(toxicity_annotators[0])
print(attack_aggro_annotators[0])

{'worker_id': '85', 'gender': 'female', 'english_first_language': '0', 'age_group': '18-30', 'education': 'bachelors', 'dataset': 'toxicity'}
{'worker_id': '833', 'gender': 'female', 'english_first_language': '0', 'age_group': '45-60', 'education': 'bachelors', 'dataset': 'attack and aggression'}


Great. Now we'll combine the two datasets into one, and then assign a sequential id to each worker. It doesn't matter what this ID is, as long as its unique within the dataset.

In [38]:
def finalize_dataset(list1, list2):
    
    #combine the two lists into a new one
    combined_list = list1 + list2
    
    #initialize our counter variable
    counter = 1
    
    for w in combined_list:
        #add the new sequential id field, and populate with the current value of 'counter'
        w.update({"unique_id" : str(counter)})
        
        #increment the counter variable by 1, so that the next ID will be one number higher
        counter = counter + 1
        
    return combined_list

In [37]:
all_annotators = finalize_dataset(toxicity_annotators, attack_aggro_annotators)
print(len(all_annotators))
print(all_annotators[0])

5781
{'worker_id': '85', 'gender': 'female', 'english_first_language': '0', 'age_group': '18-30', 'education': 'bachelors', 'dataset': 'toxicity', 'unique_id': '1'}


## Analyzing worker demographics

Now that we have our worker demographic data de-duplicated, combined and identified, we can start using it to ask and answer research questions. You are welcome to do this in Python, here in this notebook. But if you aren't super comfortable with Python, you can export this dataset to a spreadsheet application like Excel or Google Sheets and do the analysis there.

Goal: create tables or graphs that display the breakdown of workers by:
- gender
- first language
- age group
- education

In [39]:
with open('a1_all_annotator_demographics.csv', 'w', encoding='utf-8') as f:
    writer = csv.writer(f)
    #write a header row
    writer.writerow(('unique_id',
                     'worker_id', 
                     'gender', 
                     'english_first_language', 
                     'age_group',
                     'education',
                     'dataset',))
    #loop through our dataset and write it to the file, row by row
    for a in all_annotators:
        writer.writerow((a['unique_id'], a['worker_id'], a['gender'], a['english_first_language'], a['age_group'], a['education'], a['dataset']))


1. what are the demographics 
2. there is missing data, what is it where is it
3. what would have happened if we hadn't cleaned the data?
4. how do these demographics stack up to internet users (intended audience)
5. given what you know, what bias woudl there be if this was used to train a machine learning model?
6. take on at least one of the challenge questions below 


## Challenges: going further

Here are some additional questions that you now have the tools you need to answer, based on what you've done today. Answer at least 2.

- (code) How consistent are labelling behaviors among workers with different demographic profiles? For example, what proportion of comments are labelled as "toxic" or "very toxic" by female-identified vs. male-identified crowdworkers?
- (no code) Take a look at the instructions, how might these have inserted bias into the data?
- (code) what percent of all crowdworkers who labelled at least one comment in the "toxicity" dataset ALSO filled out the demographic survey?
- (code) what proportion of all comments in toxicity_comments were labelled by male-identified crowdworkers? What proportion were labelled by crowdworkers for whom English was not their first language?