## AGR Gender Language Audit ##

*Lab based on and builds upon the research of Os Keyes in [this paper](https://ironholds.org/resources/papers/agr_paper.pdf)*

In this lab, we will examine and (hopefully) build upon the work of Os Keyes in their paper, "[Misgendering Machines: Trans/HCI Implicaitons of Automatic Gender Recognition](https://ironholds.org/resources/papers/agr_paper.pdf)." 

Keyes's paper seeks to answer the following two research questions:

1. How does Automated Gender Recognition research operationalise gender, and what are the possible consequences of this should it be widely deployed?
2. How does HCI research interacting with AGR operationalise gender and contextualise any gendered assumptions of AGR software?

They do so via content analysis (hand-coding the papers and then counting the results). 

Keyes was kind enough to share the metadata from the papers used to answer the first question with us, so that's what we'll be using as the basis of the lab.

### Getting Started

First let's load the libraries we'll need:

In [None]:
# Making sure we have all the right libraries
import pandas as pd
import numpy as np


Next, let's load the metadata for the AGR paper analysis. Note that this is Os's own reserach data so please don't post this publicly:

In [None]:
AGR_metadata = pd.read_csv('AGR-metadata.csv', index_col=0)

AGR_metadata.head()

Recall that Keyes uses the following definitions:

1. **binary**: Consisting of only two categories. 
2. **immutable**: Impossible to change once defined.  
3. **physiological**: Rooted in external, biological features. 
4. **gender_focus**: Is the paper explicitly focused on developing AGR, or is it just using AGR to test a more general recognition algorithm?

Note that something wonky is going on with the **datasets** column, and we'll need to fix that later if we want to use it. (We might not--let's see).

### Part 1: Replication 

To begin our analysis, let's replicate the counts that Keyes obtains:

**Using the ``value_counts()`` function, show how many papers were published in each venue:**

**Now let's see if we can replicate the binary/immutable/physiological percentages that Keyes finds**

In [None]:
# First, create a smaller dataframe to work with, since we only care about a few of the categories.
# We can use the "filter() function to do some of the work for us.

gender_analysis = AGR_metadata.filter(["binary","immutable","physiology","gender_focus"], axis=1)

gender_analysis.head()


In [None]:
# Next, we need to replace some of the values, since we are counting explicit and implicit mentions together.
# We can use the replace() function for this. 

gender_analysis = gender_analysis.replace(to_replace = ["implicit", "explicit"], value="yes")

gender_analysis.head()



In [None]:
## You do the same thing to replace the "unmentioned" with "no"




In [None]:
## Now we can just use crosstabs to generate our percentages

pd.crosstab(gender_analysis["gender_focus"], gender_analysis["binary"], normalize="index", margins=True) 


In [None]:
## Calculate the same for "immutable" and "physiological"

### Why are these percentages a problem? 

### Is fixing AGR the solution? Why or why not?

### What is the argument for...

#### Avoiding implementing AGR?
#### Examining gender with inclusive methods? (e.g. self-disclosure vs. assignation)?
#### Framing gender explicitly and with trans-inclusivitiy at the start of any research proejct?
#### Making resources available for gender-aware HCI (and ML / data science)?
#### Designing replacement methodologies?
#### Digging deeper into AGR at level of datasets, codebases, perspectives of researchers...

So let's try to do a little bit of work towards that final aim. We're going to dig deeper into the language employed in the papers that Keyes studies. To do so, we're going to use some very basic text analysis techniques: namely, counting and sorting words. 

### Exploring the language of AGR reserach

So let's get started. To begin, we need to read in the text of each of the papers, which I've assembled as a dataset for you. 

We'll store all of the papers as a list, ``all_papers``, with the text of each paper stored as a single item. 

In [None]:
import os

base_dir = "./AGR-text/" 

all_papers = [] # our list which will store the text of each doc; empty for now

papers = sorted(os.listdir(base_dir)) # get a list of all the files in the directory

for paper in papers: # iterate through the docs
    if not paper.startswith('.'): # get only the .txt files
        with open(base_dir + paper, "r", encoding="ISO-8859-1") as file: # force format conversion to keep PCs happy
            text = file.read() # read in the file as a single text string
            all_papers.append(text) # append it to the all_docs list

# lastly, just take a look at the last list item to be sure it worked

all_papers[57]

Today, we'll be using a library called [TextBlob](https://textblob.readthedocs.io/), which is a simplified text processing library that sits on top of [NLTK](http://www.nltk.org/). It works like this:

In [None]:
import sys
!{sys.executable} -m pip install textblob

In [None]:
from textblob import TextBlob

all_papers_text = ""

# make one giant string
for paper in all_papers:
    all_papers_text += paper

# convert giant string into a single TextBlob object
all_text = TextBlob(all_papers_text)

With TextBlob, counting words is very easy (if a little slow). You do it like this:

In [None]:
all_text.word_counts['male']

**Can you find the counts for the word "female"? What about "men" vs. "women"? "Man" vs. "woman"? Anything interesting in there?**

Another thing that's easy to do using TextBlob is to segment (or "tokenize") by sentence and word, as in this example, which prints out all of the sentences that contain the word "woman".

In [None]:
for sentence in all_text.sentences:
    for word in sentence.words:
        if word == "woman":
            print(str(sentence))
            break




You can also calculate the sentiment score of any particular sentence, as well as a measure called "subjectivity." Both of these scores are very rough approximations of the thing they purport to measure, and they're not very accurate in many contexts. But they can be fun to play around with. For example, here are the polarity and subjectivity scores for the "woman" sentences.

In [None]:
for sentence in all_text.sentences:
    for word in sentence.words:
        if word == "woman":
            print("A woman sentence! Polarity: " + str(round(sentence.sentiment.polarity, 3)) + ". Subjectivity: " + str(round(sentence.sentiment.subjectivity, 3)))
            break

**Are there ways that you can think of using word counts alongside the metadata we have about the AGR papers to see if we can learn anything more about how the researchers are framing their work?**

One nice thing about the ``word_counts`` object is that it's actually a Python dictionary, so it can be sorted as so:

In [None]:
from collections import OrderedDict
import operator 

sorted_word_counts = OrderedDict(sorted(all_text.word_counts.items(), key=operator.itemgetter(1),reverse=True))

top_10 = dict(list(sorted_word_counts.items())[0: 10])

print(top_10)

Here we see a common problem in text analysis, which is that the top words are all of the most commonly used words, which means they are not very interesting for our purposes. 

One common approach to this problem is to filter the words by what are called "stopwords" -- a list of the most common words in a particular language or context. 

The code below filters `sorted_word_counts` and produces another OrderedDict, `filtered_wc`:

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
filtered_wc = OrderedDict()

for key, value in sorted_word_counts.items():
    if key not in stop_words:
        filtered_wc.update({key: value}) 

top_10 = dict(list(filtered_wc.items())[0: 10])

print(top_10)


These top words look a little better!

Let's compare the top words employed in the papers with explicit vs. implicit vs. unstated gender binaries. In order to do that, we'll need to make three separate TextBlobs, one for each set of texts. 

In [None]:
# start w/ strings
explicit = ""
implicit = ""
unmentioned = ""

# go through each row of the metadata df and check the binary category;
# depending on that category, add the corresponding text to the correct string
for index, row in AGR_metadata.iterrows():
    if row["binary"] == "explicit":
        explicit += all_papers[index]
    elif row["binary"] == "implicit":
        implicit += all_papers[index]
    elif row["binary"] == "unmentioned":
        unmentioned += all_papers[index]

# now convert each giant string to a textblob
explicit = TextBlob(explicit)
implicit = TextBlob(implicit)
unmentioned = TextBlob(unmentioned)

In [None]:
# You do the rest from here...



### What else might be interesting to look for, count, or compare?