# Gender and TED talks

One avenue of research in this project examines what -- if any -- relationship exists between the gender of a TED speaker and the talk that they give. To explore this line of thinking, we would need to have access to the gender that each speaker identifies with. 

One limitation of the data on the TED webpages is that there is no demographic information about the speakers such as age, race, or gender. In this notebook, we craft a method for detecting speakers' genders from the existing data within the collated speaker information from the TED website, including `speaker_occupation`, `speaker_introduction`, and `speaker_profile`. 

There are two parts to this method: 1) an automated gender detector, and 2) a non-automated procedure. For the first step, we extract the number and kinds of pronouns used in each of the three speaker variables: `speaker_occupation`, `speaker_introduction`, and `speaker_profile`. We then take a conservative approach to guessing a speaker's gender where the vast majority of pronouns need to be either stereotypically male or female. For any person who do not meet such a condition, we then work to determine their self-identified gender by reading their speaker information and referencing other media about the person. For anyone that we needed additional resources, they are noted in the resulting gender file. 


## Importing needed modules and data 

In [1]:
import pandas as pd
import csv
import string

In [2]:
df_only = pd.read_csv('../data/Release_v0/TEDonly_speakers_final.csv')
df_plus = pd.read_csv('../data/Release_v0/TEDplus_speakers_final.csv')

print("df_only = ", df_only.shape, "\n" + "df_plus = ", df_plus.shape)

df_only =  (992, 27) 
df_plus =  (755, 27)


In [20]:
col_names = df_only.columns
col_list = list(col_names) 
sind = col_list.index("speaker_1")
print(sind)

# https://stackoverflow.com/questions/9542738/python-find-in-list

11


In [21]:
test = df_only
test.drop(test.columns[0:sind], axis=1, inplace=True)
print(list(test.columns))

['speaker_1', 'speaker1_occupation', 'speaker1_introduction', 'speaker1_profile', 'speaker_2', 'speaker2_occupation', 'speaker2_introduction', 'speaker2_profile', 'speaker_3', 'speaker3_occupation', 'speaker3_introduction', 'speaker3_profile', 'speaker_4', 'speaker4_occupation', 'speaker4_introduction', 'speaker4_profile']


## Distilling pronouns from prose

The challenge with this extraction is balancing efficiency (and automation) with not incorrectly identifying any speaker's gender. To this, we classify speakers in two steps. First we use an automatic classifier that very conservatively assigns binary genders using pronoun ratios. The second step involves hand checking the information on each speaker and -- if needed -- checking additional media about the speaker.

### Step 1 - Automatic gender detection

In this step, we attempt to automatically detect the self-identification of speakers by finding pronouns within each speaker's variables: `speaker_occupation`, `speaker_introduction`, and `speaker_profile`. In this step, we compute the ratios of 1) male pronouns compared to all pronouns, 2) female pronouns compared to all pronouns, and 3) non-binary pronouns compared to all pronouns. If either of the first two ratios are very high, then we classify the speaker as male or female, respectively. We are very conservative in applying these labels, as we do not want to mis-identify any speakers. To aid in the next step (and for further investigations regarding the impact of gender), we store the counts for 1) male pronouns, 2) female pronouns, and 3) non-binary pronouns. 

In [8]:
# Pronoun lists
male_pronouns = {'he', 'him', 'his', 'himself'}
female_pronouns = {'she', 'her', 'hers', 'herself'}

nonbinary_pronouns = {'they', 'them', 'their', 'theirs', 'themself', 
                    'e', 'ey', 'em', 'eir', 'eirs', 'eirself', 
                    'fae', 'faer', 'faers', 'faerself', 
                    'per', 'pers', 'perself',
                    've', 'ver', 'vis', 'verself',
                    'xe', 'xem', 'xyr', 'xyrs', 'xemself',
                    'ze', 'zie', 'hir', 'hirs', 'hirself', 
                    'sie', 'zir', 'zis', 'zim', 'zieself', 
                    'emself', 'tey', 'ter', 'tem', 'ters', 'terself'} 