# Gender and TED talks

One avenue of research in this project examines what -- if any -- relationship exists between the gender of a TED speaker and the talk that they give. To explore this line of thinking, we would need to have access to the gender that each speaker identifies with. 

One limitation of the data on the TED webpages is that there is no demographic information about the speakers such as age, race, or gender. In this notebook, we craft a method for detecting speakers' genders from the existing data within the collated speaker information from the TED website, including `speaker_occupation`, `speaker_introduction`, and `speaker_profile`. 

There are two parts to this method: 1) an automated gender detector, and 2) a non-automated procedure. For the first step, we extract the number and kinds of pronouns used in each of the three speaker variables: `speaker_occupation`, `speaker_introduction`, and `speaker_profile`. We then take a conservative approach to guessing a speaker's gender where the vast majority of pronouns need to be either stereotypically male or female. For any person who do not meet such a condition, we then work to determine their self-identified gender by reading their speaker information and referencing other media about the person. For anyone that we needed additional resources, they are noted in the resulting gender file. 


## Importing needed modules and data 

In [1]:
import pandas as pd
import csv
import string

In [2]:
df_only = pd.read_csv('../data/Release_v0/TEDonly_speakers_final.csv')
df_plus = pd.read_csv('../data/Release_v0/TEDplus_speakers_final.csv')

Now that we have imported our data files, we have to extract just the data about the speakers. Noting that the speaker information comes after the talk information, we start by finding the first column related to `speaker_1`.

In [3]:
col_names = df_only.columns
col_list = list(col_names) 
sind = col_list.index("speaker_1")

# https://stackoverflow.com/questions/9542738/python-find-in-list

Now we remove all the columns affiliated to the talk information. 

In [4]:
speaker_info = df_only
speaker_info.drop(speaker_info.columns[0:sind], axis=1, inplace=True)

Now we extract information for each of the potentially four speakers per talk. We will rename the columns in each of these four data frames to be identical. We will also remove any rows that are completely empty and check for duplicates before merging all of the pieces together in one cohesive list of all the speakers. 

In [5]:
speak1 = speaker_info.iloc[:,0:4]
speak1.columns=['speaker', 'occupation', 'introduction', 'profile']
print(speak1.shape)
speak1 = speak1.dropna(how='all')
speak1 = speak1.drop_duplicates()
print(speak1.shape)

(992, 4)
(852, 4)


In [6]:
speak2 = speaker_info.iloc[:,4:8]
speak2.columns=['speaker', 'occupation', 'introduction', 'profile']
print(speak2.shape)
speak2 = speak2.dropna(how='all')
print(speak2.shape)
speak2 = speak2.drop_duplicates()
print(speak2.shape)

(992, 4)
(26, 4)
(25, 4)


In [7]:
speak3 = speaker_info.iloc[:,8:12]
speak3.columns=['speaker', 'occupation', 'introduction', 'profile']
print(speak3.shape)
speak3 = speak3.dropna(how='all')
print(speak3.shape)

(992, 4)
(1, 4)


In [8]:
speak4 = speaker_info.iloc[:,12:16]
speak4.columns=['speaker', 'occupation', 'introduction', 'profile']
print(speak4.shape)
speak4 = speak4.dropna(how='all')
print(speak4.shape)

(992, 4)
(1, 4)


In [9]:
# Above from https://stackoverflow.com/questions/47060980/...
#        renaming-the-column-names-of-pandas-dataframe-is-not-working-as-expected-pytho

At this point, we have 4 smaller data frames corresponding to the four possible speakers for each talk. Now we will put them back together as one data frame and adjust the indices. 

*Note* - In this data frame the indices are the row number. To advoid creating "duplicate indices", we will have a few instances below where we use flags from `pandas` to reset the indices to match the row numbers exactly. 

In [10]:
speak_all = pd.concat([speak1, speak2,speak3, speak4], ignore_index = True) # <-- Sarah M. Brown

Even though we dropped the duplicates from each of the smaller data frames, this does not prevent a speaker from showing up as a first speaker in one talk and the third speaker in another talk. So again, we have to check for and remove any repeats.

In [11]:
# Example of remaining double
speak_all[speak_all["speaker"] =="Rhiannon Giddens"]

Unnamed: 0,speaker,occupation,introduction,profile
783,Rhiannon Giddens,Musician,With a rich voice and an equally rich sense of...,Why you should listen\nSinger-songwriter Rhian...
870,Rhiannon Giddens,Musician,With a rich voice and an equally rich sense of...,Why you should listen\nSinger-songwriter Rhian...


In [12]:
speak_all.drop_duplicates(inplace=True)
speak_all.reset_index(drop=True,inplace=True)

In [13]:
# drop_duplicates() from 
# http://www.datasciencemadesimple.com/get-unique-values-rows-dataframe-python-pandas/

# Extra flags from Sarah M. Brown

## Distilling pronouns from prose

The challenge with this extraction is balancing efficiency (and automation) with not incorrectly identifying any speaker's gender. To this, we classify speakers in two steps. First we use an automatic classifier that very conservatively assigns binary genders using pronoun ratios. The second step involves hand checking the information on each speaker who did not receive a classification from the previous step and -- if needed -- checking additional media about the speaker.

### Step 1 - Automatic gender detection

In this step, we attempt to automatically detect the self-identification of speakers by finding pronouns within each speaker's variables: `speaker_occupation`, `speaker_introduction`, and `speaker_profile`. In this step, we compute the ratios of 1) male pronouns compared to all pronouns, 2) female pronouns compared to all pronouns, and 3) non-binary pronouns compared to all pronouns. If either of the first two ratios are very high, then we classify the speaker as male or female, respectively. We are very conservative in applying these labels, as we do not want to mis-identify any speakers. To aid in the next step (and for further investigations regarding the impact of gender), we store the counts for 1) male pronouns, 2) female pronouns, and 3) non-binary pronouns. 

### Step 1.0 - Prepare the data frame 

We start by adding the columns that we intend on filling in. So we will be adding the columns `Gender_auto`, `MaleScore`, `FemaleScore`, `NonBinaryScore` to our data frame. 

In [14]:
header_list = ['speaker', 'occupation', 'introduction', 'profile', 
               'Gender_auto', 'MaleScore', 'FemaleScore', 'NonBinaryScore']

speak_all = speak_all.reindex(columns = header_list)
speak_all.head()


#https://stackoverflow.com/questions/16327055/how-to-add-an-empty-column-to-a-dataframe

Unnamed: 0,speaker,occupation,introduction,profile,Gender_auto,MaleScore,FemaleScore,NonBinaryScore
0,Al Gore,Climate advocate,Nobel Laureate Al Gore focused the world’s att...,Why you should listen\nFormer Vice President A...,,,,
1,David Pogue,Technology columnist,David Pogue is the personal technology columni...,Why you should listen\nWhich cell phone to cho...,,,,
2,Majora Carter,Activist for environmental justice,Majora Carter redefined the field of environme...,Why you should listen\nMajora Carter is a visi...,,,,
3,Ken Robinson,Author/educator,Creativity expert Sir Ken Robinson challenges ...,Why you should listen\nWhy don't we get the be...,,,,
4,Hans Rosling,Global health expert; data visionary,"In Hans Rosling’s hands, data sings. Global tr...",Why you should listen\nEven the most worldly a...,,,,


In [18]:
# Pronoun lists
male_pronouns = {'he', 'him', 'his', 'himself'}
female_pronouns = {'she', 'her', 'hers', 'herself'}

nonbinary_pronouns = {'they', 'them', 'their', 'theirs', 'themself', 
                    'e', 'ey', 'em', 'eir', 'eirs', 'eirself', 
                    'fae', 'faer', 'faers', 'faerself', 
                    'per', 'pers', 'perself',
                    've', 'ver', 'vis', 'verself',
                    'xe', 'xem', 'xyr', 'xyrs', 'xemself',
                    'ze', 'zie', 'hir', 'hirs', 'hirself', 
                    'sie', 'zir', 'zis', 'zim', 'zieself', 
                    'emself', 'tey', 'ter', 'tem', 'ters', 'terself'} 

We next define a function that finds and counts the various pronouns used in the text of the `speaker_occupation`, `speaker_introduction`, and `speaker_profile` variables. 

In [19]:
def find_gender(input_description):
	global male_pronouns, female_pronouns, nonbinary_pronouns

	# Initialize score variables
	male_score = 0
	female_score = 0
	nonbinary_score = 0

	# Lower and isolate everyword of the description
	des_lst = input_description.lower().split()
	for word in des_lst:
		cleanword = word.strip(string.punctuation)

		# Add to the appropriate score
		if cleanword in male_pronouns:
			male_score = male_score + 1
		elif cleanword in female_pronouns:
			female_score = female_score + 1
		elif cleanword in nonbinary_pronouns:
			nonbinary_score = nonbinary_score + 1

	total = male_score + female_score + nonbinary_score

	if total == 0: # Only happens if there are no pronouns
		gender = 'no pronouns'
	# elif (nonbinary_score <= (.1)*total):
	# Note: The above line is too harsh. 
	
	# If there are two kinds of pronouns are zero
	elif (male_score == 0) and (female_score == 0):
		gender = 'non-binary'
	elif (male_score == 0) and (nonbinary_score == 0):
		gender = 'female'
	elif (female_score == 0) and (nonbinary_score == 0):
		gender = 'male'

	# If there is only one kind of pronoun that is zero
	elif (nonbinary_score <= 1):
		score = (female_score - male_score) / (female_score + male_score)
		if score > 0.3:
			gender = 'female'
		elif score < -0.3:
			gender = 'male'
		else:
			gender = 'undetected'
	elif (male_score == 0):
		score = (female_score - nonbinary_score) / (female_score + nonbinary_score)
		if score > 0.3:
			gender = 'female'
		elif score < -0.3:
			gender = 'nonbinary'
		else:
			gender = 'undetected'
	elif (female_score == 0):
		score = (male_score - nonbinary_score) / (male_score + nonbinary_score)
		if score > 0.3:
			gender = 'male'
		elif score < -0.3:
			gender = 'nonbinary'
		else:
			gender = 'undetected'
	else:
		gender = 'last case'

	return (gender, male_score, female_score, nonbinary_score)

In [29]:
for row in speak_all.itertuples():
    inds = row.Index
    des_list = [str(row.occupation), str(row.introduction), row.profile]
    descript = " ".join(des_list)
    #auto_gender, male_score, female_score, nonbinary_score = find_gender(descript)
    speak_all.loc[inds,['Gender_auto', 'MaleScore', 'FemaleScore', 'NonBinaryScore']] = find_gender(descript)

# https://cmdlinetips.com/2018/12/how-to-loop-through-pandas-rows-or-how-to-iterate-over-pandas-rows/
# https://www.programiz.com/python-programming/methods/string/join

In [28]:
#speak_all.loc[617,:]
#speak_all.loc[485,:]

speaker                                                 Gary Kovacs
occupation                                                      NaN
introduction      Gary Kovacs is a technologist and the former C...
profile           Why you should listen\nGary Kovacs is the chie...
Gender_auto                                                    male
MaleScore                                                         4
FemaleScore                                                       0
NonBinaryScore                                                    0
Name: 485, dtype: object

('male', 5, 0, 0)
('male', 5, 0, 0)
('female', 1, 11, 0)
('male', 8, 0, 2)
('male', 13, 0, 2)