# Gender and TED talks

One avenue of research in this project examines what -- if any -- relationship exists between the gender of a TED speaker and the talk that they give. To explore this line of thinking, we would need to have access to the gender that each speaker identifies with. 

One limitation of the data on the TED webpages is that there is no demographic information about the speakers such as age, race, or gender. In this notebook, we craft a method for detecting speakers' genders from the existing data within the collated speaker information from the TED website, including `speaker_occupation`, `speaker_introduction`, and `speaker_profile`. 

There are two parts to this method: 1) an automated gender detector, and 2) a non-automated procedure. For the first step, we extract the number and kinds of pronouns used in each of the three speaker variables: `speaker_occupation`, `speaker_introduction`, and `speaker_profile`. We then take a conservative approach to guessing a speaker's gender where the vast majority of pronouns need to be either stereotypically male or female. For any person who do not meet such a condition, we then work to determine their self-identified gender by reading their speaker information and referencing other media about the person. For anyone that we needed additional resources, they are noted in the resulting gender file. 


## Importing needed modules and data 

In [1]:
import pandas as pd
import csv
import string

In [2]:
df_only = pd.read_csv('../data/Release_v0/TEDonly_speakers_final.csv')
df_plus = pd.read_csv('../data/Release_v0/TEDplus_speakers_final.csv')

Now that we have imported our data files, we have to extract just the data about the speakers. Noting that the speaker information comes after the talk information, we start by finding the first column related to `speaker_1`.

In [3]:
col_names = df_only.columns
col_list = list(col_names) 
sind = col_list.index("speaker_1")

# https://stackoverflow.com/questions/9542738/python-find-in-list

Now we remove all the columns affiliated to the talk information. 

In [4]:
speaker_info = df_only
speaker_info.drop(speaker_info.columns[0:sind], axis=1, inplace=True)

Now we extract information for each of the potentially four speakers per talk. We will rename the columns in each of these four data frames to be identical. We will also remove any rows that are completely empty and check for duplicates before merging all of the pieces together in one cohesive list of all the speakers. 

In [5]:
speak1 = speaker_info.iloc[:,0:4]
speak1.columns=['speaker', 'occupation', 'introduction', 'profile']
print(speak1.shape)
speak1 = speak1.dropna(how='all')
speak1 = speak1.drop_duplicates()
print(speak1.shape)

(992, 4)
(852, 4)


In [6]:
speak2 = speaker_info.iloc[:,4:8]
speak2.columns=['speaker', 'occupation', 'introduction', 'profile']
print(speak2.shape)
speak2 = speak2.dropna(how='all')
print(speak2.shape)
speak2 = speak2.drop_duplicates()
print(speak2.shape)

(992, 4)
(26, 4)
(25, 4)


In [7]:
speak3 = speaker_info.iloc[:,8:12]
speak3.columns=['speaker', 'occupation', 'introduction', 'profile']
print(speak3.shape)
speak3 = speak3.dropna(how='all')
print(speak3.shape)

(992, 4)
(1, 4)


In [8]:
speak4 = speaker_info.iloc[:,12:16]
speak4.columns=['speaker', 'occupation', 'introduction', 'profile']
print(speak4.shape)
speak4 = speak4.dropna(how='all')
print(speak4.shape)

(992, 4)
(1, 4)


In [9]:
# Above from https://stackoverflow.com/questions/47060980/...
#        renaming-the-column-names-of-pandas-dataframe-is-not-working-as-expected-pytho

At this point, we have 4 smaller data frames corresponding to the four possible speakers for each talk. Now we will put them back together as one data frame and adjust the indices. 

*Note* - In this data frame the indices are the row number. To advoid creating "duplicate indices", we will have a few instances below where we use flags from `pandas` to reset the indices to match the row numbers exactly. 

In [10]:
speak_all = pd.concat([speak1, speak2,speak3, speak4], ignore_index = True) # <-- Sarah M. Brown

Even though we dropped the duplicates from each of the smaller data frames, this does not prevent a speaker from showing up as a first speaker in one talk and the third speaker in another talk. So again, we have to check for and remove any repeats.

In [11]:
# Example of remaining double
speak_all[speak_all["speaker"] =="Rhiannon Giddens"]

Unnamed: 0,speaker,occupation,introduction,profile
783,Rhiannon Giddens,Musician,With a rich voice and an equally rich sense of...,Why you should listen\nSinger-songwriter Rhian...
870,Rhiannon Giddens,Musician,With a rich voice and an equally rich sense of...,Why you should listen\nSinger-songwriter Rhian...


In [13]:
speak_all.drop_duplicates(inplace=True)
speak_all.reset_index(drop=True,inplace=True)

In [14]:
# drop_duplicates() from 
# http://www.datasciencemadesimple.com/get-unique-values-rows-dataframe-python-pandas/

# Extra flags from Sarah M. Brown

In [32]:
speak_all.shape

(872, 4)

In [37]:
speak_all = speak_all[~speak_all.duplicated(['speaker', 'occupation', 'introduction'])]
# From https://stackoverflow.com/questions/40059994/pandas-get-a-list-of-index-from-dataframe-loc

In [39]:
speak_all.shape

(870, 4)

In [40]:
speak_all.reset_index(drop=True,inplace=True)

In [42]:
speak_all[speak_all["speaker"] == "Bill Gates"]

Unnamed: 0,speaker,occupation,introduction,profile
249,Bill Gates,Philanthropist,"A passionate techie and a shrewd businessman, ...",Why you should listen\nBill Gates is the found...


## Distilling pronouns from prose

The challenge with this extraction is balancing efficiency (and automation) with not incorrectly identifying any speaker's gender. To this, we classify speakers in two steps. First we use an automatic classifier that very conservatively assigns binary genders using pronoun ratios. The second step involves hand checking the information on each speaker who did not receive a classification from the previous step and -- if needed -- checking additional media about the speaker.

### Step 1 - Automatic gender detection

In this step, we attempt to automatically detect the self-identification of speakers by finding pronouns within each speaker's variables: `speaker_occupation`, `speaker_introduction`, and `speaker_profile`. In this step, we compute the ratios of 1) male pronouns compared to all pronouns, 2) female pronouns compared to all pronouns, and 3) non-binary pronouns compared to all pronouns. If either of the first two ratios are very high, then we classify the speaker as male or female, respectively. We are very conservative in applying these labels, as we do not want to mis-identify any speakers. To aid in the next step (and for further investigations regarding the impact of gender), we store the counts for 1) male pronouns, 2) female pronouns, and 3) non-binary pronouns. 

### Step 1.0 - Prepare the data frame 

We start by adding the columns that we intend on filling in. So we will be adding the columns `Gender_auto`, `MaleScore`, `FemaleScore`, `NonBinaryScore` to our data frame. 

In [15]:
header_list = ['speaker', 'occupation', 'introduction', 'profile', 
               'Gender_auto', 'MaleScore', 'FemaleScore', 'NonBinaryScore']

speak_all = speak_all.reindex(columns = header_list)
speak_all.head()


#https://stackoverflow.com/questions/16327055/how-to-add-an-empty-column-to-a-dataframe

Unnamed: 0,speaker,occupation,introduction,profile,Gender_auto,MaleScore,FemaleScore,NonBinaryScore
0,Al Gore,Climate advocate,Nobel Laureate Al Gore focused the world’s att...,Why you should listen\nFormer Vice President A...,,,,
1,David Pogue,Technology columnist,David Pogue is the personal technology columni...,Why you should listen\nWhich cell phone to cho...,,,,
2,Majora Carter,Activist for environmental justice,Majora Carter redefined the field of environme...,Why you should listen\nMajora Carter is a visi...,,,,
3,Ken Robinson,Author/educator,Creativity expert Sir Ken Robinson challenges ...,Why you should listen\nWhy don't we get the be...,,,,
4,Hans Rosling,Global health expert; data visionary,"In Hans Rosling’s hands, data sings. Global tr...",Why you should listen\nEven the most worldly a...,,,,


### Step 1.1 Create necessary systems

Create the _global_ lists of pronouns and the function for the conservative assigning of genders. 

In [16]:
# Pronoun lists
male_pronouns = {'he', 'him', 'his', 'himself'}
female_pronouns = {'she', 'her', 'hers', 'herself'}

nonbinary_pronouns = {'they', 'them', 'their', 'theirs', 'themself', 
                    'e', 'ey', 'em', 'eir', 'eirs', 'eirself', 
                    'fae', 'faer', 'faers', 'faerself', 
                    'per', 'pers', 'perself',
                    've', 'ver', 'vis', 'verself',
                    'xe', 'xem', 'xyr', 'xyrs', 'xemself',
                    'ze', 'zie', 'hir', 'hirs', 'hirself', 
                    'sie', 'zir', 'zis', 'zim', 'zieself', 
                    'emself', 'tey', 'ter', 'tem', 'ters', 'terself'} 

We next define a function that finds and counts the various pronouns used in the text of the `speaker_occupation`, `speaker_introduction`, and `speaker_profile` variables. 

In [17]:
def find_gender(input_description):
	global male_pronouns, female_pronouns, nonbinary_pronouns

	# Initialize score variables
	male_score = 0
	female_score = 0
	nonbinary_score = 0

	# Lower and isolate everyword of the description
	des_lst = input_description.lower().split()
	for word in des_lst:
		cleanword = word.strip(string.punctuation)

		# Add to the appropriate score
		if cleanword in male_pronouns:
			male_score = male_score + 1
		elif cleanword in female_pronouns:
			female_score = female_score + 1
		elif cleanword in nonbinary_pronouns:
			nonbinary_score = nonbinary_score + 1

	total = male_score + female_score + nonbinary_score

	if total == 0: # Only happens if there are no pronouns
		gender = 'no pronouns'
	# elif (nonbinary_score <= (.1)*total):
	# Note: The above line is too harsh. 
	
	# If there are two kinds of pronouns are zero
	elif (male_score == 0) and (female_score == 0):
		gender = 'non-binary'
	elif (male_score == 0) and (nonbinary_score == 0):
		gender = 'female'
	elif (female_score == 0) and (nonbinary_score == 0):
		gender = 'male'

	# If there is only one kind of pronoun that is zero
	elif (nonbinary_score <= 1):
		score = (female_score - male_score) / (female_score + male_score)
		if score > 0.3:
			gender = 'female'
		elif score < -0.3:
			gender = 'male'
		else:
			gender = 'undetected'
	elif (male_score == 0):
		score = (female_score - nonbinary_score) / (female_score + nonbinary_score)
		if score > 0.3:
			gender = 'female'
		elif score < -0.3:
			gender = 'non-binary'
		else:
			gender = 'undetected'
	elif (female_score == 0):
		score = (male_score - nonbinary_score) / (male_score + nonbinary_score)
		if score > 0.3:
			gender = 'male'
		elif score < -0.3:
			gender = 'non-binary'
		else:
			gender = 'undetected'
	else:
		gender = 'last case'

	return (gender, male_score, female_score, nonbinary_score)

Now we loop over each row, computing the number of times each kind of pronoun comes up and use the above detector to conservatively assign gender groups. 

In [18]:
for row in speak_all.itertuples():
    inds = row.Index
    des_list = [str(row.occupation), str(row.introduction), row.profile]
    descript = " ".join(des_list)
    #auto_gender, male_score, female_score, nonbinary_score = find_gender(descript)
    speak_all.loc[inds,['Gender_auto', 'MaleScore', 'FemaleScore', 'NonBinaryScore']] = find_gender(descript)

# https://cmdlinetips.com/2018/12/how-to-loop-through-pandas-rows-or-how-to-iterate-over-pandas-rows/
# https://www.programiz.com/python-programming/methods/string/join

In [19]:
#speak_all.loc[617,:]
#speak_all.loc[485,:]

('male', 5, 0, 0)
('male', 5, 0, 0)
('female', 1, 11, 0)
('male', 8, 0, 2)
('male', 13, 0, 2)

In [20]:
results = speak_all['Gender_auto'].value_counts()

#https://stackoverflow.com/questions/38309729/count-unique-values-with-pandas-per-groups/38309823

In [21]:
results

male           545
female         211
undetected      50
non-binary      41
no pronouns     18
last case        7
Name: Gender_auto, dtype: int64

In [22]:
speak_all.loc[speak_all["Gender_auto"] == 'undetected']

Unnamed: 0,speaker,occupation,introduction,profile,Gender_auto,MaleScore,FemaleScore,NonBinaryScore
21,Mena Trott,"Blogger; cofounder, Six Apart",Mena Trott and her husband Ben founded Six Apa...,Why you should listen\nTime's 2006 Person of t...,undetected,0.0,4.0,5.0
22,Eve Ensler,"Playwright, activist","Eve Ensler created the ground-breaking ""Vagina...",Why you should listen\nInspired by intimate co...,undetected,0.0,7.0,4.0
41,Phil Borges,Photographer,Dentist-turned-photographer Phil Borges docume...,Why you should listen\nSpurred by the rapid di...,undetected,5.0,0.0,5.0
45,Neil Gershenfeld,"Physicist, personal fab pioneer",As Director of MIT’s Center for Bits and Atoms...,Why you should listen\nMIT's Neil Gershenfeld ...,undetected,2.0,0.0,2.0
58,Sheila Patek,"Biologist, biomechanics researcher",Biologist Sheila Patek is addicted to speed — ...,"Why you should listen\nSheila Patek, a UC Berk...",undetected,0.0,7.0,5.0
68,Jeff Bezos,Online commerce pioneer,"As founder and CEO of Amazon.com, Jeff Bezos d...",Why you should listen\nJeff Bezos didn't inven...,undetected,5.0,0.0,5.0
75,Chris Anderson,TED Curator,After a long career in journalism and publishi...,Why you should listen\nChris Anderson is the C...,undetected,8.0,0.0,6.0
95,Will Wright,Game designer,Will Wright invented a genre of computer game ...,Why you should listen\nA technical virtuoso wi...,undetected,8.0,0.0,5.0
100,Theo Jansen,Artist,Theo Jansen is a Dutch artist who builds walki...,Why you should listen\nDutch artist Theo Janse...,undetected,4.0,0.0,3.0
109,Hod Lipson,Roboticist,Hod Lipson works at the intersection of engine...,Why you should listen\nTo say that Hod Lipson ...,undetected,3.0,0.0,4.0


## Step 2 - Hand checking



In [23]:
header_list = ['speaker', 'occupation', 'introduction', 'profile', 
               'Gender_auto', 'MaleScore', 'FemaleScore', 'NonBinaryScore', 'Gender_handcheck']

speak_all = speak_all.reindex(columns = header_list)
speak_all.head()

Unnamed: 0,speaker,occupation,introduction,profile,Gender_auto,MaleScore,FemaleScore,NonBinaryScore,Gender_handcheck
0,Al Gore,Climate advocate,Nobel Laureate Al Gore focused the world’s att...,Why you should listen\nFormer Vice President A...,male,5.0,0.0,0.0,
1,David Pogue,Technology columnist,David Pogue is the personal technology columni...,Why you should listen\nWhich cell phone to cho...,male,5.0,0.0,0.0,
2,Majora Carter,Activist for environmental justice,Majora Carter redefined the field of environme...,Why you should listen\nMajora Carter is a visi...,female,1.0,11.0,0.0,
3,Ken Robinson,Author/educator,Creativity expert Sir Ken Robinson challenges ...,Why you should listen\nWhy don't we get the be...,male,8.0,0.0,2.0,
4,Hans Rosling,Global health expert; data visionary,"In Hans Rosling’s hands, data sings. Global tr...",Why you should listen\nEven the most worldly a...,male,13.0,0.0,2.0,


In [24]:
speak_all.loc[speak_all["Gender_auto"] == 'male','Gender_handcheck'] = "male"
speak_all.loc[speak_all["Gender_auto"] == 'female','Gender_handcheck'] = "female"

In [25]:
speak_all.head()

Unnamed: 0,speaker,occupation,introduction,profile,Gender_auto,MaleScore,FemaleScore,NonBinaryScore,Gender_handcheck
0,Al Gore,Climate advocate,Nobel Laureate Al Gore focused the world’s att...,Why you should listen\nFormer Vice President A...,male,5.0,0.0,0.0,male
1,David Pogue,Technology columnist,David Pogue is the personal technology columni...,Why you should listen\nWhich cell phone to cho...,male,5.0,0.0,0.0,male
2,Majora Carter,Activist for environmental justice,Majora Carter redefined the field of environme...,Why you should listen\nMajora Carter is a visi...,female,1.0,11.0,0.0,female
3,Ken Robinson,Author/educator,Creativity expert Sir Ken Robinson challenges ...,Why you should listen\nWhy don't we get the be...,male,8.0,0.0,2.0,male
4,Hans Rosling,Global health expert; data visionary,"In Hans Rosling’s hands, data sings. Global tr...",Why you should listen\nEven the most worldly a...,male,13.0,0.0,2.0,male


In [26]:
speak_all.loc[speak_all["Gender_auto"] == 'undetected']

Unnamed: 0,speaker,occupation,introduction,profile,Gender_auto,MaleScore,FemaleScore,NonBinaryScore,Gender_handcheck
21,Mena Trott,"Blogger; cofounder, Six Apart",Mena Trott and her husband Ben founded Six Apa...,Why you should listen\nTime's 2006 Person of t...,undetected,0.0,4.0,5.0,
22,Eve Ensler,"Playwright, activist","Eve Ensler created the ground-breaking ""Vagina...",Why you should listen\nInspired by intimate co...,undetected,0.0,7.0,4.0,
41,Phil Borges,Photographer,Dentist-turned-photographer Phil Borges docume...,Why you should listen\nSpurred by the rapid di...,undetected,5.0,0.0,5.0,
45,Neil Gershenfeld,"Physicist, personal fab pioneer",As Director of MIT’s Center for Bits and Atoms...,Why you should listen\nMIT's Neil Gershenfeld ...,undetected,2.0,0.0,2.0,
58,Sheila Patek,"Biologist, biomechanics researcher",Biologist Sheila Patek is addicted to speed — ...,"Why you should listen\nSheila Patek, a UC Berk...",undetected,0.0,7.0,5.0,
68,Jeff Bezos,Online commerce pioneer,"As founder and CEO of Amazon.com, Jeff Bezos d...",Why you should listen\nJeff Bezos didn't inven...,undetected,5.0,0.0,5.0,
75,Chris Anderson,TED Curator,After a long career in journalism and publishi...,Why you should listen\nChris Anderson is the C...,undetected,8.0,0.0,6.0,
95,Will Wright,Game designer,Will Wright invented a genre of computer game ...,Why you should listen\nA technical virtuoso wi...,undetected,8.0,0.0,5.0,
100,Theo Jansen,Artist,Theo Jansen is a Dutch artist who builds walki...,Why you should listen\nDutch artist Theo Janse...,undetected,4.0,0.0,3.0,
109,Hod Lipson,Roboticist,Hod Lipson works at the intersection of engine...,Why you should listen\nTo say that Hod Lipson ...,undetected,3.0,0.0,4.0,


In [27]:
speak_all.to_csv("gender_step1.csv", sep = ',', index=False)

In [2]:
speak_all = pd.read_csv("gender_step1.csv", sep = ",")

In [4]:
speak_all.head()

Unnamed: 0,speaker,occupation,introduction,profile,Gender_auto,MaleScore,FemaleScore,NonBinaryScore,Gender_handcheck
0,Al Gore,Climate advocate,Nobel Laureate Al Gore focused the world’s att...,Why you should listen\nFormer Vice President A...,male,5.0,0.0,0.0,male
1,David Pogue,Technology columnist,David Pogue is the personal technology columni...,Why you should listen\nWhich cell phone to cho...,male,5.0,0.0,0.0,male
2,Majora Carter,Activist for environmental justice,Majora Carter redefined the field of environme...,Why you should listen\nMajora Carter is a visi...,female,1.0,11.0,0.0,female
3,Ken Robinson,Author/educator,Creativity expert Sir Ken Robinson challenges ...,Why you should listen\nWhy don't we get the be...,male,8.0,0.0,2.0,male
4,Hans Rosling,Global health expert; data visionary,"In Hans Rosling’s hands, data sings. Global tr...",Why you should listen\nEven the most worldly a...,male,13.0,0.0,2.0,male


In [5]:
speak_all.loc[speak_all["Gender_auto"] == 'undetected']


Unnamed: 0,speaker,occupation,introduction,profile,Gender_auto,MaleScore,FemaleScore,NonBinaryScore,Gender_handcheck
21,Mena Trott,"Blogger; cofounder, Six Apart",Mena Trott and her husband Ben founded Six Apa...,Why you should listen\nTime's 2006 Person of t...,undetected,0.0,4.0,5.0,
22,Eve Ensler,"Playwright, activist","Eve Ensler created the ground-breaking ""Vagina...",Why you should listen\nInspired by intimate co...,undetected,0.0,7.0,4.0,
41,Phil Borges,Photographer,Dentist-turned-photographer Phil Borges docume...,Why you should listen\nSpurred by the rapid di...,undetected,5.0,0.0,5.0,
45,Neil Gershenfeld,"Physicist, personal fab pioneer",As Director of MIT’s Center for Bits and Atoms...,Why you should listen\nMIT's Neil Gershenfeld ...,undetected,2.0,0.0,2.0,
58,Sheila Patek,"Biologist, biomechanics researcher",Biologist Sheila Patek is addicted to speed — ...,"Why you should listen\nSheila Patek, a UC Berk...",undetected,0.0,7.0,5.0,
68,Jeff Bezos,Online commerce pioneer,"As founder and CEO of Amazon.com, Jeff Bezos d...",Why you should listen\nJeff Bezos didn't inven...,undetected,5.0,0.0,5.0,
75,Chris Anderson,TED Curator,After a long career in journalism and publishi...,Why you should listen\nChris Anderson is the C...,undetected,8.0,0.0,6.0,
95,Will Wright,Game designer,Will Wright invented a genre of computer game ...,Why you should listen\nA technical virtuoso wi...,undetected,8.0,0.0,5.0,
100,Theo Jansen,Artist,Theo Jansen is a Dutch artist who builds walki...,Why you should listen\nDutch artist Theo Janse...,undetected,4.0,0.0,3.0,
109,Hod Lipson,Roboticist,Hod Lipson works at the intersection of engine...,Why you should listen\nTo say that Hod Lipson ...,undetected,3.0,0.0,4.0,


In [6]:
speak_ud = speak_all.loc[speak_all["Gender_auto"] == 'undetected']
speak_ud.to_csv("gender_undected.csv", sep = ',', index=False)

### Hand evaluation

As a human it is easy for me to read a sentence or two from a TED presenter's bio and determine their gender. So for any speaker who was not given a binary gender label from the conservative auto-labeling function above, we read their bio and add a human label for their gender. 

This was first done for those whose gender was `undetected`. Then we will repeat the process on those labeled as `non-binary`, `no pronouns`, and `last case`. 

In [2]:
speak_ud_check = pd.read_csv("gender_undected.csv", sep = ',')

In [4]:
speak_ud_check.head()

Unnamed: 0,speaker,occupation,introduction,profile,Gender_auto,MaleScore,FemaleScore,NonBinaryScore,Gender_handcheck
0,Mena Trott,"Blogger; cofounder, Six Apart",Mena Trott and her husband Ben founded Six Apa...,Why you should listen\nTime's 2006 Person of t...,undetected,0,4,5,female
1,Eve Ensler,"Playwright, activist","Eve Ensler created the ground-breaking ""Vagina...",Why you should listen\nInspired by intimate co...,undetected,0,7,4,female
2,Phil Borges,Photographer,Dentist-turned-photographer Phil Borges docume...,Why you should listen\nSpurred by the rapid di...,undetected,5,0,5,male
3,Neil Gershenfeld,"Physicist, personal fab pioneer",As Director of MIT’s Center for Bits and Atoms...,Why you should listen\nMIT's Neil Gershenfeld ...,undetected,2,0,2,male
4,Sheila Patek,"Biologist, biomechanics researcher",Biologist Sheila Patek is addicted to speed — ...,"Why you should listen\nSheila Patek, a UC Berk...",undetected,0,7,5,female


In [6]:
speak_all = pd.read_csv("gender_step1.csv", sep = ",")


In [7]:
speak_nb = speak_all.loc[speak_all["Gender_auto"] == 'non-binary']
speak_np = speak_all.loc[speak_all["Gender_auto"] == 'no pronouns']
speak_lc = speak_all.loc[speak_all["Gender_auto"] == 'last case']

In [8]:
speak_all_checks = pd.concat([speak_nb, speak_np, speak_lc], ignore_index = True)

In [9]:
speak_all_checks

Unnamed: 0,speaker,occupation,introduction,profile,Gender_auto,MaleScore,FemaleScore,NonBinaryScore,Gender_handcheck
0,Richard Baraniuk,Education visionary,Richard Baraniuk founded Connexions — now call...,Why you should listen\nRice University profess...,non-binary,0.0,0.0,2.0,
1,Janine Benyus,"Science writer, innovation consultant, conserv...","A self-proclaimed nature nerd, Janine Benyus' ...",Why you should listen\nIn the world envisioned...,non-binary,0.0,1.0,2.0,
2,Sergey Brin,"Computer scientist, entrepreneur and philanthr...",Sergey Brin is half of the team that founded G...,Why you should listen\nSergey Brin and Larry P...,non-binary,0.0,0.0,1.0,
3,Stewart Brand,"Environmentalist, futurist","Since the counterculture '60s, Stewart Brand h...",Why you should listen\nWith biotech accelerati...,non-binary,0.0,0.0,1.0,
4,Tierney Thys,Marine biologist,Tierney Thys is a marine biologist and science...,Why you should listen\nMarine biologist Tierne...,non-binary,0.0,3.0,6.0,
5,Gever Tulley,Tinkerer,"The founder of the Tinkering School, Gever Tul...",Why you should listen\nGever Tulley writes the...,non-binary,0.0,0.0,1.0,
6,Raspyni Brothers,Jugglers,Unapologetic vaudevillians Barry Friedman and ...,Why you should listen\nThe Raspyni Brothers' i...,non-binary,0.0,0.0,4.0,
7,They Might Be Giants,Band,John Linnell and John Flansburgh are They Migh...,Why you should listen\nGeek-pop maestros John ...,non-binary,0.0,0.0,12.0,
8,Zach Kaplan,Inventor,"Zach Kaplan is the CEO of Inventables, a compa...",Why you should listen\nZach Kaplan founded Inv...,non-binary,0.0,0.0,1.0,
9,José Antonio Abreu,Maestro,José Antonio Abreu founded El Sistema in 1975 ...,"Why you should listen\nIn Venezuela, the gulf ...",non-binary,1.0,0.0,2.0,


In [10]:
speak_all_checks.to_csv("gender_check2.csv", sep = ',', index=False)

### Creating one complete speaker file

In the previous section, we created two files that needed to be handchecked. For completion and clarity, we created two files `gender_check2_checked.csv` and `gender_undected_checked.csv` that contain all the fully checked speakers.

In this section, we will create one file that contains all the speakers with their handchecked genders included. Specifically, we are adding the handchecked genders from `gender_check2_checked.csv` and `gender_undected_checked.csv` to `gender_step1.csv`

In [6]:
# Load gender_step1.csv
speak_all = pd.read_csv("gender_step1.csv", sep = ",")

# Load gender_check2_checked.csv and gender_undected_checked.csv 
handcheck2 = pd.read_csv("gender_check2_checked.csv", sep = ",")
handcheck2b = pd.read_csv("gender_undected_checked.csv", sep = ",")

# Merge based on name and occupation. Use loc? 
# https://stackoverflow.com/questions/49928463/python-pandas-update-a-dataframe-value-from-another-dataframe

In [5]:
speak_all.head()

Unnamed: 0,speaker,occupation,introduction,profile,Gender_auto,MaleScore,FemaleScore,NonBinaryScore,Gender_handcheck
0,Al Gore,Climate advocate,Nobel Laureate Al Gore focused the world’s att...,Why you should listen\nFormer Vice President A...,male,5.0,0.0,0.0,male
1,David Pogue,Technology columnist,David Pogue is the personal technology columni...,Why you should listen\nWhich cell phone to cho...,male,5.0,0.0,0.0,male
2,Majora Carter,Activist for environmental justice,Majora Carter redefined the field of environme...,Why you should listen\nMajora Carter is a visi...,female,1.0,11.0,0.0,female
3,Ken Robinson,Author/educator,Creativity expert Sir Ken Robinson challenges ...,Why you should listen\nWhy don't we get the be...,male,8.0,0.0,2.0,male
4,Hans Rosling,Global health expert; data visionary,"In Hans Rosling’s hands, data sings. Global tr...",Why you should listen\nEven the most worldly a...,male,13.0,0.0,2.0,male


In [7]:
handcheck2.head()

Unnamed: 0,speaker,occupation,introduction,profile,Gender_auto,MaleScore,FemaleScore,NonBinaryScore,Gender_handcheck
0,Janine Benyus,"Science writer, innovation consultant, conserv...","A self-proclaimed nature nerd, Janine Benyus' ...",Why you should listen\nIn the world envisioned...,non-binary,0,1,2,female
1,Tierney Thys,Marine biologist,Tierney Thys is a marine biologist and science...,Why you should listen\nMarine biologist Tierne...,non-binary,0,3,6,female
2,Jacqueline Novogratz,Investor and advocate for moral leadership,Jacqueline Novogratz works to enable human flo...,Why you should listen\nJacqueline Novogratz wr...,non-binary,0,1,7,female
3,Cheryl Hayashi,Spider silk scientist,Cheryl Hayashi studies the delicate but terrif...,Why you should listen\nBiologist Cheryl Hayash...,non-binary,0,2,6,female
4,Miranda Wang,Science fair winner,Miranda Wang and Jeanny Yao have identified a ...,Why you should listen\nAfter a visit to a plas...,non-binary,0,0,1,female


In [8]:
handcheck2b.head()

Unnamed: 0,speaker,occupation,introduction,profile,Gender_auto,MaleScore,FemaleScore,NonBinaryScore,Gender_handcheck
0,Mena Trott,"Blogger; cofounder, Six Apart",Mena Trott and her husband Ben founded Six Apa...,Why you should listen\nTime's 2006 Person of t...,undetected,0,4,5,female
1,Eve Ensler,"Playwright, activist","Eve Ensler created the ground-breaking ""Vagina...",Why you should listen\nInspired by intimate co...,undetected,0,7,4,female
2,Phil Borges,Photographer,Dentist-turned-photographer Phil Borges docume...,Why you should listen\nSpurred by the rapid di...,undetected,5,0,5,male
3,Neil Gershenfeld,"Physicist, personal fab pioneer",As Director of MIT’s Center for Bits and Atoms...,Why you should listen\nMIT's Neil Gershenfeld ...,undetected,2,0,2,male
4,Sheila Patek,"Biologist, biomechanics researcher",Biologist Sheila Patek is addicted to speed — ...,"Why you should listen\nSheila Patek, a UC Berk...",undetected,0,7,5,female


In [18]:
test = handcheck2b.iloc[0].loc[["speaker","occupation"]]

In [24]:
# Trying code from here
# https://stackoverflow.com/questions/49928463/python-pandas-update-a-dataframe-value-from-another-dataframe
# pd.concat([df1,df2]).drop_duplicates(['Code','Name'],keep='last').sort_values('Code')

test = pd.concat([speak_all,handcheck2]).drop_duplicates(["speaker","occupation"],keep='last')

In [23]:
speak_all.shape

(872, 9)

In [25]:
test.shape

(870, 9)

In [26]:
test = pd.concat([speak_all,handcheck2])

In [27]:
test.shape

(938, 9)

In [28]:
handcheck2.shape

(66, 9)

In [30]:
test2 = handcheck2.drop_duplicates(["speaker","occupation"],keep='last')

In [31]:
test2.shape

(66, 9)

In [34]:
test3 = speak_all.drop_duplicates(['speaker', 'occupation', 'introduction'],keep='last')

In [35]:
test3.shape

(870, 9)

In [38]:
speak_all.loc[speak_all.duplicated(['speaker', 'occupation', 'introduction'])]

Unnamed: 0,speaker,occupation,introduction,profile,Gender_auto,MaleScore,FemaleScore,NonBinaryScore,Gender_handcheck
384,David Byrne,"Musician, artist, writer",David Byrne builds an idiosyncratic world of m...,"Why you should listen\nMusician, author, filmm...",male,10.0,0.0,2.0,male
583,Bill Gates,Philanthropist,"A passionate techie and a shrewd businessman, ...",Why you should listen\nBill Gates is the found...,male,7.0,0.0,2.0,male


Need to fix the duplicates.... HOW DID THIS HAPPEN? 