# Gender and TED talks

One avenue of research in this project examines what -- if any -- relationship exists between the gender of a TED speaker and the talk that they give. To explore this line of thinking, we would need to have access to the gender that each speaker identifies with. 

One limitation of the data on the TED webpages is that there is no demographic information about the speakers such as age, race, or gender. In this notebook, we craft a method for detecting speakers' genders from the existing data within the collated speaker information from the TED website, including `speaker_occupation`, `speaker_introduction`, and `speaker_profile`. 

There are two parts to this method: 1) an automated gender detector, and 2) a non-automated procedure. For the first step, we extract the number and kinds of pronouns used in each of the three speaker variables: `speaker_occupation`, `speaker_introduction`, and `speaker_profile`. We then take a conservative approach to guessing a speaker's gender where the vast majority of pronouns need to be either stereotypically male or female. For any person who do not meet such a condition, we then work to determine their self-identified gender by reading their speaker information and referencing other media about the person. For anyone that we needed additional resources, they are noted in the resulting gender file. 


## Importing needed modules and data 

In [2]:
import pandas as pd
import csv
import string

In [3]:
df_only = pd.read_csv('../data/Release_v0/TEDonly_speakers_final.csv')
df_plus = pd.read_csv('../data/Release_v0/TEDplus_speakers_final.csv')

Now that we have imported our data files, we have to extract just the data about the speakers. Noting that the speaker information comes after the talk information, we start by finding the first column related to `speaker_1`.

In [4]:
col_names = df_only.columns
col_list = list(col_names) 
sind = col_list.index("speaker_1")

# https://stackoverflow.com/questions/9542738/python-find-in-list

Now we remove all the columns affiliated to the talk information. 

In [5]:
speaker_info = df_only
speaker_info.drop(speaker_info.columns[0:sind], axis=1, inplace=True)

Now we extract information for each of the potentially four speakers per talk. We will rename the columns in each of these four data frames to be identical. We will also remove any rows that are completely empty and check for duplicates before merging all of the pieces together in one cohesive list of all the speakers. 

In [45]:
speak1 = speaker_info.iloc[:,0:4]
speak1.columns=['speaker', 'occupation', 'introduction', 'profile']
print(speak1.shape)
speak1 = speak1.dropna(how='all')
speak1 = speak1.drop_duplicates()
print(speak1.shape)

(992, 4)
(852, 4)


In [62]:
speak1.head()

Unnamed: 0,speaker,occupation,introduction,profile
0,Al Gore,Climate advocate,Nobel Laureate Al Gore focused the world’s att...,Why you should listen\nFormer Vice President A...
1,David Pogue,Technology columnist,David Pogue is the personal technology columni...,Why you should listen\nWhich cell phone to cho...
2,Majora Carter,Activist for environmental justice,Majora Carter redefined the field of environme...,Why you should listen\nMajora Carter is a visi...
3,Ken Robinson,Author/educator,Creativity expert Sir Ken Robinson challenges ...,Why you should listen\nWhy don't we get the be...
4,Hans Rosling,Global health expert; data visionary,"In Hans Rosling’s hands, data sings. Global tr...",Why you should listen\nEven the most worldly a...


In [61]:
speak2.head()

Unnamed: 0,speaker,occupation,introduction,profile
79,Larry Page,CEO of Google,"Larry Page is the CEO and cofounder of Google,...",Why you should listen\nLarry Page and Sergey B...
151,Julia Sweeney,"Actor, comedian, playwright",Julia Sweeney creates comedic works that tackl...,Why you should listen\nJulia Sweeney is a writ...
154,Curtis Wong,Researcher,Curtis Wong is manager of Next Media Research ...,Why you should listen\nCurtis Wong is manager ...
171,Dan Ellsey,Musician,Dan Ellsey uses Hyperscore music software and ...,Why you should listen\nA resident of Tewksbury...
211,Rufus Cappadocia,Cellist,"Globe-trotting, genre-hopping cellist Rufus Ca...",Why you should listen\nRufus Cappadocia uses h...


In [59]:
speak3.head()

Unnamed: 0,speaker,occupation,introduction,profile
642,Peter Gabriel,"Musician, activist","Peter Gabriel writes incredible songs but, as ...",Why you should listen\nPeter Gabriel was a fou...


In [60]:
speak4.head()

Unnamed: 0,speaker,occupation,introduction,profile
642,Vint Cerf,Computer scientist,"Vint Cerf, now the chief Internet evangelist a...",Why you should listen\nTCP/IP. You may not kno...


In [21]:
speak2 = speaker_info.iloc[:,4:8]
speak2.columns=['speaker', 'occupation', 'introduction', 'profile']
print(speak2.shape)
speak2 = speak2.dropna(how='all')
print(speak2.shape)
speak2 = speak2.drop_duplicates()
print(speak2.shape)

(992, 4)
(26, 4)
(25, 4)


In [8]:
speak3 = speaker_info.iloc[:,8:12]
speak3.columns=['speaker', 'occupation', 'introduction', 'profile']
print(speak3.shape)
speak3 = speak3.dropna(how='all')
print(speak3.shape)

(992, 4)
(1, 4)


In [9]:
speak4 = speaker_info.iloc[:,12:16]
speak4.columns=['speaker', 'occupation', 'introduction', 'profile']
print(speak4.shape)
speak4 = speak4.dropna(how='all')
print(speak4.shape)

(992, 4)
(1, 4)


In [10]:
# Above from https://stackoverflow.com/questions/47060980/...
#        renaming-the-column-names-of-pandas-dataframe-is-not-working-as-expected-pytho

In [68]:
speak_all = pd.concat([speak1, speak2,speak3, speak4], ignore_index = True) # <-- Sarah M. Brown
speak_all

Unnamed: 0,speaker,occupation,introduction,profile
0,Al Gore,Climate advocate,Nobel Laureate Al Gore focused the world’s att...,Why you should listen\nFormer Vice President A...
1,David Pogue,Technology columnist,David Pogue is the personal technology columni...,Why you should listen\nWhich cell phone to cho...
2,Majora Carter,Activist for environmental justice,Majora Carter redefined the field of environme...,Why you should listen\nMajora Carter is a visi...
3,Ken Robinson,Author/educator,Creativity expert Sir Ken Robinson challenges ...,Why you should listen\nWhy don't we get the be...
4,Hans Rosling,Global health expert; data visionary,"In Hans Rosling’s hands, data sings. Global tr...",Why you should listen\nEven the most worldly a...
5,Tony Robbins,Life coach; expert in leadership psychology,Tony Robbins makes it his business to know why...,Why you should listen\nTony Robbins might have...
6,Joshua Prince-Ramus,Architect,Joshua Prince-Ramus is best known as architect...,Why you should listen\nWith one of the decade'...
7,Julia Sweeney,"Actor, comedian, playwright",Julia Sweeney creates comedic works that tackl...,Why you should listen\nJulia Sweeney is a writ...
8,Rick Warren,"Pastor, author",Pastor Rick Warren is the author of The Purpos...,Why you should listen\nPastor Rick Warren is o...
9,Dan Dennett,"Philosopher, cognitive scientist",Dan Dennett thinks that human consciousness an...,Why you should listen\nOne of our most importa...


In [71]:
speak_all[speak_all["speaker"] =="Rhiannon Giddens"]

Unnamed: 0,speaker,occupation,introduction,profile
783,Rhiannon Giddens,Musician,With a rich voice and an equally rich sense of...,Why you should listen\nSinger-songwriter Rhian...
870,Rhiannon Giddens,Musician,With a rich voice and an equally rich sense of...,Why you should listen\nSinger-songwriter Rhian...


In [73]:
speak_all.drop_duplicates(inplace=True)
speak_all.reset_index(drop=True,inplace=True)
speak_all

Unnamed: 0,speaker,occupation,introduction,profile
0,Al Gore,Climate advocate,Nobel Laureate Al Gore focused the world’s att...,Why you should listen\nFormer Vice President A...
1,David Pogue,Technology columnist,David Pogue is the personal technology columni...,Why you should listen\nWhich cell phone to cho...
2,Majora Carter,Activist for environmental justice,Majora Carter redefined the field of environme...,Why you should listen\nMajora Carter is a visi...
3,Ken Robinson,Author/educator,Creativity expert Sir Ken Robinson challenges ...,Why you should listen\nWhy don't we get the be...
4,Hans Rosling,Global health expert; data visionary,"In Hans Rosling’s hands, data sings. Global tr...",Why you should listen\nEven the most worldly a...
5,Tony Robbins,Life coach; expert in leadership psychology,Tony Robbins makes it his business to know why...,Why you should listen\nTony Robbins might have...
6,Joshua Prince-Ramus,Architect,Joshua Prince-Ramus is best known as architect...,Why you should listen\nWith one of the decade'...
7,Julia Sweeney,"Actor, comedian, playwright",Julia Sweeney creates comedic works that tackl...,Why you should listen\nJulia Sweeney is a writ...
8,Rick Warren,"Pastor, author",Pastor Rick Warren is the author of The Purpos...,Why you should listen\nPastor Rick Warren is o...
9,Dan Dennett,"Philosopher, cognitive scientist",Dan Dennett thinks that human consciousness an...,Why you should listen\nOne of our most importa...


In [None]:
# drop_duplicates() from 
# http://www.datasciencemadesimple.com/get-unique-values-rows-dataframe-python-pandas/

## Distilling pronouns from prose

The challenge with this extraction is balancing efficiency (and automation) with not incorrectly identifying any speaker's gender. To this, we classify speakers in two steps. First we use an automatic classifier that very conservatively assigns binary genders using pronoun ratios. The second step involves hand checking the information on each speaker and -- if needed -- checking additional media about the speaker.

### Step 1 - Automatic gender detection

In this step, we attempt to automatically detect the self-identification of speakers by finding pronouns within each speaker's variables: `speaker_occupation`, `speaker_introduction`, and `speaker_profile`. In this step, we compute the ratios of 1) male pronouns compared to all pronouns, 2) female pronouns compared to all pronouns, and 3) non-binary pronouns compared to all pronouns. If either of the first two ratios are very high, then we classify the speaker as male or female, respectively. We are very conservative in applying these labels, as we do not want to mis-identify any speakers. To aid in the next step (and for further investigations regarding the impact of gender), we store the counts for 1) male pronouns, 2) female pronouns, and 3) non-binary pronouns. 

In [8]:
# Pronoun lists
male_pronouns = {'he', 'him', 'his', 'himself'}
female_pronouns = {'she', 'her', 'hers', 'herself'}

nonbinary_pronouns = {'they', 'them', 'their', 'theirs', 'themself', 
                    'e', 'ey', 'em', 'eir', 'eirs', 'eirself', 
                    'fae', 'faer', 'faers', 'faerself', 
                    'per', 'pers', 'perself',
                    've', 'ver', 'vis', 'verself',
                    'xe', 'xem', 'xyr', 'xyrs', 'xemself',
                    'ze', 'zie', 'hir', 'hirs', 'hirself', 
                    'sie', 'zir', 'zis', 'zim', 'zieself', 
                    'emself', 'tey', 'ter', 'tem', 'ters', 'terself'} 

We next define a function that finds and counts the various pronouns used in the text of the `speaker_occupation`, `speaker_introduction`, and `speaker_profile` variables. 

In [None]:
def find_gender(input_description):
	global male_pronouns, female_pronouns, nonbinary_pronouns

	# Initialize score variables
	male_score = 0
	female_score = 0
	nonbinary_score = 0

	# Lower and isolate everyword of the description
	des_lst = input_description.lower().split()
	for word in des_lst:
		cleanword = word.strip(string.punctuation)

		# Add to the appropriate score
		if cleanword in male_pronouns:
			male_score = male_score + 1
		elif cleanword in female_pronouns:
			female_score = female_score + 1
		elif cleanword in nonbinary_pronouns:
			nonbinary_score = nonbinary_score + 1

	total = male_score + female_score + nonbinary_score

	if total == 0: # Only happens if there are no pronouns
		gender = 'no pronouns'
	# elif (nonbinary_score <= (.1)*total):
	# Note: The above line is too harsh. 
	
	# If there are two kinds of pronouns are zero
	elif (male_score == 0) and (female_score == 0):
		gender = 'non-binary'
	elif (male_score == 0) and (nonbinary_score == 0):
		gender = 'female'
	elif (female_score == 0) and (nonbinary_score == 0):
		gender = 'male'

	# If there is only one kind of pronoun that is zero
	elif (nonbinary_score <= 1):
		score = (female_score - male_score) / (female_score + male_score)
		if score > 0.3:
			gender = 'female'
		elif score < -0.3:
			gender = 'male'
		else:
			gender = 'undetected'
	elif (male_score == 0):
		score = (female_score - nonbinary_score) / (female_score + nonbinary_score)
		if score > 0.3:
			gender = 'female'
		elif score < -0.3:
			gender = 'nonbinary'
		else:
			gender = 'undetected'
	elif (female_score == 0):
		score = (male_score - nonbinary_score) / (male_score + nonbinary_score)
		if score > 0.3:
			gender = 'male'
		elif score < -0.3:
			gender = 'nonbinary'
		else:
			gender = 'undetected'
	else:
		gender = 'last case'

	return (gender, male_score, female_score, nonbinary_score)