# Gendering the talks 

One of our areas of interest are to see if TED talks by men and women are different. If there are differences, we would like to see if we can detect the gender of a speaker based on the words in a TED talk transcript. To pursue these avenues, we begin by "gendering" talks when possible. 

In this notebook, we use the genders of the speakers to "gender" the TED talks themselves. A talk with one speaker inherits the gender of the speaker. For a talk with two speakers, if the genders of the speakers are the same, then we proceed as if the talk had one speaker. For talks with two speakers where the genders of the speakers is not the same, we place these talks to the side. 

In [9]:
import pandas as pd
import csv
import string

In [10]:
# Load the gendered speaker file:
speakers = pd.read_csv("speakers_with_gender.csv")

In [11]:
ted_only = pd.read_csv('../data/Release_v0/TEDonly_final.csv')
ted_plus = pd.read_csv('../data/Release_v0/TEDplus_final.csv')

In [12]:
ted_only.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker_2,speaker_3,speaker_4
0,0,0,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",Al Gore,,,
1,1,1,7,https://www.ted.com/talks/david_pogue_says_sim...,Simplicity sells,New York Times columnist David Pogue takes aim...,TED2006,0:21:26,6/27/06,"simplicity,entertainment,interface design,soft...",1702201,"(Music: ""The Sound of Silence,"" Simon & Garf...",David Pogue,,,
2,2,2,53,https://www.ted.com/talks/majora_carter_s_tale...,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",TED2006,0:18:36,6/27/06,"MacArthur grant,cities,green,activism,politics...",2000421,If you're here today — and I'm very happy th...,Majora Carter,,,
3,3,3,66,https://www.ted.com/talks/ken_robinson_says_sc...,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,TED2006,0:19:24,6/27/06,"children,teaching,creativity,parenting,culture...",51614087,Good morning. How are you? (Laughter) ...,Ken Robinson,,,
4,4,4,92,https://www.ted.com/talks/hans_rosling_shows_t...,The best stats you've ever seen,You've never seen data presented like this. Wi...,TED2006,0:19:50,6/27/06,"demo,Asia,global issues,visualizations,global ...",12662135,"About 10 years ago, I took on the task to te...",Hans Rosling,,,


In [13]:
# Set the talk ID as the index and drop the unnecessary first two columns: 
ted_only = ted_only.set_index('Talk_ID')
# ted_only = ted_only.drop(columns = ['Unnamed: 0', 'Unnamed: 0.1'])

In [14]:
ted_only.head()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker_2,speaker_3,speaker_4
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,0,0,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",Al Gore,,,
7,1,1,https://www.ted.com/talks/david_pogue_says_sim...,Simplicity sells,New York Times columnist David Pogue takes aim...,TED2006,0:21:26,6/27/06,"simplicity,entertainment,interface design,soft...",1702201,"(Music: ""The Sound of Silence,"" Simon & Garf...",David Pogue,,,
53,2,2,https://www.ted.com/talks/majora_carter_s_tale...,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",TED2006,0:18:36,6/27/06,"MacArthur grant,cities,green,activism,politics...",2000421,If you're here today — and I'm very happy th...,Majora Carter,,,
66,3,3,https://www.ted.com/talks/ken_robinson_says_sc...,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,TED2006,0:19:24,6/27/06,"children,teaching,creativity,parenting,culture...",51614087,Good morning. How are you? (Laughter) ...,Ken Robinson,,,
92,4,4,https://www.ted.com/talks/hans_rosling_shows_t...,The best stats you've ever seen,You've never seen data presented like this. Wi...,TED2006,0:19:50,6/27/06,"demo,Asia,global issues,visualizations,global ...",12662135,"About 10 years ago, I took on the task to te...",Hans Rosling,,,


In [15]:
ted_slice = ted_only[["speaker_1","speaker_2","speaker_3","speaker_4"]]

In [16]:
ted_slice.head()

Unnamed: 0_level_0,speaker_1,speaker_2,speaker_3,speaker_4
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Al Gore,,,
7,David Pogue,,,
53,Majora Carter,,,
66,Ken Robinson,,,
92,Hans Rosling,,,


First, we want to know which talks only have one speaker. To do this, we only select the rows that `NaN` for the second speaker. 

In [17]:
talk1speak = ted_slice[ted_slice['speaker_2'].isnull()]

# https://stackoverflow.com/questions/43831539/how-to-select-rows-with-nan-in-particular-column

In [18]:
talk1speak.shape

(966, 4)

In [19]:
talk1 = talk1speak[["speaker_1"]]

In [20]:
talk1.head()


Unnamed: 0_level_0,speaker_1
Talk_ID,Unnamed: 1_level_1
1,Al Gore
7,David Pogue
53,Majora Carter
66,Ken Robinson
92,Hans Rosling


We have 966 talks that have only one speaker. 


To figure out which talks have exactly two speakers, we select the rows that have a second speaker, but **not** a third one:  

In [21]:
s2temp = ted_slice[ted_slice['speaker_3'].isnull()]
talk2speak = s2temp[~s2temp['speaker_2'].isnull()]

In [22]:
talk2speak.shape

(25, 4)

In [23]:
talk2 = talk2speak[["speaker_1","speaker_2"]]

We have 25 talks that have two speakers. 


To figure out which talks have three speakers, we select the rows that have a third speaker, but **not** a fourth one:  

In [24]:
s3temp = ted_slice[ted_slice['speaker_4'].isnull()]
talk3speak = s3temp[~s3temp['speaker_3'].isnull()]

In [25]:
talk3speak.shape

(0, 4)

There are no talks with exactly three speakers. 


To figure out which talks have four speakers, we select the rows that have a fourth speaker:  

In [26]:
talk4speak = ted_slice[~ted_slice['speaker_4'].isnull()]


In [27]:
talk4speak.shape

(1, 4)

In [28]:
talk4 = talk4speak[["speaker_1","speaker_2","speaker_3","speaker_4"]]

As a quick check, we check if the total number of rows in the `talkNspeak` data frames equal the number of rows in `ted_only`:

In [29]:
ted_only.shape[0] == talk1speak.shape[0] + talk2speak.shape[0] + talk3speak.shape[0] + talk4speak.shape[0]

True

## Adding gender to the talks

Now that we have separate the talks into smaller data frames, each representing the number of speakers, we will be working to "gender" the talks. 

In [30]:
gender_slice = speakers[["speaker","Gender_handcheck"]]
gender_slice = gender_slice.rename(columns={"Gender_handcheck": "gender"})

In [31]:
gender_slice.head()

Unnamed: 0,speaker,gender
0,Al Gore,male
1,David Pogue,male
2,Majora Carter,female
3,Ken Robinson,male
4,Hans Rosling,male


In [32]:
talk1.head()

Unnamed: 0_level_0,speaker_1
Talk_ID,Unnamed: 1_level_1
1,Al Gore
7,David Pogue
53,Majora Carter
66,Ken Robinson
92,Hans Rosling


In [33]:
talk1test = talk1.reset_index().merge(gender_slice, left_on = "speaker_1", right_on = "speaker", how = "left")

In [34]:
talk1test.head()

Unnamed: 0,Talk_ID,speaker_1,speaker,gender
0,1,Al Gore,Al Gore,male
1,7,David Pogue,David Pogue,male
2,53,Majora Carter,Majora Carter,female
3,66,Ken Robinson,Ken Robinson,male
4,92,Hans Rosling,Hans Rosling,male


### Gendering talks with one speaker

In [35]:
# We want to take speaker gender from speakers and put it into the talks
# Remove the extra speaker column and rename the last column to "talk_gender"
talk1tmp = (talk1.reset_index()
            .merge(gender_slice, left_on = "speaker_1", right_on = "speaker", how = "left")
            .drop(columns = ['speaker']))

# We use `reset_index()` to preserve the `Talk_ID` column

In [36]:
talk1tmp.head()

Unnamed: 0,Talk_ID,speaker_1,gender
0,1,Al Gore,male
1,7,David Pogue,male
2,53,Majora Carter,female
3,66,Ken Robinson,male
4,92,Hans Rosling,male


In [37]:
# Reset the index back to `Talk_ID` 
talk1tmp.set_index("Talk_ID", inplace = True)

In [44]:
talk1tmp.head()

Unnamed: 0_level_0,speaker_1,gender
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Al Gore,male
7,David Pogue,male
53,Majora Carter,female
66,Ken Robinson,male
92,Hans Rosling,male


In [45]:
talk1tmp.rename(columns={"gender": "talk_gender"}, inplace = True)

In [46]:
talk1tmp.head()

Unnamed: 0_level_0,speaker_1,talk_gender
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Al Gore,male
7,David Pogue,male
53,Majora Carter,female
66,Ken Robinson,male
92,Hans Rosling,male


### Gendering talks with one speaker

Now with that talks with just one speaker each have been "gendered," we turn our attention to the talks with multiple speakers. If the multiple speakers for a talk have the same gender, then we label the talk with that gender. 

We start with those that have two speakers. We will add the gender for the first speaker and then for the second speaker. Our procedure is similar to the above process for just one speaker. 

In [48]:
print(talk2.shape)

talk2.head()

(25, 2)


Unnamed: 0_level_0,speaker_1,speaker_2
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
118,Sergey Brin,Larry Page
222,Jill Sobule,Julia Sweeney
224,Roy Gould,Curtis Wong
246,Tod Machover,Dan Ellsey
322,Bruno Bowden,Rufus Cappadocia


We're going to make a copy of our talks with two speakers.

In [73]:
# We want to take speaker gender from speakers and put it into the talks
# Remove the extra speaker column and rename the last column to "gender_1"
talk2tmp = (talk2.reset_index()
            .merge(gender_slice, left_on = "speaker_1", right_on = "speaker", how = "left")
            .drop(columns = ['speaker'])
            .merge(gender_slice, left_on = "speaker_2", right_on = "speaker", how = "left")
            .drop(columns = ['speaker'])
            .rename(columns={"gender_x": "gender_1", "gender_y": "gender_2"})
            .set_index("Talk_ID"))
# We use `reset_index()` to preserve the `Talk_ID` column

In [71]:
talk2tmp.shape

(25, 5)

In [66]:
talk2tmp.head()

Unnamed: 0,Talk_ID,speaker_1,speaker_2,gender_1,gender_2
0,118,Sergey Brin,Larry Page,male,male
1,222,Jill Sobule,Julia Sweeney,female,female
2,224,Roy Gould,Curtis Wong,male,male
3,246,Tod Machover,Dan Ellsey,male,male
4,322,Bruno Bowden,Rufus Cappadocia,male,male


Now we check which rows have the same genders for both speakers:

In [68]:
talk2tmp[talk2tmp["gender_1"] == talk2tmp["gender_2"]].shape

(18, 5)

In [69]:
talk2tmp[talk2tmp["gender_1"] != talk2tmp["gender_2"]].shape

(7, 5)