# Gendering the talks 

One of our areas of interest are to see if TED talks by men and women are different. If there are differences, we would like to see if we can detect the gender of a speaker based on the words in a TED talk transcript. To pursue these avenues, we begin by "gendering" talks when possible. 

In this notebook, we use the genders of the speakers to "gender" the TED talks themselves. A talk with one speaker inherits the gender of the speaker. For a talk with two speakers, if the genders of the speakers are the same, then we proceed as if the talk had one speaker. For talks with two speakers where the genders of the speakers is not the same, we place these talks to the side. 

In [1]:
import pandas as pd
import csv
import string

In [2]:
# Load the gendered speaker file:
speakers = pd.read_csv("speakers_with_gender.csv")

In [3]:
ted_only = pd.read_csv('../data/Release_v0/TEDonly_final.csv')
ted_plus = pd.read_csv('../data/Release_v0/TEDplus_final.csv')

In [4]:
ted_only.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker_2,speaker_3,speaker_4
0,0,0,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",Al Gore,,,
1,1,1,7,https://www.ted.com/talks/david_pogue_says_sim...,Simplicity sells,New York Times columnist David Pogue takes aim...,TED2006,0:21:26,6/27/06,"simplicity,entertainment,interface design,soft...",1702201,"(Music: ""The Sound of Silence,"" Simon & Garf...",David Pogue,,,
2,2,2,53,https://www.ted.com/talks/majora_carter_s_tale...,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",TED2006,0:18:36,6/27/06,"MacArthur grant,cities,green,activism,politics...",2000421,If you're here today — and I'm very happy th...,Majora Carter,,,
3,3,3,66,https://www.ted.com/talks/ken_robinson_says_sc...,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,TED2006,0:19:24,6/27/06,"children,teaching,creativity,parenting,culture...",51614087,Good morning. How are you? (Laughter) ...,Ken Robinson,,,
4,4,4,92,https://www.ted.com/talks/hans_rosling_shows_t...,The best stats you've ever seen,You've never seen data presented like this. Wi...,TED2006,0:19:50,6/27/06,"demo,Asia,global issues,visualizations,global ...",12662135,"About 10 years ago, I took on the task to te...",Hans Rosling,,,


In [5]:
# Set the talk ID as the index and drop the unnecessary first two columns: 
ted_only = ted_only.set_index('Talk_ID')
# ted_only = ted_only.drop(columns = ['Unnamed: 0', 'Unnamed: 0.1'])

In [6]:
ted_only.head()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker_2,speaker_3,speaker_4
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,0,0,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",Al Gore,,,
7,1,1,https://www.ted.com/talks/david_pogue_says_sim...,Simplicity sells,New York Times columnist David Pogue takes aim...,TED2006,0:21:26,6/27/06,"simplicity,entertainment,interface design,soft...",1702201,"(Music: ""The Sound of Silence,"" Simon & Garf...",David Pogue,,,
53,2,2,https://www.ted.com/talks/majora_carter_s_tale...,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",TED2006,0:18:36,6/27/06,"MacArthur grant,cities,green,activism,politics...",2000421,If you're here today — and I'm very happy th...,Majora Carter,,,
66,3,3,https://www.ted.com/talks/ken_robinson_says_sc...,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,TED2006,0:19:24,6/27/06,"children,teaching,creativity,parenting,culture...",51614087,Good morning. How are you? (Laughter) ...,Ken Robinson,,,
92,4,4,https://www.ted.com/talks/hans_rosling_shows_t...,The best stats you've ever seen,You've never seen data presented like this. Wi...,TED2006,0:19:50,6/27/06,"demo,Asia,global issues,visualizations,global ...",12662135,"About 10 years ago, I took on the task to te...",Hans Rosling,,,


### Slicing data for just what we need

In this notebook, we seek to combine information from our gendered-speakers file and from our ted-talks file. In the rest of this file, we will take an SQL flavored approach, that is just carrying the views of the data that we need to ultimately add the gender of our speakers to the talks that they give. 

With this view in mind, we organize our thoughts regarding our data into `PRIMARY_KEY` and `FOREIGN_KEY`:
* The `PRIMARY_KEY` for `TED_only` is the `Talk_ID`, while the four columns with `speaker_` can each be thought of the `FOREIGN_KEY`s
* The `PRIMARY_KEY` for `speakers` is the `speaker` column. Here there is no foreign key. There is also an index column that we could use as the primary key, but as we will shortly see, it will not matter which we use. 

As we proceed, we will just select the columns that are relevant to our work of gendering the talks. For `TED_only`, we will keep `Talk_ID` and the four speaker columns. For `speakers`, we will keep the name of the speaker (`speaker`), the gender (`Gender_handcheck`, labeled according to work in a previous notebook) and the index column.  

In [7]:
ted_slice = ted_only[["speaker_1","speaker_2","speaker_3","speaker_4"]]

In [8]:
ted_slice.head()

Unnamed: 0_level_0,speaker_1,speaker_2,speaker_3,speaker_4
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Al Gore,,,
7,David Pogue,,,
53,Majora Carter,,,
66,Ken Robinson,,,
92,Hans Rosling,,,


#### Dividing our talks by number of speakers

We now will divide up the `ted_slice` data frame into up to four data frames: one for each number of speakers (up to 4 speakers, which is the max number of speakers). 

##### One Speaker

First, we want to know which talks only have one speaker. To do this, we only select the rows that `NaN` for the second speaker. We will then only keep the first speaker and talk ID columns. 

In [9]:
talk1speak = ted_slice[ted_slice['speaker_2'].isnull()]

# https://stackoverflow.com/questions/43831539/how-to-select-rows-with-nan-in-particular-column

In [10]:
talk1speak.shape

(966, 4)

In [11]:
# Just keep the first speaker (as the others are all NaN)

talk1 = talk1speak[["speaker_1"]]

In [12]:
talk1.head()


Unnamed: 0_level_0,speaker_1
Talk_ID,Unnamed: 1_level_1
1,Al Gore
7,David Pogue
53,Majora Carter
66,Ken Robinson
92,Hans Rosling


We have 966 talks that have only one speaker. 

##### Two Speakers

To figure out which talks have exactly two speakers, we select the rows that have a second speaker, but **not** a third one. We will then only keep the first two speakers and talk ID columns. 

In [13]:
s2temp = ted_slice[ted_slice['speaker_3'].isnull()]
talk2speak = s2temp[~s2temp['speaker_2'].isnull()]

In [14]:
talk2speak.shape

(25, 4)

In [15]:
talk2 = talk2speak[["speaker_1","speaker_2"]]

We have 25 talks that have two speakers. 

##### Three Speakers

To figure out which talks have three speakers, we select the rows that have a third speaker, but **not** a fourth one. We will then keep the first three speaker and talk ID columns. 

In [16]:
s3temp = ted_slice[ted_slice['speaker_4'].isnull()]
talk3speak = s3temp[~s3temp['speaker_3'].isnull()]

In [17]:
talk3speak.shape

(0, 4)

There are no talks with exactly three speakers. 

##### Four  Speakers
To figure out which talks have four speakers, we select the rows that have a fourth speaker. For this final frame we keep all four speaker columns as well as the talk ID column. 

In [18]:
talk4speak = ted_slice[~ted_slice['speaker_4'].isnull()]


In [19]:
talk4speak.shape

(1, 4)

In [20]:
talk4 = talk4speak[["speaker_1","speaker_2","speaker_3","speaker_4"]]

As a quick check, we check if the total number of rows in the `talkNspeak` data frames equal the number of rows in `ted_only`:

In [21]:
ted_only.shape[0] == talk1speak.shape[0] + talk2speak.shape[0] + talk3speak.shape[0] + talk4speak.shape[0]

True

## Adding gender to the talks

Now that we have separate the talks into smaller data frames, each representing the number of speakers, we will be working to "gender" the talks. 

As with `TED_only`, we are only going to carry the columns that we need for this work, that is just the `speaker` and `gender` columns. We also have an index column, which is fine.

In [22]:
gender_slice = speakers[["speaker","Gender_handcheck"]]
gender_slice = gender_slice.rename(columns={"Gender_handcheck": "gender"})

# Check that we have the columns that we want, by looking at just the beginning of the data frame
gender_slice.head()

Unnamed: 0,speaker,gender
0,Al Gore,male
1,David Pogue,male
2,Majora Carter,female
3,Ken Robinson,male
4,Hans Rosling,male


In [23]:
talk1test = talk1.reset_index().merge(gender_slice, left_on = "speaker_1", right_on = "speaker", how = "left")

In [24]:
talk1test.head()

Unnamed: 0,Talk_ID,speaker_1,speaker,gender
0,1,Al Gore,Al Gore,male
1,7,David Pogue,David Pogue,male
2,53,Majora Carter,Majora Carter,female
3,66,Ken Robinson,Ken Robinson,male
4,92,Hans Rosling,Hans Rosling,male


### Gendering talks with one speaker

To gender our talks, we will use five methods from `pandas`: 
1. `reset_index()`
2. `merge()`
3. `drop()`
4. `set_index()`
5. `rename()`

For the talks with just one speaker, we will use each of these methods separately, so that it is clear what each is doing. In the later pieces, we will chain these methods together. 

First, let us remind ourselves what we are starting with:

In [25]:
talk1.head()

Unnamed: 0_level_0,speaker_1
Talk_ID,Unnamed: 1_level_1
1,Al Gore
7,David Pogue
53,Majora Carter
66,Ken Robinson
92,Hans Rosling


The method `reset_index()` creates a new column where the index for each row is just the row's number. We need to do this so that we don't lose the `talk_ID` column when we use `merge()` in the next step: 

In [26]:
talk1_step1 = talk1.reset_index()
talk1_step1.head()

Unnamed: 0,Talk_ID,speaker_1
0,1,Al Gore
1,7,David Pogue
2,53,Majora Carter
3,66,Ken Robinson
4,92,Hans Rosling


In [27]:
# We want to take speaker gender from speakers and put it into the talks
# Remove the extra speaker column and rename the last column to "talk_gender"
talk1tmp = (talk1.reset_index()
            .merge(gender_slice, left_on = "speaker_1", right_on = "speaker", how = "left")
            .drop(columns = ['speaker']))

# We use `reset_index()` to preserve the `Talk_ID` column

In [28]:
talk1tmp.head()

Unnamed: 0,Talk_ID,speaker_1,gender
0,1,Al Gore,male
1,7,David Pogue,male
2,53,Majora Carter,female
3,66,Ken Robinson,male
4,92,Hans Rosling,male


In [29]:
# Reset the index back to `Talk_ID` 
talk1tmp.set_index("Talk_ID", inplace = True)

In [30]:
talk1tmp.head()

Unnamed: 0_level_0,speaker_1,gender
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Al Gore,male
7,David Pogue,male
53,Majora Carter,female
66,Ken Robinson,male
92,Hans Rosling,male


In [31]:
talk1tmp.rename(columns={"gender": "talk_gender"}, inplace = True)

In [32]:
talk1tmp.head()

Unnamed: 0_level_0,speaker_1,talk_gender
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Al Gore,male
7,David Pogue,male
53,Majora Carter,female
66,Ken Robinson,male
92,Hans Rosling,male


### Gendering talks with two speakers

Now with that talks with just one speaker each have been "gendered," we turn our attention to the talks with multiple speakers. If the multiple speakers for a talk have the same gender, then we label the talk with that gender. 

We start with those that have two speakers. We will add the gender for the first speaker and then for the second speaker. Our procedure is similar to the above process for just one speaker. 

In [33]:
print(talk2.shape)

talk2.head()

(25, 2)


Unnamed: 0_level_0,speaker_1,speaker_2
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
118,Sergey Brin,Larry Page
222,Jill Sobule,Julia Sweeney
224,Roy Gould,Curtis Wong
246,Tod Machover,Dan Ellsey
322,Bruno Bowden,Rufus Cappadocia


We're going to make a copy of our talks with two speakers.

In [34]:
# We want to take speaker gender from speakers and put it into the talks
# Remove the extra speaker column and rename the last column to "gender_1"
talk2tmp = (talk2.reset_index()
            .merge(gender_slice, left_on = "speaker_1", right_on = "speaker", how = "left")
            .drop(columns = ['speaker'])
            .merge(gender_slice, left_on = "speaker_2", right_on = "speaker", how = "left")
            .drop(columns = ['speaker'])
            .rename(columns={"gender_x": "gender_1", "gender_y": "gender_2"})
            .set_index("Talk_ID"))
# We use `reset_index()` to preserve the `Talk_ID` column

# Add empty column for the talk's gender: 
talk2tmp["talk_gender"] = ""

In [35]:
talk2tmp.shape

(25, 5)

In [36]:
talk2tmp.head()

Unnamed: 0_level_0,speaker_1,speaker_2,gender_1,gender_2,talk_gender
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
118,Sergey Brin,Larry Page,male,male,
222,Jill Sobule,Julia Sweeney,female,female,
224,Roy Gould,Curtis Wong,male,male,
246,Tod Machover,Dan Ellsey,male,male,
322,Bruno Bowden,Rufus Cappadocia,male,male,


Now we check which rows have the same genders for both speakers. We will assign those talks to have the gender of the speakers (which in this case is the same). 

In [37]:
talk2tmp.loc[talk2tmp["gender_1"] == talk2tmp["gender_2"], "talk_gender"] = talk2tmp["gender_1"]

Next for the rows without matching genders across the speakers, we assign the talk with the label "No one gender". We then check our resulting data slice: 

In [38]:
talk2tmp.loc[talk2tmp["gender_1"] != talk2tmp["gender_2"], "talk_gender"] = "No one gender"

In [39]:
talk2tmp

Unnamed: 0_level_0,speaker_1,speaker_2,gender_1,gender_2,talk_gender
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
118,Sergey Brin,Larry Page,male,male,male
222,Jill Sobule,Julia Sweeney,female,female,female
224,Roy Gould,Curtis Wong,male,male,male
246,Tod Machover,Dan Ellsey,male,male,male
322,Bruno Bowden,Rufus Cappadocia,male,male,male
385,Zach Kaplan,Keith Schacht,male,male,male
481,Pattie Maes,Pranav Mistry,female,male,No one gender
881,Stewart Brand,Mark Z. Jacobson,male,male,male
988,David Byrne,Thomas Dolby,male,male,male
1156,Robert Gupta,Joshua Roman,male,male,male


### Gendering talks with four speakers

Now with that talks with one and two speakers have been "gendered," we turn our attention to the remaining talks. As for talks with two speakers, if the multiple speakers for a talk have the same gender, then we label the talk with that gender. 

Since there was no talk with three speakers, we move right on to the only talk with four speakers. For consistency, we will use the same procedure as for talks with two speakers. 

In [40]:
# We want to take speaker gender from speakers and put it into the talks
# Remove the extra speaker column and rename the last column to "gender_1"
talk4tmp = (talk4.reset_index()
            .merge(gender_slice, left_on = "speaker_1", right_on = "speaker", how = "left")
            .drop(columns = ['speaker'])
            .merge(gender_slice, left_on = "speaker_2", right_on = "speaker", how = "left")
            .drop(columns = ['speaker'])
            .rename(columns={"gender_x": "gender_1", "gender_y": "gender_2"})
            .merge(gender_slice, left_on = "speaker_3", right_on = "speaker", how = "left")
            .drop(columns = ['speaker'])
            .merge(gender_slice, left_on = "speaker_4", right_on = "speaker", how = "left")
            .drop(columns = ['speaker'])
            .rename(columns={"gender_x": "gender_3", "gender_y": "gender_4"})
            .set_index("Talk_ID"))
# We use `reset_index()` to preserve the `Talk_ID` column



In [41]:
talk4tmp

Unnamed: 0_level_0,speaker_1,speaker_2,speaker_3,speaker_4,gender_1,gender_2,gender_3,gender_4
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1786,Diana Reiss,Neil Gershenfeld,Peter Gabriel,Vint Cerf,female,male,male,male


What is interesting about this talk is that three of the speakers are male, while one is female. In keeping with our work on two speakers, we will label this as "No one gender". But for future experiments, it may be useful to remember that the majority of these speakers are male. 

In [42]:
# Add empty column for the talk's gender: 
talk4tmp["talk_gender"] = "No one gender"

In [43]:
talk4tmp

Unnamed: 0_level_0,speaker_1,speaker_2,speaker_3,speaker_4,gender_1,gender_2,gender_3,gender_4,talk_gender
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1786,Diana Reiss,Neil Gershenfeld,Peter Gabriel,Vint Cerf,female,male,male,male,No one gender


## Concatenating the Genders

Now that we have gendered all the talks in several pieces, we need to recombine these pieces. Again, in keeping with the SQL approach, as we do these combinations, we only need two columns: `Talk_ID` and `talk_gender`. 

We begin by only selecting the necessary columns. We then concatenate these pieces into one column. 

In [44]:
t1add = talk1tmp["talk_gender"]
t2add = talk2tmp["talk_gender"]
t4add = talk4tmp["talk_gender"]

In [45]:
talk_just_gender = pd.concat([t1add,t2add,t4add])

Now we merge the combined result of all the talks' genders with the main dataframe: 

In [46]:
ted_gender = (ted_only.drop(columns = ['Unnamed: 0', 'Unnamed: 0.1'])
              .reset_index()
              .merge(talk_just_gender, on = "Talk_ID", how = "left")
              .set_index("Talk_ID"))

In [47]:
ted_gender

Unnamed: 0_level_0,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker_2,speaker_3,speaker_4,talk_gender
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",Al Gore,,,,male
7,https://www.ted.com/talks/david_pogue_says_sim...,Simplicity sells,New York Times columnist David Pogue takes aim...,TED2006,0:21:26,6/27/06,"simplicity,entertainment,interface design,soft...",1702201,"(Music: ""The Sound of Silence,"" Simon & Garf...",David Pogue,,,,male
53,https://www.ted.com/talks/majora_carter_s_tale...,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",TED2006,0:18:36,6/27/06,"MacArthur grant,cities,green,activism,politics...",2000421,If you're here today — and I'm very happy th...,Majora Carter,,,,female
66,https://www.ted.com/talks/ken_robinson_says_sc...,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,TED2006,0:19:24,6/27/06,"children,teaching,creativity,parenting,culture...",51614087,Good morning. How are you? (Laughter) ...,Ken Robinson,,,,male
92,https://www.ted.com/talks/hans_rosling_shows_t...,The best stats you've ever seen,You've never seen data presented like this. Wi...,TED2006,0:19:50,6/27/06,"demo,Asia,global issues,visualizations,global ...",12662135,"About 10 years ago, I took on the task to te...",Hans Rosling,,,,male
96,https://www.ted.com/talks/tony_robbins_asks_wh...,Why we do what we do,"Tony Robbins discusses the ""invisible forces"" ...",TED2006,0:21:45,6/27/06,"entertainment,goal-setting,potential,psycholog...",22368699,Thank you. I have to tell you I'm both chall...,Tony Robbins,,,,male
49,https://www.ted.com/talks/joshua_prince_ramus_...,Behind the design of Seattle's library,Architect Joshua Prince-Ramus takes the audien...,TED2006,0:19:58,7/10/06,"library,architecture,design,culture,collaboration",1042335,I'm going to present three projects in rapid...,Joshua Prince-Ramus,,,,male
86,https://www.ted.com/talks/julia_sweeney_on_let...,Letting go of God,When two young Mormon missionaries knock on Ju...,TED2006,0:16:32,7/10/06,"atheism,Christianity,religion,God,comedy,humor...",3903747,"On September 10, the morning of my seventh b...",Julia Sweeney,,,,female
71,https://www.ted.com/talks/rick_warren_on_a_lif...,A life of purpose,"Pastor Rick Warren, author of ""The Purpose-Dri...",TED2006,0:21:02,7/18/06,"Christianity,philanthropy,religion,God,happine...",3361934,"I'm often asked, ""What surprised you about t...",Rick Warren,,,,male
94,https://www.ted.com/talks/dan_dennett_s_respon...,Let's teach religion -- all religion -- in sch...,Philosopher Dan Dennett calls for religion -- ...,TED2006,0:24:45,7/18/06,"atheism,consciousness,evolution,philosophy,rel...",2751013,It's wonderful to be back. I love this wonde...,Dan Dennett,,,,male


## Splitting the talks by gender

Now we can split the talks into three groups by gender: male, female, and no one gender. By splitting our talks into these groups, we can investigate various gender dimensions of TED talk transcripts.

In [50]:
m_talks = ted_gender[ted_gender["talk_gender"] == "male"]
f_talks = ted_gender[ted_gender["talk_gender"] == "female"]
ng_talks = ted_gender[ted_gender["talk_gender"] == "No one gender"]