# Merging talks to their speakers

There are a few phases to this stage the of the data cleaning. 
* First we split the speakers of the talks into their own columns. Along the way, there are a few steps that are completed manually
* Second, we merge the meta data of the talks with the descriptions of the speakers 

#### This file handles TEDplus and is a near copy of the one that worked on TEDonly

### Step 0 - Importing packages

In [1]:
# Set of imports
import pandas as pd
import csv
import string
import numpy as np

In [2]:
# Import talk file 
talks = pd.read_csv('TEDplus.csv', encoding='utf-8')
speakers = pd.read_csv('speakers.csv',encoding = 'utf-8')

### Step 1 - Begin splitting and cleaning of the talks file

In [3]:
#Drop weird unnamed column - 
talks.drop(talks.columns[0:2], axis=1, inplace=True)


In [4]:
# Break speakers into two columns
splitList = r' \+ | , | and '
#https://stackoverflow.com/questions/37543724/python-regex-for-finding-all-words-in-a-string

splitSpeakers = talks['speaker_name'].str.split(splitList, expand=True).rename(columns=lambda x: f"speaker_{x+1}")

# Join the speakers to the talks dataframe and drop the original speakers:
splitTalks = talks.join(splitSpeakers)



In [6]:
splitTalks.drop(["speaker_name"], axis=1, inplace=True)

In [7]:
splitTalks.head()

Unnamed: 0,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker_2
0,37,https://www.ted.com/talks/jimmy_wales_on_the_b...,The birth of Wikipedia,"Jimmy Wales recalls how he assembled ""a ragtag...",TEDGlobal 2005,0:20:01,8/21/06,"wikipedia,open-source,media,invention,culture,...",1187730,"Charles Van Doren, who was later a senior ed...",Jimmy Wales,
1,47,https://www.ted.com/talks/david_deutsch_on_our...,Chemical scum that dream of distant quasars,Legendary scientist David Deutsch puts theoret...,TEDGlobal 2005,0:19:00,9/12/06,"cosmos,physics,global issues,climate change,un...",1182698,We've been told to go out on a limb and say ...,David Deutsch,
2,98,https://www.ted.com/talks/richard_dawkins_on_o...,Why the universe seems so strange,"Biologist Richard Dawkins makes a case for ""th...",TEDGlobal 2005,0:21:56,9/12/06,"cosmos,evolution,physics,astronomy,psychology,...",3036253,"My title: ""Queerer than we can suppose: the ...",Richard Dawkins,
3,93,https://www.ted.com/talks/barry_schwartz_on_th...,The paradox of choice,Psychologist Barry Schwartz takes aim at a cen...,TEDGlobal 2005,0:19:37,9/26/06,"choice,happiness,potential,psychology,economic...",11110916,I'm going to talk to you about some stuff th...,Barry Schwartz,
4,39,https://www.ted.com/talks/aubrey_de_grey_says_...,A roadmap to end aging,Cambridge researcher Aubrey de Grey argues tha...,TEDGlobal 2005,0:22:45,10/2/06,"biotech,engineering,aging,health care,disease,...",3467757,18 minutes is an absolutely brutal time limi...,Aubrey de Grey,


#### Check for non-unicode characters in both talks and the speakers

In [8]:
test_inds = splitTalks["speaker_1"].apply(lambda x: len([True for i in str(x) if (ord(i) < 32 or ord(i) > 122)]) > 0)
#https://stackoverflow.com/questions/36340627/removing-non-ascii-characters-and-
#                                   replacing-with-spaces-from-pandas-data-frame

#https://blog.teamtreehouse.com/python-single-line-loops

In [9]:
test_inds2 = splitTalks["speaker_2"].apply(lambda x: len([True for i in str(x) if (ord(i) < 32 or ord(i) > 122)]) > 0)

In [10]:
splitTalks[test_inds2]

Unnamed: 0,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker_2
486,2131,https://www.ted.com/talks/vincent_moon_and_nan...,Hidden music rituals around the world,Vincent Moon travels the world with a backpack...,TEDGlobal 2014,0:24:13,11/14/14,"jazz,travel,film,live music,creativity,music",1050973,"Vincent Moon: How can we use computers, came...",Vincent Moon,Naná¡ Vasconcelos


In [11]:
sinds = speakers["name"].apply(lambda x: len([True for i in str(x) if (ord(i) < 32 or ord(i) > 122)]) > 0)

In [12]:
# Check for matches across the rows with special characters in talks and 
# the rows with special characters in speakers

st1 = splitTalks[test_inds][["speaker_1"]]
st2 = splitTalks[test_inds2][["speaker_2"]]

st1.rename(columns={'speaker_1':'speaker'}, inplace=True)
st2.rename(columns={'speaker_2':'speaker'}, inplace=True)

st = pd.concat([st1,st2], axis = 0)

speaks = speakers[sinds][["name"]]

name_test = pd.merge(st, speaks, how = "outer", left_on = "speaker", right_on = "name", indicator = True)
# https://stackoverflow.com/questions/20375561/joining-pandas-dataframes-by-column-names

In [13]:
# This is the number of special character rows that are properly merged:
len(name_test[name_test["_merge"] == "both"])

10

In [14]:
len(name_test[name_test["_merge"] == "left_only"])

7

In [34]:
# Save the ones that we need to manually edit - 
name_test.to_csv('./speakers_work/plus_manual_name_edits_old.csv', sep = ',')

In [35]:
# Save the talks file where it is. 
splitTalks.to_csv('./speakers_work/data/TEDplus_splitSpeakers.csv',sep = ',')

At this point, we do a manual edit, fixing the names that were parsed strangely due to the html interpretor. The file `plus_manual_name_edits.csv` shows which names have accents and/or special characters and whether they are matched correctly between the talks and speakers files. During this cleaning, we renamed this file `TEDplus_splitSpeakers_clean.csv`.

In [3]:
# reload the splitTalks:
clean_talks = pd.read_csv('./speakers_work/TEDplus_splitSpeakers_clean.csv',encoding = 'utf-8')
clean_talks.drop(clean_talks.columns[0], axis=1, inplace=True)

#Edits were made to the speaker file, so we need to re-import that 
speakers = pd.read_csv('speakers.csv',encoding = 'utf-8')

In [5]:
clean_talks

Unnamed: 0,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker_2
0,37,https://www.ted.com/talks/jimmy_wales_on_the_b...,The birth of Wikipedia,"Jimmy Wales recalls how he assembled ""a ragtag...",TEDGlobal 2005,0:20:01,8/21/06,"wikipedia,open-source,media,invention,culture,...",1187730,"Charles Van Doren, who was later a senior ed...",Jimmy Wales,
1,47,https://www.ted.com/talks/david_deutsch_on_our...,Chemical scum that dream of distant quasars,Legendary scientist David Deutsch puts theoret...,TEDGlobal 2005,0:19:00,9/12/06,"cosmos,physics,global issues,climate change,un...",1182698,We've been told to go out on a limb and say ...,David Deutsch,
2,98,https://www.ted.com/talks/richard_dawkins_on_o...,Why the universe seems so strange,"Biologist Richard Dawkins makes a case for ""th...",TEDGlobal 2005,0:21:56,9/12/06,"cosmos,evolution,physics,astronomy,psychology,...",3036253,"My title: ""Queerer than we can suppose: the ...",Richard Dawkins,
3,93,https://www.ted.com/talks/barry_schwartz_on_th...,The paradox of choice,Psychologist Barry Schwartz takes aim at a cen...,TEDGlobal 2005,0:19:37,9/26/06,"choice,happiness,potential,psychology,economic...",11110916,I'm going to talk to you about some stuff th...,Barry Schwartz,
4,39,https://www.ted.com/talks/aubrey_de_grey_says_...,A roadmap to end aging,Cambridge researcher Aubrey de Grey argues tha...,TEDGlobal 2005,0:22:45,10/2/06,"biotech,engineering,aging,health care,disease,...",3467757,18 minutes is an absolutely brutal time limi...,Aubrey de Grey,
5,79,https://www.ted.com/talks/iqbal_quadir_says_mo...,How mobile phones can fight poverty,Iqbal Quadir tells how his experiences as a ki...,TEDGlobal 2005,0:15:52,10/10/06,"microfinance,alternative energy,transportation...",529470,I'll just take you to Bangladesh for a minut...,Iqbal Quadir,
6,91,https://www.ted.com/talks/jacqueline_novogratz...,Invest in Africa's own solutions,Jacqueline Novogratz applauds the world's heig...,TEDGlobal 2005,0:12:53,10/10/06,"microfinance,philanthropy,investment,poverty,g...",771282,"I want to start with a story, a la Seth Godi...",Jacqueline Novogratz,
7,3,https://www.ted.com/talks/ashraf_ghani_on_rebu...,How to rebuild a broken state,Ashraf Ghani's passionate and powerful 10-minu...,TEDGlobal 2005,0:18:45,10/18/06,"policy,investment,global issues,poverty,global...",849545,"A public, Dewey long ago observed, is consti...",Ashraf Ghani,
8,75,https://www.ted.com/talks/sasa_vucinic_invests...,Why we should invest in a free press,"A free press -- papers, magazines, radio, TV, ...",TEDGlobal 2005,0:18:00,10/18/06,"philanthropy,investment,global issues,media,cu...",599655,Video: Narrator: An event seen from one poin...,Sasa Vucinic,
9,67,https://www.ted.com/talks/peter_donnelly_shows...,How juries are fooled by statistics,Oxford mathematician Peter Donnelly reveals th...,TEDGlobal 2005,0:21:20,11/8/06,"statistics,science,genetics,culture,technology",1092979,"As other speakers have said, it's a rather d...",Peter Donnelly,


In [6]:
# Merge the cleaned_talks with the speakers file. Then clean up the column names. 
result1 = pd.merge(clean_talks, speakers, 
                   how = 'left', left_on = 'speaker_1', right_on = 'name')
result1.drop(['name'], axis=1, inplace=True)
result1.rename(columns={'occupation':'speaker1_occupation', 
                        'introduction':'speaker1_introduction', 
                        'profile':'speaker1_profile'}, inplace=True)

# https://stackoverflow.com/questions/35321812/move-column-in-pandas-dataframe/35321983
# pop off the speaker_2 column and put at the end of the dataframe
cols = list(result1.columns.values) #Make a list of all of the columns in the df
cols.pop(cols.index('speaker_2'))
result1 = result1[cols+['speaker_2']]
result1.head()

Unnamed: 0,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker1_occupation,speaker1_introduction,speaker1_profile,speaker_2
0,37,https://www.ted.com/talks/jimmy_wales_on_the_b...,The birth of Wikipedia,"Jimmy Wales recalls how he assembled ""a ragtag...",TEDGlobal 2005,0:20:01,8/21/06,"wikipedia,open-source,media,invention,culture,...",1187730,"Charles Van Doren, who was later a senior ed...",Jimmy Wales,Founder of Wikipedia,"With a vision for a free online encyclopedia, ...",Why you should listen\nJimmy Wales went from b...,
1,47,https://www.ted.com/talks/david_deutsch_on_our...,Chemical scum that dream of distant quasars,Legendary scientist David Deutsch puts theoret...,TEDGlobal 2005,0:19:00,9/12/06,"cosmos,physics,global issues,climate change,un...",1182698,We've been told to go out on a limb and say ...,David Deutsch,Quantum physicist,"David Deutsch's 1997 book ""The Fabric of Reali...",Why you should listen\nDavid Deutsch will forc...,
2,98,https://www.ted.com/talks/richard_dawkins_on_o...,Why the universe seems so strange,"Biologist Richard Dawkins makes a case for ""th...",TEDGlobal 2005,0:21:56,9/12/06,"cosmos,evolution,physics,astronomy,psychology,...",3036253,"My title: ""Queerer than we can suppose: the ...",Richard Dawkins,Evolutionary biologist,Oxford professor Richard Dawkins has helped st...,Why you should listen\nAs an evolutionary biol...,
3,93,https://www.ted.com/talks/barry_schwartz_on_th...,The paradox of choice,Psychologist Barry Schwartz takes aim at a cen...,TEDGlobal 2005,0:19:37,9/26/06,"choice,happiness,potential,psychology,economic...",11110916,I'm going to talk to you about some stuff th...,Barry Schwartz,Psychologist,Barry Schwartz studies the link between econom...,Why you should listen\nIn his 2004 book The Pa...,
4,39,https://www.ted.com/talks/aubrey_de_grey_says_...,A roadmap to end aging,Cambridge researcher Aubrey de Grey argues tha...,TEDGlobal 2005,0:22:45,10/2/06,"biotech,engineering,aging,health care,disease,...",3467757,18 minutes is an absolutely brutal time limi...,Aubrey de Grey,Crusader against aging,"Aubrey de Grey, British researcher on aging, c...","Why you should listen\nA true maverick, Aubre...",


In [9]:
result2 = pd.merge(result1, speakers, 
                   how = 'left', left_on = 'speaker_2', right_on = 'name')
result2.rename(columns={'occupation':'speaker2_occupation', 
                        'introduction':'speaker2_introduction', 
                        'profile':'speaker2_profile'}, inplace=True)
result2.drop(['name'], axis=1, inplace=True)
result2

Unnamed: 0,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker1_occupation,speaker1_introduction,speaker1_profile,speaker_2,speaker2_occupation,speaker2_introduction,speaker2_profile
0,37,https://www.ted.com/talks/jimmy_wales_on_the_b...,The birth of Wikipedia,"Jimmy Wales recalls how he assembled ""a ragtag...",TEDGlobal 2005,0:20:01,8/21/06,"wikipedia,open-source,media,invention,culture,...",1187730,"Charles Van Doren, who was later a senior ed...",Jimmy Wales,Founder of Wikipedia,"With a vision for a free online encyclopedia, ...",Why you should listen\nJimmy Wales went from b...,,,,
1,47,https://www.ted.com/talks/david_deutsch_on_our...,Chemical scum that dream of distant quasars,Legendary scientist David Deutsch puts theoret...,TEDGlobal 2005,0:19:00,9/12/06,"cosmos,physics,global issues,climate change,un...",1182698,We've been told to go out on a limb and say ...,David Deutsch,Quantum physicist,"David Deutsch's 1997 book ""The Fabric of Reali...",Why you should listen\nDavid Deutsch will forc...,,,,
2,98,https://www.ted.com/talks/richard_dawkins_on_o...,Why the universe seems so strange,"Biologist Richard Dawkins makes a case for ""th...",TEDGlobal 2005,0:21:56,9/12/06,"cosmos,evolution,physics,astronomy,psychology,...",3036253,"My title: ""Queerer than we can suppose: the ...",Richard Dawkins,Evolutionary biologist,Oxford professor Richard Dawkins has helped st...,Why you should listen\nAs an evolutionary biol...,,,,
3,93,https://www.ted.com/talks/barry_schwartz_on_th...,The paradox of choice,Psychologist Barry Schwartz takes aim at a cen...,TEDGlobal 2005,0:19:37,9/26/06,"choice,happiness,potential,psychology,economic...",11110916,I'm going to talk to you about some stuff th...,Barry Schwartz,Psychologist,Barry Schwartz studies the link between econom...,Why you should listen\nIn his 2004 book The Pa...,,,,
4,39,https://www.ted.com/talks/aubrey_de_grey_says_...,A roadmap to end aging,Cambridge researcher Aubrey de Grey argues tha...,TEDGlobal 2005,0:22:45,10/2/06,"biotech,engineering,aging,health care,disease,...",3467757,18 minutes is an absolutely brutal time limi...,Aubrey de Grey,Crusader against aging,"Aubrey de Grey, British researcher on aging, c...","Why you should listen\nA true maverick, Aubre...",,,,
5,79,https://www.ted.com/talks/iqbal_quadir_says_mo...,How mobile phones can fight poverty,Iqbal Quadir tells how his experiences as a ki...,TEDGlobal 2005,0:15:52,10/10/06,"microfinance,alternative energy,transportation...",529470,I'll just take you to Bangladesh for a minut...,Iqbal Quadir,"Founder, GrameenPhone",Iqbal Quadir is an advocate of business as a h...,Why you should listen\nAs a kid in rural Bangl...,,,,
6,91,https://www.ted.com/talks/jacqueline_novogratz...,Invest in Africa's own solutions,Jacqueline Novogratz applauds the world's heig...,TEDGlobal 2005,0:12:53,10/10/06,"microfinance,philanthropy,investment,poverty,g...",771282,"I want to start with a story, a la Seth Godi...",Jacqueline Novogratz,Investor and advocate for moral leadership,Jacqueline Novogratz works to enable human flo...,Why you should listen\nJacqueline Novogratz wr...,,,,
7,3,https://www.ted.com/talks/ashraf_ghani_on_rebu...,How to rebuild a broken state,Ashraf Ghani's passionate and powerful 10-minu...,TEDGlobal 2005,0:18:45,10/18/06,"policy,investment,global issues,poverty,global...",849545,"A public, Dewey long ago observed, is consti...",Ashraf Ghani,President-elect of Afghanistan,"Ashraf Ghani, Afghanistan’s new president-elec...",Why you should listen\nAshraf Ghani became Afg...,,,,
8,75,https://www.ted.com/talks/sasa_vucinic_invests...,Why we should invest in a free press,"A free press -- papers, magazines, radio, TV, ...",TEDGlobal 2005,0:18:00,10/18/06,"philanthropy,investment,global issues,media,cu...",599655,Video: Narrator: An event seen from one poin...,Sasa Vucinic,Nonprofit venture capitalist,Sasa Vucinic's Media Development Loan Fund app...,"Why you should listen\n""We want to create a ne...",,,,
9,67,https://www.ted.com/talks/peter_donnelly_shows...,How juries are fooled by statistics,Oxford mathematician Peter Donnelly reveals th...,TEDGlobal 2005,0:21:20,11/8/06,"statistics,science,genetics,culture,technology",1092979,"As other speakers have said, it's a rather d...",Peter Donnelly,Mathematician; statistician,Peter Donnelly is an expert in probability the...,Why you should listen\nPeter Donnelly applies ...,,,,


In [10]:
print(clean_talks.shape)
print(result2.shape)

(755, 12)
(758, 18)


In [11]:
result2.to_csv('./speakers_work/TEDplus_speakers_doubles.csv', sep = ',')

In [12]:
dup_first = result2["headline"].duplicated(keep='first')
dup_second= result2["headline"].duplicated(keep='last')
double_talks = pd.concat([result2[dup_first], result2[dup_second]], axis = 0)

In [13]:
double_talks

Unnamed: 0,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker1_occupation,speaker1_introduction,speaker1_profile,speaker_2,speaker2_occupation,speaker2_introduction,speaker2_profile
148,955,https://www.ted.com/talks/chris_anderson_how_w...,How web video powers global innovation,TED's Chris Anderson says the rise of web vide...,TEDGlobal 2010,0:18:53,9/14/10,"web,online video,global issues,innovation,scie...",1398397,"If nothing else, at least I've discovered wh...",Chris Anderson,TED Curator,After a long career in journalism and publishi...,Why you should listen\nChris Anderson is the C...,,,,
485,2134,https://www.ted.com/talks/michael_green_what_t...,What the Social Progress Index can reveal abou...,The term Gross Domestic Product is often talke...,TEDGlobal 2014,0:14:56,11/11/14,"policy,global issues,statistics,economics",1171473,"On January 4, 1934, a young man delivered a ...",Michael Green,Architect,Michael Green wants to solve architecture’s bi...,Why you should listen\nMichael Green is callin...,,,,
566,2348,https://www.ted.com/talks/michael_green_how_we...,How we can make the world a better place by 2030,"Can we end hunger and poverty, halt climate ch...",TEDGlobal>London,0:14:39,10/12/15,"policy,big problems,global issues,goal-setting...",1318712,Do you think the world is going to be a bett...,Michael Green,Architect,Michael Green wants to solve architecture’s bi...,Why you should listen\nMichael Green is callin...,,,,
147,955,https://www.ted.com/talks/chris_anderson_how_w...,How web video powers global innovation,TED's Chris Anderson says the rise of web vide...,TEDGlobal 2010,0:18:53,9/14/10,"web,online video,global issues,innovation,scie...",1398397,"If nothing else, at least I've discovered wh...",Chris Anderson,Drone maker,Chris Anderson is an authority on emerging tec...,Why you should listen\nBefore Chris Anderson t...,,,,
484,2134,https://www.ted.com/talks/michael_green_what_t...,What the Social Progress Index can reveal abou...,The term Gross Domestic Product is often talke...,TEDGlobal 2014,0:14:56,11/11/14,"policy,global issues,statistics,economics",1171473,"On January 4, 1934, a young man delivered a ...",Michael Green,"Economist, social progress expert",Michael Green is part of the team that has cre...,Why you should listen\nIn his book Philanthroc...,,,,
565,2348,https://www.ted.com/talks/michael_green_how_we...,How we can make the world a better place by 2030,"Can we end hunger and poverty, halt climate ch...",TEDGlobal>London,0:14:39,10/12/15,"policy,big problems,global issues,goal-setting...",1318712,Do you think the world is going to be a bett...,Michael Green,"Economist, social progress expert",Michael Green is part of the team that has cre...,Why you should listen\nIn his book Philanthroc...,,,,


In [15]:
double_talks.to_csv('./speakers_work/doubles_plus.csv', sep = ',')

Again we need to do a manual step here. We have two authors named Chris Anderson and two named Michael Green. Python cannot tell which one is the correct speaker to attach to the talk, so it duplicates the row associated to the talks given by people with these names. We call this new file `TEDplus_speakers.csv` and this file is the near final data file for the talks at the main TED event (i.e. those called TED YYYY). 

## Check the file that all speakers and meta speaker information is in place. 

In [19]:
ts_final = pd.read_csv('./speakers_work/TEDplus_speakers.csv')
ts_final.drop(ts_final.columns[0], axis=1, inplace=True)
ts_final

Unnamed: 0,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker1_occupation,speaker1_introduction,speaker1_profile,speaker_2,speaker2_occupation,speaker2_introduction,speaker2_profile
0,37,https://www.ted.com/talks/jimmy_wales_on_the_b...,The birth of Wikipedia,"Jimmy Wales recalls how he assembled ""a ragtag...",TEDGlobal 2005,0:20:01,8/21/06,"wikipedia,open-source,media,invention,culture,...",1187730,"Charles Van Doren, who was later a senior ed...",Jimmy Wales,Founder of Wikipedia,"With a vision for a free online encyclopedia, ...",Why you should listen\nJimmy Wales went from b...,,,,
1,47,https://www.ted.com/talks/david_deutsch_on_our...,Chemical scum that dream of distant quasars,Legendary scientist David Deutsch puts theoret...,TEDGlobal 2005,0:19:00,9/12/06,"cosmos,physics,global issues,climate change,un...",1182698,We've been told to go out on a limb and say ...,David Deutsch,Quantum physicist,"David Deutsch's 1997 book ""The Fabric of Reali...",Why you should listen\nDavid Deutsch will forc...,,,,
2,98,https://www.ted.com/talks/richard_dawkins_on_o...,Why the universe seems so strange,"Biologist Richard Dawkins makes a case for ""th...",TEDGlobal 2005,0:21:56,9/12/06,"cosmos,evolution,physics,astronomy,psychology,...",3036253,"My title: ""Queerer than we can suppose: the ...",Richard Dawkins,Evolutionary biologist,Oxford professor Richard Dawkins has helped st...,Why you should listen\nAs an evolutionary biol...,,,,
3,93,https://www.ted.com/talks/barry_schwartz_on_th...,The paradox of choice,Psychologist Barry Schwartz takes aim at a cen...,TEDGlobal 2005,0:19:37,9/26/06,"choice,happiness,potential,psychology,economic...",11110916,I'm going to talk to you about some stuff th...,Barry Schwartz,Psychologist,Barry Schwartz studies the link between econom...,Why you should listen\nIn his 2004 book The Pa...,,,,
4,39,https://www.ted.com/talks/aubrey_de_grey_says_...,A roadmap to end aging,Cambridge researcher Aubrey de Grey argues tha...,TEDGlobal 2005,0:22:45,10/2/06,"biotech,engineering,aging,health care,disease,...",3467757,18 minutes is an absolutely brutal time limi...,Aubrey de Grey,Crusader against aging,"Aubrey de Grey, British researcher on aging, c...","Why you should listen\nA true maverick, Aubre...",,,,
5,79,https://www.ted.com/talks/iqbal_quadir_says_mo...,How mobile phones can fight poverty,Iqbal Quadir tells how his experiences as a ki...,TEDGlobal 2005,0:15:52,10/10/06,"microfinance,alternative energy,transportation...",529470,I'll just take you to Bangladesh for a minut...,Iqbal Quadir,"Founder, GrameenPhone",Iqbal Quadir is an advocate of business as a h...,Why you should listen\nAs a kid in rural Bangl...,,,,
6,91,https://www.ted.com/talks/jacqueline_novogratz...,Invest in Africa's own solutions,Jacqueline Novogratz applauds the world's heig...,TEDGlobal 2005,0:12:53,10/10/06,"microfinance,philanthropy,investment,poverty,g...",771282,"I want to start with a story, a la Seth Godi...",Jacqueline Novogratz,Investor and advocate for moral leadership,Jacqueline Novogratz works to enable human flo...,Why you should listen\nJacqueline Novogratz wr...,,,,
7,3,https://www.ted.com/talks/ashraf_ghani_on_rebu...,How to rebuild a broken state,Ashraf Ghani's passionate and powerful 10-minu...,TEDGlobal 2005,0:18:45,10/18/06,"policy,investment,global issues,poverty,global...",849545,"A public, Dewey long ago observed, is consti...",Ashraf Ghani,President-elect of Afghanistan,"Ashraf Ghani, Afghanistan’s new president-elec...",Why you should listen\nAshraf Ghani became Afg...,,,,
8,75,https://www.ted.com/talks/sasa_vucinic_invests...,Why we should invest in a free press,"A free press -- papers, magazines, radio, TV, ...",TEDGlobal 2005,0:18:00,10/18/06,"philanthropy,investment,global issues,media,cu...",599655,Video: Narrator: An event seen from one poin...,Sasa Vucinic,Nonprofit venture capitalist,Sasa Vucinic's Media Development Loan Fund app...,"Why you should listen\n""We want to create a ne...",,,,
9,67,https://www.ted.com/talks/peter_donnelly_shows...,How juries are fooled by statistics,Oxford mathematician Peter Donnelly reveals th...,TEDGlobal 2005,0:21:20,11/8/06,"statistics,science,genetics,culture,technology",1092979,"As other speakers have said, it's a rather d...",Peter Donnelly,Mathematician; statistician,Peter Donnelly is an expert in probability the...,Why you should listen\nPeter Donnelly applies ...,,,,


In [38]:
s1cut = ts_final[ts_final['speaker1_occupation'].isnull()]

s2inds = ts_final[~ts_final['speaker_2'].isnull()].index
#ts_final.iloc[s2inds]
s2cut = ts_final.iloc[s2inds].loc[ts_final.iloc[s2inds]['speaker2_occupation'].isnull()]
#s2inds

In [53]:
s_missing = pd.concat([s1cut,s2cut], axis = 0)
s_missing = s_missing.drop_duplicates(keep = "first")
s_missing.to_csv('./speakers_work/missing_meta_plus.csv', sep = ',')

The file `missing_meta_plus.csv` contains the rows of our dataset that are missing the speaker meta information. In this last step we will do one final manual set of additions to fill these missing cells in. This final file will be called `TEDplus_speakers_final.csv`. (As a middle step, we have created a middle step to aid in this process called `missing_meta_plus-filled.csv`).

To ensure that all of the cleaning steps are carried backwards to the early files, we will make `TEDplus_final.csv` that includes everything in `TEDoplus_speakers_final.csv` save for the meta-information about the speakers. 

In [18]:
ts = pd.read_csv('./speakers_work/TEDplus_speakers_final.csv',sep = ',')

In [19]:
# remove speaker meta information except the name
cols = list(ts.columns.values)
s1 = cols.index('speaker_1')
s2 = cols.index('speaker_2')
s3 = cols.index('speaker_3')
s4 = cols.index('speaker_4')
new_cols = cols[0:s1+1]+['speaker_2']+['speaker_3']+['speaker_4']

In [20]:
just_talks = ts[new_cols]
just_talks.to_csv('./speakers_work/TEDplus_final.csv', sep = ',')

*Note*: The above code was adjusted after the data cleaning was complete due to reorganizing of the directories within this github.