# Merging talks to their speakers

There are a few phases to this stage the of the data cleaning. 
* First we split the speakers of the talks into their own columns. Along the way, there are a few steps that are completed manually
* Second, we merge the meta data of the talks with the descriptions of the speakers 
* Finally, we run a gender detector. Again this step does require a bit of hand-coding and checking to ensure that we have not misgendered individuals. 

#### This file handles TEDplus and is a near copy of the one that worked on TEDonly

### Step 0 - Importing packages

In [10]:
# Set of imports
import pandas as pd
import csv
import string
import numpy as np

In [11]:
# Import talk file 
talks = pd.read_csv('../../tedtalks/data/TEDplus.csv', encoding='utf-8')
speakers = pd.read_csv('../../tedtalks/data/speakers.csv',encoding = 'utf-8')

### Step 1 - Begin splitting and cleaning of the talks file

In [12]:
#Drop weird unnamed column - 
talks.drop(talks.columns[0:2], axis=1, inplace=True)


In [13]:
# Break speakers into two columns
splitList = r' \+ | , | and '

# WATCH THE EXTRA SPACES!!!!!!!!!!!!!!!!
# SO MUCH PAIN FOR A SPACE?!?!

#https://stackoverflow.com/questions/37543724/python-regex-for-finding-all-words-in-a-string

# Melvin helped here:
splitSpeakers = talks['speaker_name'].str.split(splitList, expand=True).rename(columns=lambda x: f"speaker_{x+1}")

# Join the speakers to the talks dataframe and drop the original speakers:
splitTalks = talks.join(splitSpeakers)



In [89]:
# Fix incorrect splitting of this talk's speakers
splitTalks.at[289,"speaker_1"] = splitTalks.at[289,"speaker_name"]
splitTalks.at[289,"speaker_2"] = None
splitTalks.iloc[289]

Talk_ID                                                       466
public_url      https://www.ted.com/talks/astonishing_performa...
speaker_name    Gustavo Dudamel and the Teresa Carreño Youth O...
headline                         El Sistema's top youth orchestra
description     The Teresa Carreño Youth Orchestra contains th...
event                                                     TED2009
duration                                                  0:17:06
published                                                 2/18/09
tags            conducting,TED Prize,entertainment,children,li...
views                                                     2165588
text              Chris Anderson: And now we go live to Caraca...
speaker_1       Gustavo Dudamel and the Teresa Carreño Youth O...
speaker_2                                                    None
Name: 289, dtype: object

In [14]:
splitTalks.drop(["speaker_name"], axis=1, inplace=True)

In [15]:
splitTalks.head()

Unnamed: 0,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker_2
0,37,https://www.ted.com/talks/jimmy_wales_on_the_b...,The birth of Wikipedia,"Jimmy Wales recalls how he assembled ""a ragtag...",TEDGlobal 2005,0:20:01,8/21/06,"wikipedia,open-source,media,invention,culture,...",1187730,"Charles Van Doren, who was later a senior ed...",Jimmy Wales,
1,47,https://www.ted.com/talks/david_deutsch_on_our...,Chemical scum that dream of distant quasars,Legendary scientist David Deutsch puts theoret...,TEDGlobal 2005,0:19:00,9/12/06,"cosmos,physics,global issues,climate change,un...",1182698,We've been told to go out on a limb and say ...,David Deutsch,
2,98,https://www.ted.com/talks/richard_dawkins_on_o...,Why the universe seems so strange,"Biologist Richard Dawkins makes a case for ""th...",TEDGlobal 2005,0:21:56,9/12/06,"cosmos,evolution,physics,astronomy,psychology,...",3036253,"My title: ""Queerer than we can suppose: the ...",Richard Dawkins,
3,93,https://www.ted.com/talks/barry_schwartz_on_th...,The paradox of choice,Psychologist Barry Schwartz takes aim at a cen...,TEDGlobal 2005,0:19:37,9/26/06,"choice,happiness,potential,psychology,economic...",11110916,I'm going to talk to you about some stuff th...,Barry Schwartz,
4,39,https://www.ted.com/talks/aubrey_de_grey_says_...,A roadmap to end aging,Cambridge researcher Aubrey de Grey argues tha...,TEDGlobal 2005,0:22:45,10/2/06,"biotech,engineering,aging,health care,disease,...",3467757,18 minutes is an absolutely brutal time limi...,Aubrey de Grey,


#### Check for non-unicode characters in both talks and the speakers

In [16]:
test_inds = splitTalks["speaker_1"].apply(lambda x: len([True for i in str(x) if (ord(i) < 32 or ord(i) > 122)]) > 0)
#https://stackoverflow.com/questions/36340627/removing-non-ascii-characters-and-
#                                   replacing-with-spaces-from-pandas-data-frame

#https://blog.teamtreehouse.com/python-single-line-loops

In [24]:
test_inds2 = splitTalks["speaker_2"].apply(lambda x: len([True for i in str(x) if (ord(i) < 32 or ord(i) > 122)]) > 0)

In [26]:
splitTalks[test_inds2]

Unnamed: 0,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker_2
486,2131,https://www.ted.com/talks/vincent_moon_and_nan...,Hidden music rituals around the world,Vincent Moon travels the world with a backpack...,TEDGlobal 2014,0:24:13,11/14/14,"jazz,travel,film,live music,creativity,music",1050973,"Vincent Moon: How can we use computers, came...",Vincent Moon,Naná¡ Vasconcelos


In [17]:
sinds = speakers["name"].apply(lambda x: len([True for i in str(x) if (ord(i) < 32 or ord(i) > 122)]) > 0)

In [31]:
# Check for matches across the rows with special characters in talks and 
# the rows with special characters in speakers

st1 = splitTalks[test_inds][["speaker_1"]]
st2 = splitTalks[test_inds2][["speaker_2"]]

st1.rename(columns={'speaker_1':'speaker'}, inplace=True)
st2.rename(columns={'speaker_2':'speaker'}, inplace=True)

st = pd.concat([st1,st2], axis = 0)

speaks = speakers[sinds][["name"]]

name_test = pd.merge(st, speaks, how = "outer", left_on = "speaker", right_on = "name", indicator = True)
# https://stackoverflow.com/questions/20375561/joining-pandas-dataframes-by-column-names

In [32]:
# This is the number of special character rows that are properly merged:
len(name_test[name_test["_merge"] == "both"])

9

In [33]:
len(name_test[name_test["_merge"] == "left_only"])

8

In [34]:
# Save the ones that we need to manually edit - 
name_test.to_csv('../../tedtalks/data/plus_manual_name_edits_old.csv', sep = ',')

In [35]:
# Save the talks file where it is. 
splitTalks.to_csv('../../tedtalks/data/TEDplus_splitSpeakers.csv',sep = ',')

At this point, we do a manual edit, fixing the names that were parsed strangely due to the html interpretor. The file `plus_manual_name_edits.csv` shows which names have accents and/or special characters and whether they are matched correctly between the talks and speakers files. During this cleaning, we renamed this file `TEDplus_splitSpeakers_clean.csv`.

In [99]:
# reload the splitTalks:
clean_talks = pd.read_csv('../../tedtalks/data/TEDplus_splitSpeakers_clean.csv',encoding = 'utf-8')
clean_talks.drop(clean_talks.columns[0], axis=1, inplace=True)

In [100]:
# Merge the cleaned_talks with the speakers file. Then clean up the column names. 
result1 = pd.merge(clean_talks, speakers, 
                   how = 'left', left_on = 'speaker_1', right_on = 'name')
result1.drop(['name'], axis=1, inplace=True)
result1.rename(columns={'occupation':'speaker1_occupation', 
                        'introduction':'speaker1_introduction', 
                        'profile':'speaker1_profile'}, inplace=True)

# https://stackoverflow.com/questions/35321812/move-column-in-pandas-dataframe/35321983
# pop off the speaker_2 column and put at the end of the dataframe
cols = list(result1.columns.values) #Make a list of all of the columns in the df
cols.pop(cols.index('speaker_2'))
result1 = result1[cols+['speaker_2']]
result1.head()

Unnamed: 0,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker1_occupation,speaker1_introduction,speaker1_profile,speaker_2
0,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",Al Gore,Climate advocate,Nobel Laureate Al Gore focused the world’s att...,Why you should listen\nFormer Vice President A...,
1,7,https://www.ted.com/talks/david_pogue_says_sim...,Simplicity sells,New York Times columnist David Pogue takes aim...,TED2006,0:21:26,6/27/06,"simplicity,entertainment,interface design,soft...",1702201,"(Music: ""The Sound of Silence,"" Simon & Garf...",David Pogue,Technology columnist,David Pogue is the personal technology columni...,Why you should listen\nWhich cell phone to cho...,
2,53,https://www.ted.com/talks/majora_carter_s_tale...,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",TED2006,0:18:36,6/27/06,"MacArthur grant,cities,green,activism,politics...",2000421,If you're here today — and I'm very happy th...,Majora Carter,Activist for environmental justice,Majora Carter redefined the field of environme...,Why you should listen\nMajora Carter is a visi...,
3,66,https://www.ted.com/talks/ken_robinson_says_sc...,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,TED2006,0:19:24,6/27/06,"children,teaching,creativity,parenting,culture...",51614087,Good morning. How are you? (Laughter) ...,Ken Robinson,Author/educator,Creativity expert Sir Ken Robinson challenges ...,Why you should listen\nWhy don't we get the be...,
4,92,https://www.ted.com/talks/hans_rosling_shows_t...,The best stats you've ever seen,You've never seen data presented like this. Wi...,TED2006,0:19:50,6/27/06,"demo,Asia,global issues,visualizations,global ...",12662135,"About 10 years ago, I took on the task to te...",Hans Rosling,Global health expert; data visionary,"In Hans Rosling’s hands, data sings. Global tr...",Why you should listen\nEven the most worldly a...,


In [113]:
result2 = pd.merge(result1, speakers, 
                   how = 'left', left_on = 'speaker_2', right_on = 'name')
result2.rename(columns={'occupation':'speaker2_occupation', 
                        'introduction':'speaker2_introduction', 
                        'profile':'speaker2_profile'}, inplace=True)
result2.drop(['name'], axis=1, inplace=True)
result2

Unnamed: 0,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker1_occupation,speaker1_introduction,speaker1_profile,speaker_2,speaker2_occupation,speaker2_introduction,speaker2_profile
0,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",Al Gore,Climate advocate,Nobel Laureate Al Gore focused the world’s att...,Why you should listen\nFormer Vice President A...,,,,
1,7,https://www.ted.com/talks/david_pogue_says_sim...,Simplicity sells,New York Times columnist David Pogue takes aim...,TED2006,0:21:26,6/27/06,"simplicity,entertainment,interface design,soft...",1702201,"(Music: ""The Sound of Silence,"" Simon & Garf...",David Pogue,Technology columnist,David Pogue is the personal technology columni...,Why you should listen\nWhich cell phone to cho...,,,,
2,53,https://www.ted.com/talks/majora_carter_s_tale...,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",TED2006,0:18:36,6/27/06,"MacArthur grant,cities,green,activism,politics...",2000421,If you're here today — and I'm very happy th...,Majora Carter,Activist for environmental justice,Majora Carter redefined the field of environme...,Why you should listen\nMajora Carter is a visi...,,,,
3,66,https://www.ted.com/talks/ken_robinson_says_sc...,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,TED2006,0:19:24,6/27/06,"children,teaching,creativity,parenting,culture...",51614087,Good morning. How are you? (Laughter) ...,Ken Robinson,Author/educator,Creativity expert Sir Ken Robinson challenges ...,Why you should listen\nWhy don't we get the be...,,,,
4,92,https://www.ted.com/talks/hans_rosling_shows_t...,The best stats you've ever seen,You've never seen data presented like this. Wi...,TED2006,0:19:50,6/27/06,"demo,Asia,global issues,visualizations,global ...",12662135,"About 10 years ago, I took on the task to te...",Hans Rosling,Global health expert; data visionary,"In Hans Rosling’s hands, data sings. Global tr...",Why you should listen\nEven the most worldly a...,,,,
5,96,https://www.ted.com/talks/tony_robbins_asks_wh...,Why we do what we do,"Tony Robbins discusses the ""invisible forces"" ...",TED2006,0:21:45,6/27/06,"entertainment,goal-setting,potential,psycholog...",22368699,Thank you. I have to tell you I'm both chall...,Tony Robbins,Life coach; expert in leadership psychology,Tony Robbins makes it his business to know why...,Why you should listen\nTony Robbins might have...,,,,
6,49,https://www.ted.com/talks/joshua_prince_ramus_...,Behind the design of Seattle's library,Architect Joshua Prince-Ramus takes the audien...,TED2006,0:19:58,7/10/06,"library,architecture,design,culture,collaboration",1042335,I'm going to present three projects in rapid...,Joshua Prince-Ramus,Architect,Joshua Prince-Ramus is best known as architect...,Why you should listen\nWith one of the decade'...,,,,
7,86,https://www.ted.com/talks/julia_sweeney_on_let...,Letting go of God,When two young Mormon missionaries knock on Ju...,TED2006,0:16:32,7/10/06,"atheism,Christianity,religion,God,comedy,humor...",3903747,"On September 10, the morning of my seventh b...",Julia Sweeney,"Actor, comedian, playwright",Julia Sweeney creates comedic works that tackl...,Why you should listen\nJulia Sweeney is a writ...,,,,
8,71,https://www.ted.com/talks/rick_warren_on_a_lif...,A life of purpose,"Pastor Rick Warren, author of ""The Purpose-Dri...",TED2006,0:21:02,7/18/06,"Christianity,philanthropy,religion,God,happine...",3361934,"I'm often asked, ""What surprised you about t...",Rick Warren,"Pastor, author",Pastor Rick Warren is the author of The Purpos...,Why you should listen\nPastor Rick Warren is o...,,,,
9,94,https://www.ted.com/talks/dan_dennett_s_respon...,Let's teach religion -- all religion -- in sch...,Philosopher Dan Dennett calls for religion -- ...,TED2006,0:24:45,7/18/06,"atheism,consciousness,evolution,philosophy,rel...",2751013,It's wonderful to be back. I love this wonde...,Dan Dennett,"Philosopher, cognitive scientist",Dan Dennett thinks that human consciousness an...,Why you should listen\nOne of our most importa...,,,,


In [117]:
result2.to_csv('../../tedtalks/data/TEDonly_speakers_doubles.csv', sep = ',')

In [114]:
print(clean_talks.shape)
print(result2.shape)

(992, 12)
(997, 18)


In [115]:
dup_first = result2["headline"].duplicated(keep='first')
dup_second= result2["headline"].duplicated(keep='last')
double_talks = pd.concat([result2[dup_first], result2[dup_second]], axis = 0)

In [116]:
double_talks.to_csv('../../tedtalks/data/doubles.csv', sep = ',')

Again we need to do a manual step here. We have two authors named Chris Anderson and two named Michael Green. Python cannot tell which one is the correct speaker to attach to the talk, so it duplicates the row associated to the talks given by people with these names. We call this new file `TEDonly_speakers.csv` and this file is the final data file for the talks at the main TED event (i.e. those called TED YYYY). 

# Notes Below this block 

In [18]:
talks_s1.loc[~talks_s1['speaker_2'].isnull()]

Unnamed: 0,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker_2,speaker1_occupation,speaker1_introduction,speaker1_profile
80,118,https://www.ted.com/talks/sergey_brin_and_larr...,The genesis of Google,Google co-founders Larry Page and Sergey Brin ...,TED2004,0:20:33,5/3/07,"web,design,Google,culture,business,technology,...",1529641,Sergey Brin: I want to discuss a question I ...,Sergey Brin,Larry Page,,,
153,222,https://www.ted.com/talks/the_jill_and_julia_show,The Jill and Julia Show,"Two TED favorites, Jill Sobule and Julia Sween...",TED2007,0:06:14,2/20/08,"entertainment,comedy,humor,storytelling,collab...",507130,♫ Jill Sobule: At a conference in Monterey b...,Jill Sobule,Julia Sweeney,,,
156,224,https://www.ted.com/talks/roy_gould_and_curtis...,A preview of the WorldWide Telescope,Educator Roy Gould and researcher Curtis Wong ...,TED2008,0:06:42,2/27/08,"telescopes,demo,astronomy,universe,science,tec...",1043036,"Roy Gould: Less than a year from now, the wo...",Roy Gould,Curtis Wong,,,
173,246,https://www.ted.com/talks/tod_machover_and_dan...,Inventing instruments that unlock new music,Tod Machover of MIT's Media Lab is devoted to ...,TED2008,0:20:41,4/15/08,"demo,entertainment,writing,live music,health c...",519734,The first idea I'd like to suggest is that w...,Tod Machover,Dan Ellsey,,,
213,322,https://www.ted.com/talks/bruno_bowden_folds_w...,Blindfold origami and cello,After Robert Lang's talk on origami at TED2008...,TED2008,0:02:58,8/1/08,"origami,entertainment,cello,music",384129,Hello everyone. And so the two of us are her...,Bruno Bowden,Rufus Cappadocia,,,
247,385,https://www.ted.com/talks/toys_from_the_future,Toys and materials from the future,"The Inventables guys, Zach Kaplan and Keith Sc...",TED2005,0:15:46,10/30/08,"toy,smell,industrial design,design,creativity,...",420887,Zach Kaplan: Keith and I lead a research tea...,Zach Kaplan,Keith Schacht,,,
291,466,https://www.ted.com/talks/astonishing_performa...,El Sistema's top youth orchestra,The Teresa Carreño Youth Orchestra contains th...,TED2009,0:17:06,2/18/09,"conducting,TED Prize,entertainment,children,li...",2165588,Chris Anderson: And now we go live to Caraca...,Gustavo Dudamel,the Teresa Carreño Youth Orchestra,,,
302,481,https://www.ted.com/talks/pattie_maes_demos_th...,Meet the SixthSense interaction,"This demo -- from Pattie Maes' lab at MIT, spe...",TED2009,0:08:42,3/10/09,"demo,interface design,design,technology",9912033,I've been intrigued by this question of whet...,Pattie Maes,Pranav Mistry,,,
439,881,https://www.ted.com/talks/debate_does_the_worl...,Debate: Does the world need nuclear energy?,Nuclear power: the energy crisis has even die-...,TED2010,0:22:59,6/10/10,"nuclear weapons,wind energy,green,climate chan...",1362908,Chris Anderson: We're having a debate. The d...,Stewart Brand,Mark Z. Jacobson,,,
452,988,https://www.ted.com/talks/david_byrne_sings_no...,"""(Nothing But) Flowers"" with string quartet","David Byrne sings the Talking Heads' 1988 hit,...",TED2010,0:03:15,10/22/10,"garden,future,music,performance,society",665679,(Music) ♫ Here we stand ♫ ♫ Like an Ad...,"David Byrne, Ethel",Thomas Dolby,,,


In [20]:
person1 = talks_s1.iloc[153]["speaker_2"]
person1

' Julia Sweeney'

In [63]:
# First break speakers into two columns

splitList = r'\+| , | and'

#https://stackoverflow.com/questions/37543724/python-regex-for-finding-all-words-in-a-string

# Melvin helped here:
splitSpeakers = talks['speaker_name'].str.split(splitList, expand=True).rename(columns=lambda x: f"speaker_{x+1}")

# Check splits
# splitSpeakers

# Join the speakers to the talks dataframe and clear out the join speakers:
talks = talks.join(splitSpeakers)
talks.drop(["speaker_name"], axis=1, inplace=True)

# =-=-=-= Drafts here - 

#result2 = (result.iloc[emptyjobs]['speaker_name'].str.split('\+', expand=True).rename(columns=lambda x: f"speaker_name_{x+1}"))
#result2 = (result['speaker_name'].str.split(', ', expand=True).rename(columns=lambda x: f"speaker_name_{x+1}"))

#result = result.join(result2)
#result["speaker_name"] = result["speaker_name"].str.split('\+')

#result2['speaker_name'].str.contains(' + ')
#result2[result2['speaker_name'].str.contains('Sergey')]

In [2]:
# Pronoun lists
male_pronouns = {'he', 'him', 'his', 'himself'}
female_pronouns = {'she', 'her', 'hers', 'herself'}
#nonbinary_pronouns = {'they', 'them', 'their', 'theirs', 'themself'}

nonbinary_pronouns = {'they', 'them', 'their', 'theirs', 'themself', 
                    'e', 'ey', 'em', 'eir', 'eirs', 'eirself', 
                    'fae', 'faer', 'faers', 'faerself', 
                    'per', 'pers', 'perself',
                    've', 'ver', 'vis', 'verself',
                    'xe', 'xem', 'xyr', 'xyrs', 'xemself',
                    'ze', 'zie', 'hir', 'hirs', 'hirself', 
                    'sie', 'zir', 'zis', 'zim', 'zieself', 
                    'emself', 'tey', 'ter', 'tem', 'ters', 'terself'} 

In [3]:
# This function finds the gender within a few parameters. 
def find_gender(input_description):
	global male_pronouns, female_pronouns, nonbinary_pronouns

	# Initialize score variables
	male_score = 0
	female_score = 0
	nonbinary_score = 0

	# Lower and isolate everyword of the description
	des_lst = input_description.lower().split()
	for word in des_lst:
		cleanword = word.strip(string.punctuation)

		# Add to the appropriate score
		if cleanword in male_pronouns:
			male_score = male_score + 1
		elif cleanword in female_pronouns:
			female_score = female_score + 1
		elif cleanword in nonbinary_pronouns:
			nonbinary_score = nonbinary_score + 1

	total = male_score + female_score + nonbinary_score

	if total == 0: # Only happens if there are no pronouns
		gender = 'no pronouns'
	# elif (nonbinary_score <= (.1)*total):
	# Note: The above line is too harsh. 
	
	# If there are two kinds of pronouns are zero
	elif (male_score == 0) and (female_score == 0):
		gender = 'non-binary'
	elif (male_score == 0) and (nonbinary_score == 0):
		gender = 'female'
	elif (female_score == 0) and (nonbinary_score == 0):
		gender = 'male'

	# If there is only one kind of pronoun that is zero
	elif (nonbinary_score <= 1):
		score = (female_score - male_score) / (female_score + male_score)
		if score > 0.3:
			gender = 'female'
		elif score < -0.3:
			gender = 'male'
		else:
			gender = 'undetected'
	elif (male_score == 0):
		score = (female_score - nonbinary_score) / (female_score + nonbinary_score)
		if score > 0.3:
			gender = 'female'
		elif score < -0.3:
			gender = 'nonbinary'
		else:
			gender = 'undetected'
	elif (female_score == 0):
		score = (male_score - nonbinary_score) / (male_score + nonbinary_score)
		if score > 0.3:
			gender = 'male'
		elif score < -0.3:
			gender = 'nonbinary'
		else:
			gender = 'undetected'
	else:
		gender = 'last case'

	return (gender, male_score, female_score, nonbinary_score)

In [4]:
# Read in the talks, speakers, and new speakers
talks = pd.read_csv('../../tedtalks/data/tedtalks2018.csv')

#Drop the weird last column when importing the speakers file
speakers = pd.read_csv('../../speakersGenderTest.csv', encoding='latin-1')
speakers.drop(speakers.columns[len(speakers.columns)-1], axis=1, inplace=True)
#https://stackoverflow.com/questions/20517650/how-to-delete-the-last-column-of-data-of-a-pandas-dataframe

# Rename the columns to match the talks file
speakers.rename(columns={'Name':'speaker_name'}, inplace=True)

# Import the new speakers (no weird rows... yet) 
# Rename the column files to match speakers
# newspeakers = pd.read_csv('../../tedtalks/data/speakers/speakers_raw.csv')
# newspeakers.rename(columns={'name':'speaker_name',
#                             'introduction':'ShortDescription',
#                             'profile':'LongDescription'}, inplace=True)# 

In [5]:
# In this block, we are performing the gender find on the new speakers
# and we create a dictionary that we then save as a new CSV 

# Open the CSV file as a dictionary 
with open('../../tedtalks/data/speakers/speakers2.csv') as des_file:
    des_data = csv.DictReader(des_file)
    
    # Create an empty list of dictionaries
    name_lst = []
    for row in des_data:
        # Pull the description
        row_des = row['profile']
        short_des = row['introduction']
        both_des = row_des + short_des
        found_gender, ms, fs, ns = find_gender(both_des)
        
        if (found_gender == "male") or (found_gender == "female"):
            row_dict = {'speaker_name':row['name'], 
                        'Occupation':row['occupation'],
                        'ShortDescription':short_des,
                        'LongDescription': row_des, 
                        'Gender': found_gender,
                        'MaleScore': ms,
                        'FemaleScore': fs,
                        'NonBinaryScore': ns,
                        'Gender + hand codes': found_gender}
        else: 
            row_dict = {'speaker_name':row['name'], 
                        'Occupation':row['occupation'],
                        'ShortDescription':short_des,
                        'LongDescription': row_des, 
                        'Gender': found_gender,
                        'MaleScore': ms,
                        'FemaleScore': fs,
                        'NonBinaryScore': ns}
        name_lst.append(row_dict)


with open('../../tedtalks/data/speakers/speakers_gender_test.csv', 'w') as csvfile:
	fields = ['speaker_name', 'Occupation','ShortDescription', 'LongDescription', 'Gender',
	'MaleScore','FemaleScore','NonBinaryScore', 'Gender + hand codes']
	writer = csv.DictWriter(csvfile, fieldnames = fields)
	writer.writeheader()

	writer.writerows(name_lst)

In [6]:
sgt = pd.read_csv('../../tedtalks/data/speakers/speakers_gender_test.csv')

In [7]:
# Add the new speakers to the old set
add_speakers = pd.merge(speakers, sgt, how="outer")

add_speakers = add_speakers.drop_duplicates(subset=["speaker_name"], keep='first')

In [8]:
sgt.shape

(2569, 9)

In [9]:
# Check the number of unlocated genders
len(add_speakers.loc[add_speakers['Gender + hand codes'].isnull()])

89

In [10]:
# Merge the speaker information into the talks file
result = pd.merge(talks, add_speakers, on="speaker_name", how="left")
#https://www.shanelynn.ie/merge-join-dataframes-python-pandas-index-1/

In [11]:
#Check to see how many genders are missing
len(result.loc[result['Gender + hand codes'].isnull()])

154

In [12]:
emptyjobs = result[result['Occupation'].isnull()].index

# 181206 - Thoughts: Need to deal with the "+" signs in speaker_name
#                    Also the letters that came in oddly coded
#                    Do we want the "A TED Original Podcast"? No right? 


Unnamed: 0,rowID,Talk_ID,public_url,speaker_name,headline,description,event,duration,published,tags,views,text,Occupation,ShortDescription,LongDescription,Gender,MaleScore,FemaleScore,NonBinaryScore,Gender + hand codes
61,61,73,https://www.ted.com/talks/carl_honore_praises_...,Carl Honoré,In praise of slowness,Journalist Carl Honore believes the Western wo...,TEDGlobal 2005,0:19:15,2/28/07,"choice,happiness,potential,psychology,health,p...",2632619,What I'd like to start off with is an observ...,,,,,,,,
100,100,118,https://www.ted.com/talks/sergey_brin_and_larr...,Sergey Brin + Larry Page,The genesis of Google,Google co-founders Larry Page and Sergey Brin ...,TED2004,0:20:33,5/3/07,"web,design,Google,culture,business,technology,...",1529641,Sergey Brin: I want to discuss a question I ...,,,,,,,,
108,108,129,https://www.ted.com/talks/blaise_aguera_y_arca...,Blaise Agá¼era y Arcas,How PhotoSynth can connect the world's images,Blaise Aguera y Arcas leads a dazzling demo of...,TED2007,0:07:30,5/27/07,"microsoft,virtual reality,demo,software,visual...",4909579,"What I'm going to show you first, as quickly...",,,,,,,,
151,151,184,https://www.ted.com/talks/vilayanur_ramachandr...,VS Ramachandran,3 clues to understanding your brain,Vilayanur Ramachandran tells us what brain dam...,TED2007,0:23:34,10/21/07,"consciousness,illusion,brain,illness,science,c...",4229924,"Well, as Chris pointed out, I study the huma...",,,,,,,,
190,190,222,https://www.ted.com/talks/the_jill_and_julia_show,Jill Sobule + Julia Sweeney,The Jill and Julia Show,"Two TED favorites, Jill Sobule and Julia Sween...",TED2007,0:06:14,2/20/08,"entertainment,comedy,humor,storytelling,collab...",507130,♫ Jill Sobule: At a conference in Monterey b...,,,,,,,,
194,194,224,https://www.ted.com/talks/roy_gould_and_curtis...,Roy Gould + Curtis Wong,A preview of the WorldWide Telescope,Educator Roy Gould and researcher Curtis Wong ...,TED2008,0:06:42,2/27/08,"telescopes,demo,astronomy,universe,science,tec...",1043036,"Roy Gould: Less than a year from now, the wo...",,,,,,,,
212,212,246,https://www.ted.com/talks/tod_machover_and_dan...,Tod Machover + Dan Ellsey,Inventing instruments that unlock new music,Tod Machover of MIT's Media Lab is devoted to ...,TED2008,0:20:41,4/15/08,"demo,entertainment,writing,live music,health c...",519734,The first idea I'd like to suggest is that w...,,,,,,,,
266,266,322,https://www.ted.com/talks/bruno_bowden_folds_w...,Bruno Bowden + Rufus Cappadocia,Blindfold origami and cello,After Robert Lang's talk on origami at TED2008...,TED2008,0:02:58,8/1/08,"origami,entertainment,cello,music",384129,Hello everyone. And so the two of us are her...,,,,,,,,
315,315,385,https://www.ted.com/talks/toys_from_the_future,Zach Kaplan + Keith Schacht,Toys and materials from the future,"The Inventables guys, Zach Kaplan and Keith Sc...",TED2005,0:15:46,10/30/08,"toy,smell,industrial design,design,creativity,...",420887,Zach Kaplan: Keith and I lead a research tea...,,,,,,,,
394,394,481,https://www.ted.com/talks/pattie_maes_demos_th...,Pattie Maes + Pranav Mistry,Meet the SixthSense interaction,"This demo -- from Pattie Maes' lab at MIT, spe...",TED2009,0:08:42,3/10/09,"demo,interface design,design,technology",9912033,I've been intrigued by this question of whet...,,,,,,,,


In [None]:
result.iloc[emptyjobs]

In [90]:
# Melvin
#splitList = ['\+', ', ', ' and ']

splitList = r'\+| , | and'

#https://stackoverflow.com/questions/37543724/python-regex-for-finding-all-words-in-a-string

result2 = (result.iloc[381]['speaker_name'].str.split(splitList, expand=True).rename(columns=lambda x: f"speaker_name_{x+1}"))

#result2 = (result.iloc[emptyjobs]['speaker_name'].str.split('\+', expand=True).rename(columns=lambda x: f"speaker_name_{x+1}"))
#result2 = (result['speaker_name'].str.split(', ', expand=True).rename(columns=lambda x: f"speaker_name_{x+1}"))

#result = result.join(result2)
#result["speaker_name"] = result["speaker_name"].str.split('\+')

#result2['speaker_name'].str.contains(' + ')
#result2[result2['speaker_name'].str.contains('Sergey')]

AttributeError: 'str' object has no attribute 'str'

In [91]:
result2

Unnamed: 0,speaker_name_1,speaker_name_2
61,Carl Honoré,
100,Sergey Brin,Larry Page
108,Blaise Agá¼era y Arcas,
151,VS Ramachandran,
178,Bernie Dunlap,
190,Jill Sobule,Julia Sweeney
194,Roy Gould,Curtis Wong
212,Tod Machover,Dan Ellsey
266,Bruno Bowden,Rufus Cappadocia
315,Zach Kaplan,Keith Schacht


In [50]:
add_speakers = add_speakers.drop_duplicates(subset=["speaker_name"], keep='first')

In [40]:
add_speakers.loc[add_speakers["Gender + hand codes"].isnull()]

Unnamed: 0,speaker_name,Occupation,ShortDescription,LongDescription,Gender,MaleScore,FemaleScore,NonBinaryScore,Gender + hand codes
1836,John Gable,"Technologist, activist",John Gable is the founder and CEO of AllSides....,Why you should listen\nJohn Gable offers a uni...,undetected,3,0,3,
1857,Katie Hinde,Lactation researcher,Katie Hinde is studying breast milk’s status a...,Why you should listen\nDid you know mother's m...,undetected,0,3,3,
1860,Raj Panjabi,Physician,A billion people around the world lack access ...,Why you should listen\nRaj Panjabi was nine wh...,undetected,5,0,3,
1871,Stuart Duncan,Web developer,"Stuart Duncan is the creator of AutCraft, the ...","Why you should listen\nIn 2013, Stuart ""Autism...",undetected,2,0,2,
1922,Supasorn Suwajanakorn,Computer scientist,Supasorn Suwajanakorn works on ways to reconst...,Why you should listen\nCan we create a digital...,undetected,3,0,3,
1927,Sydney Chaffee,Educator,Sydney Chaffee believes that teachers and stud...,Why you should listen\nAs the 2017 National Te...,no pronouns,0,0,0,


In [43]:
result.loc[result.duplicated(['headline'])]
#https://stackoverflow.com/questions/45262134/inner-join-merge-in-pandas-dataframe-give-more-rows-than-left-dataframe

Unnamed: 0,rowID,Talk_ID,public_url,speaker_name,headline,description,event,duration,published,tags,...,text,Occupation,ShortDescription,LongDescription,Gender,MaleScore,FemaleScore,NonBinaryScore,Gender + hand codes,_merge
13,12,58,https://www.ted.com/talks/larry_brilliant_want...,Larry Brilliant,My wish: Help me stop pandemics,"Accepting the 2006 TED Prize, Dr. Larry Brilli...",TED2006,0:25:50,7/25/06,"TED Prize,ebola,global issues,health,disease,s...",...,I'm the luckiest guy in the world. I got to ...,"Epidemiologist, philanthropist",TED Prize winner Larry Brilliant has spent his...,Why you should listen\nLarry Brilliant's caree...,male,17.0,0.0,2.0,,both
48,46,23,https://www.ted.com/talks/peter_gabriel_fights...,Peter Gabriel,Fight injustice with raw video,Musician and activist Peter Gabriel shares his...,TED2006,0:14:08,12/6/06,"global issues,film,activism,storytelling,art,c...",...,"I love trees, and I'm very lucky, because we...","Musician, activist","Peter Gabriel writes incredible songs but, as ...",Why you should listen\nPeter Gabriel was a fou...,male,3.0,0.0,0.0,,both
52,49,26,https://www.ted.com/talks/rives_controls_the_i...,Rives,If I controlled the Internet,"How many poets could cram eBay, Friendster and...",TEDSalon 2006,0:04:07,12/14/06,"entertainment,philosophy,love,poetry,culture,p...",...,I wrote this poem after hearing a pretty wel...,"Performance poet, multimedia artist",Performance artist and storyteller Rives has b...,"Why you should listen\nPart poet, part storyte...",male,6.0,0.0,0.0,,both
69,65,5,https://www.ted.com/talks/chris_bangle_says_gr...,Chris Bangle,Great cars are great art,American designer Chris Bangle explains his ph...,TED2002,0:20:04,4/5/07,"industrial design,transportation,cars,art,desi...",...,"What I want to talk about is, as background,...",Car designer,Car design is a ubiquitous but often overlooke...,Why you should listen\nAmerican designer Chris...,male,8.0,0.0,0.0,,both
79,74,77,https://www.ted.com/talks/sheila_patek_clocks_...,Sheila Patek,The shrimp with a kick!,Biologist Sheila Patek talks about her work me...,TED2004,0:16:25,4/5/07,"biomechanics,online video,oceans,biology,scien...",...,If you'd like to learn how to play the lobst...,"Biologist, biomechanics researcher",Biologist Sheila Patek is addicted to speed — ...,"Why you should listen\nSheila Patek, a UC Berk...",undetected,0.0,7.0,5.0,,both
84,78,35,https://www.ted.com/talks/james_watson_on_how_...,James Watson,How we discovered DNA,Nobel laureate James Watson opens TED2005 with...,TED2005,0:20:11,4/5/07,"DNA,storytelling,history,science,invention,gen...",...,"Well, I thought there would be a podium, so ...","Biologist, Nobel laureate",Nobel laureate James Watson took part in one o...,Why you should listen\nJames Watson has led a ...,male,8.0,0.0,0.0,,both
95,88,104,https://www.ted.com/talks/william_mcdonough_on...,William McDonough,Cradle to cradle design,Green-minded architect and designer William Mc...,TED2005,0:20:05,4/6/07,"cities,china,global issues,architecture,design...",...,"In 1962, with Rachel Carson's ""Silent Spring...",Architect,Architect William McDonough believes green des...,Why you should listen\nArchitect William McDon...,male,7.0,0.0,3.0,,both
99,91,108,https://www.ted.com/talks/rives_remixes_ted2006,Rives,A mockingbird remix of TED2006,Rives recaps the most memorable moments of TED...,TED2006,0:04:11,4/9/07,"entertainment,memory,storytelling,poetry,spoke...",...,Mockingbirds are badass. (Laughter) They ...,"Performance poet, multimedia artist",Performance artist and storyteller Rives has b...,"Why you should listen\nPart poet, part storyte...",male,6.0,0.0,0.0,,both
107,98,72,https://www.ted.com/talks/chris_anderson_of_wi...,Chris Anderson,Technology's long tail,"Chris Anderson, then the editor of Wired, expl...",TED2004,0:14:18,4/27/07,"entertainment,marketing,economics,culture,busi...",...,"I'd like to speak about technology trends, w...",TED Curator,After a long career in journalism and publishi...,Why you should listen\rChris Anderson is the C...,undetected,8.0,0.0,5.0,male,both
132,122,148,https://www.ted.com/talks/rives_on_4_a_m,Rives,The 4 a.m. mystery,"Poet Rives does 8 minutes of lyrical origami, ...",TED2007,0:09:12,7/17/07,"entertainment,poetry,spoken word",...,This is a recent comic strip from the Los An...,"Performance poet, multimedia artist",Performance artist and storyteller Rives has b...,"Why you should listen\nPart poet, part storyte...",male,6.0,0.0,0.0,,both


In [38]:
result.loc[result['_merge'] == "left_only"]
#https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas

Unnamed: 0,rowID,Talk_ID,public_url,speaker_name,headline,description,event,duration,published,tags,...,text,Occupation,ShortDescription,LongDescription,Gender,MaleScore,FemaleScore,NonBinaryScore,Gender + hand codes,_merge
64,61,73,https://www.ted.com/talks/carl_honore_praises_...,Carl Honoré,In praise of slowness,Journalist Carl Honore believes the Western wo...,TEDGlobal 2005,0:19:15,2/28/07,"choice,happiness,potential,psychology,health,p...",...,What I'd like to start off with is an observ...,,,,,,,,,left_only
109,100,118,https://www.ted.com/talks/sergey_brin_and_larr...,Sergey Brin + Larry Page,The genesis of Google,Google co-founders Larry Page and Sergey Brin ...,TED2004,0:20:33,5/3/07,"web,design,Google,culture,business,technology,...",...,Sergey Brin: I want to discuss a question I ...,,,,,,,,,left_only
117,108,129,https://www.ted.com/talks/blaise_aguera_y_arca...,Blaise Agá¼era y Arcas,How PhotoSynth can connect the world's images,Blaise Aguera y Arcas leads a dazzling demo of...,TED2007,0:07:30,5/27/07,"microsoft,virtual reality,demo,software,visual...",...,"What I'm going to show you first, as quickly...",,,,,,,,,left_only
162,151,184,https://www.ted.com/talks/vilayanur_ramachandr...,VS Ramachandran,3 clues to understanding your brain,Vilayanur Ramachandran tells us what brain dam...,TED2007,0:23:34,10/21/07,"consciousness,illusion,brain,illness,science,c...",...,"Well, as Chris pointed out, I study the huma...",,,,,,,,,left_only
192,178,208,https://www.ted.com/talks/ben_dunlap_talks_abo...,Bernie Dunlap,The life-long learner,Wofford College president Bernie Dunlap tells ...,TED2007,0:19:08,1/23/08,"entertainment,library,literature,storytelling,...",...,"""Jó napot, pacák"" Which, as somebody here mu...",,,,,,,,,left_only
205,190,222,https://www.ted.com/talks/the_jill_and_julia_show,Jill Sobule + Julia Sweeney,The Jill and Julia Show,"Two TED favorites, Jill Sobule and Julia Sween...",TED2007,0:06:14,2/20/08,"entertainment,comedy,humor,storytelling,collab...",...,♫ Jill Sobule: At a conference in Monterey b...,,,,,,,,,left_only
209,194,224,https://www.ted.com/talks/roy_gould_and_curtis...,Roy Gould + Curtis Wong,A preview of the WorldWide Telescope,Educator Roy Gould and researcher Curtis Wong ...,TED2008,0:06:42,2/27/08,"telescopes,demo,astronomy,universe,science,tec...",...,"Roy Gould: Less than a year from now, the wo...",,,,,,,,,left_only
227,212,246,https://www.ted.com/talks/tod_machover_and_dan...,Tod Machover + Dan Ellsey,Inventing instruments that unlock new music,Tod Machover of MIT's Media Lab is devoted to ...,TED2008,0:20:41,4/15/08,"demo,entertainment,writing,live music,health c...",...,The first idea I'd like to suggest is that w...,,,,,,,,,left_only
284,266,322,https://www.ted.com/talks/bruno_bowden_folds_w...,Bruno Bowden + Rufus Cappadocia,Blindfold origami and cello,After Robert Lang's talk on origami at TED2008...,TED2008,0:02:58,8/1/08,"origami,entertainment,cello,music",...,Hello everyone. And so the two of us are her...,,,,,,,,,left_only
335,315,385,https://www.ted.com/talks/toys_from_the_future,Zach Kaplan + Keith Schacht,Toys and materials from the future,"The Inventables guys, Zach Kaplan and Keith Sc...",TED2005,0:15:46,10/30/08,"toy,smell,industrial design,design,creativity,...",...,Zach Kaplan: Keith and I lead a research tea...,,,,,,,,,left_only


In [52]:
result = pd.merge(talks, add_speakers, on="speaker_name", how="left", indicator=True)

In [53]:
print(result.shape)
print(talks.shape)
print(add_speakers.shape)

(2656, 21)
(2656, 12)
(1842, 9)


In [59]:
result.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2664 entries, 0 to 2663
Data columns (total 21 columns):
rowID                  2664 non-null int64
Talk_ID                2664 non-null int64
public_url             2664 non-null object
speaker_name           2662 non-null object
headline               2664 non-null object
description            2664 non-null object
event                  2664 non-null object
duration               2664 non-null object
published              2664 non-null object
tags                   2664 non-null object
views                  2664 non-null int64
text                   2664 non-null object
Occupation             2083 non-null object
ShortDescription       2086 non-null object
LongDescription        2087 non-null object
Gender                 2087 non-null object
MaleScore              2087 non-null float64
FemaleScore            2087 non-null float64
NonBinaryScore         2087 non-null float64
Gender + hand codes    2087 non-null object
_merge       

In [50]:
speakers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Data columns (total 9 columns):
speaker_name           1800 non-null object
Occupation             1796 non-null object
ShortDescription       1799 non-null object
LongDescription        1800 non-null object
Gender                 1800 non-null object
MaleScore              1800 non-null int64
FemaleScore            1800 non-null int64
NonBinaryScore         1800 non-null int64
Gender + hand codes    1800 non-null object
dtypes: int64(3), object(6)
memory usage: 126.6+ KB


In [62]:
# Can't run just yet
# result['MaleScore'] = result['MaleScore'].astype(int)
# https://stackoverflow.com/questions/41590884/change-data-type-of-a-specific-column-of-a-pandas-dataframe

ValueError: Cannot convert NA to integer

In [65]:
result.loc[result['MaleScore'].isnull()].shape

(577, 21)