## Getting the Pages that List Speakers

After some rooting around on the TED website, examining the Speakers home page, it looks like there are 86 pages that list TED speakers, with URLS like this: `https://www.ted.com/speakers?page=1`. My plan is to:

1. create a list of pages in a text file, 
2. download the 86 pages, 
3. parse the speakers out,
4. create a second list of pages with the `ted.com/speakers/speaker_name" format`
5. download all those pages
6. parse them into a CSV for KK.

Oh, I hope this works.

There's going to be a lot of files created in this process, so it's important to remember where I am:

In [5]:
% pwd

'/Users/c00253218/Code/tedtalks/data'

In [12]:
# As always, the collected imports for this notebook are at the top:

import re, csv, os
from bs4 import BeautifulSoup

### Step 1: Create a list of pages

In [3]:
# This is just proof of concept. I actually used range(1,87) to get
# the list I needed and pasted it into a text document.
for i in range(1,5):
    print("https://www.ted.com/speakers?page=" + str(i))

https://www.ted.com/speakers?page=1
https://www.ted.com/speakers?page=2
https://www.ted.com/speakers?page=3
https://www.ted.com/speakers?page=4


### Step 2: Download the pages

Okay, the text file is `speaker_index_pages.txt` in the data/speakers/directory. I used the following to download all 86 speaker pages to the indices directory:

    wget -w 2 -i ../speaker_index_pages.txt

Then, being lazy and not wanting to figure out how to parse all the files into one list (see below), I simply concatenated all the files into one, with the plan to use `BeautifulSoup` to run through it.

    cat indices/* > speakers_all_pages.txt

### Step 3: Parse the Speaker URLs Out of the Pages

Inside the HTML, each speaker's profile can be found in the following line:

    <a class="results__result media media--sm-v m4" href="/speakers/ellen_t_hoen">

So we need the `href` attribute for the `results__result` class.

In [4]:
with open('./speakers/speakers_all_pages.txt', 'r') as myfile:
    data = myfile.read()

In [5]:
the_soup = BeautifulSoup(data, "lxml")

In [8]:
type(the_soup)

bs4.BeautifulSoup

In [9]:
speaker_suffix = the_soup.find('a', {'class':'results__result media media--sm-v m4'})
print(speaker_suffix)

<a class="results__result media media--sm-v m4" href="/speakers/ellen_t_hoen">
<div class="media__image media__image--thumb">
<span class="thumb thumb--square"><span class="thumb__sizer"><span class="thumb__tugger"><img alt="" class=" thumb__image" play="false" src="https://pi.tedcdn.com/r/pe.tedcdn.com/images/ted/bcd3208945bf8b311418b64f917b1f38e84a24e8_800x600.jpg?h=191&amp;w=254"/><span class="thumb__aligner"></span></span></span></span>
</div>
<div class="media__message">
<h4 class="h7 m5">
Ellen<br/>'t Hoen</h4>
<p class="p4">
<strong>Medicine law expert</strong>
</p>
</div>
</a>


In [11]:
speaker_suffix = the_soup.find('a', {'class':'results__result'})['href']
print(type(speaker_suffix), speaker_suffix)

<class 'str'> /speakers/ellen_t_hoen


In [13]:
speaker_suffixes = the_soup.findAll('a', {'class':'results__result media media--sm-v m4'})
len(speaker_suffixes)

30

Why is this only returning 30 items? Using search in a plain text editor (Atom) on the same files above, there are 2572 occurrences of **`results__result`**.

```python
# Archived code that was meant to generate URLs from the soup above
suffixes = [i.attrs["href"] for i in speaker_suffixes]
urls = [str("https://www.ted.com"+suffix) for suffix in suffixes]
```

### Using regex to get the speaker suffix

Okay, I'm still not sure what's going on with the truncated input above, but what I am going to try to do below is read the file in as a string and then use regex to grab the speaker suffixes.

In [15]:
suffixes = re.findall(r'/speakers/(.*?)\'>', data)
len(suffixes)

2572

In [16]:
urls = [str("https://www.ted.com/speakers/"+suffix) for suffix in suffixes]

In [17]:
urls[0:5]

['https://www.ted.com/speakers/ellen_t_hoen',
 'https://www.ted.com/speakers/sandra_aamodt',
 'https://www.ted.com/speakers/trevor_aaronson',
 'https://www.ted.com/speakers/chris_abani',
 'https://www.ted.com/speakers/yassmin_abdel_magied']

### Create a second list with the speaker suffixes

In [18]:
with open('./speakers/speaker_urls.txt', 'w') as f:
    for item in urls:
        f.write("%s\n" % item)

**Step 5** was done using **`wget`**: 

```bash
wget -w 2 -i ../speaker_urls.txt
```

And here's the eventual report:

```
FINISHED --2018-11-15 22:05:00--
Total wall clock time: 2h 0m 43s
Downloaded: 2569 files, 91M in 7.0s (13.1 MB/s)
```

### Step 6: Parse the speaker profiles

In [6]:
% cd speakers

/Users/c00253218/Code/tedtalks/data/speakers


In [7]:
% ls

[1m[34mindices[m[m/                  speaker_urls.txt          speakers_raw.csv
[1m[34mprofiles[m[m/                 speakers_all_pages.txt
speaker_index_pages.txt   speakers_gender_test.csv


Here's the HTML where things are found:

* **Name**: `<meta name="author" content="Aala El-Khani" />` or `<h1 class="h2 profile-header__name">`
* **Occupation**: `<div class="p2 profile-header__summary">`
* **Intro**: `<div class="profile-intro">`
* **Profile**: `<div class="section section--minor">`

These are all converted into **`BS4`** searches below. (Fingers crossed that these work.)

In [24]:
def parsethis(soup):
    if (soup.find('h1', {'class' : 'h2 profile-header__name'})) is not None:
        name = soup.find('h1', {'class' : 'h2 profile-header__name'}).text.strip('\n')
        occupation = soup.find('div', {'class' : 'p2 profile-header__summary'}).text.strip('\n')
        intro = soup.find('div', {'class' : 'profile-intro'}).text.strip('\n')
        profile = soup.find('div', {'class' : 'section section--minor'}).text.strip('\n')
        return name, occupation, intro, profile
    else:
        return 'undetected','undetected','undetected','undetected'

In [25]:
def to_csv(pth, out):
    # open file to write to.
    with open(out, "w") as out:
        # create csv.writer
        wr = csv.writer(out)
        # write headers
        wr.writerow(["name", "occupation", "introduction", "profile"])
        # get all our html files
        for html in os.listdir(pth):
            with open(os.path.join(pth, html)) as f:
                print(html) # prints off name as it goes?
                # parse the file and write the data to a row.
                wr.writerow(parsethis(BeautifulSoup(f, "html")))

In [26]:
# This is the ACTION:
to_csv("./profiles/","speakers2.csv")

gary_haugen
rabbi_lord_jonathan_sacks
joe_kowan
allan_adams
peter_tyack
heather_brooke
reed_hastings
oscar_schwartz
marwa_al_sabouni
steven_addis
nizar_ibrahim
ian_bremmer
lisa_harouni
shashi_tharoor
sergey_brin
candy_chang
susan_solomon
jk_rowling
diane_kelly
thelma_golden
john_kasaona
lee_smolin
elon_musk
yossi_vardi
david_logan
christiane_amanpour
ndidi_nwuneli
jonas_gahr_store
larry_brilliant
zaria_forman
marc_abrahams
adam_driver
halla_tomasdottir
sarah_donnelly
courtney_e_martin
joshua_klein
john_gable
john_la_grou
christine_sun_kim
deborah_gordon
bart_knols
einstein_the_parrot
mike_degruy
peter_gabriel
jill_heinerth
haas_hahn
benjamin_wallace
ethan_zuckerman
marco_tempest
pam_warhurst
neri_oxman
cynthia_kenyon
chimamanda_ngozi_adichie
jonathan_marks
hannah_brencher
birke_baehr
abraham_verghese
katie_hinde
sheila_patek
nirmalya_kumar
raj_panjabi
kate_stafford
lauren_sallan
sian_leah_beilock
jimmy_lin
elizabeth_lindsey
manwar_ali
peter_fankhauser
sue_desmond_hellman
anindya_kundu


kevin_njabo
ola_rosling
sebastian_deterding
richard_branson
isaac_lidsky
ines_hercovich
miriam_zoila_perez
maurizio_seracini
max_tegmark
sandrine_thuret
soyapi_mumba
robin_ince
boyd_varty
ben_ambridge
sheryl_wudunn
katherine_kuchenbecker
michael_pemberton
sarah_murray
bel_pesce
chris_hadfield
andrew_mwenda
amy_webb
david_kelley
david_keith
marina_abramovic
anant_agarwal
emily_oster
bhu_srinivasan
paul_knoepfler
robert_ballard
sophie_andrews
bruce_feiler
angelo_vermeulen
elise_legrow
dan_goldstein
vilayanur_ramachandran
nick_bostrom
ilona_stengel
mary_maker
ibeyi
alyssa_monks
ajit_narayanan
alexis_charpentier
james_patten
hugh_herr
paul_debevec
sanford_biggers
justin_hall_tipping
daan_roosegaarde
iain_mcgilchrist
olafur_eliasson
vincent_moon
garry_kasparov
kiran_sethi
sheila_nirenberg
rachel_botsman
jose_antonio_abreu
jim_yong_kim
philip_evans
ismael_nazario
chris_burkard
latoya_ruby_frazier
alex_steffen
shimon_schocken
amanda_palmer
peter_haas
soka_moses
janine_shepherd
matthieu_ricard

maya_penn_1
sisonke_msimang
barbara_block
mohamad_jabara
tiffany_watt_smith
glen_henry
david_eagleman
john_lloyd
helen_czerski
jason_shen
dawn_wacek
wendy_woods
ilona_szabo_de_carvalho
the_teresa_carreno_youth_orchestra
stella_young
catherine_bracy
danielle_feinberg
zachary_r_wood
peter_ouko
paul_tudor_jones_ii
janine_benyus
emilie_wapnick
david_agus
chris_milk
dianna_cohen
samantha_nutt
steve_silberman
eddi_reader
harry_cliff
jane_mcgonigal
sirena_huang
thomas_insel
tshering_tobgay
oren_yakobovich
clifford_stoll
dimitar_sasselov
mark_ronson
ari_wallach
kristen_marhaver
maryn_mckenna
aaron_huey
ramanan_laxminarayan
sue_austin
ramona_pierson
leah_chase
mac_barnett
sheena_iyengar
tony_wyss_coray
dave_debronkart
the_soul_rebels
natalie_merchant
markus_fischer
bunker_roy
john_cary
suzanne_simard
sean_carroll
nilay_kulkarni
david_grady
sandra_fisher_martins
joseph_redmon
antony_gormley
mick_mountz
will_macaskill
mitch_resnick
phil_borges
fredy_peccerelli
alex_honnold
michael_pawlyn
george_t

cyndi_stivers
r_luke_dubois
philippa_neave
amos_winter
regina_dugan
jennifer_doudna
howard_rheingold
sara_menker
tan_le
quyen_nguyen
azim_n_khamisa
julia_shaw
lera_boroditsky
shea_hembrey
toby_eccles
erik_schlangen
scott_mccloud
edith_widder
anjan_chatterjee
ann_morgan
rutger_bregman
noah_wilson_rich
chris_abani
liz_diller
nancy_lublin
yochai_benkler
james_balog
gary_flake
robert_sapolsky
arianna_huffington
philip_rosedale
carolyn_steel
annie_murphy_paul
bobby_ghosh
eric_whitacre
david_rockwell
steven_strogatz
helen_pearson
kim_gorgens
nadia_al_sakkaf
heather_knight
ethan_nadelmann
tina_seelig
stephen_deberry
caleb_harper
doris_kim_sung
craig_venter
anne_lamott
geraldine_hamilton
michael_rain
casey_brown
kimberle_crenshaw
susan_cain
mae_jemison
ray_dalio
celeste_headlee
teresa_bajan
io_tillett_wright
mina_bissell
hasan_elahi
joy_sun
sakena_yacoobi
ursus_wehrli
dan_buettner
isaac_mizrahi
brian_dettmer
cheyenne_cochrane
rajiv_maheswaran
jose_bowen
ole_scheeren
madeleine_albright
kristie_

joshua_smith
black
sanjay_pradhan
marc_bamuthi_joseph
michael_biddle
arthur_potts_dawson
dr_hawa_abdi_dr_deqo_mohamed
jennifer_golbeck
alice_dreger
rocio_lorenzo
skylar_tibbits
joseph_kim
hamish_jolly
graham_hill
jared_ficklin
john_maeda
aris_venetikidis
jonathan_trent
jessica_pryce
andy_yen
thomas_peschak
stephen_ritz
stefan_wolff
richard_feynman
his_holiness_the_17th_karmapa
caitlin_doughty
joan_halifax
khadija_gbla
will_noel
camille_brown
gary_kovacs
sean_gourley
robb_willer
michael_pritchard
clay_shirky
harvey_fineberg
ed_yong
adam_grosser
john_bohannon
peter_calthorpe
charles_hazlewood
rob_harmon
jehane_noujaim
vinay_shandal
adam_foss
joshua_prager
daniel_schnitzer
shukla_bose
tracee_ellis_ross
mike_matas
wendy_troxel
preston_reed
richard_sears
eric_berridge
anand_agarawala
charles_c_mann
will_marshall
jessica_jackley
jd_schramm
david_pogue
jessa_gamble
tasso_azevedo
taryn_simon
serena_williams
alastair_gray
becci_manson
chera_kowalski
hasini_jayatilaka
pope_francis
kai_fu_lee
ruf

In [2]:
import pandas as pd

df = pd.read_csv('speakers.csv')
print(df.shape)

df.head()

(134, 4)


Unnamed: 0,name,occupation,introduction,profile
0,Gary Haugen,Human rights attorney,"As founder of International Justice Mission, G...",While a member of the 1994 United Nations te...
1,Rabbi Lord Jonathan Sacks,Religious leader,"In a world violently polarized by extremists, ...",Rabbi Lord Sacks is one of Judaism's spiritua...
2,Joe Kowan,Musician and graphic designer,"By day he's a graphic designer, and by night J...",Joe Kowan is a Boston-based musician and grap...
3,Allan Adams,Theoretical physicist,Allan Adams is a theoretical physicist working...,Allan Adams is a theoretical physicist workin...
4,Peter Tyack,Behavioral ecologist,Peter Tyack studies the social behavior and ac...,"Peter Tyack, a senior scientist in biology at..."


In [None]:
with open(out, "w") as out:
    # create csv.writer
    wr = csv.writer(out)
    # write headers
    wr.writerow(["name", "occupation", "introduction", "profile"])
    # get all our html files
    for html in os.listdir(pth):
        with open(os.path.join(pth, html)) as f:
            # print(html) # prints off name as it goes?
            # parse the file and write the data to a row.
            wr.writerow(parse(BeautifulSoup(f, "lxml")))

## The Problem with Eric Haseltine

We have one speaker who's page kept throwing an error in the `parsethis` function and for which we wrote the `else: return undetected` code. His name is Eric Haseltine, and we turned the TED website upside down and shook it, hard, and it looks like he's been removed from the website: his speaker page sends you to a recommendations page.

Let's load the big CSV, `tedtalks2018` and see if we can find the row for his talk(s) to see if we need to create a line in the speakers CSV for him.

In [1]:
import pandas as pd

df = pd.read_csv('tedtalks2018.csv')
print(df.shape)

(2656, 12)


In [3]:
df.head()

Unnamed: 0,name,occupation,introduction,profile
0,Gary Haugen,Human rights attorney,"As founder of International Justice Mission, G...",While a member of the 1994 United Nations te...
1,Rabbi Lord Jonathan Sacks,Religious leader,"In a world violently polarized by extremists, ...",Rabbi Lord Sacks is one of Judaism's spiritua...
2,Joe Kowan,Musician and graphic designer,"By day he's a graphic designer, and by night J...",Joe Kowan is a Boston-based musician and grap...
3,Allan Adams,Theoretical physicist,Allan Adams is a theoretical physicist working...,Allan Adams is a theoretical physicist workin...
4,Peter Tyack,Behavioral ecologist,Peter Tyack studies the social behavior and ac...,"Peter Tyack, a senior scientist in biology at..."


In [6]:
df.loc[df['name'] == "Eric"]

Unnamed: 0,name,occupation,introduction,profile


He has been removed: his name appearing in our system is entirely a function of his name still appearing in TED's speaker list. We hand-edited the `speakers2.csv` file to remove the line with `undetected` in it.