## Getting the Pages that List Speakers

After some rooting around on the TED website, examining the Speakers home page, it looks like there are 86 pages that list TED speakers, with URLS like this: `https://www.ted.com/speakers?page=1`. My plan is to:

1. create a list of pages in a text file, 
2. download the 86 pages, 
3. parse the speakers out,
4. create a second list of pages with the `ted.com/speakers/speaker_name" format`
5. download all those pages
6. parse them into a CSV for KK.

Oh, I hope this works.

There's going to be a lot of files created in this process, so it's important to remember where I am:

In [42]:
% pwd

'/Users/john/Code/tedtalks/data'

### Step 1: Create a list of pages

In [5]:
# This is just proof of concept. I actually used range(1,87) to get
# the list I needed and pasted it into a text document.
for i in range(1,5):
    print("https://www.ted.com/speakers?page=" + str(i))

https://www.ted.com/speakers?page=1
https://www.ted.com/speakers?page=2
https://www.ted.com/speakers?page=3
https://www.ted.com/speakers?page=4


### Step 2: Download the pages

Okay, the text file is `speaker_index_pages.txt` in the data/speakers/directory. I used the following to download all 86 speaker pages to the indices directory:

    wget -w 2 -i ../speaker_index_pages.txt

Then, being lazy and not wanting to figure out how to parse all the files into one list (see below), I simply concatenated all the files into one, with the plan to use `BeautifulSoup` to run through it.

    cat indices/* > speakers_all_pages.txt

### Step 3: Parse the Speaker URLs Out of the Pages

Inside the HTML, each speaker's profile can be found in the following line:

    <a class="results__result media media--sm-v m4" href="/speakers/ellen_t_hoen">

So we need the `href` attribute for the `results__result` class.

In [8]:
from bs4 import BeautifulSoup

the_file = open("./speakers/speakers_all_pages.txt", "r")
the_soup = BeautifulSoup(the_file, "lxml")
speaker_suffix = the_soup.find('a', {'class':'results__result'})
print(speaker_suffix)

<a class="results__result media media--sm-v m4" href="/speakers/ellen_t_hoen">
<div class="media__image media__image--thumb">
<span class="thumb thumb--square"><span class="thumb__sizer"><span class="thumb__tugger"><img alt="" class=" thumb__image" play="false" src="https://pi.tedcdn.com/r/pe.tedcdn.com/images/ted/bcd3208945bf8b311418b64f917b1f38e84a24e8_800x600.jpg?h=191&amp;w=254"/><span class="thumb__aligner"></span></span></span></span>
</div>
<div class="media__message">
<h4 class="h7 m5">
Ellen<br/>'t Hoen</h4>
<p class="p4">
<strong>Medicine law expert</strong>
</p>
</div>
</a>


In [10]:
the_file = open("./speakers/speakers_all_pages.txt", "r")
the_soup = BeautifulSoup(the_file, "lxml")
speaker_suffix = the_soup.find('a', {'class':'results__result'})['href']
print(speaker_suffix)

/speakers/ellen_t_hoen


In [15]:
type(speaker_suffix)

str

In [16]:
the_file = open("./speakers/speakers_all_pages.txt", "r")
the_soup = BeautifulSoup(the_file, "lxml")
speaker_suffixes = the_soup.find_all('a', {'class':'results__result'})

In [41]:
len(speaker_suffixes)

30

Why is this only returning 30 items?

In [22]:
print(speaker_suffixes[6])

<a class="results__result media media--sm-v m4" href="/speakers/marc_abrahams">
<div class="media__image media__image--thumb">
<span class="thumb thumb--square"><span class="thumb__sizer"><span class="thumb__tugger"><img alt="" class=" thumb__image" play="false" src="https://pi.tedcdn.com/r/pe.tedcdn.com/images/ted/6ce52170e488499f44d4145454d54b247325bc8b_800x600.jpg?h=191&amp;w=254"/><span class="thumb__aligner"></span></span></span></span>
</div>
<div class="media__message">
<h4 class="h7 m5">
Marc<br/>Abrahams</h4>
<p class="p4">
<strong>Science humorist</strong>
</p>
</div>
</a>


In [28]:
suffixes = [i.attrs["href"] for i in speaker_suffixes[0:3]]
print(suffixes)

['/speakers/ellen_t_hoen', '/speakers/sandra_aamodt', '/speakers/trevor_aaronson']


In [30]:
suffixes = [i.attrs["href"] for i in speaker_suffixes]

In [40]:
len(suffixes)

30

In [34]:
urls = [str("https://www.ted.com"+suffix) for suffix in suffixes[0:2]]
print(urls)

['https://www.ted.com/speakers/ellen_t_hoen', 'https://www.ted.com/speakers/sandra_aamodt']


In [36]:
urls = [str("https://www.ted.com"+suffix) for suffix in suffixes]

In [39]:
len(urls)

30

In [38]:
with open('./speakers/speaker_urls.txt', 'w') as f:
    for item in urls:
        f.write("%s\n" % item)

In [None]:
import re
import csv
import os
from bs4 import BeautifulSoup

# name: <h1 class="h2 profile-header__name">
# occupation: <div class="p2 profile-header__summary">
# intro: <div class="profile-intro">
# profile: <div class="section section--minor">


def parse(soup):
    # both title and views are can be parsed in separate tags.
    name = soup.find('h1', {'class' : "h2 profile-header__name"}).text.strip('\n')
    occupation = soup.find('div', {'class' : "p2 profile-header__summary"}).text.strip('\n')
    intro = soup.find('div', {'class' : "profile-intro"}).text.strip('\n')
    profile = soup.find('div', {'class' : "section section--minor"}).text.strip('\n')
    return name, occupation, intro, profile

def to_csv(pth, out):
    # open file to write to.
    with open(out, "w") as out:
        # create csv.writer.
        wr = csv.writer(out)
        # write our headers.
        wr.writerow(["title", "views", "descr"])
        # get all our html files.
        for html in os.listdir(pth):
            with open(os.path.join(pth, html)) as f:
                print(html)
                # parse the file and write the data to a row.
                wr.writerow(parse(BeautifulSoup(f, "lxml")))

# This is the ACTION:
to_csv("./html_files/speakers/","speakers.csv")