Capstone

Scrape this page to get list of TED talk urls.

https://www.ted.com/talks?language=en&page=1&sort=newest

Another notebook will then import csv file list of urls to scrape the actual transcripts.

In [8]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time

In [9]:
# Page containing a grid of links to TED talks
#https://www.ted.com/talks?language=en&page=1&sort=newest

base_url = 'https://www.ted.com/talks'


In [10]:
list(range(1, 3))

[1, 2]

Create a record for each talk containing:
* Title
* Speaker
* URL
* Date


In [13]:
all_talks = []

# There are 122 pages that contain the links and titles to each TED talk.
for page_number in range(1, 123) :
    
    mydict = {'language': 'en', 'page': page_number, 'sort':'newest'}

    
    resp = requests.get(base_url, params=mydict)
    
    if resp.status_code != 200:
        print(f'Get failed. Status Code: {resp.status_code} Page: {page_number} ')
        break
    
    soup = BeautifulSoup(resp.text, 'html.parser')


    # Each talk's information is contained in a <div class="media__message"> tag
    for talk_container in soup.find_all('div', class_='media__message') :

        speaker_tag = talk_container.find('h4')
        speaker = speaker_tag.text
        talk_info = speaker_tag.find_next_sibling("h4")

        title = talk_info.text
        #strip out newline characters
        title = title.strip('\n')

        # extract relative url for later scraping
        url = talk_info.find('a')['href']

        # Every talk has a date in a string format that looks like 'Jan 2021'
        talk_date_string = talk_container.find('span', class_='meta__val').text.strip('\n')

        month = talk_date_string.split(' ')[0]  # extract month
        year = talk_date_string.split(' ')[1]   # extract year

        talk_row = {}
        talk_row['title'] = title
        talk_row['speaker'] = speaker
        talk_row['url'] = url
        talk_row['month'] = month
        talk_row['year'] = year

        all_talks.append(talk_row)

    print (f'Page {page_number} done.')

    # TED's servers have a 'Retry-After': '5' value, so let's wait 6 seconds in between requests to be on the safe side.
    # Wait 6 sec before next request
    time.sleep(6)

        

Page 1 done.
Page 2 done.
Page 3 done.
Page 4 done.
Page 5 done.
Page 6 done.
Page 7 done.
Page 8 done.
Page 9 done.
Page 10 done.
Page 11 done.
Page 12 done.
Page 13 done.
Page 14 done.
Page 15 done.
Page 16 done.
Page 17 done.
Page 18 done.
Page 19 done.
Page 20 done.
Page 21 done.
Page 22 done.
Page 23 done.
Page 24 done.
Page 25 done.
Page 26 done.
Get failed. Status Code: 429 Page: 27 


In [15]:
resp.content

b'\n<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n<html>\n  <head>\n    <title>429 Rate Limited too many requests.</title>\n  </head>\n  <body>\n    <h1>Error 429 Rate Limited too many requests.</h1>\n    <p>Rate Limited too many requests.</p>\n    <h3>Guru Meditation:</h3>\n    <p>XID: 2430258188</p>\n    <hr>\n    <p>Varnish cache server</p>\n  </body>\n</html>\n'

In [19]:
resp.headers

{'Connection': 'keep-alive', 'Content-Length': '455', 'Content-Type': 'text/html; charset=utf-8', 'Retry-After': '5', 'Accept-Ranges': 'bytes', 'Age': '0', 'Strict-Transport-Security': 'max-age=31536001', 'Date': 'Thu, 11 Feb 2021 06:57:36 GMT', 'Via': '1.1 varnish', 'X-Served-By': 'cache-bwi5147-BWI, cache-lax10639-LGB', 'X-Cache': 'MISS, MISS', 'X-Cache-Hits': '0, 0', 'Set-Cookie': '_nu=1613026656; Expires=Tue, 10 Feb 2026 06:57:36 GMT; path=/, _abby=VvuVLzliF7ix6C8; Expires=Tue, 10 Feb 2026 06:57:36 GMT; Path=/; Domain=.ted.com, _abby_hero_form=b; Expires=Thu, 25 Feb 2021 06:57:36 GMT; Path=/'}

In [12]:
talks_df = pd.DataFrame(all_talks)
talks_df.head()

Unnamed: 0,title,speaker,url,month,year
0,Community-powered solutions to the climate crisis,Rahwa Ghirmatzion and Zelalem Adefris,/talks/rahwa_ghirmatzion_and_zelalem_adefris_c...,Feb,2021
1,A simple 2-step plan for saving more money,Wendy De La Rosa,/talks/wendy_de_la_rosa_a_simple_2_step_plan_f...,Feb,2021
2,"What causes dandruff, and how do you get rid o...",Thomas L. Dawson,/talks/thomas_l_dawson_what_causes_dandruff_an...,Feb,2021
3,The artist who won a Nobel Prize... in medicine,Melanie E. Peffer,/talks/melanie_e_peffer_the_artist_who_won_a_n...,Feb,2021
4,A concrete idea to reduce carbon emissions,Karen Scrivener,/talks/karen_scrivener_a_concrete_idea_to_redu...,Feb,2021
...,...,...,...,...,...
67,Why is pneumonia so dangerous?,Eve Gaus and Vanessa Ruiz,/talks/eve_gaus_and_vanessa_ruiz_why_is_pneumo...,Nov,2020
68,The city planting a million trees in two years,Yvonne Aki-Sawyerr,/talks/yvonne_aki_sawyerr_the_city_planting_a_...,Nov,2020
69,"How to come out at work, about anything",Micah Eames,/talks/micah_eames_how_to_come_out_at_work_abo...,Nov,2020
70,How reverse mentorship can help create better ...,Patrice Gordon,/talks/patrice_gordon_how_reverse_mentorship_c...,Nov,2020


In [20]:
# Save to csv - this path exists on my Google Colab
#talks_df.to_csv('./ted/talk_list.csv', index=False)