Capstone

Scrape this page to get list of TED talk urls.

https://www.ted.com/talks?language=en&page=1&sort=newest

Another notebook will then import csv file list of urls to scrape the actual transcripts.

In [2]:
import pandas as pd
from bs4 import BeautifulSoup
import re
import requests
import time

In [3]:
# Load list of urls that will contain a url for each TED talk
list_of_talks = pd.read_csv('talk_list.csv')
list_of_talks.head()

Unnamed: 0,title,speaker,url,month,year
0,Community-powered solutions to the climate crisis,Rahwa Ghirmatzion and Zelalem Adefris,/talks/rahwa_ghirmatzion_and_zelalem_adefris_c...,Feb,2021
1,A simple 2-step plan for saving more money,Wendy De La Rosa,/talks/wendy_de_la_rosa_a_simple_2_step_plan_f...,Feb,2021
2,"What causes dandruff, and how do you get rid o...",Thomas L. Dawson,/talks/thomas_l_dawson_what_causes_dandruff_an...,Feb,2021
3,The artist who won a Nobel Prize... in medicine,Melanie E. Peffer,/talks/melanie_e_peffer_the_artist_who_won_a_n...,Feb,2021
4,A concrete idea to reduce carbon emissions,Karen Scrivener,/talks/karen_scrivener_a_concrete_idea_to_redu...,Feb,2021


In [4]:
list_of_talks.shape

(4384, 5)

In [5]:
# Each TED talk transcript is a url in the form of:
#https://www.ted.com/talks/rahwa_ghirmatzion_and_zelalem_adefris_community_powered_solutions_to_the_climate_crisis/transcript?language=en

base_url = 'https://www.ted.com'

list_of_talks.shape[0]

4384

In [6]:
# Current url needs to be transformed a bit to get it into the right format to retrieve each separate transcript
list_of_talks.iloc[row_index]['url'] 

'/talks/jack_dangermond_how_a_geospatial_nervous_system_could_help_us_design_a_better_future?language=en'

In [7]:
# Add /transcript to the end of each url but insert it before ?language=en
list_of_talks['url'] = list_of_talks['url'].str.replace ('?language', '/transcript?language', regex=False)

In [21]:
#for row_index in range(list_of_talks.shape[0]):
for row_index in range(641,642):
    each_talk_url = list_of_talks.iloc[row_index]['url']

    resp = requests.get(base_url + each_talk_url)
    
    if resp.status_code != 200:
        print (f'Get failed. Status Code: {resp.status_code} ')
        print (f'URL failed: {base_url + each_talk_url} ')
        continue
    
    soup = BeautifulSoup(resp.text, 'html.parser')  # html.parser    
    #print (soup.title)

    #text_block = soup.select('div', class_ = ['d:n', 'f-w:700', 'f:.9', 'f:1@xxl', 'c:white']) 
    
    # if this transcript was already found, skip it - this way I can re-run this code to try to fill in blank
    # trancripts that might have failed the first time.
    
    if pd.notna(list_of_talks.at[row_index, 'transcript']):
        continue
    
    full_transcript = ''
    
    # Find tags for this talk, which are stored in a script tag with the format <script data-spec="q">
    
    tags = soup.body.find("script" , {'data-spec' : "q"})
    
    script_text = str(tags)      # convert from Tag object to str, so we can slice it up and extract list of talk tag values
    script_start = script_text.find('"tags":') + 8   # add 8 to get past "tags":[ 
    script_end = script_text.find(']', script_start)   # list of tags end with a ]
    tag_list = script_text[script_start:script_end].replace('"', "")  # remove double quotes from string
    #tag_list = tag_list.split(',')  # convert to list of values

    list_of_talks.at[row_index, 'tags'] = tag_list
    
    for tag in soup.find_all('div'):
        
        # if tag has all matching attributes of class 
        if tag.has_attr( "class" ):
            c = tag['class']
            
            # this could work for talk views
            if 'f-w:700' in c and 'f:.9' in c and 'f:1@xxl' in c:
                temp_views = tag.text
                
                temp_views = temp_views.replace('\n', ' ')
                temp_views = temp_views.replace('\t', '')
                
                temp_views = " ".join(temp_views.split())
                
                list_of_talks.at[row_index, 'views'] = temp_views
    
#                 print (f'Div name: {tag.name} ')
#                 print(f"Class name: {tag['class']}")
#                 print(f'Text:{tag.text[0:200]} END OF TEXT')
                
                
            # Opening paragraph in transcript
#             if  'p:2' in c and 'p-t:4@md' in c:
#                 print (f'Transcript Open Div name: {tag.name} ')
#                 print(f" Transcript Open Class name: {tag['class']}")
#                 print(f'Transcript Open Text:{tag.text[0:200]} END OF TEXT')
                
            # every paragraph of main transcript text
            if  'Grid__cell' in c and 'flx-s:1' in c and 'p-r:4' in c:
                # Talk paragraphs found in this <div>
                # class=" Grid__cell flx-s:1 p-r:4 "
                               
                print (f'Transcript Div name: {tag.name} ')
                print(f" Transcript Class name: {tag['class']}")
                print(f'Transcript Text:{tag.text[0:200]} END OF TEXT')
                
                full_transcript = full_transcript + tag.text
    
    # Strip out tabs and new line characters
    full_transcript = full_transcript.replace('\n', ' ')
    full_transcript = full_transcript.replace('\t', '')
    
    
    full_transcript = " ".join(full_transcript.split())
    
    list_of_talks.at[row_index, 'transcript'] = full_transcript
    
    if row_index % 500 == 0:
        print (f'{row_index} transcripts downloaded.')
        
    # TED's servers have a 'Retry-After': '5' value, so let's wait 6 seconds in between requests to be on the safe side.
    # Wait 6 sec before next request
    time.sleep(6)
        

        
print (f'{row_index + 1} transcripts downloaded. End of downloading.')
        

  



642 transcripts downloaded. End of downloading.


In [13]:
list_of_talks

Unnamed: 0,title,speaker,url,month,year,tags,views,transcript
0,Community-powered solutions to the climate crisis,Rahwa Ghirmatzion and Zelalem Adefris,/talks/rahwa_ghirmatzion_and_zelalem_adefris_c...,Feb,2021,,,
1,A simple 2-step plan for saving more money,Wendy De La Rosa,/talks/wendy_de_la_rosa_a_simple_2_step_plan_f...,Feb,2021,,,
2,"What causes dandruff, and how do you get rid o...",Thomas L. Dawson,/talks/thomas_l_dawson_what_causes_dandruff_an...,Feb,2021,,,
3,The artist who won a Nobel Prize... in medicine,Melanie E. Peffer,/talks/melanie_e_peffer_the_artist_who_won_a_n...,Feb,2021,,,
4,A concrete idea to reduce carbon emissions,Karen Scrivener,/talks/karen_scrivener_a_concrete_idea_to_redu...,Feb,2021,,,
...,...,...,...,...,...,...,...,...
4379,The best stats you've ever seen,Hans Rosling,/talks/hans_rosling_the_best_stats_you_ve_ever...,Jun,2006,,,
4380,Do schools kill creativity?,Sir Ken Robinson,/talks/sir_ken_robinson_do_schools_kill_creati...,Jun,2006,,,
4381,Greening the ghetto,Majora Carter,/talks/majora_carter_greening_the_ghetto/trans...,Jun,2006,,,
4382,Simplicity sells,David Pogue,/talks/david_pogue_simplicity_sells/transcript...,Jun,2006,,,


In [18]:
list_of_talks.iloc[641]

title                          The first and last king of Haiti
speaker                                            Marlene Daut
url           /talks/marlene_daut_the_first_and_last_king_of...
month                                                       Oct
year                                                       2019
tags          animation,TED-Ed,history,world cultures,educat...
views                                      430,476 views • 4:48
transcript    The royal couple of Haiti rode into their coro...
Name: 641, dtype: object

In [10]:
list_of_talks.head(11)

Unnamed: 0,title,speaker,url,month,year,tags,views,transcript
0,Community-powered solutions to the climate crisis,Rahwa Ghirmatzion and Zelalem Adefris,/talks/rahwa_ghirmatzion_and_zelalem_adefris_c...,Feb,2021,,,
1,A simple 2-step plan for saving more money,Wendy De La Rosa,/talks/wendy_de_la_rosa_a_simple_2_step_plan_f...,Feb,2021,,,
2,"What causes dandruff, and how do you get rid o...",Thomas L. Dawson,/talks/thomas_l_dawson_what_causes_dandruff_an...,Feb,2021,,,
3,The artist who won a Nobel Prize... in medicine,Melanie E. Peffer,/talks/melanie_e_peffer_the_artist_who_won_a_n...,Feb,2021,,,
4,A concrete idea to reduce carbon emissions,Karen Scrivener,/talks/karen_scrivener_a_concrete_idea_to_redu...,Feb,2021,,,
5,How a green economy could work for you,Angela Francis,/talks/angela_francis_how_a_green_economy_coul...,Feb,2021,,,
6,"Why didn't this 2,000 year old body decompose?",Carolyn Marshall,/talks/carolyn_marshall_why_didn_t_this_2_000_...,Feb,2021,,,
7,How technology changes our sense of right and ...,Juan Enriquez,/talks/juan_enriquez_how_technology_changes_ou...,Feb,2021,,,
8,"TikTok, Instagram, Snapchat — and the rise of ...",Qiuqing Tai,/talks/qiuqing_tai_tiktok_instagram_snapchat_a...,Feb,2021,,,
9,What if every satellite suddenly disappeared?,Moriba Jah,/talks/moriba_jah_what_if_every_satellite_sudd...,Feb,2021,,,


In [11]:
#list_of_talks['tags'] = list_of_talks['tags'].split(',')

Create a record for each talk containing:
* Title
* Speaker
* URL
* Date


In [12]:
# Save to csv - this path exists on my Google Colab
list_of_talks.to_csv('./transcripts.csv', index=False)

PermissionError: [Errno 13] Permission denied: './transcripts.csv'

In [None]:
list_of_talks.isnull().sum()