# TEDx Scraper


This scraper downloads more than 3,300 talks from www.ted.com
in order to create a dataset for research purporse.

The main dataset is composed by the following attributes:
- unique id
- details
- posted
- main_speaker
- event
- title
- num_views
- url

The tags dataset is composed by the following attributes (linked 1-n with the main dataset):
- unique id
- tag

The "watch next" dataset is composed by the following attributes:
- id
- watch_next_id


The  is organized with the following sections:

- Setup of the env (install libraries, set up variables and credentials, ...)
- Download of the index (with Selenium and Chrome Browser libraries)
- Parse DOM of the web pages and download each single TEDx
- Store the data on CSV files

### Setup of the env

Install and import of python libraries 

In [21]:
!pip3 install selenium
!pip3 install pandas



In [22]:
import requests
import pandas as pd
import time
from selenium import webdriver as wd
import selenium
import json
import hashlib 

Set the following variables to download data:

- max_page: max number of pages to loop (set to -1 to download all the pages)
- sleep_time: to be polite with TEDx (number of seconds between different request)




In [23]:
max_page = 142
sleep_time = 1
log = False

This notebook uses Chrome Driver to simulate user interaction with TEDx.
To set up Chrome Driver on your laptop please refer to https://chromedriver.chromium.org/downloads

The notebook is tested with
`ChromeDriver 79.0.3945.36`

Please set up `chromedriver_path` to your Chrome Driver folder.
For example:

~~~~~
chromedriver_path =  '/Users/mauropelucchi/Downloads/chromedriver2'
~~~~~

In [24]:
chromedriver_path =  '/Users/lussi/Desktop/chromedriver.exe'

In [25]:
# from https://github.com/MatthewChatham/glassdoor-review-scraper/blob/master/main.py

def get_browser():
    chrome_options = wd.ChromeOptions()
    chrome_options.add_argument('log-level=3')
    browser = wd.Chrome(chromedriver_path, options=chrome_options)
    return browser

browser = get_browser()

def talks_page():
    url = 'https://www.ted.com/talks'
    print(f'Navigate to {url}')
    browser.get(url)
    time.sleep(sleep_time * 4)
#    cookie_btn = browser.find_element_by_id('_evidon-accept-button')
#    cookie_btn.click()
    time.sleep(sleep_time)
    
talks_page()

Navigate to https://www.ted.com/talks


# Get TEDx data

`get_tedx_list` function gets a response and produces a list composed by a dict with

~~~~
{'main_speaker': 'Alexandra Auer',
  'url': 'https://www.ted.com/talks/alexandra_auer_the_intangible_effects_of_walls_apr_2020',
  'id': 1,
  ...
}
~~~~

To use id:
~~~~
my_tedx_list = get_tedx_list(0)
~~~~

To download all the data:
~~~~
my_tedx_list = get_tedx_all()
~~~~

In [26]:
def get_tedx(my_tedx):
    if log:
        print("Current url: " + my_tedx['url'])
    try:
        browser.get(my_tedx['url'])
        my_tedx['details'] = browser.find_element_by_xpath("//meta[@name='description']").get_attribute('content')
        my_tedx['tags'] = browser.find_element_by_xpath("//meta[@name='keywords']") \
            .get_attribute('content') \
            .split(", ")
        try:
            my_tedx['num_views'] = browser.find_elements_by_css_selector(".css-1uodv95")[0].text
        except:
            my_tedx['num_views'] = 0
        l = browser.find_elements_by_css_selector(".react-tabs__tab-panel--selected a")
        watch_next = []
        for rel in l:
            url = rel.get_attribute('href')
            idx = hashlib.md5(url.encode()).hexdigest()
            watch_next.append({"url": url, "idx": idx})
        my_tedx['watch_next'] = watch_next
    except:
        pass
    return my_tedx
    
def get_tedx_list(step = 0):
    if(step == 0):
        url = 'https://www.ted.com/talks'
        browser.get(url)
    print("Current url: " + browser.current_url)
    print(f"Current step: {step}")
    tedxs =  browser.find_elements_by_css_selector('#browse-results .col')
    tedxs_number = len(tedxs)
    print(f"Total number of TEDx in this page: {tedxs_number}")
    my_tedx_list = []
    for d_tedx in tedxs:
        my_tedx = {"main_speaker": "", "url": "", "posted": ""}
        my_tedx['main_speaker'] = d_tedx.find_elements_by_css_selector(".talk-link__speaker")[0].text
        my_tedx['title'] = d_tedx.find_elements_by_css_selector(".ga-link")[1].text
        my_tedx['url'] = d_tedx.find_elements_by_css_selector(".ga-link")[1].get_attribute('href')
        my_tedx['idx'] = hashlib.md5(my_tedx['url'].encode()).hexdigest()
        my_tedx['posted'] = d_tedx.find_elements_by_css_selector(".meta__item")[0].text
        try:
            my_tedx['duration'] = d_tedx.find_elements_by_css_selector(".thumb__duration")[0].text
        except:
            my_tedx['duration'] = 0
        my_tedx_list.append(my_tedx)
        time.sleep(sleep_time)
    return my_tedx_list


def get_tedx_all():
    my_tedx_list = []
    for page in range(0, max_page):
        my_tedx_list.extend(get_tedx_list(page))
        get_next = browser.find_elements_by_css_selector(".pagination__next")[0].get_attribute('href')
        if get_next and (page < max_page or max_page == -1):
            browser.get(get_next)
        else:
            break
    my_tedx_list_final = []
    for my_tedx in my_tedx_list:
        my_tedx_list_final.append(get_tedx(my_tedx))
        time.sleep(sleep_time)
    return my_tedx_list_final

## Download and store the data to CSV file



In [27]:
my_tedx_list = get_tedx_all()



In [28]:
print(len(my_tedx_list))

5096


In [29]:
df = pd.DataFrame.from_dict(my_tedx_list)
df = df.to_csv('tedx_dataset.csv', columns=['idx','main_speaker','title', 'details','posted', 'url', 'num_views', 'duration'], index=False)

In [30]:
tags_dataset = []
for o in my_tedx_list:
    for t in o['tags']:
        tags_dataset.append({"idx": o['idx'], "tag": t})
tags_df = pd.DataFrame.from_dict(tags_dataset)
tags_df.to_csv('tags_dataset.csv', index=False)

In [31]:
watch_next_dataset = []
for o in my_tedx_list:
    for t in o['watch_next']:
        watch_next_dataset.append({"idx": o['idx'], "url": t['url'], "watch_next_idx": t['idx']})
watch_next_df = pd.DataFrame.from_dict(watch_next_dataset)
watch_next_df.to_csv('watch_next_dataset.csv', index=False)