Capstone

Scrape this page to get list of TED talk urls.

https://www.ted.com/talks?language=en&page=1&sort=newest

Another notebook will then import csv file list of urls to scrape the actual transcripts.

In [3]:
import pandas as pd
import matplotlib as plt
import seaborn as sns

from sklearn.preprocessing import MultiLabelBinarizer

In [4]:
# Load list of urls that will contain a url for each TED talk
# list_of_talks = pd.read_csv('talk_list.csv')
# list_of_talks.head()

In [5]:
# Load transcripts file 
transcripts = pd.read_csv('transcripts.csv')
transcripts.head()

Unnamed: 0,title,speaker,url,month,year,tags,views,transcript
0,Community-powered solutions to the climate crisis,Rahwa Ghirmatzion and Zelalem Adefris,/talks/rahwa_ghirmatzion_and_zelalem_adefris_c...,Feb,2021,"climate change,Countdown,activism,community,so...","472,619 views • 4:32",Don Cheadle: Home. It's where we celebrate our...
1,A simple 2-step plan for saving more money,Wendy De La Rosa,/talks/wendy_de_la_rosa_a_simple_2_step_plan_f...,Feb,2021,"goal-setting,finance,self,money",0 views • 2:41,Everyone's heard of the tired old adage of pay...
2,"What causes dandruff, and how do you get rid o...",Thomas L. Dawson,/talks/thomas_l_dawson_what_causes_dandruff_an...,Feb,2021,"TED-Ed,education,human body,animation,science,...",0 views • 4:51,"Here in this abundant forest, Malassezia is eq..."
3,The artist who won a Nobel Prize... in medicine,Melanie E. Peffer,/talks/melanie_e_peffer_the_artist_who_won_a_n...,Feb,2021,"animation,education,TED-Ed,history,science,bra...","92,822 views • 4:49","In the late 1860s, scientists believed they we..."
4,A concrete idea to reduce carbon emissions,Karen Scrivener,/talks/karen_scrivener_a_concrete_idea_to_redu...,Feb,2021,"Countdown,materials,climate change,innovation,...","605,375 views • 4:26",Concrete is the second most used substance on ...


In [6]:
transcripts.shape

(4384, 8)

In [7]:
transcripts['tags'].nunique()

4250

In [8]:
#replace nan with empty string
transcripts['tags'] = ['' if pd.isna(tag) else tag for tag in transcripts['tags'] ]

In [9]:
transcripts['tags'] = [tag.split(',') for tag in transcripts['tags']]

In [10]:
transcripts.iloc[2600]

title         Why giving away our wealth has been the most s...
speaker                                  Bill and Melinda Gates
url           /talks/bill_and_melinda_gates_why_giving_away_...
month                                                       Apr
year                                                       2014
tags                  [activism, business, money, philanthropy]
views                                   4,906,975 views • 25:00
transcript    Chris Anderson: So, this is an interview with ...
Name: 2600, dtype: object

In [11]:
transcripts.isnull().sum()

title           0
speaker         0
url             0
month           0
year            0
tags            0
views         103
transcript    103
dtype: int64

In [12]:
transcripts[transcripts['tags'].map(len) == 1].head()

Unnamed: 0,title,speaker,url,month,year,tags,views,transcript
628,An app that helps incarcerated people stay con...,Marcus Bullock,/talks/marcus_bullock_an_app_that_helps_incarc...,Oct,2019,[],,
632,Revelations from a lifetime of dance,Judith Jamison and members of the Alvin Ailey ...,/talks/judith_jamison_and_members_of_the_alvin...,Oct,2019,[],,
643,Reducing corruption takes a specific kind of i...,Efosa Ojomo,/talks/efosa_ojomo_reducing_corruption_takes_a...,Oct,2019,[],,
664,We need to track the world's water like we tra...,Sonaar Luthra,/talks/sonaar_luthra_we_need_to_track_the_worl...,Sep,2019,[],,
735,Ancient Rome's most notorious doctor,Ramon Glazov,/talks/ramon_glazov_ancient_rome_s_most_notori...,Jul,2019,[],,


After scraping 4,384 TED talks, 109 of them were unable to download the transcript. It looks like the html syntax might be slightly different. Talks with empty transcripts account for about 2% of the dataset, so just drop instead of trying to debug.

In [13]:
transcripts[transcripts['tags'].map(len) == 1].shape[0]

109

In [14]:
transcripts = transcripts[transcripts['tags'].map(len) > 1]

In [15]:
transcripts.isnull().sum()

title         0
speaker       0
url           0
month         0
year          0
tags          0
views         0
transcript    0
dtype: int64

---
After removing null rows and converting all **tags** column to a list of tags, create dummy columns so we can count for some analysis on which tags are most popular, etc.

In [16]:
mlb = MultiLabelBinarizer()
mlb.fit_transform(transcripts['tags'])

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [17]:
print (f'There are {len(mlb.classes_)} unique tags (categories) associated with the TED talks. ')

There are 459 unique tags (categories) associated with the TED talks. 


A sampling of the talk tags look like this:

In [18]:
list(mlb.classes_)[0:10]

['3D printing',
 'AI',
 'AIDS',
 'Africa',
 "Alzheimer's",
 'Antarctica',
 'Anthropocene',
 'Asia',
 'Audacious Project',
 'Autism spectrum disorder']

In [19]:
tags_df = pd.DataFrame(mlb.fit_transform(transcripts['tags']), columns=mlb.classes_)
tags_df.head()

Unnamed: 0,3D printing,AI,AIDS,Africa,Alzheimer's,Antarctica,Anthropocene,Asia,Audacious Project,Autism spectrum disorder,...,wikipedia,wind energy,women,women in business,work,work-life balance,world cultures,writing,wunderkind,youth
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
tags_counts = pd.DataFrame(tags_df.sum(axis=0), columns=['count'])
tags_counts.head()

Unnamed: 0,count
3D printing,12
AI,82
AIDS,16
Africa,166
Alzheimer's,10


In [21]:
tags_counts.sort_values(by='count', ascending=False, inplace=True)
top_20_tags = tags_counts.head(20)
top_20_tags

Unnamed: 0,count
science,1045
technology,1001
culture,700
TED-Ed,687
animation,616
TEDx,613
global issues,579
society,572
social change,554
education,532
