# TED Data Cleaning
Nathan Walker - February, 2018
Springbaord Capstone 1

TED is a nonprofit that holds multiple conferences about Technology, Entertainment, and Design every year [1]. The process of deciding who speaks at a TED event is rather complicated [2-3] and the success of a chosen speaker is not guaranteed. Those who perform well enough have a video of their talk posted to TED's official website. 

After users watch these videos, they can use up to three words to describe the video, some of which are positive (beautiful, courageous, fascinating, funny, informative, ingenious, inspiring, jaw-dropping, persuasive) and some negative (confusing, longwinded, obnoxious, ok, unconvincing). Users are given a search option to view videos that are highly rated in the positive categories (i.e., “Show me videos that are: beautiful” gives a list of the talks that rate high in “beautiful”). “OK” is considered a “negative” rating because it is not featured in this search list.

Because quality content drives TED’s official site’s success, being able to predict the success or failure of a TED talk would be quite valuable. TED employs some unknown system to decide which videos should go on their official site, but even among these hand-selected videos, not all receive overwhelmingly positive reviews. This hand-selection process naturally creates some bias against would-be failures, so unsuccessful videos will be under-represented in the data.

__The aim of this project is to create a model that can predict whether a new video will be successful or not, based on some metadata and its transcript.__

Although TED has their own definitions of "success" for videos, for the purposes of this project, videos will be considered "unsuccessful" if they meet one or more of the following three criteria:

1. It has more negative than positive ratings.  
2. The most common rating is negative.  
3. The second-most common rating is negative and no more than 30% smaller than the first.  

The model will be created using a combination of user ratings, video metadata, and the video transcript.

The data was procured by a user on Kaggle [4] through scraping TED's website [5] for all of their videos on September 21, 2017. The data consists of two files. One consists of the video metadata (views, speaker, video length, date, ratings, etc.) and the other has a transcript of each video.

To prepare the data for modeling, some preprocessing must take place first. This notebook outlines that process.



## Table of Contents
### File 1: Video Metadata
1. Library and Data Import
2. Data Cleaning

### File 2: Video Transcripts
1. Library and Data Import
2. Data Cleaning

[1] https://www.ted.com/about/our-organization  
[2] http://speaker-nominations.ted.com/  
[3] https://www.ted.com/about/conferences/speaking-at-ted  
[4] https://www.kaggle.com/rounakbanik/ted-talks  
[5] https://www.ted.com/talks

## Metadata: 1. Library and Data Import

The data is in CSV format, so we can load it in and explore with Pandas.  

Let's see what data types we're dealing with and get a preview of the data itself:

In [1]:
# 1. Library and Data Import

# import libraries
import pandas as pd              # CSV loading, dataframes
import matplotlib.pyplot as plt  # Plots
import seaborn as sns            # Pretty Plots
import json                      # Convert ratings column
import re                        # Regex

# import data
df_stats = pd.read_csv('ted_main.csv')

# data preview
print(df_stats.info())
df_stats.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2550 entries, 0 to 2549
Data columns (total 17 columns):
comments              2550 non-null int64
description           2550 non-null object
duration              2550 non-null int64
event                 2550 non-null object
film_date             2550 non-null int64
languages             2550 non-null int64
main_speaker          2550 non-null object
name                  2550 non-null object
num_speaker           2550 non-null int64
published_date        2550 non-null int64
ratings               2550 non-null object
related_talks         2550 non-null object
speaker_occupation    2544 non-null object
tags                  2550 non-null object
title                 2550 non-null object
url                   2550 non-null object
views                 2550 non-null int64
dtypes: int64(7), object(10)
memory usage: 338.8+ KB
None


Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110
1,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520
2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292


We can see that we have 2550 videos with 17 columns of information each. There are 6 missing rows in the __speaker_occupation__ column. The data types from .info() also show us we need to change some _object_ columns to _category_ data types and some of the _int64_ columns to _TimeStamp_ data types.

---

## Metadata: 2. Data Cleaning

We've got a few things to fix here, but the data looks fairly clean already.

A. Identify and discard duplicates.  
B. Fill in missing rows in __speaker_occupation__.
C. __related_talks__ is not related to our overall goal, so we'll delete it to reduce clutter.  
D. The __film_date__ and __published_date__ columns are in a Unix format, so we should convert them to Pandas Timestamp.  
E. __tags__ is just a list, so we should convert it from a string to a list.  
F. __event__ has too many options, so we need to summarize and group them.  
G. Create separate columns for each __rating__ category, plus "positive" and "negative" total count columns.  
H. Create "success" column.

Let's take these one at a time:

In [2]:
# A. Identify and discard duplicates.

# identify duplicates based on name
dup_list = df_stats.duplicated(subset='name')

# print index of duplicates (if any)
print("Index of duplicates:", list(dup_list[dup_list == True].index), '\n')
print("Number of duplicates:", len(dup_list[dup_list == True]))

Index of duplicates: [] 

Number of duplicates: 0


In [3]:
# B. Fill in missing data in speaker_occupation.

df_stats.loc[df_stats.speaker_occupation.isna(), 'speaker_occupation'] = 'No Occupation Given'
print("Number of missing rows in speaker_occupation:", sum(df_stats.speaker_occupation.isna()))

Number of missing rows in speaker_occupation: 0


In [4]:
# C. Remove columns: related_talks

to_remove = ['related_talks']

df_stats.drop(columns=to_remove, inplace=True)

in_columns = to_remove in list(df_stats.columns)

print("True/False, related_talks column is in the dataframe:", in_columns)

True/False, related_talks column is in the dataframe: False


In [5]:
# D. Change dates from Unix (seconds from epoch) to Pandas Timestamp

df_stats['film_date'] = pd.to_datetime(df_stats['film_date'], unit='s')
df_stats['published_date'] = pd.to_datetime(
    df_stats['published_date'], unit='s')

print("Updated film_date data type:", type(df_stats.iloc[0].loc['film_date']))
print("Updated film_date example:", df_stats.iloc[0].loc['film_date'], '\n')
print("Updated published_date data type:", type(df_stats.iloc[0].loc['published_date']))
print("Updated published_date example:", df_stats.iloc[0].loc['published_date'])

Updated film_date data type: <class 'pandas._libs.tslib.Timestamp'>
Updated film_date example: 2006-02-25 00:00:00 

Updated published_date data type: <class 'pandas._libs.tslib.Timestamp'>
Updated published_date example: 2006-06-27 00:11:00


In [6]:
# E. Convert tags from string to list

def tag_split(tags):
    """
    Converts single-quote list of strings to an actual list.
    """
    import re

    tags = tags.replace("'", "")  # remove single quotes
    # ignore square brackets and split on comma-space (creates list)
    tags = tags[1:-1].split(', ')

    return tags


# Apply to df
df_stats.loc[:, 'tags'] = df_stats.loc[:, 'tags'].apply(tag_split)

# Check type
print("'tag' data type:", type(df_stats.iloc[0].loc['tags']))

'tag' data type: <class 'list'>


In [7]:
# F. Regroup event column

# F1. Find all events
df_stats.event.value_counts()

TED2014                                 84
TED2009                                 83
TED2016                                 77
TED2013                                 77
TED2015                                 75
TED2011                                 70
TEDGlobal 2012                          70
TED2007                                 68
TED2010                                 68
TEDGlobal 2011                          68
TED2017                                 67
TEDGlobal 2013                          66
TED2012                                 65
TEDGlobal 2009                          65
TED2008                                 57
TEDGlobal 2010                          55
TEDGlobal 2014                          51
TED2006                                 45
TED2005                                 37
TEDIndia 2009                           35
TED2003                                 34
TEDWomen 2010                           34
TEDSummit                               34
TED2004    

There are __355__ different events to begin with, but there are quite a few patterns.  

After searching through the un-shortened version of the above list, a few things become clear:

1. There are non-TED events listed here. Because this study is interested in TED talks, these non-TED talks should be excluded. According to TED's website [1], all TED events have "TED" in the name, except the "Mission Blue" series. It should also be noted that TED-Ed videos are actually animated video shorts created for online consumption only [2] and should also be excluded.
2. A good list of categories looks like: TED[year], TEDGlobal, TED@[country], TEDx, TEDMED, TEDWomen, TEDYouth, TED Live, Mission Blue, Salon, Summit, and Other.

[1] https://www.ted.com/about/conferences/past-teds  
[2] https://www.ted.com/about/programs-initiatives

In [8]:
# F2. Categorize

other_list = list()
def event_categorizer(event):
    """
    Returns event type based on event text.
    """
    import re
    
    if re.search('TED[0-9]{4}', event):
        return 'TED Yearly'
    elif re.search('TEDGl', event):
        return 'TED Global'
    elif re.search('TEDx', event):
        return 'TEDx'
    elif re.search('TED@|TEDIndia|TEDNYC', event):
        return 'TED@'
    elif re.search('Salon', event):
        return 'TED Salon'
    elif re.search('Women', event):
        return 'TED Women'
    elif re.search('TEDYouth', event):
        return 'TED Youth'
    elif re.search('TEDMED', event):
        return 'TEDMED'
    elif re.search('Live', event):
        return 'TED Live'
    elif re.search('Mission Blue', event):
        return 'Mission Blue'
    elif re.search('TEDSummit', event):
        return 'Summit'
    elif re.search('TED-Ed', event):
        return 'Exclude'
    elif re.search('TED', event):
        return 'Other'
    else:
        return 'Exclude'

# apply to df
df_stats['event_type'] = df_stats['event'].apply(event_categorizer)

In [9]:
# F3 . Look at new category counts
df_stats.event_type.value_counts()

TED Yearly      978
TEDx            471
TED Global      464
TED@            141
TED Women        96
Exclude          94
TED Salon        79
TEDMED           68
Other            59
Summit           34
Mission Blue     26
TED Live         21
TED Youth        19
Name: event_type, dtype: int64

In [10]:
# F4. Remove non-TED videos

df_stats = df_stats.loc[df_stats.event_type != "Exclude", :].copy()

In [12]:
# G. Convert each 'rating' count into its own column of counts

# G1. Convert 'rating' from string to dictionary

def rating_to_dict(rating):
    """
    Converts string representation of list of dictionaries/JSON items into dictonary based on 'name'.
    """
    import json  # easy load-in from string
    import re

    # ignore square brackets, split by curly braces
    match = re.findall("\{.*?\}", rating[1:-1])
    rate_dict = dict()  # empty dict for adding

    for i in match:
        i = i.replace("'", "\"")  # json requires double quotes
        i = json.loads(i)  # curly braced string to dict
        # {rating_name:count} as key:value pair
        rate_dict[i['name']] = i['count']

    return rate_dict

# Apply to df
df_stats.loc[:, 'ratings'] = df_stats.loc[:, 'ratings'].apply(rating_to_dict)

# Check type
print("ratings data type:", type(df_stats.iloc[0].loc['ratings']))

TypeError: unhashable type: 'slice'

In [13]:
# G2. Find all ratings

all_ratings = set()

for i in df_stats.loc[:, 'ratings']:  # iterate through rows
    k = list(i.keys())  # return list of dictionary keys (categories)
    all_ratings.update(k)  # add new entries, if any

print("Number of ratings:", len(all_ratings))
all_ratings

Number of ratings: 14


{'Beautiful',
 'Confusing',
 'Courageous',
 'Fascinating',
 'Funny',
 'Informative',
 'Ingenious',
 'Inspiring',
 'Jaw-dropping',
 'Longwinded',
 'OK',
 'Obnoxious',
 'Persuasive',
 'Unconvincing'}

We have 14 different ratings. Let's take each one and give it its own column. Let's also split them into "positive" and "negative" ratings:  

__positive__: Beautiful, Courageous, Fascinating, Funny, Informative, Ingenious, Inspiring, Jaw-dropping, Persuasive  
__negative__: Confusing, Longwinded, Obnoxious, OK, Unconvincing  

"OK" is ambiguous, but is considered "negative" because it does not show up in TED's search options for "show me a video that is..." assumedly because nobody says, "I want to watch something that's just OK."

In [14]:
# G3. Create columns for each rating

for category in list(all_ratings):  # loop through categories in ratings
    df_stats[category.lower()] = df_stats['ratings'].apply(lambda x: x.get(category))
    # for each category, create a new column
    # for each row, return the 'value' of that row's dictionary's 'key' (category), if it exists

# Create column for positives
pos_rating = list([
    'beautiful', 'courageous', 'fascinating', 'funny', 'informative',
    'ingenious', 'inspiring', 'jaw-dropping', 'persuasive'
])

df_stats['positives'] = df_stats[pos_rating].sum(axis=1)

# Create column for negatives
neg_rating = list(['confusing', 'longwinded', 'obnoxious', 'ok', 'unconvincing'])

df_stats['negatives'] = df_stats[neg_rating].sum(axis=1)

df_stats.drop(columns=['ratings'], inplace=True)

df_stats.loc[:, 'ingenious':'negatives'].head(3)

Unnamed: 0,ingenious,obnoxious,inspiring,unconvincing,ok,positives,negatives
0,6073,209,24924,300,1174,91538,2312
1,56,131,413,258,203,2169,767
2,183,142,230,104,146,2327,497


In [15]:
# H. Create "Success" Column

ratings = neg_rating + pos_rating  # full list

# test for success for each row
def success_test(x):
    """
    Checks row for conditions and returns True where applicable:
        1. It has more negative than positive ratings.  
        2. The most common rating is negative.  
        3. The second-most common rating is negative and no more than 30% smaller than the first.  
    """
    
    # list of ratings sorted descending by count
    top_list = list(x.loc[ratings].sort_values(ascending=False).keys())
    
    # how close is #2 from #1 rating?
    top_pct = x.loc[top_list[1]] / x.loc[top_list[0]]
    
    # more negative than positive ratings
    if x['negatives'] >= x['positives']:
        return False
    # most common rating is negative
    elif len(set([top_list[0]]).intersection(neg_rating)) > 0:
        return False
    # second-most common rating is negative and no more than 30% smaller than first
    elif len(set(top_list[0:2]).intersection(neg_rating)) > 0 and top_pct >= .7:
        return False
    # success!
    else:
        return True
    
# apply function to each row of df
df_stats['success'] = df_stats.apply(success_test, axis=1)

print("Success counts:")
print(df_stats.loc[:, 'success'].value_counts())
df_stats.loc[df_stats['success'] == False, ratings + ['negatives', 'positives', 'success']].head(10)

Success counts:
True     2372
False      84
Name: success, dtype: int64


Unnamed: 0,confusing,longwinded,obnoxious,ok,unconvincing,beautiful,courageous,fascinating,funny,informative,ingenious,inspiring,jaw-dropping,persuasive,negatives,positives,success
22,40,119,41,87,67,54,33,34,106,82,9,86,3,23,354,430,False
60,94,31,6,53,22,4,4,84,3,98,89,48,23,12,206,365,False
74,69,61,18,44,60,24,11,53,7,59,42,91,15,2,252,304,False
79,88,173,66,108,273,20,66,119,15,151,78,108,19,47,708,623,False
82,53,89,57,89,80,95,7,65,13,21,56,43,16,0,368,316,False
105,159,10,44,115,261,24,30,58,10,189,25,53,15,49,589,453,False
113,0,64,19,26,60,11,12,12,3,37,11,29,9,4,169,128,False
178,30,161,36,139,45,49,14,68,100,90,23,127,12,8,411,491,False
185,57,42,40,88,157,7,61,76,6,162,136,114,28,123,384,713,False
202,11,76,9,35,7,19,9,31,40,32,16,41,10,10,138,208,False


Now that our metadata dataset df_stats is cleaned, we can move on to preparing our transcripts for Natural Langugage Processing.

## Transcripts: 1. Library and Data Import

Our texts are in CSV, so we can load them in through Pandas and explore from there.

Let's see what data types we're dealing with and get a preview of the data itself:

In [16]:
# 1. Library and Data Import

import spacy.lang.en  # word analysis

# load language models
#nlp = spacy.load('en')
#nlp2 = spacy.load('en_core_web_lg')

# import data
df_trans = pd.read_csv('transcripts.csv')
df_trans.info()
df_trans.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2467 entries, 0 to 2466
Data columns (total 2 columns):
transcript    2467 non-null object
url           2467 non-null object
dtypes: object(2)
memory usage: 38.6+ KB


Unnamed: 0,transcript,url
0,Good morning. How are you?(Laughter)It's been ...,https://www.ted.com/talks/ken_robinson_says_sc...
1,"Thank you so much, Chris. And it's truly a gre...",https://www.ted.com/talks/al_gore_on_averting_...
2,"(Music: ""The Sound of Silence,"" Simon & Garfun...",https://www.ted.com/talks/david_pogue_says_sim...
3,If you're here today — and I'm very happy that...,https://www.ted.com/talks/majora_carter_s_tale...
4,"About 10 years ago, I took on the task to teac...",https://www.ted.com/talks/hans_rosling_shows_t...


We can see that our data consists of just two columns: __transcript__ and __url__. Unfortunatley, we can see we only have transcripts for 2467 of our 2550 videos. We'll use the URL to match each available transcript to its respective metadata in a future step. Because the transcript text file is only ~28mb, we can load it all into memory at once without worrying about overloading the system.

## Transcripts: 2. Data Cleaning

Our goals here will unfold as we explore issues in the text, but our general outline is:  

A. Identify and discard duplicates  
B. Join datasets  
C. Identify and fill missing data, where possible  
D. Parse and clean text

In [17]:
# A. Identify and discard duplicates

# boolean mask of duplicates by URL
dup_trans_u = df_trans.duplicated(subset='url')

# boolean mask of duplicates by transcript
dup_trans_t = df_trans.duplicated(subset='transcript')

# check to make sure all duplicates are exact across columns
print("Are duplicate entries the same?")
print(df_trans[dup_trans_u] == df_trans[dup_trans_t])

# discard duplicates
df_trans = df_trans[~dup_trans_u]

# re-check for duplicates
duplicates = df_trans.duplicated(subset='url')
print("\n\nNon-Duplicate Rows:", duplicates.value_counts()[False], "of 2464 rows")

# glance at data
df_trans.head()

Are duplicate entries the same?
      transcript   url
1114        True  True
1115        True  True
1116        True  True


Non-Duplicate Rows: 2464 of 2464 rows


Unnamed: 0,transcript,url
0,Good morning. How are you?(Laughter)It's been ...,https://www.ted.com/talks/ken_robinson_says_sc...
1,"Thank you so much, Chris. And it's truly a gre...",https://www.ted.com/talks/al_gore_on_averting_...
2,"(Music: ""The Sound of Silence,"" Simon & Garfun...",https://www.ted.com/talks/david_pogue_says_sim...
3,If you're here today — and I'm very happy that...,https://www.ted.com/talks/majora_carter_s_tale...
4,"About 10 years ago, I took on the task to teac...",https://www.ted.com/talks/hans_rosling_shows_t...


In [18]:
# B. Join datasets

# join on 'url'
df_ted = df_stats.join(df_trans.set_index('url'), on='url')

# get new dataframe info
df_ted.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2456 entries, 0 to 2549
Data columns (total 34 columns):
comments              2456 non-null int64
description           2456 non-null object
duration              2456 non-null int64
event                 2456 non-null object
film_date             2456 non-null datetime64[ns]
languages             2456 non-null int64
main_speaker          2456 non-null object
name                  2456 non-null object
num_speaker           2456 non-null int64
published_date        2456 non-null datetime64[ns]
speaker_occupation    2456 non-null object
tags                  2456 non-null object
title                 2456 non-null object
url                   2456 non-null object
views                 2456 non-null int64
event_type            2456 non-null object
courageous            2456 non-null int64
confusing             2456 non-null int64
longwinded            2456 non-null int64
fascinating           2456 non-null int64
persuasive            2456

We can see that our dataset has been successfully joined, but that 47 transcripts are missing. Let's take a look at the videos with missing transcripts.

In [19]:
# C. Identify and fill missing data, where possible

# C1. find missing data
df_ted.loc[df_ted['transcript'].isna(), ['name', 'description', 'event_type', 'event', 'duration', 'success']].sort_values('event_type')

Unnamed: 0,name,description,event_type,event,duration,success
135,"Vusi Mahlasela: ""Woza""",After Vusi Mahlasela's 3-song set at TEDGlobal...,TED Global,TEDGlobal 2007,299,True
209,"Rokia Traore: ""M'Bifo""","Rokia Traore sings the moving ""M'Bifo,"" accomp...",TED Global,TEDGlobal 2007,419,True
237,"Rokia Traore: ""Kounandi""","Singer-songwriter Rokia Traore performs ""Kouna...",TED Global,TEDGlobal 2007,386,True
696,Sophie Hunger: Songs of secrets and city lights,"This haunting, intimate performance by Europea...",TED Global,TEDGlobal 2009,1384,True
547,Matthew White: The modern euphonium,"The euphonium, with its sweet brass sound, is ...",TED Global,TEDGlobal 2009,141,False
58,"Pilobolus: A dance of ""Symbiosis""","Two Pilobolus dancers perform ""Symbiosis."" Doe...",TED Yearly,TED2005,825,True
2407,"Silk Road Ensemble: ""Turceasca""",Grammy-winning Silk Road Ensemble display thei...,TED Yearly,TED2016,389,True
512,Vishal Vaid: Hypnotic South Asian improv music,Vishal Vaid and his band explore a traditional...,TED Yearly,TED2006,814,True
2418,"Sō Percussion: ""Music for Wood and Strings""",Sō Percussion creates adventurous compositions...,TED Yearly,TED2016,609,False
446,Eric Lewis: Chaos and harmony on piano,Eric Lewis explores the piano's expressive pow...,TED Yearly,TED2009,294,False


Looking at the above snippets of names, descriptions, events, and durations, it seems a couple of things are going on in the data. 

1. Some of the "talks" are actually some kind of performace (music, dance, poetry, etc.).
2. The only non-performance talks with no transcripts are from TEDx (independent TED events).

Because this project deals with "talks," all of the videos with no actual "talk" should be removed, whether or not they have a transcript. In instances where there are both a "talk" and a performance, the transcript should be included with the lyrics removed (if given). 

It is unclear whether these transcripts were left out because they were unavialable or because of an automation error. All URLs must be visited to see whether a transcript is (now) available and imported where appropriate.

In [82]:
# C2. Manually check for transcript availability

# print, then visit
df_ted.loc[(df_ted['transcript'].isna()), 'url'].apply(print)

https://www.ted.com/talks/pilobolus_perform_symbiosis

https://www.ted.com/talks/ethel_performs_blue_room

https://www.ted.com/talks/vusi_mahlasela_s_encore_at_tedglobal2007

https://www.ted.com/talks/rokia_traore_sings_m_bifo

https://www.ted.com/talks/rokia_traore_sings_kounandi

https://www.ted.com/talks/sxip_shirey_at_the_breathing_place

https://www.ted.com/talks/eric_lewis_strikes_chords_to_rock_the_jazz_world

https://www.ted.com/talks/eric_lewis_plays_chaos_and_harmony

https://www.ted.com/talks/qi_zhang_s_electrifying_organ_performance

https://www.ted.com/talks/vishal_vaid_s_hypnotic_song

https://www.ted.com/talks/matthew_white_gives_the_euphonium_a_new_voice

https://www.ted.com/talks/sivamani_rhythm_is_everything_everywhere

https://www.ted.com/talks/sophie_hunger_plays_songs_of_secrets_city_lights

https://www.ted.com/talks/paul_lewis_crowdsourcing_the_news

https://www.ted.com/talks/roger_mcnamee_six_ways_to_save_the_internet

https://www.ted.com/talks/michael_nielsen_op

58      None
115     None
135     None
207     None
234     None
245     None
366     None
408     None
434     None
472     None
507     None
551     None
637     None
974     None
981     None
982     None
987     None
988     None
998     None
1013    None
1023    None
1030    None
1031    None
1037    None
1044    None
1060    None
1063    None
1087    None
1093    None
1098    None
1134    None
1140    None
1218    None
1376    None
2314    None
2324    None
Name: url, dtype: object

Ten transcripts were either missed or had been added since September 21, 2017. Because the data is now available, the missing transcripts were copied to a text document and need to be incorporated into the dataset now.

In [83]:
# C3. Import New Transcripts

with open('raw_transcripts.txt', 'r') as file:
    text = file.readlines()  # read each line in as a list
    
text_dict = dict()  # initialize dict for talks
talk = ""
for line in text:
    if line == '\n':  # skip newlines
        continue
    elif line[0:4] == 'http':  # talks are identified by URL
        talk = line  # create new dict key for each URL
        text_dict[talk] = ""  # set dict key as URL and value as blank string
    elif re.match('\d\d', line):  # skip timestamps
        continue
    else:
        text_dict[talk] += line  # append each line to the last
        
# test that it worked -- first 500 characters
print(list(text_dict.keys())[0], text_dict[list(text_dict.keys())[0]][0:500])

https://www.ted.com/talks/giles_duley_when_a_reporter_becomes_the_story
 Good morning, everyone. When I was first asked to do a TED Talk, I Googled to try and find out a little bit more about, you know, how it felt to be giving one. And one of the first things I read was a speaker in the States saying that she felt fine until she came onstage, and then she saw the timer ticking down.
(Laughter)
And it reminded her of a bomb. I was thinking, "That's the last thing I need."
(Laughter)
(Applause)
Anyway, it's a great privilege to be here. I think it's a bit of a joke fo


In [84]:
# C4 . Add New Transcripts

df_ted.set_index('url', inplace=True)  # index by URL
for url, text in text_dict.items():  # iterate through dictionary
    df_ted.loc[url, 'transcript'] = text  # from dictionary: {URL: transcript text}
df_ted.reset_index(inplace=True)  # put URL back as a column and re-index by arbitrary integers

# check that it worked
for url, text in text_dict.items():
    print(df_ted.loc[df_ted['url'] == url, 'transcript'])

In [93]:
# C5. Remove talks with no transcript
drop_indices = df_ted.loc[df_ted['transcript'].isna(), :].index  # indices to drop
df_ted.drop(drop_indices, inplace=True)

df_ted.info() # check to see if it worked

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2420 entries, 0 to 2455
Data columns (total 34 columns):
url                   2420 non-null object
comments              2420 non-null int64
description           2420 non-null object
duration              2420 non-null int64
event                 2420 non-null object
film_date             2420 non-null datetime64[ns]
languages             2420 non-null int64
main_speaker          2420 non-null object
name                  2420 non-null object
num_speaker           2420 non-null int64
published_date        2420 non-null datetime64[ns]
speaker_occupation    2420 non-null object
tags                  2420 non-null object
title                 2420 non-null object
views                 2420 non-null int64
event_type            2420 non-null object
courageous            2420 non-null int64
confusing             2420 non-null int64
longwinded            2420 non-null int64
fascinating           2420 non-null int64
persuasive            2420

In [None]:
"""def has_notes(x):
    if["♫" in x.loc['transcript']:
        print(x.loc['transcript'], "\n\n")
    
df_ted.apply(has_notes, axis=1)
"""