### Ted Talks Text Generator

All talks have been downloaded, and are stored in a csv file called `transcripts.csv`

The tags columns contains all tag values in one long string. To analyze, this must be converted to a list, then create dummy columns, one for each tag. The dummy columns can then be summed to show which tags are the most popular in Explorary Data Analysis (EDA)

In [30]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import re
from sklearn.preprocessing import MultiLabelBinarizer

In [31]:
# Load transcripts file 
transcripts = pd.read_csv('../data/transcripts.csv')
transcripts.head(3)

Unnamed: 0,title,speaker,url,month,year,tags,views,transcript
0,Community-powered solutions to the climate crisis,Rahwa Ghirmatzion and Zelalem Adefris,/talks/rahwa_ghirmatzion_and_zelalem_adefris_c...,Feb,2021,"climate change,Countdown,activism,community,so...","472,619 views • 4:32",Don Cheadle: Home. It's where we celebrate our...
1,A simple 2-step plan for saving more money,Wendy De La Rosa,/talks/wendy_de_la_rosa_a_simple_2_step_plan_f...,Feb,2021,"goal-setting,finance,self,money",0 views • 2:41,Everyone's heard of the tired old adage of pay...
2,"What causes dandruff, and how do you get rid o...",Thomas L. Dawson,/talks/thomas_l_dawson_what_causes_dandruff_an...,Feb,2021,"TED-Ed,education,human body,animation,science,...",0 views • 4:51,"Here in this abundant forest, Malassezia is eq..."


In [32]:
transcripts.shape

(4384, 8)

In [33]:
#replace nan with empty string
transcripts['tags'] = ['' if pd.isna(tag) else tag for tag in transcripts['tags'] ]

Convert each row's tags value from a long string to a list of tags

In [34]:
transcripts['tags'] = [tag.split(',') for tag in transcripts['tags']]

Spot check one row's tags

In [35]:
transcripts.iloc[2600]['tags']

['activism', 'business', 'money', 'philanthropy']

In [36]:
transcripts.isnull().sum()

title           0
speaker         0
url             0
month           0
year            0
tags            0
views         103
transcript    103
dtype: int64

The scraping functions failed on a few talks, indicated by the tags value was missing. These missing tags were converted from NaN to the empty string.

In [37]:
transcripts[transcripts['tags'].map(len) == 1]['tags'].values[0:10]

array([list(['']), list(['']), list(['']), list(['']), list(['']),
       list(['']), list(['']), list(['']), list(['']), list([''])],
      dtype=object)

After scraping 4,384 TED talks, 103 of them were unable to download the transcript. It looks like the html syntax might be slightly different. Talks with empty transcripts account for about 2% of the dataset, so just drop instead of trying to debug.

Talks with empty transcripts can be identified by a list containing `''` as it's only value.

In [38]:
transcripts[[item[0] == '' for item in transcripts[transcripts['tags'].map(len) >0]['tags'].values]].shape

(103, 8)

In [39]:
# Drop rows with a blank tag value
transcripts = transcripts[[item[0] != '' for item in transcripts[transcripts['tags'].map(len) >0]['tags'].values]]

In [40]:
transcripts.isnull().sum()

title         0
speaker       0
url           0
month         0
year          0
tags          0
views         0
transcript    0
dtype: int64

In [41]:
transcripts.shape

(4281, 8)

#### Extract run time value from views column

In [42]:
transcripts.iloc[2600]['views']

'1,363,651 views • 6:29'

In [43]:
transcripts['run_time'] = [data.split(' ')[-1] for data in transcripts['views']]

#### Number of views is the first text that appears in the views column. Extract it out

In [44]:
transcripts['views'] = [data.split(' ')[0] for data in transcripts['views']]

Then strip out commas and convert to integer

In [45]:
transcripts['views'] = transcripts['views'].str.replace(',', '').astype(int)
transcripts.head()

Unnamed: 0,title,speaker,url,month,year,tags,views,transcript,run_time
0,Community-powered solutions to the climate crisis,Rahwa Ghirmatzion and Zelalem Adefris,/talks/rahwa_ghirmatzion_and_zelalem_adefris_c...,Feb,2021,"[climate change, Countdown, activism, communit...",472619,Don Cheadle: Home. It's where we celebrate our...,4:32
1,A simple 2-step plan for saving more money,Wendy De La Rosa,/talks/wendy_de_la_rosa_a_simple_2_step_plan_f...,Feb,2021,"[goal-setting, finance, self, money]",0,Everyone's heard of the tired old adage of pay...,2:41
2,"What causes dandruff, and how do you get rid o...",Thomas L. Dawson,/talks/thomas_l_dawson_what_causes_dandruff_an...,Feb,2021,"[TED-Ed, education, human body, animation, sci...",0,"Here in this abundant forest, Malassezia is eq...",4:51
3,The artist who won a Nobel Prize... in medicine,Melanie E. Peffer,/talks/melanie_e_peffer_the_artist_who_won_a_n...,Feb,2021,"[animation, education, TED-Ed, history, scienc...",92822,"In the late 1860s, scientists believed they we...",4:49
4,A concrete idea to reduce carbon emissions,Karen Scrivener,/talks/karen_scrivener_a_concrete_idea_to_redu...,Feb,2021,"[Countdown, materials, climate change, innovat...",605375,Concrete is the second most used substance on ...,4:26


Transcripts contain a lot of audience behavior in parenthesis, such as (Applause) or (laughter).

Use regular expression to remove any text inside parenthesis

In [46]:
transcripts['transcript'] = [re.sub(r"\([^()]*\)", "", talk_text) for talk_text in transcripts['transcript']]

# Hat tip to this site for the regular expression
#https://www.kite.com/python/answers/how-to-use-regular-expressions-to-remove-text-within-parentheses-in-python

Replace anything that is a word followed by a colon. This removes words or names to indicate a person speaking.  

For example:
> Cecily: Ah, well, I feel rather

In [47]:
transcripts['transcript'] = [re.sub(r"\w+:", "", talk_text) for talk_text in transcripts['transcript']]

In [48]:
test = transcripts.iloc[75]['transcript']
print(test)

  Ah, well, I feel rather frightened. I'm so afraid he will look just like everyone else.   He does.  You are my little cousin Cecily, I'm sure.  You are under some grave mistake. I'm not little. In fact, I do believe I'm actually more than usually tall for my age. But I am your cousin Cecily, and you, I see, are also here helping Jo Michael Rezes with their TEDx talk. And you are my cousin Ernest, my wicked cousin Ernest.  Oh! Well, I'm not really wicked at all, cousin Cecily. You mustn't think that I am wicked.  Well, I hope you haven't been leading a double life, pretending to be good and being really wicked all the time. That would be hypocrisy.  Well, of course, I have been rather reckless.  I am glad to hear it.  But the world is good enough for me, cousin Cecily.  Yes, but are you good enough for it?  I'm afraid I am not that. That's why I want you to reform me.  Well, I'm afraid I have no time this afternoon. The TED talk and all.   Well, would you mind my reforming myself this

Export cleaned data as pickle in order to import to other notebooks. (Exporting to csv will convert the contents of the tags column from a list into a string, which would then have to converted back. Pickling eliminates that hassle.)

In [50]:
# save the model to disk
pickle.dump(transcripts, open('../data/transcripts_clean.pickle', 'wb'))