# TED Talks: Terms

With the TEDtalks-all dataset created, we have xxxx talks with which to work. This is a small corpus, and so the usual reasons for shrinking the feature set for the texts do not apply, but as we begin our survey of the contents of the TED talks we wanted to be mindful of standards that had emerged both so that our results would be comparable to the work of others but also so that we could potentially scale up the work here without having to re-think the foundations.

## Imports and Data

In [22]:
# Imports
import pandas as pd
import re

In [3]:
# Load the Data
df = pd.read_csv('../output/TEDall_speakers.csv')
df.shape

(1747, 27)

In [6]:
# A quick check of the columns
list(df)

['Set',
 'Talk_ID',
 'public_url',
 'headline',
 'description',
 'event',
 'duration',
 'published',
 'tags',
 'views',
 'text',
 'speaker_1',
 'speaker1_occupation',
 'speaker1_introduction',
 'speaker1_profile',
 'speaker_2',
 'speaker2_occupation',
 'speaker2_introduction',
 'speaker2_profile',
 'speaker_3',
 'speaker3_occupation',
 'speaker3_introduction',
 'speaker3_profile',
 'speaker_4',
 'speaker4_occupation',
 'speaker4_introduction',
 'speaker4_profile']

In [7]:
df.head()

Unnamed: 0,Set,Talk_ID,public_url,headline,description,event,duration,published,tags,views,...,speaker2_introduction,speaker2_profile,speaker_3,speaker3_occupation,speaker3_introduction,speaker3_profile,speaker_4,speaker4_occupation,speaker4_introduction,speaker4_profile
0,only,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,...,,,,,,,,,,
1,only,7,https://www.ted.com/talks/david_pogue_says_sim...,Simplicity sells,New York Times columnist David Pogue takes aim...,TED2006,0:21:26,6/27/06,"simplicity,entertainment,interface design,soft...",1702201,...,,,,,,,,,,
2,only,53,https://www.ted.com/talks/majora_carter_s_tale...,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",TED2006,0:18:36,6/27/06,"MacArthur grant,cities,green,activism,politics...",2000421,...,,,,,,,,,,
3,only,66,https://www.ted.com/talks/ken_robinson_says_sc...,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,TED2006,0:19:24,6/27/06,"children,teaching,creativity,parenting,culture...",51614087,...,,,,,,,,,,
4,only,92,https://www.ted.com/talks/hans_rosling_shows_t...,The best stats you've ever seen,You've never seen data presented like this. Wi...,TED2006,0:19:50,6/27/06,"demo,Asia,global issues,visualizations,global ...",12662135,...,,,,,,,,,,


## Coming to Terms

The sole purpose of this notebook is to establish how we are going to elicit our features, our words, from the collection of talks. Thus, the only column we are interested in is the one with the texts of the talks. We do, however, want to be able to trace any interesting developments back to a given talk, so we will label each text with its `public_url` minus `https://www.ted.com/talks/`.

Creating two lists, one of the URLs and one of the texts, is easy. We'll then use SciKit-Learn's counter to create the term frequency array and then we'll add the URLs back as labels -- perhaps through merging the two into a dataframe.

In [8]:
urls  = df.public_url.tolist()
texts = df.text.tolist()

Earlier explorations of the corpus revealed something we knew but had not realized could affect our work: some TED talks are not talks but musical performances. Generally, the text of such performances are rather short. Using an arbitrary length of `500` characters, we can see what these texts look like:

In [20]:
for text in texts:
    if len(text) < 500:
        print(text)

  (Applause)    (Music)    (Applause)  
  Let's just get started here.    Okay, just a moment.    (Whirring)    All right. (Laughter) Oh, sorry.    (Music) (Beatboxing)    Thank you.    (Applause)  
  (Music)    (Applause)    (Music)    (Music) (Applause)    (Music) (Applause) (Applause)    Herbie Hancock: Thank you. Marcus Miller. (Applause) Harvey Mason. (Applause)    Thank you. Thank you very much. (Applause)  
  (Music)    (Applause)    (Music)    (Applause)  
  (Music)    (Applause)    (Music)    (Applause)    (Music)    (Applause)    (Music)    (Applause)  
  (Music)    (Music) (Applause)    (Applause)  
  (Guitar music starts)    (Cheers)    (Cheers)    (Music ends)  
  (Music)    (Applause)  
  (Guitar music starts)    (Music ends)    (Applause)    (Distorted guitar music starts)    (Music ends)    (Applause)    (Ambient/guitar music starts)    (Music ends)    (Applause)  


To get a list of the indices for the texts, substitute `texts.index(text)` for `text` as follows:

```python
for text in texts:
    if len(text) < 500:
        print(texts.index(text))
```

When it comes time to process words in a text, our best bet will be to remove the parentheticals, though, having them means we can possibly explore sentiment using `(Applause)` and `(Laughter)` as contextual valuations.

For now, we will need some regex to remove the parentheses and their contents from our texts.

In [25]:
for text in texts:
    if len(text) < 500:
        print(texts.index(text))

113
235
382
496
573
799
899
1484
1564


In [31]:
# regex = "\(([^\)]+)\)"

for stuff in texts[235]:
    parens = re.findall('\(.*?\)',stuff)
print(parens)


[]
