# Talks as Performances & the Problem of Parentheticals

Earlier explorations of the corpus revealed something we knew but had not realized could affect our work: some TED talks are not talks but musical performances. Generally, the text of such performances are rather short. Using an arbitrary length of `500` characters, we can see what these texts look like:

## Imports and Data

In [5]:
# Imports
import pandas as pd
import re

# Load the Data
df = pd.read_csv('../output/TEDall_speakers.csv')

# Create a list of all the texts
texts = df.text.tolist()

In [6]:
for text in texts:
    if len(text) < 500:
        print(text)

  (Applause)    (Music)    (Applause)  
  Let's just get started here.    Okay, just a moment.    (Whirring)    All right. (Laughter) Oh, sorry.    (Music) (Beatboxing)    Thank you.    (Applause)  
  (Music)    (Applause)    (Music)    (Music) (Applause)    (Music) (Applause) (Applause)    Herbie Hancock: Thank you. Marcus Miller. (Applause) Harvey Mason. (Applause)    Thank you. Thank you very much. (Applause)  
  (Music)    (Applause)    (Music)    (Applause)  
  (Music)    (Applause)    (Music)    (Applause)    (Music)    (Applause)    (Music)    (Applause)  
  (Music)    (Music) (Applause)    (Applause)  
  (Guitar music starts)    (Cheers)    (Cheers)    (Music ends)  
  (Music)    (Applause)  
  (Guitar music starts)    (Music ends)    (Applause)    (Distorted guitar music starts)    (Music ends)    (Applause)    (Ambient/guitar music starts)    (Music ends)    (Applause)  


To get a list of the indices for the texts, substitute `texts.index(text)` for `text` as follows:

```python
for text in texts:
    if len(text) < 500:
        print(texts.index(text))
```

Here is the same thing in a list comprehension:

In [7]:
shorts = [ texts.index(text) for text in texts if len(text) < 500 ]
print(shorts)

[113, 235, 382, 496, 573, 799, 899, 1484, 1564]


When it comes time to process words in a text, our best bet will be to remove the parentheticals, though, having them means we can possibly explore sentiment using `(Applause)` and `(Laughter)` as contextual valuations.

For now, we will need some regex to remove the parentheses and their contents from our texts. An examination of `113` above reveals that it is only three parenthetical expressions:

    (Applause)    (Music)    (Applause)

We need a sample text that is a mix, and so we will use `235`:

In [18]:
print(texts[235])

  Let's just get started here.    Okay, just a moment.    (Whirring)    All right. (Laughter) Oh, sorry.    (Music) (Beatboxing)    Thank you.    (Applause)  


Two different regexes give us one list without the parentheses and one with:

In [26]:
print(re.findall(r'(?<=\().*?(?=\))', texts[235]))
print(re.findall(r'\([^)]*\)', texts[235]))

['Whirring', 'Laughter', 'Music', 'Beatboxing', 'Applause']
['(Whirring)', '(Laughter)', '(Music)', '(Beatboxing)', '(Applause)']


We could use sklearn's count vectorizer to catch only these texts!

In [28]:
from sklearn.feature_extraction.text import CountVectorizer

In [29]:
vec = CountVectorizer(token_pattern = r'(?<=\().*?(?=\))')
X = vec.fit_transform(texts[230:240])
X.shape

(10, 6)

In [31]:
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
df.index = [ texts.index(text) for text in texts[230:240] ]
df.head(10)

Unnamed: 0,applause,beatboxing,laugher,laughter,music,whirring
230,1,0,0,2,0,0
231,0,0,0,0,0,0
232,3,0,0,10,7,0
233,3,0,1,10,0,0
234,0,0,0,9,0,0
235,1,1,0,1,1,1
236,3,0,0,12,0,0
237,0,0,0,2,0,0
238,1,0,0,6,0,0
239,1,0,0,25,0,0


In [33]:
parentheticals = vec.fit_transform(texts)
parentheticals.shape

(1747, 620)

In [36]:
df_parens = pd.DataFrame(parentheticals.toarray(), columns=vec.get_feature_names())
df_parens.index = [ texts.index(text) for text in texts ]
df_parens.head(10)

Unnamed: 0,"""actually about ... 1%""","""although it's nothing serious, let's keep an eye on it to make sure it doesn't turn into a major lawsuit.""","""close it!""","""do architects have ears?""","""i sold my soul for about a tenth of what the damn things are going for now.""","""in order to remain competitive in today's marketplace, i'm afraid we're going to have to replace you with a sleezeball.""","""intrigue and murder among 16th century ottoman court painters.""","""kill him.""","""michael crichton responds by fax:""","""sure""",...,whistling,whoosh,with 4 attempts,woman screaming,woman: have you ever done a kissing test before?,woman: okay.,woo-hoo-hoo-hoo,xylophone,yelling more loudly,your fathers bristles white and stiff now
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Sigh, more TED randomness.

In [38]:
parens = list(df_parens)
print(parens[0:20])

['"actually about ... 1%"', '"although it\'s nothing serious, let\'s keep an eye on it to make sure it doesn\'t turn into a major lawsuit."', '"close it!"', '"do architects have ears?"', '"i sold my soul for about a tenth of what the damn things are going for now."', '"in order to remain competitive in today\'s marketplace, i\'m afraid we\'re going to have to replace you with a sleezeball."', '"intrigue and murder among 16th century ottoman court painters."', '"kill him."', '"michael crichton responds by fax:"', '"sure"', '"that\'s wrong too."', '"that\'s wrong"', '"the end" by the doors', '"what\'s a jurassic park?"', '"wow! fucking fantastic jacket"', '"yes, books. you know, the bound volumes with ink on paper. you cannot turn them off with a switch. tell your kids."', '1902', '1925', '1927', '1939']


In [39]:
from collections import Counter
paren_counts = Counter(parens)

In [40]:
paren_counts

Counter({'"actually about ... 1%"': 1,
         '"although it\'s nothing serious, let\'s keep an eye on it to make sure it doesn\'t turn into a major lawsuit."': 1,
         '"close it!"': 1,
         '"do architects have ears?"': 1,
         '"i sold my soul for about a tenth of what the damn things are going for now."': 1,
         '"in order to remain competitive in today\'s marketplace, i\'m afraid we\'re going to have to replace you with a sleezeball."': 1,
         '"intrigue and murder among 16th century ottoman court painters."': 1,
         '"kill him."': 1,
         '"michael crichton responds by fax:"': 1,
         '"sure"': 1,
         '"that\'s wrong too."': 1,
         '"that\'s wrong"': 1,
         '"the end" by the doors': 1,
         '"what\'s a jurassic park?"': 1,
         '"wow! fucking fantastic jacket"': 1,
         '"yes, books. you know, the bound volumes with ink on paper. you cannot turn them off with a switch. tell your kids."': 1,
         '1902': 1,
       