# Talks as Performances & the Problem of Parentheticals

Earlier explorations of the corpus revealed something we knew but had not realized could affect our work: some TED talks are not talks but musical performances. Generally, the text of such performances are rather short. Using an arbitrary length of `500` characters, we can see what these texts look like:

## Imports and Data

In [5]:
# Imports
import pandas as pd
import re

# Load the Data
df = pd.read_csv('../output/TEDall_speakers.csv')

# Create a list of all the texts
texts = df.text.tolist()

In [6]:
for text in texts:
    if len(text) < 500:
        print(text)

  (Applause)    (Music)    (Applause)  
  Let's just get started here.    Okay, just a moment.    (Whirring)    All right. (Laughter) Oh, sorry.    (Music) (Beatboxing)    Thank you.    (Applause)  
  (Music)    (Applause)    (Music)    (Music) (Applause)    (Music) (Applause) (Applause)    Herbie Hancock: Thank you. Marcus Miller. (Applause) Harvey Mason. (Applause)    Thank you. Thank you very much. (Applause)  
  (Music)    (Applause)    (Music)    (Applause)  
  (Music)    (Applause)    (Music)    (Applause)    (Music)    (Applause)    (Music)    (Applause)  
  (Music)    (Music) (Applause)    (Applause)  
  (Guitar music starts)    (Cheers)    (Cheers)    (Music ends)  
  (Music)    (Applause)  
  (Guitar music starts)    (Music ends)    (Applause)    (Distorted guitar music starts)    (Music ends)    (Applause)    (Ambient/guitar music starts)    (Music ends)    (Applause)  


To get a list of the indices for the texts, substitute `texts.index(text)` for `text` as follows:

```python
for text in texts:
    if len(text) < 500:
        print(texts.index(text))
```

Here is the same thing in a list comprehension:

In [7]:
shorts = [ texts.index(text) for text in texts if len(text) < 500 ]
print(shorts)

[113, 235, 382, 496, 573, 799, 899, 1484, 1564]


When it comes time to process words in a text, our best bet will be to remove the parentheticals, though, having them means we can possibly explore sentiment using `(Applause)` and `(Laughter)` as contextual valuations.

For now, we will need some regex to remove the parentheses and their contents from our texts. An examination of `113` above reveals that it is only three parenthetical expressions:

    (Applause)    (Music)    (Applause)

We need a sample text that is a mix, and so we will use `235`:

In [18]:
print(texts[235])

  Let's just get started here.    Okay, just a moment.    (Whirring)    All right. (Laughter) Oh, sorry.    (Music) (Beatboxing)    Thank you.    (Applause)  


In [19]:
parens = re.findall(r'(?<=\().*?(?=\))', texts[235])
print(parens)

['Whirring', 'Laughter', 'Music', 'Beatboxing', 'Applause']


Okay, now we have a way to get the text inside a parentheses out, so we can examine that later. We also need a way to remove the parenthetical material so that only the words spoken remain.

In [23]:
text = texts[235]
retext = re.sub(r'\([^)]*\)', ' ', text)
print(retext)

  Let's just get started here.    Okay, just a moment.         All right.   Oh, sorry.           Thank you.       


Okay, so now we have a way to filter out the parentheticals. At some point later, let's come back and count them.

***Oi!*** I just realized that the two regexes above do the same thing: all I've done is change from findall to sub. I need a way to pattern match the parentheses *out*.