# Talks as Performances & the Problem of Parentheticals

## Imports and Data

In [5]:
# Imports
import pandas as pd
import re

# Load the Data
df = pd.read_csv('../output/TEDall_speakers.csv')

# Create a list of all the texts
texts = df.text.tolist()

## Exploration

Earlier explorations of the corpus revealed something we knew but had not realized could affect our work: some TED talks are not talks but musical performances. Generally, the text of such performances are rather short. Using an arbitrary length of `500` characters, we can see what these texts look like:

In [6]:
for text in texts:
    if len(text) < 500:
        print(text)

  (Applause)    (Music)    (Applause)  
  Let's just get started here.    Okay, just a moment.    (Whirring)    All right. (Laughter) Oh, sorry.    (Music) (Beatboxing)    Thank you.    (Applause)  
  (Music)    (Applause)    (Music)    (Music) (Applause)    (Music) (Applause) (Applause)    Herbie Hancock: Thank you. Marcus Miller. (Applause) Harvey Mason. (Applause)    Thank you. Thank you very much. (Applause)  
  (Music)    (Applause)    (Music)    (Applause)  
  (Music)    (Applause)    (Music)    (Applause)    (Music)    (Applause)    (Music)    (Applause)  
  (Music)    (Music) (Applause)    (Applause)  
  (Guitar music starts)    (Cheers)    (Cheers)    (Music ends)  
  (Music)    (Applause)  
  (Guitar music starts)    (Music ends)    (Applause)    (Distorted guitar music starts)    (Music ends)    (Applause)    (Ambient/guitar music starts)    (Music ends)    (Applause)  


To get a list of the indices for the texts, substitute `texts.index(text)` for `text` as follows:

```python
for text in texts:
    if len(text) < 500:
        print(texts.index(text))
```

Here is the same thing in a list comprehension:

In [7]:
shorts = [ texts.index(text) for text in texts if len(text) < 500 ]
print(shorts)

[113, 235, 382, 496, 573, 799, 899, 1484, 1564]


When it comes time to process words in a text, our best bet will be to remove the parentheticals, though, having them means we can possibly explore sentiment using `(Applause)` and `(Laughter)` as contextual valuations.

For now, we will need some regex to remove the parentheses and their contents from our texts. An examination of `113` above reveals that it is only three parenthetical expressions:

    (Applause)    (Music)    (Applause)

We need a sample text that is a mix, and so we will use `235`:

In [18]:
print(texts[235])

  Let's just get started here.    Okay, just a moment.    (Whirring)    All right. (Laughter) Oh, sorry.    (Music) (Beatboxing)    Thank you.    (Applause)  


Two different regexes give us one list without the parentheses and one with:

In [26]:
print(re.findall(r'(?<=\().*?(?=\))', texts[235]))
print(re.findall(r'\([^)]*\)', texts[235]))

['Whirring', 'Laughter', 'Music', 'Beatboxing', 'Applause']
['(Whirring)', '(Laughter)', '(Music)', '(Beatboxing)', '(Applause)']


We could use sklearn's count vectorizer to catch only these texts!

In [28]:
from sklearn.feature_extraction.text import CountVectorizer

In [29]:
vec = CountVectorizer(token_pattern = r'(?<=\().*?(?=\))')
X = vec.fit_transform(texts[230:240])
X.shape

(10, 6)

In [31]:
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
df.index = [ texts.index(text) for text in texts[230:240] ]
df.head(10)

Unnamed: 0,applause,beatboxing,laugher,laughter,music,whirring
230,1,0,0,2,0,0
231,0,0,0,0,0,0
232,3,0,0,10,7,0
233,3,0,1,10,0,0
234,0,0,0,9,0,0
235,1,1,0,1,1,1
236,3,0,0,12,0,0
237,0,0,0,2,0,0
238,1,0,0,6,0,0
239,1,0,0,25,0,0


In [33]:
parentheticals = vec.fit_transform(texts)
parentheticals.shape

(1747, 620)

In [36]:
df_parens = pd.DataFrame(parentheticals.toarray(), columns=vec.get_feature_names())
df_parens.index = [ texts.index(text) for text in texts ]
df_parens.head(10)

Unnamed: 0,"""actually about ... 1%""","""although it's nothing serious, let's keep an eye on it to make sure it doesn't turn into a major lawsuit.""","""close it!""","""do architects have ears?""","""i sold my soul for about a tenth of what the damn things are going for now.""","""in order to remain competitive in today's marketplace, i'm afraid we're going to have to replace you with a sleezeball.""","""intrigue and murder among 16th century ottoman court painters.""","""kill him.""","""michael crichton responds by fax:""","""sure""",...,whistling,whoosh,with 4 attempts,woman screaming,woman: have you ever done a kissing test before?,woman: okay.,woo-hoo-hoo-hoo,xylophone,yelling more loudly,your fathers bristles white and stiff now
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Sigh, more TED randomness. Let's determine the top parentheticals, take a look at their numbers, and then decide what's the best path:

In [49]:
sums = df_parens.sum(axis = 0)
sums.sort_values(ascending=False).head(25)

laughter            7275
applause            4271
music                424
video                269
laughs                41
applause ends         37
audio                 37
singing               35
music ends            33
cheers                30
cheering              21
recording             19
beatboxing            18
audience              15
guitar strum          14
clicks metronome      13
sighs                 13
guitar                13
marimba sounds        10
drum sounds           10
clicking               8
sings                  8
voice-over             8
in chinese             7
clapping               7
dtype: int64

Based on these results, the top 20 parentheticals could be inserted into a stopword list and we would remove, in the case of the top 4 especially, words or clauses that might affect results.

## Revised Frequencies without Parentheticals

### Experiment 1

For now, without better regex-fu, we are going to filter the texts and then pass them onto the vectorizer:

In [None]:
# For more information on what these steps do, see the cell in Frequencies above.
vec = CountVectorizer(token_pattern=r'(?<=\().*?(?=\))')
X2 = vec.fit_transform(texts[235:237])
X2.shape

pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

# Here's what the texts look like unfiltered:
for text in texts[235:237]:
    print(text[0:80])

# Here's the filter:
for text in texts[235:237]:
    retext = re.sub(r'\([^)]*\)', ' ', text)
    print(retext[0:80])

# This is a quick check of the list comprehension:
retexts = [ re.sub(r'\([^)]*\)', ' ', text) for text in texts[235:237]]

for text in retexts:
    print(text[0:80])

# Everything above in one compact block:

# Filter texts to remove parentheticals
retexts = [ re.sub(r'\([^)]*\)', ' ', text) for text in texts ]
# Load and run the vectorizer
vec1 = CountVectorizer()
X1 = vec.fit_transform(retexts)
X1.shape

# with open('../output/word_freq_noparens.csv','w') as out:
#     csv_out = csv.writer(out)
#     csv_out.writerow(['word','count'])
#     for row in words_freq:
#         csv_out.writerow(row)

We have not dropped that many words: `50379 - 50316 = 63`, but we have removed words that were not spoken by the speakers. We now have a finished freqency count with which to begin the rest of our work. The cell below was used to write a new CSV -- it is commented out so that should anyone choose the "run all" option, then there won't be a danger of the files getting over-written. 

### Experiment 2

Further experiments in the parentheticals notebook (01-Terms-02-Parentheticals), revealed that there is a a reasonable amount of speaker discourse being parenthesized, making removing all such parenthetical material less than optimal. However, the parentheticals appear to follow a Zipf distribution, and so we can effectively remove 80% of the parentheticals with a relatively small number of them fed to the vectorizer as stop words -- we will need to do so in a pre-processing step.

In [1]:
stopped_parens = [ "(laughter)", "(applause)", "(music)", "(video)", "(laughs)", 
                  "(applause ends)", "(audio)", "(singing)", "(music ends)", "(cheers)", 
                  "(cheering)", "(recording)", "(beatboxing)", "(audience)", 
                  "(guitar strum)", "(clicks metronome)", "(sighs)", "(guitar)", 
                  "(marimba sounds)", "(drum sounds)" ]
len(stopped_parens)

20

Success in this small-scale experiment here: `(laughter)` has been removed and `(whirring)` remains. We will scale this up now to the entirety of the corpus. 

First, we work out what code will give us the results we want and then, in the next cell, we use it with the vectorizer. (José Blanco has a post on _Towards Data Science_ on ["Hacking Scikit-Learn’s Vectorizers"](https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af).)

In [None]:
def paren_cleanse (text):
    result = ' '.join([ word for word in re.sub("[^a-zA-Z0-9'()]"," ", text).lower().split() 
              if word not in stopped_parens ])
    return result

clean_texts = [ paren_cleanse(text) for text in texts]
print(clean_texts[0][0:200])

'(mock', 'sob)'` is problematic. And, also a bit of a problem, `sklearn` expects input as strings -- I have not found a way to bypass this, save writing custom preprocessors or tokenizers, *à la*:

```python
def paren_cleanse(text):
    return(" ".join([ word for word in re.sub("[^a-zA-Z0-9'()]"," ", text.lower())
              if word not in stopped_parens ]))

vec3 = CountVectorizer(preprocessor = paren_cleanse)
```

A simpler approach might be to feed the vectorizer the stopped parenthetical tokens -- we do not know where the words are being removed in the process, and so one of the things will be to check the count of a word like laughter to see if its count decreased from one feature set to another.