# TED Talk Transcription Oddities

In [75]:
# IMPORTS
import re 
import numpy as np, pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk import tokenize, text

In [4]:
# Loading the data in a gendered partitioned fashion: 
df_male = pd.read_csv('talks_male.csv', index_col='Talk_ID')
df_female = pd.read_csv('talks_female.csv', index_col='Talk_ID')
df_nog = pd.read_csv('talks_nog.csv', index_col='Talk_ID')

df_all = pd.concat([df_male, df_female, df_nog])

texts = df_all.text.tolist()

print(f"From our {all_talks.shape[0]} x {all_talks.shape[1]} CSV, \
we have a list of {len(texts)} talks.")

From our 992 x 14 CSV, we have a list of 992 talks.


In [5]:
# Default vectorizer = lowercase, remove punctuation, 
# tokens > 2 char, split contractions, no stopwords
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
X.shape

(992, 39515)

In [109]:
df_all = pd.DataFrame(X.toarray(), 
                      columns = vectorizer.get_feature_names_out())

wc_all = df_all_words.sum()
wc_all.head(10)
wc_all.to_csv('../output/gender-part-word_counts.csv')

## Pause Fillers and Interjections: Variations on "ah"

The words from a search for "ah" were: 

    aaaah, aah, ah, ahh, ahhh
    
The words from a search for "aa" were: 

    aa, aaa, aaaa, aaaaa, aaaah, aah
    
Code used:

```python
for match in wc_all.index:
    if "aa" in match:
        print(match)
```

In [102]:
ahs_list = ["aa", "aaa", "aaaa", "aaaaa", "aah", 
            "aaaah", "aah", "ah", "ahh", "ahhh" ] 

ahs = wc_all.filter(items = ahs_list, axis=0)
ahs.head(len(ahs_list))

aa        11
aaa        6
aaaa       2
aaaaa      1
aah       10
aaaah      2
aah       10
ah       116
ahh        6
ahhh       1
dtype: int64

In [76]:
alltexts = ' '.join(texts).lower()
t = tokenize.WhitespaceTokenizer()
corpse = text.Text(t.tokenize(alltexts))
print(alltexts[0:100])

  thank you so much, chris. and it's truly a great honor to have the opportunity to come to this sta


In [84]:
corpse.concordance("ahh")

no matches


<div class="alert alert-block alert-warning">
There appears to be a tokenizer mis-match between SciKit-Learn, from which the list of <b>ah</b>s above is drawn and the NLTK tokenizer which is not finding the same tokens. 
</div>

### Word Count - Sorted

While we have a series with a word count, it's easy enough to sort the list to get a quick overview of the total count for the words in the corpus:

In [24]:
wc_sorted = wc_all.sort_values(ascending=False, inplace=False)
wc_sorted.head(10)

the     93853
and     67710
to      57089
of      52313
that    44087
it      35339
in      34728
you     34162
we      30407
is      28569
dtype: int64

## Performatives and Other Parenthetical Expressions <a class="anchor" id="performatives"></a>

Earlier explorations of the corpus revealed something we knew but had not realized could affect our work: some TED talks are not talks but musical performances. Generally, the text of such performances are rather short. Using an arbitrary length of `500` characters, we can see what these texts look like:

In [85]:
shorts = [ texts.index(text) for text in texts if len(text) < 500 ]
print(shorts)

[84, 183, 297, 388, 602, 983, 991]


In [106]:
print(f"{shorts[0]}:\n{texts[shorts[0]]}\n\n{shorts[1]}:\n{texts[shorts[1]]}")

84:
  (Applause)    (Music)    (Applause)  

183:
  Let's just get started here.    Okay, just a moment.    (Whirring)    All right. (Laughter) Oh, sorry.    (Music) (Beatboxing)    Thank you.    (Applause)  


As the two examples above reveal, some talks are not talks at all and do not actually contain any text, except for the performatives, while other talks contain non-textual materials, some of which is part of the performance and some of which is the audience's response to the performance. 

When it comes time to process words in a text, we think the best bet is to remove the performatives. We do think, however, that having them means we can possibly explore sentiment using `(Applause)` and `(Laughter)` as contextual valuations.

For now, we will need some regex to remove the parentheses and their contents from our texts. An examination of `113` above reveals that it is only three parenthetical expressions:

    (Applause)    (Music)    (Applause)

We need a sample text that is a mix, and so we will use `183` from above:

> Let's just get started here.    Okay, just a moment.    (Whirring)    All right. (Laughter) Oh, sorry.    (Music) (Beatboxing)    Thank you.    (Applause)

A quick check using two regexes give us one list without the parentheses and one with:

In [110]:
print(re.findall(r'(?<=\().*?(?=\))', texts[183]))
print(re.findall(r'\([^)]*\)', texts[183]))

['Whirring', 'Laughter', 'Music', 'Beatboxing', 'Applause']
['(Whirring)', '(Laughter)', '(Music)', '(Beatboxing)', '(Applause)']


We can also use `CountVectorizer` to inventory all the parenthetical expressions in the corpus to see if we are missing anything. 

First, we test is on a known text:

In [118]:
vec = CountVectorizer(token_pattern = r'(?<=\().*?(?=\))')
X183 = vec.fit_transform(texts[183:184])
df183 = pd.DataFrame(X183.toarray(), columns=vec.get_feature_names_out())
df.index = [ texts.index(text) for text in texts[183:184] ]
df.head()

Unnamed: 0,applause,beatboxing,laughter,music,whirring
183,1,1,1,1,1


In [119]:
parentheticals = vec.fit_transform(texts)
parentheticals.shape

(992, 449)

In [122]:
df_parens = pd.DataFrame(parentheticals.toarray(), 
                         columns=vec.get_feature_names_out())
df_parens.index = [ texts.index(text) for text in texts ]
df_parens.head()

Unnamed: 0,"""although it's nothing serious, let's keep an eye on it to make sure it doesn't turn into a major lawsuit.""","""close it!""","""i sold my soul for about a tenth of what the damn things are going for now.""","""in order to remain competitive in today's marketplace, i'm afraid we're going to have to replace you with a sleezeball.""","""intrigue and murder among 16th century ottoman court painters.""","""michael crichton responds by fax:""","""the end"" by the doors","""what's a jurassic park?""","""wow! fucking fantastic jacket""","""yes, books. you know, the bound volumes with ink on paper. you cannot turn them off with a switch. tell your kids.""",...,whistles,whistling,whoosh,woman screaming,woman: have you ever done a kissing test before?,woman: okay.,woo-hoo-hoo-hoo,xylophone,yelling more loudly,your fathers bristles white and stiff now
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


With all the parenthetical expressions captured, we can determine the most frequent, take a look at their numbers, and then decide what's the best path:

In [123]:
sums = df_parens.sum(axis = 0)
sums.sort_values(ascending=False).head(25)

laughter           4625
applause           2446
music               297
video               185
audio                33
music ends           29
singing              29
applause ends        21
laughs               16
cheers               14
guitar strum         14
guitar               13
sighs                13
drum sounds          10
marimba sounds       10
sings                 8
ball squeaks          7
clicking              7
drum sound            6
drum roll             6
cheering              6
audio: laughing       6
piano                 6
beep                  6
ss's voice            5
dtype: int64

Based on these results, the top 20 parentheticals could be inserted into a stopword list and we would remove, in the case of the top 4 especially, words or clauses that might affect results.

The code below can be included in `CountVectorizer` as a preprocessor:

In [125]:
# A Refined Preprocessor --
# This one removes two-word phrases/clauses

parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)",  
                  "\(video\)", "\(laughs\)", "\(applause ends\)", 
                  "\(audio\)", "\(singing\)", "\(music ends\)", 
                  "\(cheers\)", "\(cheering\)", "\(recording\)", 
                  "\(beatboxing\)", "\(audience\)", "\(guitar strum\)", 
                  "\(clicks metronome\)", "\(sighs\)", "\(guitar\)", 
                  "\(marimba sounds\)", "\(drum sounds\)" ]

def remove_parentheticals(text):
    global parentheticals
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text.lower(), 
                          flags=re.IGNORECASE)
    return new_text

A quick test of the function:

In [126]:
test = """Laughter is the best medicine. (Laughter) 
Hold your applause; I'm not done yet. (Applause ends)"""

print(remove_parentheticals(test))

laughter is the best medicine.   
hold your applause; i'm not done yet.  


In [127]:
the_vec = CountVectorizer( preprocessor = remove_parentheticals,
                          max_df = 0.9, min_df = 2 )
the_X = the_vec.fit_transform(texts)
the_X.shape

(992, 22389)

## Numbers

One of the dimensions of the corpus that arises out of a hand inspection of the terms is the frequency with which some numbers appear. The follow table captures the top ten numbers:

| TERM | FREQUENCY |
|------|-----------|
| 000  | 2098 |
| 10   | 1691 |
|  20  | 1107 |
| 100  |  902 |
|  30  |  827 |
|  50  |  784 |
|  15  |  659 | 
|  40  |  494 |
|  12  |  460 | 
|  25  |  410 |

Other frequently occurring numbers: 60, 500, 200, 11, 18, 80, 14 (241 times!). 

In order to examine the appearance of the numbers in context, we make a giant string out of the list of strings, `texts`: in which text a number appears is less important than its immediate context. 

In [6]:
onetext = nltk.Text('\n'.join(texts).split())
# And here's what an NLTK text object looks like: a list of words, really
print(onetext[0:10])

['Thank', 'you', 'so', 'much,', 'Chris.', 'And', "it's", 'truly', 'a', 'great']


In [7]:
onetext.concordance("000")

no matches


In [8]:
onetext.concordance("10")

Displaying 25 of 1216 matches:
Thank you very much. (Applause) About 10 years ago, I took on the task to teac
tion of income of people. One dollar, 10 dollars or 100 dollars per day. There
 a long time, but they come out after 10 years very, very differently. And the
at drives you in your life today? Not 10 years ago. Are you running the same p
really heavy, but in the last five or 10 years, have there been some decisions
. (Laughter) Are you sure? (Laughter) 10 seconds! (Laughter and applause) 10 s
) 10 seconds! (Laughter and applause) 10 seconds, I want to be respectful. All
principle in the Bible that says give 10 percent of what you get back to chari
ional shelter that would last five to 10 years, that would be placed next to t
tandards of five billion people? With 10 million solutions. So I wish to devel
 to go see Central Command, which was 10 minutes away. And that way, I could g
 will not launch this without five to 10 million units in the first run. And t
 down, and that's why

In [15]:
onetext.concordance("40")

Displaying 25 of 387 matches:
w York City already handled more than 40 percent of the entire city's commerci
ing rooms, whose evolution in 20, 30, 40 years we can't predict. So that liter
nd all the other teams have done this 40 Days of Purpose, based on the book. A
nternet tools, and we ended up having 40 chapters starting up, thousands of ar
cumented the Lower Ninth for the last 40 years. That was their home, and these
me. And a long time ago — well, about 40 years ago — my mom had an exchange st
 world where women and children spend 40 billion hours a year fetching water. 
 age category of 76 to 85, as much as 40 percent of people have nothing really
things tend to happen every 25 years. 40 years long, with an overlap. You can 
 all high-rises. So they'll put 20 or 40 up at a time, and they just go up in 
te, we've seen no side effects in the 40 or so patients in whom it's been impl
 terms of price performance, that's a 40 to 50 percent deflation rate. And eco
 people may increase t

A couple of things to note here:

First, there is a discrepancy in the count between `sklearn` and the NLTK: the former counted 2098 occurrences of `000`, the latter none. In all the counts that follow, there is a similar mismatch:

| TERM | `sklearn` | `nltk` |
|------|-----------|--------|
| 000  | 2098 | "no match" |
|  10  | 1691 | 1216 |
|  20  | 1107 | 879 |
| 100  |  902 | 647 |
|  30  |  827 | 650 |
|  50  |  784 | 594 | 
|  15  |  659 | 512 | 
|  40  |  494 | 387 | 
| ...               | 
|  14  |  241 | 148 | 

I don't have a ready explanation for this.

Second, the frequency of some numbers are readily explained:

* Round numbers like 10, 20, 30, 50, and 100 are approximations -- though it would be interesting to explore how often they are attached to large scalars like "thousand" or million." 
* Some numbers seem to represent alternate ways of counting: 25 reagularly stands in for "one-quarter" -- though not as often as we might imagine -- and 18 is regularly paired with *month* as a more precise way to say " a year and a half."
* There are some numbers, like 11 and 14 which seem to have power all their own, perhaps tied to particular ages in humans. 

Next up is some code to explore the most common occurring words with these numbers.

In [10]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

All my searches for "collocations with specific words" took me to the NLTK, which means, so far as I can tell, generating all the bigrams and then filtering to get the one(s) you want. This seems backwards to me: wouldn't it be faster simply to find the word and then what comes after it? I'll take a look at regex for this later.

In [11]:
## Bigrams
finder = BigramCollocationFinder.from_words(onetext)

In [12]:
## Here's the filter operation:
the_number = lambda *w: '14' not in w
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# only bigrams that contain the number
finder.apply_ngram_filter(the_number)
# return the 10 n-grams with the highest PMI
print(finder.nbest(bigram_measures.likelihood_ratio, 10))

[('14', 'years'), ('14', 'billion'), ('was', '14'), ('14', 'years,'), ('14', 'hours'), ('14', 'orders'), ('14', 'million'), ('14', 'percent'), ('14', 'feet'), ('14', 'times')]


This does not return a count. *Oi!*

### Trial 2

In [13]:
the_one = nltk.Text(re.sub("[^a-zA-Z0-9']"," ",'\n'.join(texts)).lower().split())
# And here's what an NLTK text object looks like: a list of words, really
print(the_one[0:10])

['thank', 'you', 'so', 'much', 'chris', 'and', "it's", 'truly', 'a', 'great']


In [16]:
the_one.concordance("40")

Displaying 25 of 494 matches:
oking for a place to eat we were on i 40 we got to exit 238 lebanon tennessee 
w york city already handled more than 40 percent of the entire city's commerci
eading rooms whose evolution in 20 30 40 years we can't predict so that litera
nd all the other teams have done this 40 days of purpose based on the book and
internet tools and we ended up having 40 chapters starting up thousands of arc
cumented the lower ninth for the last 40 years that was their home and these a
e time and a long time ago well about 40 years ago my mom had an exchange stud
 world where women and children spend 40 billion hours a year fetching water t
be someone coming to rescue me cut to 40 some odd years later we go to kenya a
t age category of 76 to 85 as much as 40 percent of people have nothing really
 is how do you go to the loo at minus 40 ben i've read somewhere that at minus
ben i've read somewhere that at minus 40 exposed skin becomes frostbitten in l
ou answer the call of 

Well, there's the missing `000`! It's in the idiomatic transcription practices of TED wherein a number like "sixty thousand" is rendered as "60,000." 

One thing we know now: reporting large numbers is a part of TED talks.

**TO DO**: How to keep the comma marker between numbers? (Or should we just look to 000 as a possible collocate with the other numbers?) One solution from the [Regex Cookbook][]:

```python
\b[0-9]{1,3}(,[0-9]{3})*(\.[0-9]+)?\b|\.[0-9]+\b
```

[Regex Cookbook]: https://www.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/ch06s11.html