Focusing only on talks sponsored by the TED organization itself and not any of the TEDx talks, we have a total of 1747 texts. This is a small corpus, and so the usual reasons for shrinking the feature set for texts do not (necessarily) apply, but as we begin our survey of the contents of the TED talks we want to be mindful of standards that have emerged both so that our results are comparable to the work of others but also so that we could potentially scale up the work here without having to re-think the foundations. At the same time, we can enjoy taking our time to explore various methodological and topical issues that a small corpus presents.

**Table of Contents**

* [Summary](#summary)
* [Tokenization(s)](#tokenizations)
* [Vectorizer Variations](#variations)
  - Frequencies of Terms with a 2-Document Minimum
  - Frequencies of Terms Appearing in All/Most Documents
  - [One Doc Wonders](#wonders)
    - A Very Peculiar Vector: `max_df = 1`
    - Words That Appear More Than Once But Only in One Document
* [Transcription Artifacts](#artifacts)
  - Parentheticals
  - Numbers
  - Other Artifacts
* [Conclusions](#conclusions)

# Summary <a class="anchor" id="summary"></a>

Summing the words to get a total for each word across all talks in the dataset, we then hand-inspect the totals and discover that there is an assortment of numbers that recur: an exploration of that phenomenon appears in a separate notebook. The primary task of this notebook is to count the words in the text, which means addressing the matter of parentheticals head on. Much of this work occurs in a separate notebook, but everything comes down to the fact that there are two kinds of parenthetical expressions in TED talks:

* those which indicate actions and events outside the speaker's discourse, such as the audience laughing or applauding or the speaker sighing or playing music. 
* those which somehow the TED transcription practices have discerned as being digressive material within the speech of the author. 

We want to remove the former and keep the latter. In the end, the decision is made to remove, through `regex` matching, the top 20 parentheticals and to leave the others, most of which occur less than a dozen times. The affect on the overall study is thus reduced significantly.

After removing the parentheticals, a second term matrix is derived with a total vocabulary of 50377. (This after discovering a slight bug, or at least weirdness, in `sklearn` that is now in their bug-tracking system.) 

The final step in this notebook is to raise the threshold for a word to be included in the TF-matrix to 2 documents. With that done, we convert the matrix to a pandas dataframe and then save to CSV. (QUESTION: better/more efficient to save as pickle?)

In [1]:
# IMPORTS
import re 
import nltk
import numpy as np, pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# DATA
df = pd.read_csv('../output/TEDall_speakers.csv') # Skipping `index_col='Talk_ID'`

# GRAB the talks and create a list of strings
texts = df.text.tolist()

print(f"From our {df.shape[0]}x{df.shape[1]} CSV, \
we have a list of {len(texts)} talks.")

From our 1747x27 CSV, we have a list of 1747 talks.


In [3]:
# Remove quotes at top and bottom to uncomment this block

""" 
# Loading the Data in a gendered partitioned fashion: 

# Load binary gendered talks 
talks_male = pd.read_csv('talks_male.csv', index_col='Talk_ID')
talks_female = pd.read_csv('talks_female.csv', index_col='Talk_ID')

# No one gender ==> NOG
talks_nog = pd.read_csv('talks_nog.csv', index_col='Talk_ID')

all_talks = pd.concat([talks_male,talks_female,talks_nog])

texts = all_talks.text.tolist()

print(f"From our {all_talks.shape[0]}x{all_talks.shape[1]} CSV, \
we have a list of {len(texts)} talks.")

"""

' \n# Loading the Data in a gendered partitioned fashion: \n\n# Load binary gendered talks \ntalks_male = pd.read_csv(\'talks_male.csv\', index_col=\'Talk_ID\')\ntalks_female = pd.read_csv(\'talks_female.csv\', index_col=\'Talk_ID\')\n\n# No one gender ==> NOG\ntalks_nog = pd.read_csv(\'talks_nog.csv\', index_col=\'Talk_ID\')\n\nall_talks = pd.concat([talks_male,talks_female,talks_nog])\n\ntexts = all_talks.text.tolist()\n\nprint(f"From our {all_talks.shape[0]}x{all_talks.shape[1]} CSV, we have a list of {len(texts)} talks.")\n\n'

# Tokenization(s) <a class="anchor" id="tokenizations"></a>

Before we can count words and establish frequencies, we need to settle upon what we are going to consider words, which means determining our method of tokenizing our strings of characters into lists of tokens.

- The first tokenizer is regex that John has long used in order to keep contractions as single words, but it throws away all other forms of punctuation, including numbers which, as we will see later in this notebook, feature prominently in the TED talk transcripts.
- The NLTK's `word_tokenize()` function is based on a TreebankWordTokenizer: basically it tokenizes text like in the Penn Treebank, which means apostrophes break contractions into their distinct parts — e.g., `I'm` becomes `I` + `'m`. Whereas `wordpunct_tokenize()` is a regex that breaks the apostrophes of contractions into their own tokens.
- SciKit Learn's tokenization comes up the leanest. 

In [4]:
# REGEX
regex = [word for word in re.sub("[^a-zA-Z']"," ", texts[0]).lower().split()]

# NLTK
w_tokens = [word.lower() for word in nltk.word_tokenize(texts[0])]
wp_tokens = [word.lower() for word in nltk.wordpunct_tokenize(texts[0])]

# SciKit-Learn
vectorizer = CountVectorizer( lowercase = True ) # We are vectorizing
x = vectorizer.fit_transform([texts[0]])         # the same text as above
count = np.sum(x.toarray(), axis = 1)            # then summing the freq count

# Print to Compare
print(f"regex:       {len(regex)}")
print(f"nltk words:  {len(w_tokens)}")
print(f"nltk wpunct: {len(wp_tokens)}")
print(f"scikit:      {count[0]}")

regex:       2146
nltk words:  2576
nltk wpunct: 2686
scikit:      2035


When examined as the set of tokens making up the word count, the differences do not seem so great, a matter of a little more than a dozen words, as seen in the first cell below, and then in the following cell we take a look at the words involved: it's not clear why the NLTK tokenizer is making hyphenated words as one token.

In [5]:
print(f"METHOD : TOKEN SET")
print(f"regex  :  {len(set(regex))}")
print(f"NLTK   :  {len(set(w_tokens))}")
print(f"SciKit :  {x.shape[1]}")

METHOD : TOKEN SET
regex  :  727
NLTK   :  725
SciKit :  711


In [6]:
difference = set(w_tokens) - set(vectorizer.get_feature_names_out())
print(difference)

{'(', 'i-40', 're-brand', 'short-term', 'end-use', 'g-v', "'s", ';', '—', "'ll", 'beach-combing', 'a', "'", 'rear-view', 'low-cost', '?', 'limited-use', 'ok.', '``', '28-second', 'does', 'click-through', 'u.s.', 'has', "'m", "'ve", '39.5', ')', ',', 'climatecrisis.net', "'my", 'consumer-friendly', 'low-hanging', 'i', ':', 'hot-button', '30-second', 'carbon-neutral', "'re", "'d", "'named", '!', '2.0', "n't", 're-purposed', "''", '.'}


# Vectorizer Variations <a class="anchor" id="variations"></a>

With the short run of experiments above solidifying that SciKit-Learn's tokenizer as sufficient, we first run `CountVectorizer` unadorned. 

The default options are: 

- lowercase everything, 
- get rid of all punctuation, 
- make a word out of anything more than two characters long, 
- split contractions, and 
- no stopwords.

The tokenizer is not without its problems: while it breaks contractions at the apostrophe, like NLTK, it then throws away anything less than two letters, which means `I'm` disappears entirely. And pity the indefinite article *a(n)*, which is pitched while the definite article *the* remains.

In [7]:
# We are going with the defaults, 
# so no options/arguments are being passed:
vectorizer = CountVectorizer()

# fit the model to the data 
# vecs = vectorizer.fit(texts)
X = vectorizer.fit_transform(texts)

# see how many features we have
X.shape

(1747, 50379)

50,379 tokens is our raw, unfiltered, no pre-processing baseline. It includes, as we will see, a number of artifacts of the TED transcription process, including a variety of ways to spell out *ah*, the use of numerals for a host of things -- from dates to counts and everything in between, and some things which actually take place outside the talk itself, like audience laughter, which we here term *parentheticals*.

In [8]:
raw_df = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())
raw_series = raw_df.sum()

In [9]:
raw_series.head()

00             15
000          2098
000000004       1
0000001         1
000001          1
dtype: int64

Over two thousand occurrences of `000`! That's unexpected (but not unexplainable). We explore this oddity among other transcription artifacts below. 

## Frequencies of Terms with a 2-Document Minimum

Setting a `min_df = 2` removes all words that appear in only document. (An exploration of *one document wonders* is elsewhere.) As we can see from the `shape` below, we dropped 50379 to 29340, resulting in a reduction of 42% of the possible features.

In [10]:
min2_vec = CountVectorizer(min_df = 2)
min2_X = min2_vec.fit_transform(texts)
min2_X.shape

(1747, 29343)

As above we can sum our terms and write the results to a CSV file for closer inspection. Here, we highlight just a particular series of words.

In [11]:
# Create the dataframe
min2_df = pd.DataFrame(min2_X.toarray(), columns = min2_vec.get_feature_names_out())

# Create a series of sums
min2_series = min2_df.sum()

# Write sums to CSV:
# min2_series.to_csv('../output/word_freq_clean.csv')

# Inspect the sums
min2_series[1000:1005]

affiliations    3
affinity        3
affirm          7
affirmation     4
affirmations    3
dtype: int64

## Frequencies of Terms Appearing in All/Most Documents

Having trimmed terms that occur in only document from our feature set, we also need to determine what words occur across all the documents and thus are not particularly effective discriminators of topic. 

Note that this is a somewhat more dangerous move to make, since it has been shown that one of the keys to genre and gender in documents are the varying uses of so-called function words. In fact, as our explorations establish, using sklearn's built-in algorithms reveals a focus on removing common words does not really result in much of a reduction in the overall feature set, only 8 words occurring across 99% or more documents and that number only rising to 46 when we change the minimum threshold to 90%.

In [12]:
alldocs_vec = CountVectorizer( min_df = 0.99 )
alldocs_X = alldocs_vec.fit_transform(texts)
alldocs_X.shape

(1747, 8)

In [13]:
alldocs_df = pd.DataFrame(alldocs_X.toarray(), columns = alldocs_vec.get_feature_names_out())
alldocs_series = alldocs_df.sum()
alldocs_series

and    118989
in      62673
is      50072
it      59191
of      92416
the    166093
to     102276
you     56296
dtype: int64

In [14]:
mostdocs_vec = CountVectorizer( min_df = 0.90 )
mostdocs_X = mostdocs_vec.fit_transform(texts)
mostdocs_X.shape

(1747, 47)

In [15]:
# mostdocs_df = pd.DataFrame(mostdocs_X.toarray(), columns = mostdocs_vec.get_feature_names())
# mostdocs_series = mostdocs_df.sum()
# mostdocs_series.head(46)

This last run simply verifies that the exploration above with `min_df` holds true when we switch to `max_df`. That is, that our exploration of the parameters were not asking something unexpected by the code and possibly generating inexplicable and undesirable results.

In [16]:
# tnt = topsntails
tnt_vec = CountVectorizer( max_df = 0.9, min_df = 2 )
tnt_X = tnt_vec.fit_transform(texts)
tnt_X.shape

(1747, 29296)

The numbers add up. With that done, my recommendation is that we do not, for the time being, throw away common and/or function words: that is, let's not use a stopword list -- at least not in the usual way. Those words could be important in other ways, and dropping them has only a limited impact on the actual document-term matrix, one that pales by comparison to the words that occur only in one document.

## One Doc Wonders <a class="anchor" id="wonders"></a>

After establishing words that may have limited usage in terms of separating one group of documents from another but may have other uses, we also need to explore words that only occur in one document and thus, while they could carry a great deal of meaning within that one document, do not have much meaning outside of that document. 

We first take a look at all words that occur only in one document, and then take a somewhat closer look at words that appear multiply within a given document, attempting to distinguish between nonce words and words that may carry some weight, if only at the local level of the particular document.

### A Very Peculiar Vector: `max_df = 1`

In [17]:
vec = CountVectorizer( max_df = 1)
X = vec.fit_transform(texts)
X.shape

(1747, 21036)

In [18]:
dfx = pd.DataFrame(X.toarray(), columns = vec.get_feature_names_out())

# Create a pandas series of just the terms 
# and the number of times they occur in total
sums = dfx.sum()

# Save to CSV
#sums.to_csv('../output/one_doc_wonders.csv')

# Sort and View
sums = sums.sort_values(ascending = False)
sums.head()

bf           75
gk           46
telomeres    40
abed         39
mzuri        36
dtype: int64

### Words That Appear More Than Once But Only in One Document

Filtering a series is just like filtering a dateframe, and here we create a new series with that data and save it to a CSV.

In [19]:
repeated = sums [ sums > 1]
print(f"The number of words that appear more than once but \
only in one document: {len(repeated)}.")

# Write repeated to CSV:
# repeated.to_csv('../output/one_doc_repeated.csv')

# The Top 10
repeated.head(10)

The number of words that appear more than once but only in one document: 3894.


bf           75
gk           46
telomeres    40
abed         39
mzuri        36
jf           36
indus        33
fonio        32
teszler      30
tkm          30
dtype: int64

It's time to take see these terms in context. For now, we will convert the list of texts into a single NLTK text, but after that we will want to see in which texts these words occur.

In [20]:
onetext = nltk.Text(re.sub("[^a-zA-Z0-9']"," ",'\n'.join(texts)).split())

In [21]:
onetext.concordance("tkm")

Displaying 25 of 30 matches:
w Laughter Siphumeze Khundayi Hi guys TKM Hello everyone TKM So you think you k
e Khundayi Hi guys TKM Hello everyone TKM So you think you know about sex Chanc
e going to keep things safe and spicy TKM So now the act of rubbing our naked b
are going to live our best sexy lives TKM And we're going to tell you how to ha
out the things that we need to change TKM And the things we need to embrace in 
 From the top of my head rape culture TKM How tradition and culture limit ideas
hen he's trying to turn up the volume TKM Like that is a personal pet peeve of 
ve of mine SK We are so scared of sex TKM And we need somebody to blame for our
 see how well that goes down Laughter TKM Does not go down well I once challeng
at you saw on the internet by mistake TKM Mhm So now in order to cure this ailm
ach us to help us upgrade the present TKM So now if I had a glass of Merlot whi
t be named SK Whispering Colonization TKM Came through Within African societies
es of old w

### Initial Observations

* Unexplained (artifacts of some kind?): bf, gk, jf, tkm.
* Yup, used: telomeres (biological), Indus, fonio (some kind of grain).
* Name: Abed, Teszler.
* Song lyric: **mzuri** (in a transliterated language).

# Transcription Artifacts <a class="anchor" id="summary"></a>

A hand inspection of the raw output above turned up a couple of interesting issues: 

* that there are **parentheticals** mixed in with the text of the talks,
* that **numbers** feature in the talks regularly,
* and that there are also some interesting tokenization practices that emerge out of a crowd-sourced transcription technology.

One example of "interesting tokenization" is the number of ways that the pause filler *ah* can be transcribed:

| Term    | Freq |
|:---------|---:|
|aa       |12 |
|aaa      |7  |
|aaaa     |2  |
|aaaaa    |1  |
|aaaaaaaah|1  |
|aaaaaaah |1  |
|aaaaaah  |3  |
|aaaah    |2  |
|aaaahhh  |1  |
|aaah     |3  |
|aag      |1  |
|aah      |10 |

There are an equal number of alternate spellings for *shh*. 

There's also this oddity:

| Term    | Freq |
|:---------|---:|
|FALSE    |115|
|TRUE     |909|

It appears to be simply the occurrences of *true* and *false* but somehow the vectorizer thinks they are Booleans?

## Parentheticals

Earlier explorations of the corpus revealed something we knew but had not realized could affect our work: some TED talks are not talks but musical performances. Generally, the text of such performances are rather short. Using an arbitrary length of `500` characters, we can see what these texts look like:

In [22]:
for text in texts:
    if len(text) < 500:
        print(text)

  (Applause)    (Music)    (Applause)  
  Let's just get started here.    Okay, just a moment.    (Whirring)    All right. (Laughter) Oh, sorry.    (Music) (Beatboxing)    Thank you.    (Applause)  
  (Music)    (Applause)    (Music)    (Music) (Applause)    (Music) (Applause) (Applause)    Herbie Hancock: Thank you. Marcus Miller. (Applause) Harvey Mason. (Applause)    Thank you. Thank you very much. (Applause)  
  (Music)    (Applause)    (Music)    (Applause)  
  (Music)    (Applause)    (Music)    (Applause)    (Music)    (Applause)    (Music)    (Applause)  
  (Music)    (Music) (Applause)    (Applause)  
  (Guitar music starts)    (Cheers)    (Cheers)    (Music ends)  
  (Music)    (Applause)  
  (Guitar music starts)    (Music ends)    (Applause)    (Distorted guitar music starts)    (Music ends)    (Applause)    (Ambient/guitar music starts)    (Music ends)    (Applause)  


In [23]:
# Text Indices
shorts = [ texts.index(text) for text in texts if len(text) < 500 ]
print(shorts)

[113, 235, 382, 496, 573, 799, 899, 1484, 1564]


When it comes time to process words in a text, our best bet will be to remove the parentheticals, though, having them means we can possibly explore sentiment using `(Applause)` and `(Laughter)` as contextual valuations.

For now, we will need some regex to remove the parentheses and their contents from our texts. An examination of `113` above reveals that it is only three parenthetical expressions:

    (Applause)    (Music)    (Applause)

We need a sample text that is a mix, and so we will use `235`:

In [24]:
print(texts[235])

  Let's just get started here.    Okay, just a moment.    (Whirring)    All right. (Laughter) Oh, sorry.    (Music) (Beatboxing)    Thank you.    (Applause)  


Two different regexes give us one list without the parentheses and one with:

In [25]:
print(re.findall(r'(?<=\().*?(?=\))', texts[235]))
print(re.findall(r'\([^)]*\)', texts[235]))

['Whirring', 'Laughter', 'Music', 'Beatboxing', 'Applause']
['(Whirring)', '(Laughter)', '(Music)', '(Beatboxing)', '(Applause)']


In the next two cells we use the regex pattern extablished above on a subset of the larger list, bracketing a text, `235` known to have parentheticals:

In [26]:
parenvec = CountVectorizer(token_pattern = r'(?<=\().*?(?=\))')
parentheticals = parenvec.fit_transform(texts[230:240])
parentheticals.shape

(10, 6)

In [27]:
df = pd.DataFrame(parentheticals.toarray(), columns=parenvec.get_feature_names_out())
df.index = [ texts.index(text) for text in texts[230:240] ]
df.head(10)

Unnamed: 0,applause,beatboxing,laugher,laughter,music,whirring
230,1,0,0,2,0,0
231,0,0,0,0,0,0
232,3,0,0,10,7,0
233,3,0,1,10,0,0
234,0,0,0,9,0,0
235,1,1,0,1,1,1
236,3,0,0,12,0,0
237,0,0,0,2,0,0
238,1,0,0,6,0,0
239,1,0,0,25,0,0


Now we apply the regex pattern to the entire list:

In [28]:
parentheticals = parenvec.fit_transform(texts)
print(parentheticals.shape)

df_parens = pd.DataFrame(parentheticals.toarray(), columns=parenvec.get_feature_names_out())
df_parens.index = [ texts.index(text) for text in texts ]
df_parens.head(10)

(1747, 620)


Unnamed: 0,"""actually about ... 1%""","""although it's nothing serious, let's keep an eye on it to make sure it doesn't turn into a major lawsuit.""","""close it!""","""do architects have ears?""","""i sold my soul for about a tenth of what the damn things are going for now.""","""in order to remain competitive in today's marketplace, i'm afraid we're going to have to replace you with a sleezeball.""","""intrigue and murder among 16th century ottoman court painters.""","""kill him.""","""michael crichton responds by fax:""","""sure""",...,whistling,whoosh,with 4 attempts,woman screaming,woman: have you ever done a kissing test before?,woman: okay.,woo-hoo-hoo-hoo,xylophone,yelling more loudly,your fathers bristles white and stiff now
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And we discover more TED randomness! 

Let's determine the top parentheticals, take a look at their numbers, and then decide what's the best path:

In [29]:
sums = df_parens.sum(axis = 0)
sums.sort_values(ascending=False).head(25)

laughter            7275
applause            4271
music                424
video                269
laughs                41
audio                 37
applause ends         37
singing               35
music ends            33
cheers                30
cheering              21
recording             19
beatboxing            18
audience              15
guitar strum          14
clicks metronome      13
sighs                 13
guitar                13
drum sounds           10
marimba sounds        10
voice-over             8
clicking               8
sings                  8
clapping               7
in chinese             7
dtype: int64

Based on these results, the top 20 parentheticals could be inserted into a stopword list and we would remove, in the case of the top 4 especially, words or clauses that might affect results.

## Parentheticals

Further hand inspection revealed that there is a a reasonable amount of speaker discourse being parenthesized, making removing all such parenthetical material less than optimal. However, the parentheticals appear to follow a Zipf distribution, and so we can effectively remove 80% of the parentheticals with a relatively small number of them either included in a stop word list or fed to the vectorizer as a pre-processing step.

In [30]:
# Parenthetical Token List
parentheticals = [ "(laughter)", "(applause ends)", "(applause)", "(music)", 
                  "(video)", "(laughs)", "(audio)", "(singing)", "(music ends)", 
                  "(cheers)", "(cheering)", "(recording)", "(beatboxing)", 
                  "(audience)", "(guitar strum)", "(clicks metronome)", "(sighs)", 
                  "(guitar)", "(marimba sounds)", "(drum sounds)" ]

def remove_parentheticals (text):
    result = ' '.join([ word for word in re.sub("[^a-zA-Z0-9'()]"," ", text).lower().split() 
              if word not in parentheticals ])
    return result

In the cells that follow, we create a test string that has two terms, *laughter* and *applause*, both in the speech of the imaginary speaker as well as in a parenthetical. We want to keep the former and remove the latter.

Then we develop a function, `clean_text`, that goes through the list of terms to be removed as it goes through a text. This is a loop within a loop, and I am not sure if this is the most efficient way to do this, but it works. 

We then test the `clean_text` function on the `test` string and then on one of the talks we know is mostly a performance. The results, at long last, are what we want.

In [31]:
test = """Laughter is the best medicine. (Laughter) 
Hold your applause; I'm not done yet. (Applause ends)"""

print(remove_parentheticals(test))

laughter is the best medicine hold your applause i'm not done yet (applause ends)


Success in this small-scale experiment here: `(laughter)` has been removed and `laughter` remains. Unfortunately, `(applause ends)` also remains. (José Blanco's ["Hacking Scikit-Learn’s Vectorizers"](https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af) may hold some ideas here.)

`'(mock', 'sob)'` is problematic. And, also a bit of a problem, `sklearn` expects input as strings -- I have not found a way to bypass this, save writing custom preprocessors or tokenizers like the one above.

A simpler approach might be to feed the vectorizer the stopped parenthetical tokens, but when we tried this we got the following error:

```
UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['applause', 'audience', 'audio', 'beatboxing', 'cheering', 'cheers', 'clicks', 'drum', 'ends', 'guitar', 'laughs', 'laughter', 'marimba', 'metronome', 'music', 'recording', 'sighs', 'singing', 'sounds', 'strum', 'video'] not in stop_words.
  % sorted(inconsistent)
```

Removing multiple words remains unresolved at this time, but the frequency is low enough that we think it can be ignored.

## Numbers

One of the dimensions of the corpus that arises out of a hand inspection of the terms is the frequency with which some numbers appear. The follow table captures the top ten numbers:

| TERM | FREQUENCY |
|------|-----------|
| 000  | 2098 |
| 10   | 1691 |
|  20  | 1107 |
| 100  |  902 |
|  30  |  827 |
|  50  |  784 |
|  15  |  659 | 
|  40  |  494 |
|  12  |  460 | 
|  25  |  410 |

Other frequently occurring numbers: 60, 500, 200, 11, 18, 80, 14 (241 times!). 

In order to examine the appearance of the numbers in context, we make a giant string out of the list of strings, `texts`: in which text a number appears is less important than its immediate context. 

In [32]:
onetext = nltk.Text('\n'.join(texts).split())
# And here's what an NLTK text object looks like: a list of words, really
print(onetext[0:10])

['Thank', 'you', 'so', 'much,', 'Chris.', 'And', "it's", 'truly', 'a', 'great']


### Numbers: Experiment 1

In [33]:
onetext.concordance("000") # Best guess for lack of matches: doesn't stand alone as a word

no matches


In [34]:
onetext.concordance("10")

Displaying 25 of 1216 matches:
Thank you very much. (Applause) About 10 years ago, I took on the task to teac
tion of income of people. One dollar, 10 dollars or 100 dollars per day. There
 a long time, but they come out after 10 years very, very differently. And the
at drives you in your life today? Not 10 years ago. Are you running the same p
really heavy, but in the last five or 10 years, have there been some decisions
. (Laughter) Are you sure? (Laughter) 10 seconds! (Laughter and applause) 10 s
) 10 seconds! (Laughter and applause) 10 seconds, I want to be respectful. All
principle in the Bible that says give 10 percent of what you get back to chari
ional shelter that would last five to 10 years, that would be placed next to t
tandards of five billion people? With 10 million solutions. So I wish to devel
 to go see Central Command, which was 10 minutes away. And that way, I could g
 will not launch this without five to 10 million units in the first run. And t
 down, and that's why

In [35]:
onetext.concordance("40")

Displaying 25 of 387 matches:
w York City already handled more than 40 percent of the entire city's commerci
ing rooms, whose evolution in 20, 30, 40 years we can't predict. So that liter
nd all the other teams have done this 40 Days of Purpose, based on the book. A
nternet tools, and we ended up having 40 chapters starting up, thousands of ar
cumented the Lower Ninth for the last 40 years. That was their home, and these
me. And a long time ago — well, about 40 years ago — my mom had an exchange st
 world where women and children spend 40 billion hours a year fetching water. 
 age category of 76 to 85, as much as 40 percent of people have nothing really
things tend to happen every 25 years. 40 years long, with an overlap. You can 
 all high-rises. So they'll put 20 or 40 up at a time, and they just go up in 
te, we've seen no side effects in the 40 or so patients in whom it's been impl
 terms of price performance, that's a 40 to 50 percent deflation rate. And eco
 people may increase t

A couple of things to note here:

First, there is a discrepancy in the count between `sklearn` and the NLTK: the former counted 2098 occurrences of `000`, the latter none. In all the counts that follow, there is a similar mismatch:

| TERM | `sklearn` | `nltk` |
|------|-----------|--------|
| 000  | 2098 | "no match" |
|  10  | 1691 | 1216 |
|  20  | 1107 | 879 |
| 100  |  902 | 647 |
|  30  |  827 | 650 |
|  50  |  784 | 594 | 
|  15  |  659 | 512 | 
|  40  |  494 | 387 | 
| ...               | 
|  14  |  241 | 148 | 

I don't have a ready explanation for this, except that there is a difference between the two tokenizers in how they parse a number like `1,000`.

Second, the frequency of some numbers are readily explained:

* Round numbers like 10, 20, 30, 50, and 100 are approximations -- though it would be interesting to explore how often they are attached to large scalars like "thousand" or million." 
* Some numbers seem to represent alternate ways of counting: 25 reagularly stands in for "one-quarter" -- though not as often as we might imagine -- and 18 is regularly paired with *month* as a more precise way to say " a year and a half."
* There are some numbers, like 11 and 14 which seem to have power all their own, perhaps tied to particular ages in humans. 

Next up is some code to explore the most common occurring words with these numbers.

In [36]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

All my searches for "collocations with specific words" took me to the NLTK, which means, so far as I can tell, generating all the bigrams and then filtering to get the one(s) you want. This seems backwards to me: wouldn't it be faster simply to find the word and then what comes after it? I'll take a look at regex for this later.

In [37]:
## Bigrams
finder = BigramCollocationFinder.from_words(onetext)

In [38]:
## Here's the filter operation:
the_number = lambda *w: '14' not in w
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# only bigrams that contain the number
finder.apply_ngram_filter(the_number)
# return the 10 n-grams with the highest PMI
print(finder.nbest(bigram_measures.likelihood_ratio, 10))

[('14', 'years'), ('14', 'billion'), ('was', '14'), ('14', 'years,'), ('14', 'hours'), ('14', 'orders'), ('14', 'million'), ('14', 'percent'), ('14', 'feet'), ('14', 'times')]


This does not return a count. *Oi!*

### Numbers: Experiment 2

In [39]:
the_one = nltk.Text(re.sub("[^a-zA-Z0-9']"," ",'\n'.join(texts)).lower().split())
# And here's what an NLTK text object looks like: a list of words, really
print(the_one[0:10])

['thank', 'you', 'so', 'much', 'chris', 'and', "it's", 'truly', 'a', 'great']


In [40]:
the_one.concordance("40")

Displaying 25 of 494 matches:
oking for a place to eat we were on i 40 we got to exit 238 lebanon tennessee 
w york city already handled more than 40 percent of the entire city's commerci
eading rooms whose evolution in 20 30 40 years we can't predict so that litera
nd all the other teams have done this 40 days of purpose based on the book and
internet tools and we ended up having 40 chapters starting up thousands of arc
cumented the lower ninth for the last 40 years that was their home and these a
e time and a long time ago well about 40 years ago my mom had an exchange stud
 world where women and children spend 40 billion hours a year fetching water t
be someone coming to rescue me cut to 40 some odd years later we go to kenya a
t age category of 76 to 85 as much as 40 percent of people have nothing really
 is how do you go to the loo at minus 40 ben i've read somewhere that at minus
ben i've read somewhere that at minus 40 exposed skin becomes frostbitten in l
ou answer the call of 

Well, there's the missing `000`! It's in the idiomatic transcription practices of TED wherein a number like "sixty thousand" is rendered as "60,000." 

One thing we know now: reporting large numbers is a part of TED talks.

**TO DO**: How to keep the comma marker between numbers? (Or should we just look to 000 as a possible collocate with the other numbers?) One solution from the [Regex Cookbook][]:

```python
\b[0-9]{1,3}(,[0-9]{3})*(\.[0-9]+)?\b|\.[0-9]+\b
```

[Regex Cookbook]: https://www.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/ch06s11.html

## Other Artifacts

A hand inspection of the output above turned up a couple of interesting issues: that there are parentheticals mixed in with the text of the talks (see 01-Terms-02-Parentheticals) and numbers feature in the talks regularly (see 01-Terms-03-Numbers). 

There are also some interesting tokenization dimensions to a crowd-sourced transcription technology:

| Term    | Freq |
|:---------|---:|
|aa       |12 |
|aaa      |7  |
|aaaa     |2  |
|aaaaa    |1  |
|aaaaaaaah|1  |
|aaaaaaah |1  |
|aaaaaah  |3  |
|aaaah    |2  |
|aaaahhh  |1  |
|aaah     |3  |
|aag      |1  |
|aah      |10 |

There are an equal number of alternate spellings for *shh*. There's also this oddity:

| Term    | Freq |
|:---------|---:|
|FALSE    |115|
|TRUE     |909|

It appears to be simply the occurrences of *true* and *false* but somehow the vectorizer thinks they are Booleans?

The rest of the hand-examination turns up no other issues, so the basic vectorization built into `sklearn` appears to be satisfactory, with the only exception being its breaking of contractions at the apostrophe: e.g., *isn't* becomes *isn* and *'t*, with the latter being thrown away as too small -- which means that the indefinite article *a* is also not present in the frequencies. 

In the section that follows, we take the array holding all the talks and their terms and transpose it so that the terms are on the rows. We then sum the rows and turn the sum column into a series which is then sorted and saved to CSV for hand inspection. What the hand inspection revealed was, beyond the usual stopwords, there were enough terms in the top 300 terms that had semantic possibilities that they should be kept. The best bet moving forward is to use an available, and widely-used, stopword list. We chose the one provided by the NLTK.

In [41]:
term_matrix = pd.DataFrame(X.todense(), columns=vec.get_feature_names_out())
tfdf = term_matrix.transpose()
tfdf.shape

(21036, 1747)

In [42]:
tfdf.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1737,1738,1739,1740,1741,1742,1743,1744,1745,1746
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
42,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
46,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [43]:
# Sum the counts for each term into a column titled "sum":
tfdf['sum'] = tfdf.sum(axis=1)

# Isolate the sums in a series
sums = pd.Series(tfdf['sum'])

In [44]:
sums.sort_values(ascending=False, inplace=True)

In [45]:
# sums.to_csv('../output/tf_main.csv')

## Conclusions <a class="anchor" id="conclusions"></a>

Taking into consideration all the peculiarities of the TED talks corpus and the strengths and weaknesses of the various ways to process the corpus, we suggest the following setup:

### A preprocessor to handle parentheticals

In [46]:
# A Refined Preprocessor --
# This one removes two-word phrases/clauses

parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)",  
                  "\(video\)", "\(laughs\)", "\(applause ends\)", 
                  "\(audio\)", "\(singing\)", "\(music ends\)", 
                  "\(cheers\)", "\(cheering\)", "\(recording\)", 
                  "\(beatboxing\)", "\(audience\)", "\(guitar strum\)", 
                  "\(clicks metronome\)", "\(sighs\)", "\(guitar\)", 
                  "\(marimba sounds\)", "\(drum sounds\)" ]

def remove_parentheticals(text):
    global parentheticals
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text.lower(), 
                          flags=re.IGNORECASE)
    return new_text

test = """Laughter is the best medicine. (Laughter) 
Hold your applause; I'm not done yet. (Applause ends)"""

print(remove_parentheticals(test))

laughter is the best medicine.   
hold your applause; i'm not done yet.  


In [47]:
the_vec = CountVectorizer( preprocessor = remove_parentheticals,
                          max_df = 0.9, min_df = 2 )
the_X = the_vec.fit_transform(texts)
the_X.shape

(1747, 29294)