In this notebook, we focus only on talks sponsored by the TED organization itself, 1747 texts, and not the TEDx talks. With a focus on gender, we have further restricted the set to the 99 talks given at the main TED event, removing the effects that the *TED Women* event may have had. 

This is a small corpus, and so the usual reasons for shrinking the feature set for texts do not (necessarily) apply, but as we begin our survey of the contents of the TED talks we want to be mindful of standards that have emerged both so that our results are comparable to the work of others but also so that we can potentially scale up the work here without having to re-think the foundations. (At the same time, we can enjoy taking our time to explore various methodological and topical issues that a small corpus presents. For more on that, see the Transcription Effects notebook.)

**Table of Contents**

* [Summary](#summary)
* [Tokenization(s)](#tokenizations)
* [Vectorizer Variations](#variations)
  - Frequencies of Terms with a 2-Document Minimum
  - Frequencies of Terms Appearing in All/Most Documents
  - [One Doc Wonders](#wonders)
    - A Very Peculiar Vector: `max_df = 1`
    - Words That Appear More Than Once But Only in One Document
* [Conclusions](#conclusions)

# Summary <a class="anchor" id="summary"></a>

Summing the words to get a total for each word across all talks in the dataset, we then hand-inspect the totals and discover that there is an assortment of numbers that recur: an exploration of that phenomenon appears in a separate notebook. The primary task of this notebook is to count the words in the text, which means addressing the matter of parentheticals head on. Much of this work occurs in a separate notebook, but everything comes down to the fact that there are two kinds of parenthetical expressions in TED talks:

* those which indicate actions and events outside the speaker's discourse, such as the audience laughing or applauding or the speaker sighing or playing music. 
* those which somehow the TED transcription practices have discerned as being digressive material within the speech of the author. 

We want to remove the former and keep the latter. In the end, the decision is made to remove, through `regex` matching, the top 20 parentheticals and to leave the others, most of which occur less than a dozen times. The affect on the overall study is thus reduced significantly.

After removing the parentheticals, a second term matrix is derived with a total vocabulary of 50377. (This after discovering a slight bug, or at least weirdness, in `sklearn` that is now in their bug-tracking system.) 

The final step in this notebook is to raise the threshold for a word to be included in the TF-matrix to 2 documents. With that done, we convert the matrix to a pandas dataframe and then save to CSV. (QUESTION: better/more efficient to save as pickle?)

In [1]:
# IMPORTS
import re 
import nltk
import numpy as np, pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# Remove quotes at top and bottom to uncomment this block

# Loading the Data in a gendered partitioned fashion: 

# Load binary gendered talks 
talks_male = pd.read_csv('talks_male.csv', index_col='Talk_ID')
talks_female = pd.read_csv('talks_female.csv', index_col='Talk_ID')

# No one gender ==> NOG
talks_nog = pd.read_csv('talks_nog.csv', index_col='Talk_ID')

all_talks = pd.concat([talks_male,talks_female,talks_nog])

texts = all_talks.text.tolist()

print(f"From our {all_talks.shape[0]}x{all_talks.shape[1]} CSV, \
we have a list of {len(texts)} talks.")

From our 992x14 CSV, we have a list of 992 talks.


# Tokenization(s) <a class="anchor" id="tokenizations"></a>

Before we can count words and establish frequencies, we need to settle upon what we are going to consider words, which means determining our method of tokenizing our strings of characters into lists of tokens.

- The first tokenizer is regex that John has long used in order to keep contractions as single words, but it throws away all other forms of punctuation, including numbers which, as we will see later in this notebook, feature prominently in the TED talk transcripts.
- The NLTK's `word_tokenize()` function is based on a TreebankWordTokenizer: basically it tokenizes text like in the Penn Treebank, which means apostrophes break contractions into their distinct parts — e.g., `I'm` becomes `I` + `'m`. Whereas `wordpunct_tokenize()` is a regex that breaks the apostrophes of contractions into their own tokens.
- SciKit Learn's tokenization comes up the leanest. 

In [3]:
# REGEX
regex = [word for word in re.sub("[^a-zA-Z']"," ", texts[0]).lower().split()]

# NLTK
w_tokens = [word.lower() for word in nltk.word_tokenize(texts[0])]
wp_tokens = [word.lower() for word in nltk.wordpunct_tokenize(texts[0])]

# SciKit-Learn
vectorizer = CountVectorizer( lowercase = True ) # We are vectorizing
x = vectorizer.fit_transform([texts[0]])         # the same text as above
count = np.sum(x.toarray(), axis = 1)            # then summing the freq count

# Print to Compare
print(f"regex:       {len(regex)}")
print(f"nltk words:  {len(w_tokens)}")
print(f"nltk wpunct: {len(wp_tokens)}")
print(f"scikit:      {count[0]}")

regex:       2146
nltk words:  2576
nltk wpunct: 2686
scikit:      2035


When examined as the set of tokens making up the word count, the differences do not seem so great, a matter of a little more than a dozen words, as seen in the first cell below, and then in the following cell we take a look at the words involved: it's not clear why the NLTK tokenizer is making hyphenated words as one token.

In [4]:
print(f"METHOD : TOKEN SET")
print(f"regex  :  {len(set(regex))}")
print(f"NLTK   :  {len(set(w_tokens))}")
print(f"SciKit :  {x.shape[1]}")

METHOD : TOKEN SET
regex  :  727
NLTK   :  725
SciKit :  711


In [5]:
difference = set(w_tokens) - set(vectorizer.get_feature_names_out())
print(difference)

{"'d", '30-second', 're-brand', '``', "'ve", 'hot-button', 'beach-combing', 'i-40', '.', 'limited-use', 'carbon-neutral', "n't", 'has', 'low-hanging', 'climatecrisis.net', 'low-cost', 'end-use', "''", 'a', 'short-term', "'ll", '2.0', 'g-v', '?', '39.5', '—', '!', 'click-through', 'does', "'my", ':', '(', 'rear-view', "'s", "'named", 'ok.', 'i', "'re", '28-second', ',', "'", ')', "'m", ';', 'consumer-friendly', 'u.s.', 're-purposed'}


# Vectorizer Variations <a class="anchor" id="variations"></a>

With the short run of experiments above solidifying that SciKit-Learn's tokenizer as sufficient, we first run `CountVectorizer` unadorned. 

The default options are: 

- lowercase everything, 
- get rid of all punctuation, 
- make a word out of anything more than two characters long, 
- split contractions, and 
- no stopwords.

The tokenizer is not without its problems: while it breaks contractions at the apostrophe, like NLTK, it then throws away anything less than two letters, which means `I'm` disappears entirely. And pity the indefinite article *a(n)*, which is pitched while the definite article *the* remains.

In [6]:
# We are going with the defaults, 
# so no options/arguments are being passed:
vectorizer = CountVectorizer()

# fit the model to the data 
# vecs = vectorizer.fit(texts)
X = vectorizer.fit_transform(texts)

# see how many features we have
X.shape

(992, 39515)

50,379 tokens is our raw, unfiltered, no pre-processing baseline. It includes, as we will see, a number of artifacts of the TED transcription process, including a variety of ways to spell out *ah*, the use of numerals for a host of things -- from dates to counts and everything in between, and some things which actually take place outside the talk itself, like audience laughter, which we here term *parentheticals*.

In [7]:
raw_df = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())
raw_series = raw_df.sum()

In [8]:
raw_series.head()

00              4
000          1187
000000004       1
000001          1
000042          1
dtype: int64

Over two thousand occurrences of `000`! That's unexpected (but not unexplainable). We explore this oddity among other transcription artifacts below. 

## Frequencies of Terms with a 2-Document Minimum

Setting a `min_df = 2` removes all words that appear in only document. (An exploration of *one document wonders* is elsewhere.) As we can see from the `shape` below, we dropped 50379 to 29340, resulting in a reduction of 42% of the possible features.

In [9]:
min2_vec = CountVectorizer(min_df = 2)
min2_X = min2_vec.fit_transform(texts)
min2_X.shape

(992, 22433)

As above we can sum our terms and write the results to a CSV file for closer inspection. Here, we highlight just a particular series of words.

In [10]:
# Create the dataframe
min2_df = pd.DataFrame(min2_X.toarray(), columns = min2_vec.get_feature_names_out())

# Create a series of sums
min2_series = min2_df.sum()

# Write sums to CSV:
# min2_series.to_csv('../output/word_freq_clean.csv')

# Inspect the sums
min2_series[1000:1005]

alternating       5
alternative      82
alternatively     2
alternatives     13
alters            2
dtype: int64

## Frequencies of Terms Appearing in All/Most Documents

Having trimmed terms that occur in only document from our feature set, we also need to determine what words occur across all the documents and thus are not particularly effective discriminators of topic. 

Note that this is a somewhat more dangerous move to make, since it has been shown that one of the keys to genre and gender in documents are the varying uses of so-called function words. In fact, as our explorations establish, using sklearn's built-in algorithms reveals a focus on removing common words does not really result in much of a reduction in the overall feature set, only 8 words occurring across 99% or more documents and that number only rising to 46 when we change the minimum threshold to 90%.

In [11]:
alldocs_vec = CountVectorizer( min_df = 0.99 )
alldocs_X = alldocs_vec.fit_transform(texts)
alldocs_X.shape

(992, 5)

In [12]:
alldocs_df = pd.DataFrame(alldocs_X.toarray(), columns = alldocs_vec.get_feature_names_out())
alldocs_series = alldocs_df.sum()
alldocs_series

and    67710
of     52313
the    93853
to     57089
you    34162
dtype: int64

In [13]:
mostdocs_vec = CountVectorizer( min_df = 0.90 )
mostdocs_X = mostdocs_vec.fit_transform(texts)
mostdocs_X.shape

(992, 43)

In [14]:
# mostdocs_df = pd.DataFrame(mostdocs_X.toarray(), columns = mostdocs_vec.get_feature_names())
# mostdocs_series = mostdocs_df.sum()
# mostdocs_series.head(46)

This last run simply verifies that the exploration above with `min_df` holds true when we switch to `max_df`. That is, that our exploration of the parameters were not asking something unexpected by the code and possibly generating inexplicable and undesirable results.

In [15]:
# tnt = topsntails
tnt_vec = CountVectorizer( max_df = 0.9, min_df = 2 )
tnt_X = tnt_vec.fit_transform(texts)
tnt_X.shape

(992, 22390)

The numbers add up. With that done, my recommendation is that we do not, for the time being, throw away common and/or function words: that is, let's not use a stopword list -- at least not in the usual way. Those words could be important in other ways, and dropping them has only a limited impact on the actual document-term matrix, one that pales by comparison to the words that occur only in one document.

## One Doc Wonders <a class="anchor" id="wonders"></a>

After establishing words that may have limited usage in terms of separating one group of documents from another but may have other uses, we also need to explore words that only occur in one document and thus, while they could carry a great deal of meaning within that one document, do not have much meaning outside of that document. 

We first take a look at all words that occur only in one document, and then take a somewhat closer look at words that appear multiply within a given document, attempting to distinguish between nonce words and words that may carry some weight, if only at the local level of the particular document.

### A Very Peculiar Vector: `max_df = 1`

In [16]:
vec = CountVectorizer( max_df = 1)
X = vec.fit_transform(texts)
X.shape

(992, 17082)

In [17]:
dfx = pd.DataFrame(X.toarray(), columns = vec.get_feature_names_out())

# Create a pandas series of just the terms 
# and the number of times they occur in total
sums = dfx.sum()

# Save to CSV
#sums.to_csv('../output/one_doc_wonders.csv')

# Sort and View
sums = sums.sort_values(ascending = False)
sums.head()

bf           75
gk           46
telomeres    40
abed         39
indus        33
dtype: int64

### Words That Appear More Than Once But Only in One Document

Filtering a series is just like filtering a dateframe, and here we create a new series with that data and save it to a CSV.

In [18]:
repeated = sums [ sums > 1]
print(f"The number of words that appear more than once but \
only in one document: {len(repeated)}.")

# Write repeated to CSV:
# repeated.to_csv('../output/one_doc_repeated.csv')

# The Top 10
repeated.head(10)

The number of words that appear more than once but only in one document: 3024.


bf           75
gk           46
telomeres    40
abed         39
indus        33
fonio        32
teszler      30
mycelium     29
ems          29
edi          27
dtype: int64

It's time to take see these terms in context. For now, we will convert the list of texts into a single NLTK text, but after that we will want to see in which texts these words occur.

In [19]:
onetext = nltk.Text(re.sub("[^a-zA-Z0-9']"," ",'\n'.join(texts)).split())

In [23]:
onetext.concordance("telomeres")

Displaying 25 of 39 matches:
 very ends of chromosomes known as telomeres Now when I started my quest all we
tant but I wanted to find out what telomeres consisted of and for that I needed
romosomes around 20 000 so lots of telomeres And I discovered that telomeres co
of telomeres And I discovered that telomeres consisted of special segments of n
cells never got old and died Their telomeres weren't shortening as time marched
e that could replenish make longer telomeres and we named it telomerase And whe
d our pond scum's telomerase their telomeres ran down and they died So it was t
urns out that as we humans age our telomeres do shorten and remarkably that sho
Generally speaking the longer your telomeres the better off you are It's the ov
you are It's the overshortening of telomeres that leads us to feel and see sign
we look and we feel older yeah Our telomeres are losing the war of attrition fa
l youthful longer it turns out our telomeres are staying longer for longer peri
 seems like

### Initial Observations

* Unexplained (artifacts of some kind?): bf, gk, jf, tkm.
* Yup, used: telomeres (biological), Indus, fonio (some kind of grain).
* Name: Abed, Teszler.
* Song lyric: **mzuri** (in a transliterated language).

## Conclusions <a class="anchor" id="conclusions"></a>

Taking into consideration all the peculiarities of the TED talks corpus and the strengths and weaknesses of the various ways to process the corpus, we suggest the following setup:

### A preprocessor to handle parentheticals

In [21]:
# A Refined Preprocessor --
# This one removes two-word phrases/clauses

parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)",  
                  "\(video\)", "\(laughs\)", "\(applause ends\)", 
                  "\(audio\)", "\(singing\)", "\(music ends\)", 
                  "\(cheers\)", "\(cheering\)", "\(recording\)", 
                  "\(beatboxing\)", "\(audience\)", "\(guitar strum\)", 
                  "\(clicks metronome\)", "\(sighs\)", "\(guitar\)", 
                  "\(marimba sounds\)", "\(drum sounds\)" ]

def remove_parentheticals(text):
    global parentheticals
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text.lower(), 
                          flags=re.IGNORECASE)
    return new_text

test = """Laughter is the best medicine. (Laughter) 
Hold your applause; I'm not done yet. (Applause ends)"""

print(remove_parentheticals(test))

laughter is the best medicine.   
hold your applause; i'm not done yet.  


In [22]:
the_vec = CountVectorizer( preprocessor = remove_parentheticals,
                          max_df = 0.9, min_df = 2 )
the_X = the_vec.fit_transform(texts)
the_X.shape

(992, 22389)