# Term Frequencies

With the TEDtalks-all dataset created, we have 1747 talks with which to work. This is a small corpus, and so the usual reasons for shrinking the feature set for texts do not (necessarily) apply, but as we begin our survey of the contents of the TED talks we want to be mindful of standards that have emerged both so that our results are comparable to the work of others but also so that we could potentially scale up the work here without having to re-think the foundations.

At the same time, with a small corpus, we can enjoy taking our time to explore the methodological issues, algorithms and their various (hyper)parameters that studies focused on larger corpora often have to take as given.

## Summary

In this notebook we load the complete corpus of the TED-curated talks, which includes both the main TED events as well as TED+ events as described in the previous notebook. We then use Python's `Sci-Kit Learn` library to explore the document-term frequency matrix (DTM). 

with an initial shape of 1747 x 50379. 

Summing the words to get a total for each word across all talks in the dataset, we then hand-inspect the totals and discover that there is an assortment of numbers that recur: an exploration of that phenomenon appears in a separate notebook. The primary task of this notebook is to count the words in the text, which means addressing the matter of parentheticals head on. Much of this work occurs in a separate notebook, but everything comes down to the fact that there are two kinds of parenthetical expressions in TED talks:

* those which indicate actions and events outside the speaker's discourse, such as the audience laughing or applauding or the speaker sighing or playing music. 
* those which somehow the TED transcription practices have discerned as being digressive material within the speech of the author. 

We want to remove the former and keep the latter. In the end, the decision is made to remove, through `regex` matching, the top 20 parentheticals and to leave the others, most of which occur less than a dozen times. The affect on the overall study is thus reduced significantly.

After removing the parentheticals, a second term matrix is derived with a total vocabulary of 50377. (This after discovering a slight bug, or at least weirdness, in `sklearn` that is now in their bug-tracking system.) 

The final step in this notebook is to raise the threshold for a word to be included in the TF-matrix to 2 documents. With that done, we convert the matrix to a pandas dataframe and then save to CSV. (QUESTION: better/more efficient to save as pickle?)

**TODO**: Complete commenting throughout this notebook.

## Load Imports and Data

In [1]:
# IMPORTS

import pandas as pd, re, csv, nltk
from sklearn.feature_extraction.text import CountVectorizer

# LOCAL FUNCTION --> Explained below

def remove_parentheticals(text):
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text.lower(), flags=re.IGNORECASE)
    return new_text

parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)", "\(video\)", 
                  "\(laughs\)", "\(applause ends\)", "\(audio\)", "\(singing\)", 
                  "\(music ends\)", "\(cheers\)", "\(cheering\)", "\(recording\)", 
                  "\(beatboxing\)", "\(audience\)", "\(guitar strum\)", 
                  "\(clicks metronome\)", "\(sighs\)", "\(guitar\)", "\(marimba sounds\)", 
                  "\(drum sounds\)" ]

# DATA

df = pd.read_csv('../output/TEDall_speakers.csv') # Skipping `index_col='Talk_ID'`
df.shape

(1747, 27)

## Raw Frequencies

The goal of this notebook is to establish how we are going to elicit our features, our words, from the collection of talks. Thus, the only column we are interested in is the one with the texts of the talks. While we recognize the utility of having the data in one file, we are looking to minimize individual file sizes, and also make it possible only to load the data we need for any particular task. For that reason, we have chosen to create a number of CSV files, using the ID number that TED assigns each talk as it publishes them to its website.

For this first experiment, we keep it simple by creating two lists, one of the talk IDs and one of the texts themselves: we will only be using the IDs later when we write the DTM to a CSV. This ID will allow us later to correlate the DTM with metadata associated with each talk: the speaker, her biography, the talk's popularity, etc. 

In [2]:
texts = df.text.tolist()

**Nota bene**: If you are using this notebook to re-create the basic TF matrix you can stop here and jump to the end of this notebook.

There are a number of ways to get term frequencies, but **SciKit-Learn**'s `CountVectorizer` offers a one-stop shop for generating a DTM from which we can examine words as well as generate BoW-products like topic models. This is made possible by the interoperability between `CountVectorizer` and the other vectorizers and models available in `sklearn`.



In our first experiment, we run `CountVectorizer` unadorned. The default options are: lowercase everything, get rid of all punctuation, make a word out of anything more than two characters long. The only thing that might not be welcome is the splitting of contractions. For now, we will leave things as they are. (Also, please note, no stopwords were used, so we have an unfiltered word list and no pre-processing of the texts is done.)

CHECK THIS ASSERTION: Since `CountVectorizer` only counts when *fitting* and does not engage in any kind of normalization or centering, *transforming*, there is no need to break the two functionalities out in the code that follows.

In [3]:
# If you want to pass options, pass them here:
vectorizer = CountVectorizer( lowercase = True )

# fit the model to the data 
# vecs = vectorizer.fit(texts)
X = vectorizer.fit_transform(texts)

# see how many features we have
X.shape

(1747, 50379)

50,379 tokens is our raw, unfiltered, no pre-processing baseline. It includes, as we will see, a number of artifacts of the TED transcription process, including a variety of ways to spell out *ah*, the use of numerals for a host of things -- from dates to counts and everything in between, and some things which actually take places outside the talk itself, like audience laughter, which we here term *parentheticals*. (Accounting for all these artifacts requires a number of closer inspections, which take place in a separate notebook: [Term Artifacts](Terms-02).)

In [6]:
raw_df = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names())
raw_series = df_raw.sum()

In [7]:
raw_series.head()

00             15
000          2098
000000004       1
0000001         1
000001          1
dtype: int64

A hand inspection of the output above turned up a couple of interesting issues: that there are parentheticals mixed in with the text of the talks (see 01-Terms-02-Parentheticals) and numbers feature in the talks regularly (see 01-Terms-03-Numbers). 

There are also some interesting tokenization dimensions to a crowd-sourced transcription technology:

| Term    | Freq |
|:---------|---:|
|aa       |12 |
|aaa      |7  |
|aaaa     |2  |
|aaaaa    |1  |
|aaaaaaaah|1  |
|aaaaaaah |1  |
|aaaaaah  |3  |
|aaaah    |2  |
|aaaahhh  |1  |
|aaah     |3  |
|aag      |1  |
|aah      |10 |

There are an equal number of alternate spellings for *shh*. There's also this oddity:

| Term    | Freq |
|:---------|---:|
|FALSE    |115|
|TRUE     |909|

It appears to be simply the occurrences of *true* and *false* but somehow the vectorizer thinks they are Booleans?

The rest of the hand-examination turns up no other issues, so the basic vectorization built into `sklearn` appears to be satisfactory, with the only exception being its breaking of contractions at the apostrophe: e.g., *isn't* becomes *isn* and *'t*, with the latter being thrown away as too small -- which means that the indefinite article *a(n)* is also not present in the frequencies, but the definite article *the* remains. 

## Revised Frequencies without Parentheticals

For more on the work leading up to this regex, see the `01-Terms-02-Parentheticals` notebook: it contains a number of experiments, which were vexed by sklearn's vectorizer only accepting strings as inputs. (While gensim appears to accept texts as lists of words, we are hoping, I think, to keep our code base as simple as possible.) The eventual workaround is to clean the texts of the top 20 parentheticals, joining the list back to a string, and then pushing the result to sklearn. This is less than optimum, but the hack works and appears to be a hack performed elsewhere by others.

For now, the first thing we have is the list of the top 20 parentheticals, some of which are two words:

In [9]:
noparens_vec = CountVectorizer(preprocessor = remove_parentheticals)
noparens_X = noparens_vec.fit_transform(texts)
noparens_X.shape

(1747, 50377)

Good news. Our removal parentheticals has only removed two words. 

In [11]:
noparens_df = pd.DataFrame(noparens_X.toarray(), columns = noparens_vec.get_feature_names())

In [13]:
term = 'laughter'
f"The baseline count for *{term}* is {raw_df[term].sum()}; \
with cleaning it is {noparens_df[term].sum()}"

'The baseline count for *laughter* is 7374; with cleaning it is 98'

## Frequencies of Terms with a 2-Document Minimum

Setting a `min_df = 2` removes all words that appear in only document. (An exploration of *one document wonders* is elsewhere.) As we can see from the `shape` below, we dropped 50379 to 29340, resulting in a reduction of 42% of the possible features.

In [19]:
min2_vec = CountVectorizer(preprocessor = remove_parentheticals, min_df = 2)
min2_X = min2_vec.fit_transform(texts)
min2_X.shape

(1747, 29340)

As above we can sum our terms and write the results to a CSV file for closer inspection. Here, we highlight just a particular series of words.

In [21]:
# Create the dataframe
min2_df = pd.DataFrame(min2_X.toarray(), columns = min2_vec.get_feature_names())

# Create a series of sums
min2_series = min2_df.sum()

# Write sums to CSV:
# min2_series.to_csv('../output/word_freq_clean.csv')

# Inspect the sums
min2_series[1000:1005]

affiliations    3
affinity        3
affirm          7
affirmation     4
affirmations    3
dtype: int64

## Frequencies of Terms Appearing in All Documents

Having trimmed terms that occur in only document from our feature set, we also need to determine what words occur across all the documents and thus are not particularly effective discriminators of topic. 

Note that this is a somewhat more dangerous move to make, since it has been shown that one of the keys to genre and gender in documents are the varying uses of so-called function words. In fact, as our explorations establish, using sklearn's built-in algorithms reveals a focus on removing common words does not really result in much of a reduction in the overall feature set, only 8 words occurring across 99% or more documents and that number only rising to 46 when we change the minimum threshold to 90%.

In [34]:
alldocs_vec = CountVectorizer(preprocessor = remove_parentheticals, min_df = 0.99)
alldocs_X = alldocs_vec.fit_transform(texts)
alldocs_X.shape

(1747, 8)

In [43]:
alldocs_df = pd.DataFrame(alldocs_X.toarray(), columns = alldocs_vec.get_feature_names())
alldocs_series = alldocs_df.sum()
alldocs_series

and    118989
in      62673
is      50072
it      59191
of      92416
the    166093
to     102276
you     56296
dtype: int64

In [40]:
mostdocs_vec = CountVectorizer(preprocessor = remove_parentheticals, min_df = 0.90)
mostdocs_X = mostdocs_vec.fit_transform(texts)
mostdocs_X.shape

(1747, 46)

In [42]:
# mostdocs_df = pd.DataFrame(mostdocs_X.toarray(), columns = mostdocs_vec.get_feature_names())
# mostdocs_series = mostdocs_df.sum()
# mostdocs_series.head(46)

This last run simply verifies that the exploration above with `min_df` holds true when we switch to `max_df`. That is, that our exploration of the parameters were not asking something unexpected by the code and possibly generating inexplicable and undesirable results.

In [44]:
# tnt = topsntails
tnt_vec = CountVectorizer(preprocessor = remove_parentheticals, max_df = 0.9, min_df = 2)
tnt_X = tnt_vec.fit_transform(texts)
tnt_X.shape

(1747, 29294)

The numbers add up. With that done, my recommendation is that we do not, for the time being, throw away common and/or function words: that is, let's not use a stopword list. Those words could be important in other ways, and dropping them has only a limited impact on the actual document-term matrix, one that pales by comparison to the words that occur only in one document.