# Term Frequencies

With the TEDtalks-all dataset created, we have 1747 talks with which to work. This is a small corpus, and so the usual reasons for shrinking the feature set for texts do not (necessarily) apply, but as we begin our survey of the contents of the TED talks we want to be mindful of standards that have emerged both so that our results are comparable to the work of others but also so that we could potentially scale up the work here without having to re-think the foundations.

At the same time, with a small corpus, we can enjoy taking our time to explore the methodological issues, algorithms and their various (hyper)parameters that studies focused on larger corpora often have to take as given.

## Summary

In this notebook we load the complete corpus of the TED-curated talks, so both the main TED events as well as TED+ events. We then use Python's `Sci-Kit Learn` library to explore the document-term frequency matrix (DTM). 

with an initial shape of 1747 x 50379. 

Summing the words to get a total for each word across all talks in the dataset, we then hand-inspect the totals and discover that there is an assortment of numbers that recur: an exploration of that phenomenon appears in a separate notebook. The primary task of this notebook is to count the words in the text, which means addressing the matter of parentheticals head on. Much of this work occurs in a separate notebook, but everything comes down to the fact that there are two kinds of parenthetical expressions in TED talks:

* those which indicate actions and events outside the speaker's discourse, such as the audience laughing or applauding or the speaker sighing or playing music. 
* those which somehow the TED transcription practices have discerned as being digressive material within the speech of the author. 

We want to remove the former and keep the latter. In the end, the decision is made to remove, through `regex` matching, the top 20 parentheticals and to leave the others, most of which occur less than a dozen times. The affect on the overall study is thus reduced significantly.

After removing the parentheticals, a second term matrix is derived with a total vocabulary of 50377. (This after discovering a slight bug, or at least weirdness, in `sklearn` that is now in their bug-tracking system.) 

The final step in this notebook is to raise the threshold for a word to be included in the TF-matrix to 2 documents. With that done, we convert the matrix to a pandas dataframe and then save to CSV. (QUESTION: better/more efficient to save as pickle?)

**TODO**: Complete commenting throughout this notebook.

## Imports and Data

In [1]:
# Imports
import pandas as pd, re, csv, nltk
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# Load the Data
df = pd.read_csv('../output/TEDall_speakers.csv')
df.shape

(1747, 27)

In [3]:
df.head()

Unnamed: 0,Set,Talk_ID,public_url,headline,description,event,duration,published,tags,views,...,speaker2_introduction,speaker2_profile,speaker_3,speaker3_occupation,speaker3_introduction,speaker3_profile,speaker_4,speaker4_occupation,speaker4_introduction,speaker4_profile
0,only,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,...,,,,,,,,,,
1,only,7,https://www.ted.com/talks/david_pogue_says_sim...,Simplicity sells,New York Times columnist David Pogue takes aim...,TED2006,0:21:26,6/27/06,"simplicity,entertainment,interface design,soft...",1702201,...,,,,,,,,,,
2,only,53,https://www.ted.com/talks/majora_carter_s_tale...,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",TED2006,0:18:36,6/27/06,"MacArthur grant,cities,green,activism,politics...",2000421,...,,,,,,,,,,
3,only,66,https://www.ted.com/talks/ken_robinson_says_sc...,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,TED2006,0:19:24,6/27/06,"children,teaching,creativity,parenting,culture...",51614087,...,,,,,,,,,,
4,only,92,https://www.ted.com/talks/hans_rosling_shows_t...,The best stats you've ever seen,You've never seen data presented like this. Wi...,TED2006,0:19:50,6/27/06,"demo,Asia,global issues,visualizations,global ...",12662135,...,,,,,,,,,,


**To Do**: Edit the CSV to remove the vestigial index column at the start of each line. Then use `df.set_index('Talk_ID')`.

## Frequencies

The sole purpose of this notebook is to establish how we are going to elicit our features, our words, from the collection of talks. Thus, the only column we are interested in is the one with the texts of the talks. While we recognize the utility of having the data in one file, we are looking to minimize individual file sizes, and also make it possible only to load the data we need for any particular task. For that reason, we have chosen to create a number of CSV files, using the ID number that TED assigns each talk as it publishes them to its website.

For this first experiment, we will keep it simple, creating two lists, one of the IDs and one of the texts. (Another possibility is to grab the URLs which are also unique as well as human-friendly identifiers for the talks.)

**TODO**: We can, perhaps, make them a bit more friendly by modifying them a bit, subtracting `https://www.ted.com/talks/` from each.)

In [4]:
IDs  = df.Talk_ID.tolist()
texts = df.text.tolist()

**>>> Nota bene**: If you are using this notebook to re-create the basic TF matrix you can stop here and jump to the end of this notebook.

There are a number of ways to get term frequencies, but **SciKit-Learn**'s `CountVectorizer` offers a one-stop shop for generating a DTM from which we can examine words as well as generate BoW-products like topic models. This is made possible by the interoperability between `CountVectorizer` and the other vectorizers and models available in `sklearn`.


### Experiment 1: Raw Counts


In our first experiment, we run `CountVectorizer` unadorned. The default options are: lowercase everything, get rid of all punctuation, make a word out of anything more than two characters long. The only thing that might not be welcome is the splitting of contractions. For now, we will leave things as they are. (Also, please note, no stopwords were used, so we have an unfiltered word list and no pre-processing of the texts is done.)

CHECK THIS ASSERTION: Since `CountVectorizer` only counts when *fitting* and does not engage in any kind of normalization or centering, *transforming*, there is no need to break the two functionalities out in the code that follows.

In [5]:
# If you want to pass options, pass them here:
vec = CountVectorizer( lowercase = True )

# fit the model to the data 
X = vec.fit_transform(texts)

# see how many features we have
X.shape

(1747, 50379)

50,379 tokens is our raw, unfiltered, no pre-processing baseline. It includes, as we will see, a number of artifacts of the TED transcription process, including a variety of ways to spell out *ah*, the use of numerals for a host of things -- from dates to counts and everything in between, and some things which actually take places outside the talk itself, like audience laughter, which we here term *parentheticals*. Accounting for all these artifacts requires a number of closer inspections, which take place in a separate notebook: [Term Artifacts](Terms-02).

## Frequency Totals per Word

In [5]:
df_base = pd.DataFrame(X.toarray(), columns = vecs.get_feature_names())

sums_base = df_base.sum()

A hand inspection of the output above turned up a couple of interesting issues: that there are parentheticals mixed in with the text of the talks (see 01-Terms-02-Parentheticals) and numbers feature in the talks regularly (see 01-Terms-03-Numbers). 

There are also some interesting tokenization dimensions to a crowd-sourced transcription technology:

| Term    | Freq |
|:---------|---:|
|aa       |12 |
|aaa      |7  |
|aaaa     |2  |
|aaaaa    |1  |
|aaaaaaaah|1  |
|aaaaaaah |1  |
|aaaaaah  |3  |
|aaaah    |2  |
|aaaahhh  |1  |
|aaah     |3  |
|aag      |1  |
|aah      |10 |

There are an equal number of alternate spellings for *shh*. There's also this oddity:

| Term    | Freq |
|:---------|---:|
|FALSE    |115|
|TRUE     |909|

It appears to be simply the occurrences of *true* and *false* but somehow the vectorizer thinks they are Booleans?

The rest of the hand-examination turns up no other issues, so the basic vectorization built into `sklearn` appears to be satisfactory, with the only exception being its breaking of contractions at the apostrophe: e.g., *isn't* becomes *isn* and *'t*, with the latter being thrown away as too small -- which means that the indefinite article *a* is also not present in the frequencies. 

## Revised Frequencies without Parentheticals

For more on the work leading up to this regex, see the `01-Terms-02-Parentheticals` notebook: it contains a number of experiments, which were vexed by sklearn's vectorizer only accepting strings as inputs. (While gensim appears to accept texts as lists of words, we are hoping, I think, to keep our code base as simple as possible.) The eventual workaround is to clean the texts of the top 20 parentheticals, joining the list back to a string, and then pushing the result to sklearn. This is less than optimum, but the hack works and appears to be a hack performed elsewhere by others.

For now, the first thing we have is the list of the top 20 parentheticals, some of which are two words:

In [6]:
parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)", "\(video\)", 
                  "\(laughs\)", "\(applause ends\)", "\(audio\)", "\(singing\)", 
                  "\(music ends\)", "\(cheers\)", "\(cheering\)", "\(recording\)", 
                  "\(beatboxing\)", "\(audience\)", "\(guitar strum\)", 
                  "\(clicks metronome\)", "\(sighs\)", "\(guitar\)", "\(marimba sounds\)", 
                  "\(drum sounds\)" ]
len(parentheticals)

20

In the cells that follow, we create a test string that has two terms, *laughter* and *applause*, both in the speech of the imaginary speaker as well as in a parenthetical. We want to keep the former and remove the latter.

Then we develop a function, `clean_text`, that goes through the list of terms to be removed as it goes through a text. This is a loop within a loop, and I am not sure if this is the most efficient way to do this, but it works. 

We then test the `clean_text` function on the `test` string and then on one of the talks we know is mostly a performance. The results, at long last, are what we want.

In [None]:
test = """Laughter is the best medicine. (Laughter) 
Hold your applause; I'm not done yet. (Applause ends)"""

def clean_text(rgx_list, text):
    new_text = text
    for rgx_match in rgx_list:
        new_text = re.sub(rgx_match, ' ', new_text, flags=re.IGNORECASE)
    return new_text

In [None]:
print(clean_text(parentheticals, test))

In [None]:
print(clean_text(parentheticals, texts[235]))

We now have a working function, but the two arguments of `clean_text` cannot be used in `CountVectorizer`: we are going to use this as a `preprocessor` argument, which takes only one argument itself, the text. For the sake of simplicity, we are simply going to embed the regex list, `parentheticals`, into the function, renaming it `clean_parens`.

With that done, we are going to use our new custom preprocessor in a new instantiation of the `CountVectorizer`, which runs up our term count from 50379 to 58404. 

Locally, the comparisons look good: a term like *laughter* drops from 7374 occurrenaces to 87. Other terms show similar drops in usage. 

In [7]:
def clean_parens(text):
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text.lower(), flags=re.IGNORECASE)
    return new_text

A survey of the global increase in terms, from 50379 to 58404, turned up some interesting restuls: first, the difference between the two sets of names was 19,891 and it looks like the `CountVectorizer` didn't lowercase a lot of things, even when the `lowercase=True` argument was passed to it. This necessitated a slight workaround, until we understand otherwise, of lowering everything in the regex function, `clean_parens`, as had occurred above.

With that change in place, we vectorize things again, and we get the slightest of drops in features, from 50379 to 50377, which means we are pretty stable. Nice! (A hand inspection reveals that this still includes a lot of named entities, so we are going to raise the threshold for `min_df` from the default 1 to 2 in a moment. First let's explore what our cleaning has wrought:

In [8]:
vec_clean = CountVectorizer(preprocessor = clean_parens)
X_clean = vec_clean.fit_transform(texts)
X_clean.shape

(1747, 50377)

In [9]:
df_clean = pd.DataFrame(X_clean.toarray(), columns = vec_clean.get_feature_names())

In [10]:
term = 'laughter'
f"The baseline count for *{term}* is {df_base[term].sum()}; \
with cleaning it is {df_clean[term].sum()}"

'The baseline count for *laughter* is 7374; with cleaning it is 98'

## TF Matrix with a 2-Document Minimum

The list of parentheticals and the `clean_parens` function are repeated here for ease of readability and portability.

In [11]:
parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)", "\(video\)", 
                  "\(laughs\)", "\(applause ends\)", "\(audio\)", "\(singing\)", 
                  "\(music ends\)", "\(cheers\)", "\(cheering\)", "\(recording\)", 
                  "\(beatboxing\)", "\(audience\)", "\(guitar strum\)", 
                  "\(clicks metronome\)", "\(sighs\)", "\(guitar\)", "\(marimba sounds\)", 
                  "\(drum sounds\)" ]

def clean_parens(text):
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text.lower(), flags=re.IGNORECASE)
    return new_text

vec_new = CountVectorizer(preprocessor = clean_parens, min_df = 2)
X_new = vec_new.fit_transform(texts)
X_new.shape

(1747, 29340)

As above where we sum our terms and write the results to a CSV file using `pandas`:

In [12]:
# Create the dataframe
df_new = pd.DataFrame(X_new.toarray(), columns = vec_new.get_feature_names())

# Create a series of sums
sums_new = df_new.sum()

# Inspect the sums
sums_new[1000:1005]

affiliations    3
affinity        3
affirm          7
affirmation     4
affirmations    3
dtype: int64

In [None]:
# Write sums to CSV:
# sums.to_csv('../output/word_freq_clean.csv')

## Creating a Revised Dataset with Term Frequencies

We now have a dataframe whose contents are those of the term frequency matrix for all our texts. First we give it an index using the `urls` list drawn from the base dataframe. Then we save it to a CSV, and finally we merge the base dataframe with the term matrix to produce a dataframe that has everything we need for our exploration of trends. 

As above, where cells write files, or possibly overwrite files, the code is commented out.

In [13]:
df_new_labeled = df_new.set_axis(urls, axis=0)

# df_new_labeled.to_csv('../output/talks_all_tf.csv')

In [14]:
# Set the index for the new dataframe as 'public_url'
df_new_labeled.index.name = 'public_url'

# Use this to check the name
df_new_labeled.index.name

'public_url'

In [None]:
# We can merge the dataframes on the columns we know match:
df_merge = pd.merge(df, df_new_labeled, on = 'public_url')

# Write the merged dataframe to file:
# df_merge.to_csv('../output/talks_all_tf_matrix.csv')

## Deriving a Custom Stopword List Based on Term Frequencies

In the section that follows, we take the array holding all the talks and their terms and transpose it so that the terms are on the rows. We then sum the rows and turn the sum column into a series which is then sorted and saved to CSV for hand inspection. What the hand inspection revealed was, beyond the usual stopwords, there were enough terms in the top 300 terms that had semantic possibilities that they should be kept. The best bet moving forward is to use an available, and widely-used, stopword list. We chose the one provided by the NLTK.

In [15]:
term_matrix = pd.DataFrame(X.todense(), columns=vec.get_feature_names())
tfdf = term_matrix.transpose()
tfdf.shape

(50379, 1747)

In [17]:
tfdf.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1738,1739,1740,1741,1742,1743,1744,1745,1746,sum
00,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,15
000,0,1,4,0,1,4,0,0,0,2,...,0,0,0,1,3,1,0,0,1,2098
000000004,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
0000001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
000001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
00001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
000042,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
0001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
00046,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
000th,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,6


In [25]:
# Sum the counts for each term into a column titled "sum":
tfdf['sum'] = tfdf.sum(axis=1)

# Isolate the sums in a series
sums = pd.Series(tfdf['sum'])

In [26]:
sums.sort_values(ascending=False, inplace=True)

In [27]:
sums.to_csv('../output/tf_main.csv')