# Overall Term Frequencies

With the TEDtalks-all dataset created, we have 1747 talks with which to work. This is a small corpus, and so the usual reasons for shrinking the feature set for the texts do not apply, but as we begin our survey of the contents of the TED talks we wanted to be mindful of standards that had emerged both so that our results would be comparable to the work of others but also so that we could potentially scale up the work here without having to re-think the foundations.

## Summary

In this notebook we load the complete dataset of both TED and TED+ talks. (No TEDx talks.) We use Python's `Sci-Kit Learn` library to create a document - term frequency matrix with a shape of 1747 x 50379. Summing the words to get a total for each word across all talks in the dataset, we then hand-inspect the totals and discover that there is a small list of numbers that recur. We approach mapping those numbers in two trials before returning to the task of getting a clean frequency list with no repetitions or other oddities.

## Imports and Data

In [1]:
# Imports
import pandas as pd, re, csv, nltk
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# Load the Data
df = pd.read_csv('../output/TEDall_speakers.csv')
df.shape

(1747, 27)

**To Do**: Edit the CSV to remove the vestigial index column at the start of each line. Then use `df.set_index('Talk_ID')`.

## Frequencies

The sole purpose of this notebook is to establish how we are going to elicit our features, our words, from the collection of talks. Thus, the only column we are interested in is the one with the texts of the talks. As we move forward, however, we will want to decide if we are simply going to append ~30,000 columns to a version of the extant CSV or create a separate CSV for each experiment. 

For this first experiment, we will keep it simple, creating two lists, one of the URLs and one of the texts. The URLs are unique, human-friendly identifiers for the talks. (We can, perhaps, make them a bit more friendly by modifying them a bit, subtracting `https://www.ted.com/talks/` from each.)

In [3]:
urls  = df.public_url.tolist()
texts = df.text.tolist()

There are a number of ways to get term frequencies, but **SciKit-Learn**'s `CountVectorizer` is, I think, the way to go, since it will work well with the other vectorizers and models also available in `sklearn`.

In our first experiment, we run `CountVectorizer` unadorned. The default options are: lowercase everything, get rid of all punctuation, make a word out of anything more than two characters long. The only thing that might not be welcome is the splitting of contractions. For now, we will leave things as they are. (Also, please note, no stopwords were used, so we have an unfiltered word list.)

For this current work, we are running `fit()` and `transform()` separately, but since `fit()` just calculates the parameters and saves them as an internal objects state `transform()`  applies the transformation to a particular set of examples (the ones we just fitted), the two operations are usually simply done at the same time as `fit_transform()`.

In [4]:
# If you want to pass options, pass them here:
vectorizer = CountVectorizer()

# fit the model to the data 
vecs = vectorizer.fit(texts)

# transform the data according to the fitted model
bow = vecs.transform(texts)

# see how many features we have
bow.shape

(1747, 50379)

## Frequency Totals per Word

We can total up our columns for each feature (word), which is something we will be doing per year, per gender, per discipline. Here, we take the vector describing a word and sum it. We then pair the sum with the word in a tuple, which we then sort by frequency. 

(I'm doing it this way because it appears to be the way to do it, but it also strikes me that there should be a way to do this within the array itself, or, perhaps, to do it through **pandas**.)

We save the results to a CSV file so that we can hand-check the words: are these the results we expected? (We don't want any weirdness affecting our overall results.) The hand inspection looks good. I didn't see anything in words 4 or above in frequency that looked off. (So, the simplest solution works!) What I did note was the frequency of certain **numbers**: **100**, **12**, etc. This might be worth taking a closer look: are there *power numbers*? (I am thinking here of Alan Dundes' essay on the "power" of three in American culture.)

**To Do**: It would be nice to be able grab all words of a certain frequency, or range of frequencies.

---
**Follow-up**: whenever I attempt some version of
```python
for item in vecs:
    if vecs.vocabulary_.get(item) == 1691:
        print(item)
```
I get **`TypeError: 'CountVectorizer' object is not iterable`**. My best guess, for now, is that we need to use the tuple above to get this information.

---


In [5]:
# summing up the counts for each word
sum_words = bow.sum(axis=0)

# create a tuple
words_freq = [(word, sum_words[0, idx]) for word, idx in vecs.vocabulary_.items()]

# sort the tuple
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)

# check the results of our work by printing the top 20 more frequent words
print(words_freq[0:20])

[('the', 166093), ('and', 118989), ('to', 102276), ('of', 92416), ('that', 76268), ('in', 62673), ('it', 59191), ('you', 56296), ('we', 54458), ('is', 50072), ('this', 38510), ('so', 29001), ('they', 25157), ('was', 24582), ('for', 24445), ('are', 22592), ('have', 21965), ('but', 21804), ('on', 20978), ('what', 20907)]


In [6]:
# with open('../output/word_freq.csv','w') as out:
#     csv_out = csv.writer(out)
#     csv_out.writerow(['word','count'])
#     for row in words_freq:
#         csv_out.writerow(row)

## Frequently Occurring Numbers

One of the dimensions of the corpus that arises out of a hand inspection of the terms if the frequency with which some numbers appear. The follow table captures the top ten numbers:

| TERM | FREQUENCY |
|------|-----------|
| 000  | 2098 |
| 10   | 1691 |
|  20  | 1107 |
| 100  |  902 |
|  30  |  827 |
|  50  |  784 |
|  15  |  659 | 
|  40  |  494 |
|  12  |  460 | 
|  25  |  410 |

Other frequently occurring numbers: 60, 500, 200, 11, 18, 80, 14 (241 times!). 

In order to examine the appearance of the numbers in context, we make a giant string out of the list of strings, `texts`: in which text a number appears is less important than its immediate context. 

First, a quick reminder of what the `texts` look like:

In [7]:
print(texts[0][0:100])

  Thank you so much, Chris. And it's truly a great honor to have the opportunity to come to this sta


### Trial 1

Normally I would use `words = re.sub("[^a-zA-Z']"," ", mdg).lower().split()` but in this case we want to keep the non-letter numbers, so we'll keep it simple:

In [8]:
onetext = nltk.Text('\n'.join(texts).split())
# And here's what an NLTK text object looks like: a list of words, really
print(onetext[0:10])

['Thank', 'you', 'so', 'much,', 'Chris.', 'And', "it's", 'truly', 'a', 'great']


In [9]:
onetext.concordance("000")

no matches


In [10]:
onetext.concordance("10")

Displaying 25 of 1216 matches:
Thank you very much. (Applause) About 10 years ago, I took on the task to teac
tion of income of people. One dollar, 10 dollars or 100 dollars per day. There
 a long time, but they come out after 10 years very, very differently. And the
at drives you in your life today? Not 10 years ago. Are you running the same p
really heavy, but in the last five or 10 years, have there been some decisions
. (Laughter) Are you sure? (Laughter) 10 seconds! (Laughter and applause) 10 s
) 10 seconds! (Laughter and applause) 10 seconds, I want to be respectful. All
principle in the Bible that says give 10 percent of what you get back to chari
ional shelter that would last five to 10 years, that would be placed next to t
tandards of five billion people? With 10 million solutions. So I wish to devel
 to go see Central Command, which was 10 minutes away. And that way, I could g
 will not launch this without five to 10 million units in the first run. And t
 down, and that's why

In [11]:
onetext.concordance("25")

Displaying 25 of 320 matches:
ts live at or below the poverty line; 25 percent of us are unemployed. Low-inc
k. Right now we're separated by about 25 feet of water, but this link will cha
about earlier. And although less than 25 percent of South Bronx residents own 
en I started Saddleback Church, I was 25 years old. I started it with one othe
 the church had paid me over the last 25 years, and I gave it back. And I gave
now it. But I think the man was maybe 25 years too early. So let's do it. Than
reports in the world came from GPHIN, 25 percent of all the reports in the wor
 years, imagine them coming out every 25 seconds. So, imagine we could do that
nd his wife gave birth to her baby at 25 weeks. And he never expected this. On
 and the other, certainly of the last 25 years, that are going to have an impa
ccording to Howard, somewhere between 25 and 27 percent of you. Most of you li
bout it, I would say that in the last 25 years, of every invention or innovati
years, you expect to d

A couple of things to note here:

First, there is a discrepancy in the count between `sklearn` and the NLTK: the former counted 2098 occurrences of `000`, the latter none. In all the counts that follow, there is a similar mismatch:

| TERM | `sklearn` | `nltk` |
|------|-----------|--------|
| 000  | 2098 | "no match" |
|  10  | 1691 | 1216 |
|  20  | 1107 | 879 |
| 100  |  902 | 647 |
|  30  |  827 | 650 |
|  50  |  784 | 594 | 
|  15  |  659 | 512 | 
|  40  |  494 | 387 | 
| ...               | 
|  14  |  241 | 148 | 

I don't have a ready explanation for this.

Second, the frequency of some numbers are readily explained:

* Round numbers like 10, 20, 30, 50, and 100 are approximations -- though it would be interesting to explore how often they are attached to large scalars like "thousand" or million." 
* Some numbers seem to represent alternate ways of counting: 25 reagularly stands in for "one-quarter" -- though not as often as we might imagine -- and 18 is regularly paired with *month* as a more precise way to say " a year and a half."
* There are some numbers, like 11 and 14 which seem to have power all their own, perhaps tied to particular ages in humans. 

Next up is some code to explore the most common occurring words with these numbers.

In [12]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

All my searches for "collocations with specific words" took me to the NLTK, which means, so far as I can tell, generating all the bigrams and then filtering to get the one(s) you want. This seems backwards to me: wouldn't it be faster simply to find the word and then what comes after it? I'll take a look at regex for this later.

In [13]:
## Bigrams
finder = BigramCollocationFinder.from_words(onetext)

In [14]:
## Here's the filter operation:
the_number = lambda *w: '14' not in w
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# only bigrams that contain the number
finder.apply_ngram_filter(the_number)
# return the 10 n-grams with the highest PMI
print(finder.nbest(bigram_measures.likelihood_ratio, 10))

[('14', 'years'), ('14', 'billion'), ('was', '14'), ('14', 'years,'), ('14', 'hours'), ('14', 'orders'), ('14', 'million'), ('14', 'percent'), ('14', 'feet'), ('14', 'times')]


This does not return a count. *Oi!*

### Trial 2

In [15]:
the_one = nltk.Text(re.sub("[^a-zA-Z0-9']"," ",'\n'.join(texts)).lower().split())
# And here's what an NLTK text object looks like: a list of words, really
print(the_one[0:10])

['thank', 'you', 'so', 'much', 'chris', 'and', "it's", 'truly', 'a', 'great']


In [16]:
the_one.concordance("000")

Displaying 25 of 2098 matches:
u for 99 i know one guy who's spent 4 000 just on photoshop over the years and 
er industries that bring more than 60 000 diesel truck trips to the area each w
ed by the parks department about a 10 000 seed grant initiative to help develop
re than 60 years we leveraged that 10 000 seed grant more than 300 times into a
before their buildings were razed 600 000 people were displaced the common perc
p by showing the internet users per 1 000 in this software we access about 500 
ll tell you about emotion there are 6 000 emotions that we have words for in th
f your dominant emotions if i have 20 000 people or 1 000 and i have them write
emotions if i have 20 000 people or 1 000 and i have them write down all the em
ith this i was in hawaii i was with 2 000 people from 45 countries we were tran
 vertebrate landmass that was just 10 000 years ago yesterday in biological ter
rillion facts your mind can handle 15 000 decisions a second well it would be i
d in the 

Well, there's the missing `000`! It's in the idiomatic transcription practices of TED wherein a number like "sixty thousand" is rendered as "60,000." 

One thing we know now: reporting large numbers is a part of TED talks.

**TO DO**: How to keep the comma marker between numbers? (Or should we just look to 000 as a possible collocate with the other numbers?) One solution from the [Regex Cookbook][]:

```python
\b[0-9]{1,3}(,[0-9]{3})*(\.[0-9]+)?\b|\.[0-9]+\b
```

[Regex Cookbook]: https://www.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/ch06s11.html

## An Improved Term Frequency Mapping

With the numbers revealing some problems with the way the default tokenization works in `sklearn`, we need to tweak things a bit. 