## Frequently Occurring Numbers

One of the dimensions of the corpus that arises out of a hand inspection of the terms if the frequency with which some numbers appear. The follow table captures the top ten numbers:

| TERM | FREQUENCY |
|------|-----------|
| 000  | 2098 |
| 10   | 1691 |
|  20  | 1107 |
| 100  |  902 |
|  30  |  827 |
|  50  |  784 |
|  15  |  659 | 
|  40  |  494 |
|  12  |  460 | 
|  25  |  410 |

Other frequently occurring numbers: 60, 500, 200, 11, 18, 80, 14 (241 times!). 

In order to examine the appearance of the numbers in context, we make a giant string out of the list of strings, `texts`: in which text a number appears is less important than its immediate context. 

## Imports and Data

In [1]:
# Imports
import pandas as pd, re, nltk

In [2]:
# Load the Main Dataframe
df = pd.read_csv('../output/TEDall_speakers.csv')
df.shape

(1747, 27)

In [3]:
texts = df.text.tolist()

In [4]:
frequencies = pd.read_csv('../output/word_freq.csv')
frequencies.shape

(50379, 2)

First, a quick reminder of what the `texts` look like:

In [5]:
print(texts[0][0:100])

  Thank you so much, Chris. And it's truly a great honor to have the opportunity to come to this sta


### Trial 1

Normally I would use `words = re.sub("[^a-zA-Z']"," ", mdg).lower().split()` but in this case we want to keep the non-letter numbers, so we'll keep it simple:

In [6]:
onetext = nltk.Text('\n'.join(texts).split())
# And here's what an NLTK text object looks like: a list of words, really
print(onetext[0:10])

['Thank', 'you', 'so', 'much,', 'Chris.', 'And', "it's", 'truly', 'a', 'great']


In [7]:
onetext.concordance("000")

no matches


In [8]:
onetext.concordance("10")

Displaying 25 of 1216 matches:
Thank you very much. (Applause) About 10 years ago, I took on the task to teac
tion of income of people. One dollar, 10 dollars or 100 dollars per day. There
 a long time, but they come out after 10 years very, very differently. And the
at drives you in your life today? Not 10 years ago. Are you running the same p
really heavy, but in the last five or 10 years, have there been some decisions
. (Laughter) Are you sure? (Laughter) 10 seconds! (Laughter and applause) 10 s
) 10 seconds! (Laughter and applause) 10 seconds, I want to be respectful. All
principle in the Bible that says give 10 percent of what you get back to chari
ional shelter that would last five to 10 years, that would be placed next to t
tandards of five billion people? With 10 million solutions. So I wish to devel
 to go see Central Command, which was 10 minutes away. And that way, I could g
 will not launch this without five to 10 million units in the first run. And t
 down, and that's why

In [15]:
onetext.concordance("40")

Displaying 25 of 387 matches:
w York City already handled more than 40 percent of the entire city's commerci
ing rooms, whose evolution in 20, 30, 40 years we can't predict. So that liter
nd all the other teams have done this 40 Days of Purpose, based on the book. A
nternet tools, and we ended up having 40 chapters starting up, thousands of ar
cumented the Lower Ninth for the last 40 years. That was their home, and these
me. And a long time ago — well, about 40 years ago — my mom had an exchange st
 world where women and children spend 40 billion hours a year fetching water. 
 age category of 76 to 85, as much as 40 percent of people have nothing really
things tend to happen every 25 years. 40 years long, with an overlap. You can 
 all high-rises. So they'll put 20 or 40 up at a time, and they just go up in 
te, we've seen no side effects in the 40 or so patients in whom it's been impl
 terms of price performance, that's a 40 to 50 percent deflation rate. And eco
 people may increase t

A couple of things to note here:

First, there is a discrepancy in the count between `sklearn` and the NLTK: the former counted 2098 occurrences of `000`, the latter none. In all the counts that follow, there is a similar mismatch:

| TERM | `sklearn` | `nltk` |
|------|-----------|--------|
| 000  | 2098 | "no match" |
|  10  | 1691 | 1216 |
|  20  | 1107 | 879 |
| 100  |  902 | 647 |
|  30  |  827 | 650 |
|  50  |  784 | 594 | 
|  15  |  659 | 512 | 
|  40  |  494 | 387 | 
| ...               | 
|  14  |  241 | 148 | 

I don't have a ready explanation for this.

Second, the frequency of some numbers are readily explained:

* Round numbers like 10, 20, 30, 50, and 100 are approximations -- though it would be interesting to explore how often they are attached to large scalars like "thousand" or million." 
* Some numbers seem to represent alternate ways of counting: 25 reagularly stands in for "one-quarter" -- though not as often as we might imagine -- and 18 is regularly paired with *month* as a more precise way to say " a year and a half."
* There are some numbers, like 11 and 14 which seem to have power all their own, perhaps tied to particular ages in humans. 

Next up is some code to explore the most common occurring words with these numbers.

In [10]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

All my searches for "collocations with specific words" took me to the NLTK, which means, so far as I can tell, generating all the bigrams and then filtering to get the one(s) you want. This seems backwards to me: wouldn't it be faster simply to find the word and then what comes after it? I'll take a look at regex for this later.

In [11]:
## Bigrams
finder = BigramCollocationFinder.from_words(onetext)

In [12]:
## Here's the filter operation:
the_number = lambda *w: '14' not in w
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# only bigrams that contain the number
finder.apply_ngram_filter(the_number)
# return the 10 n-grams with the highest PMI
print(finder.nbest(bigram_measures.likelihood_ratio, 10))

[('14', 'years'), ('14', 'billion'), ('was', '14'), ('14', 'years,'), ('14', 'hours'), ('14', 'orders'), ('14', 'million'), ('14', 'percent'), ('14', 'feet'), ('14', 'times')]


This does not return a count. *Oi!*

### Trial 2

In [13]:
the_one = nltk.Text(re.sub("[^a-zA-Z0-9']"," ",'\n'.join(texts)).lower().split())
# And here's what an NLTK text object looks like: a list of words, really
print(the_one[0:10])

['thank', 'you', 'so', 'much', 'chris', 'and', "it's", 'truly', 'a', 'great']


In [16]:
the_one.concordance("40")

Displaying 25 of 494 matches:
oking for a place to eat we were on i 40 we got to exit 238 lebanon tennessee 
w york city already handled more than 40 percent of the entire city's commerci
eading rooms whose evolution in 20 30 40 years we can't predict so that litera
nd all the other teams have done this 40 days of purpose based on the book and
internet tools and we ended up having 40 chapters starting up thousands of arc
cumented the lower ninth for the last 40 years that was their home and these a
e time and a long time ago well about 40 years ago my mom had an exchange stud
 world where women and children spend 40 billion hours a year fetching water t
be someone coming to rescue me cut to 40 some odd years later we go to kenya a
t age category of 76 to 85 as much as 40 percent of people have nothing really
 is how do you go to the loo at minus 40 ben i've read somewhere that at minus
ben i've read somewhere that at minus 40 exposed skin becomes frostbitten in l
ou answer the call of 

Well, there's the missing `000`! It's in the idiomatic transcription practices of TED wherein a number like "sixty thousand" is rendered as "60,000." 

One thing we know now: reporting large numbers is a part of TED talks.

**TO DO**: How to keep the comma marker between numbers? (Or should we just look to 000 as a possible collocate with the other numbers?) One solution from the [Regex Cookbook][]:

```python
\b[0-9]{1,3}(,[0-9]{3})*(\.[0-9]+)?\b|\.[0-9]+\b
```

[Regex Cookbook]: https://www.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/ch06s11.html