<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-the-Data" data-toc-modified-id="Loading-the-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading the Data</a></span></li><li><span><a href="#Term-Frequencies" data-toc-modified-id="Term-Frequencies-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Term Frequencies</a></span></li><li><span><a href="#TF-IDF" data-toc-modified-id="TF-IDF-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>TF-IDF</a></span></li></ul></div>

# Words

Like cats, there is more than one way to approach getting bags of words out of a collection of texts. And that's all we are doing here. We are not worrying about capturing any context more fine-grained than the text itself. So, no syntax, no sentences, no paragraphs. We may look at parts of speech, but only as a dimension of a word itself. 

That noted, there are two questions any effort at tokenization must answer: (1) **tokens** or **lemmas**? and (2) **stopwords**?

**Sections in this Notebook**
1. [Loading the Data](#Loading-the-Data)
2. [Term Frequencies](#Term-Frequencies)
3. [TF-IDF](#TF-IDF)

## Loading the Data

Working with only the data from the release, we have two data sets, `TEDonly` and `TEDplus` that we have previously merged into `TEDall_speakers` with an additional column indicating from which data set a given talk is taken. The first thing we will do is to load `TEDall`.

In [5]:
import pandas as pd

df = pd.read_csv('../output/TEDall_speakers.csv')
df.shape

(1747, 27)

In [6]:
df.head(3)

Unnamed: 0,Set,Talk_ID,public_url,headline,description,event,duration,published,tags,views,...,speaker2_introduction,speaker2_profile,speaker_3,speaker3_occupation,speaker3_introduction,speaker3_profile,speaker_4,speaker4_occupation,speaker4_introduction,speaker4_profile
0,only,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,...,,,,,,,,,,
1,only,7,https://www.ted.com/talks/david_pogue_says_sim...,Simplicity sells,New York Times columnist David Pogue takes aim...,TED2006,0:21:26,6/27/06,"simplicity,entertainment,interface design,soft...",1702201,...,,,,,,,,,,
2,only,53,https://www.ted.com/talks/majora_carter_s_tale...,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",TED2006,0:18:36,6/27/06,"MacArthur grant,cities,green,activism,politics...",2000421,...,,,,,,,,,,


To work with just the talks for the time being, we will double-check the label of the talk column:

In [7]:
with open('../output/TEDall_speakers.csv') as f:
    cols_only = f.readline().strip().split(",")

print(cols_only)

['Set', 'Talk_ID', 'public_url', 'headline', 'description', 'event', 'duration', 'published', 'tags', 'views', 'text', 'speaker_1', 'speaker1_occupation', 'speaker1_introduction', 'speaker1_profile', 'speaker_2', 'speaker2_occupation', 'speaker2_introduction', 'speaker2_profile', 'speaker_3', 'speaker3_occupation', 'speaker3_introduction', 'speaker3_profile', 'speaker_4', 'speaker4_occupation', 'speaker4_introduction', 'speaker4_profile']


For the sake of this particular experiment, I am going to convert the column into a list.

In [11]:
talks = df.text.tolist()
len(talks)

1747

The first thing to check is how many talks are, essentially, empty. We are going to set an arbitrary string length of 10 or less to see what gets returned:

In [16]:
for talk in talks:
    if len(talk) < 10:
        print(talks.index(talk))

Okay, no empty talks. Now let's see how many short ones there are. For reference, the length of the first 10 talks in our dataset is in the five figures. 

In [15]:
for talk in talks[0:10]:
    print(len(talk))

12074
18785
18477
17801
17561
23666
18794
16063
18097
20267


In [17]:
for talk in talks:
    if len(talk) < 500:
        print(talks.index(talk))

113
235
382
496
573
799
899
1484
1564


We can also inspect one of the short talks: it looks like it's a musical performance and all we have in the transcript are three parantheticals. 

In [19]:
print(talks[113])

  (Applause)    (Music)    (Applause)  


And, just to make sure that the list does not include the column header, we will print a bit of the first item:

In [21]:
print(talks[0][0:50])

  Thank you so much, Chris. And it's truly a great


Okay, so no empty texts, but some short ones, and our first text is not the column name, but our first talk. 

## Term Frequencies

Next up is to reduce our texts to bags of words. There are a number of ways to do this, but **SciKit-Learn**'s `CountVectorizer` is, I think, the way to go, since it will work well with the other vectorizers and models also available in `sklearn`. We can store the counts as either their own CSV or as an ammendment to our current CSV.

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

If we run it unadorned, we get a word count of a little over 50,000. In the `TEDonly` talks, there were not quite 40,000 words. So `TEDplus` has added 10,000 words. I'm not sure of the significance, or the lack thereof, of that number. We have used the default options of the `CountVectorizer`: lowercase everything, get rid of all punctuation, make a word our of anything more than two characters long. The only thing that might not be welcome is the splitting of contractions. For now, we will leave things as they are. (Also, please note, no stopwords were used, so we have an unfiltered word list.)

For this current work, we are running `fit()` and `transform()` separately, but since `fit()` just calculates the parameters and saves them as an internal objects state `transform()`  applies the transformation to a particular set of examples (the ones we just fitted), the two operations are usually simply done at the same time as `fit_transform()`. 

In [24]:
# If you want to pass options, pass them here:
vectorizer = CountVectorizer()

# fit the model to the data 
vecs = vectorizer.fit(talks)

# transform the data according to the fitted model
bow = vecs.transform(talks)

# see how many features we have
bow.shape

(1747, 50379)

With a complete matrix in place, we can total up our columns for each feature (word). We can then take that vector, with each position describing one word and convert it into a tuple, which we can then sort by frequency. (I'm doing it this way because it appears to be the way to do it, but it also strikes me that there should be a way to do this within the array itself, or, perhaps, to do it through **pandas**. 

In [25]:
# summing up the counts for each word
sum_words = bow.sum(axis=0)

# create a tuple
words_freq = [(word, sum_words[0, idx]) for word, idx in vecs.vocabulary_.items()]

# sort the tuple
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)

# check the results of our work by printing the top 20 more frequent words
print(words_freq[0:20])

[('the', 166093), ('and', 118989), ('to', 102276), ('of', 92416), ('that', 76268), ('in', 62673), ('it', 59191), ('you', 56296), ('we', 54458), ('is', 50072), ('this', 38510), ('so', 29001), ('they', 25157), ('was', 24582), ('for', 24445), ('are', 22592), ('have', 21965), ('but', 21804), ('on', 20978), ('what', 20907)]


Let's save these results to a CSV file and check to see that the words we see are the kinds of words we want. We don't want any weirdness affecting our overall results.

In [27]:
import csv

with open('../output/word_freq.csv','w') as out:
    csv_out = csv.writer(out)
    csv_out.writerow(['word','count'])
    for row in words_freq:
        csv_out.writerow(row)

The hand inspection looks good. I didn't see anything in words 4 or above in frequency that looked off. (So, the simplest solution works!) 

What I did note was the frequency of certain **numbers**: **100**, **12**, etc. This might be worth taking a closer look: are there *power numbers*? (I am thinking here of Alan Dundes' essay on the "power" of three in American culture.)

We also can do some one-off inspections: I would like also to develop code where we could grab all words of a certain frequency, or range of frequencies. (In part, this is a note to myself to understand better the nature of the `vec` and the `bow` so as to be able to interact with them more effectively.)

**Follow-up**: whenever I attempt some version of
```python
for item in vecs:
    if vecs.vocabulary_.get(item) == 1691:
        print(item)
```
I get **`TypeError: 'CountVectorizer' object is not iterable`**. My best guess, for now, is that we need to use the tuple above to get this information. 

In [28]:
vecs.vocabulary_.get('algorithm') # Wait! This doesn't get the count but the index?!

2124

In [30]:
vecs.get_feature_names()[2120:2125]

['algeria', 'algerian', 'algiers', 'algo', 'algorithm']

The list of features begins with numbers, and there are a lot of them. (We don't, as yet, know how often or where they occur.) The range below isolates where the list transitions from numbers to words, in case anyone was wondering. The different spellings of "aaa" is interesting, but I am most curious about "aatcagggaccc." 

In [32]:
feature_names = vecs.get_feature_names()
print(feature_names[900:1000])

['838', '84', '841', '844', '8462852', '849', '85', '850', '855', '8567', '85871', '85th', '86', '860', '862', '863', '8649', '86th', '87', '88', '880', '883', '885', '886', '89', '890', '8pm', '8th', '8x10', '90', '900', '9000', '90s', '91', '911', '912', '92', '920', '92121', '928', '93', '930', '931', '94', '95', '950', '952', '956', '96', '961', '962', '967', '97', '974', '979', '98', '982', '984', '987', '99', '996', '997', '999', '999999', '9mm', '9th', '______', '________', 'a10', 'a12', 'a380', 'a4', 'a92', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaaaah', 'aaaaaaah', 'aaaaaah', 'aaaah', 'aaaahhh', 'aaah', 'aag', 'aah', 'aakash', 'aaleh', 'aargh', 'aarhus', 'aaron', 'aaronson', 'aarp', 'aat', 'aatcagggaccc', 'ab', 'ab32', 'aba', 'ababa', 'abacazanian', 'abacha']


Okay, the solution to this problem appears to be to return to the tuple and, perhaps, convert it to a dictionary. Or at least that's one solution:

In [33]:
wf_dict = dict(words_freq)
wf_dict['algorithm']

123

In [34]:
for key in wf_dict:
    if wf_dict[key] == 123:
        print(key, end = ", ")

fresh, personally, appreciate, features, pulled, increased, budget, ride, 45, rooms, lay, embrace, stronger, affected, younger, fears, battery, milk, horse, inequality, oceans, algorithm, 

One interesting weirdness I discovered while doing this: "co" has apparently been broken off from the nouns and verbs it might modify. Oops. 

In [35]:
from sklearn.externals import joblib

In theory, if we save the term frequency matrix, we need not run all the code above again. Here we are saving the `bow` and the `vec` as pickles. The bow, at 12.5MB, is the larger of the two files. While the `vec` weighs in at a mere 2.5MB by contrast.

In [36]:
# Output a pickle file for the model
joblib.dump(bow, '../output/bow.pkl') 

['../output/bow.pkl']

In [37]:
joblib.dump(vecs, '../output/vec.pkl') 

['../output/vec.pkl']

## TF-IDF

Can we use a TF-IDF matrix to glean how words achieve a certain level of significance? (Is this what NMF does?)


**Useful Links**

* [Extracting Important Keywords from Text with TF-IDF](https://github.com/kavgan/nlp-in-practice/blob/master/tf-idf/Keyword%20Extraction%20with%20TF-IDF%20and%20SKlearn.ipynb) explains how to train and then test a document to discover its keywords. 


In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

In [None]:
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
tfidf = tfidf_transformer.fit(bow)
# tfidf.shape