<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-the-Data" data-toc-modified-id="Loading-the-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading the Data</a></span></li><li><span><a href="#Term-Frequencies" data-toc-modified-id="Term-Frequencies-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Term Frequencies</a></span></li><li><span><a href="#TF-IDF" data-toc-modified-id="TF-IDF-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>TF-IDF</a></span></li></ul></div>

# Words

Like cats, there is more than one way to approach getting bags of words out of a collection of texts. And that's all we are doing here. We are not worrying about capturing any context more fine-grained than the text itself. So, no syntax, no sentences, no paragraphs. We may look at parts of speech, but only as a dimension of a word itself. 

That noted, there are two questions any effort at tokenization must answer: (1) **tokens** or **lemmas**? and (2) **stopwords**?

## Loading the Data

Working with only the data from the release, we have two files:

    TEDonly_speakers_final.csv
    TEDplus_speakers_final.csv

The first thing we will do is to load both as `pandas` dataframes, but I think we also want to consider creating a merged file -- something like `TEDall.csv`. 

In [4]:
import pandas as pd

df_only = pd.read_csv('../data/Release_v0/TEDonly_speakers_final.csv')
df_plus = pd.read_csv('../data/Release_v0/TEDplus_speakers_final.csv')

print("df_only = ", df_only.shape, "\n" + "df_plus = ", df_plus.shape)

df_only =  (992, 27) 
df_plus =  (755, 27)


That's a lot of columns, so we want to double check the column names to get the texts of the talks.

In [5]:
with open('../data/Release_v0/TEDonly_speakers_final.csv') as f:
    colnames = f.readline().strip().split(",")
    
print(colnames)

['', 'Talk_ID', 'public_url', 'headline', 'description', 'event', 'duration', 'published', 'tags', 'views', 'text', 'speaker_1', 'speaker1_occupation', 'speaker1_introduction', 'speaker1_profile', 'speaker_2', 'speaker2_occupation', 'speaker2_introduction', 'speaker2_profile', 'speaker_3', 'speaker3_occupation', 'speaker3_introduction', 'speaker3_profile', 'speaker_4', 'speaker4_occupation', 'speaker4_introduction', 'speaker4_profile']


For the sake of this particular experiment, I am going to convert the column into a list.

In [6]:
only = df_only.text.tolist()
plus = df_plus.text.tolist()
print(len(only), len(plus))

992 755


In [None]:
all_talks = only + plus
print(len(all_talks))

For now, as we work through our basic methods, let's work with the `only` texts. The first thing to check is how many talks are empty. 

In [7]:
for text in only:
    if len(text) < 10:
        print(only.index(text))

Now let's see how many short ones there are:

In [8]:
for text in only:
    if len(text) < 300:
        print(only.index(text))

113
235
382
496
573
799
899


And, just to make sure that the list does not include the column header, we will print a bit of the first item:

In [9]:
print(only[0][0:50])

  Thank you so much, Chris. And it's truly a great


Okay, so no empty texts, but some short ones, and our first text is not the column name, but our first talk. 

## Term Frequencies

Next up is to reduce our texts to bags of words. There are a number of ways to do this, but **SciKit-Learn**'s `CountVectorizer` is, I think, the way to go, since it will work well with the Kit's TFIDF vectorizer, and both of those are foundations for either LDA or NMF topic modeling as well as being useful in and of themselves. We can also store the counts as either their own CSV or as an ammendment to our current CSV. 

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

If we run it unadorned, we get a word count of almost 40,000 from our not quite a thousand texts. For now, let's let that be: the default options are to lowercase everything, get rid of all punctuation, etc. The only thing that might not be welcome is that splitting of contractions. 

For this current work, we are running `fit()` and `transform()` separately, but since `fit()` just calculates the parameters and saves them as an internal objects state `transform()`  applies the transformation to a particular set of examples (the ones we just fitted), the two operations are usually simply done at the same time as `fit_transform()`. 

In [11]:
# If you want to pass options, pass them here:
vectorizer = CountVectorizer()

# fit the model to the data 
vecs = vectorizer.fit(only)

# transform the data according to the fitted model
bow = vecs.transform(only)

In [73]:
bow.shape

(992, 39515)

With a complete matrix in place, we can total up our columns for each feature (word). We can then take that vector, with each position describing one word and convert it into a tuple, which we can then sort by frequency. (I'm doing it this way because it appears to be the way to do it, but it also strikes me that there should be a way to do this within the array itself, or, perhaps, to do it through **pandas**. 

In [12]:
# summing up the counts for each word
sum_words = bow.sum(axis=0)

# create a tuple
words_freq = [(word, sum_words[0, idx]) for word, idx in vecs.vocabulary_.items()]

# sort the tuple
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)

# check the results of our work by printing the top 20 more frequent words
print(words_freq[0:20])

[('the', 93853), ('and', 67710), ('to', 57089), ('of', 52313), ('that', 44087), ('it', 35339), ('in', 34728), ('you', 34162), ('we', 30407), ('is', 28569), ('this', 22640), ('so', 16733), ('was', 14124), ('they', 13963), ('for', 13606), ('but', 12527), ('have', 12298), ('are', 12245), ('on', 12118), ('what', 12019)]


Let's save these results to a CSV file and check to see that the words we see are the kinds of words we want. We don't want any weirdness affecting our overall results.

In [None]:
import csv

with open('word_freq.csv','w') as out:
    csv_out = csv.writer(out)
    csv_out.writerow(['name','num'])
    for row in words_freq:
        csv_out.writerow(row)

The hand inspection looks good. I didn't see anything in words 4 or above in frequency that looked off. (So, the simplest solution works!) 

What I did note was the frequency of certain **numbers**: **100**, **12**, etc. This might be worth taking a closer look: are there *power numbers*? (I am thinking here of Alan Dundes' essay on the "power" of three in American culture.)

We also can do some one-off inspections: I would like also to develop code where we could grab all words of a certain frequency, or range of frequencies. (In part, this is a note to myself to understand better the nature of the `vec` and the `bow` so as to be able to interact with them more effectively.)

**Follow-up**: whenever I attempt some version of
```python
for item in vecs:
    if vecs.vocabulary_.get(item) == 1691:
        print(item)
```
I get **`TypeError: 'CountVectorizer' object is not iterable`**. My best guess, for now, is that we need to use the tuple above to get this information. 

In [45]:
vecs.vocabulary_.get('algorithm') # Wait! This doesn't get the count but the index?!

1691

In [46]:
vecs.get_feature_names()[1690:1695]

['algeria', 'algorithm', 'algorithmic', 'algorithmically', 'algorithms']

The list of features begins with numbers, and there are a lot of them. (We don't, as yet, know how often or where they occur.) The range below isolates where the list transitions from numbers to words, in case anyone was wondering. The different spellings of "aaa" is interesting, but I am most curious about "aakash." 

In [68]:
feature_names = vecs.get_feature_names()
print(feature_names[760:800])

['94', '95', '950', '952', '956', '96', '962', '97', '974', '98', '982', '984', '987', '99', '999', '999999', '9mm', '9th', '______', '________', 'a10', 'a12', 'a380', 'a4', 'a92', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaah', 'aag', 'aah', 'aakash', 'aargh', 'aaron', 'aaronson', 'aarp', 'aat', 'aatcagggaccc', 'ab']


Okay, the solution to this problem appears to be to return to the tuple and, perhaps, convert it to a dictionary. Or at least that's one solution:

In [47]:
wf_dict = dict(words_freq)
wf_dict['algorithm']

63

In [53]:
for key in wf_dict:
    if wf_dict[key] == 63:
        print(key, end = ", ")

length, admit, automatically, imagined, amounts, presentation, miss, axis, shit, recording, grows, require, depends, co, suit, witness, rid, parent, actions, relate, furniture, wanting, teeth, practical, terrorism, classes, battle, chicken, algorithm, 

Oh, interesting weirdness: "co" has apparently been broken off from the nouns and verbs it might modify. Oops. 

In [70]:
from sklearn.externals import joblib

In [71]:
# Output a pickle file for the model
joblib.dump(bow, 'saved_bow_only.pkl') 

['saved_bow_only.pkl']

In [72]:
joblib.dump(vecs, 'saved_vecs_only.pkl') 

['saved_vecs_only.pkl']

## TF-IDF

Can we use a TF-IDF matrix to glean how words achieve a certain level of significance? (Is this what NMF does?)


**Useful Links**

* [Extracting Important Keywords from Text with TF-IDF](https://github.com/kavgan/nlp-in-practice/blob/master/tf-idf/Keyword%20Extraction%20with%20TF-IDF%20and%20SKlearn.ipynb) explains how to train and then test a document to discover its keywords. 


In [23]:
from sklearn.feature_extraction.text import TfidfTransformer

In [55]:
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
tfidf = tfidf_transformer.fit(bow)
# tfidf.shape