# Coming to Terms

With the TEDtalks-all dataset created, we have xxxx talks with which to work. This is a small corpus, and so the usual reasons for shrinking the feature set for the texts do not apply, but as we begin our survey of the contents of the TED talks we wanted to be mindful of standards that had emerged both so that our results would be comparable to the work of others but also so that we could potentially scale up the work here without having to re-think the foundations.

## Imports and Data

In [1]:
# Imports
import pandas as pd, re, csv
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
# Load the Data
df = pd.read_csv('../output/TEDall_speakers.csv')
df.shape

(1747, 27)

**To Do**: Edit the CSV to remove the vestigial index column at the start of each line. Then use `df.set_index('Talk_ID')`.

## Frequencies

The sole purpose of this notebook is to establish how we are going to elicit our features, our words, from the collection of talks. Thus, the only column we are interested in is the one with the texts of the talks. As we move forward, however, we will want to decide if we are simply going to append ~30,000 columns to a version of the extant CSV or create a separate CSV for each experiment. 

For this first experiment, we will keep it simple, creating two lists, one of the URLs and one of the texts. The URLs are unique, human-friendly identifiers for the talks. (We can, perhaps, make them a bit more friendly by modifying them a bit, subtracting `https://www.ted.com/talks/` from each.)

In [3]:
urls  = df.public_url.tolist()
texts = df.text.tolist()

There are a number of ways to get term frequencies, but **SciKit-Learn**'s `CountVectorizer` is, I think, the way to go, since it will work well with the other vectorizers and models also available in `sklearn`.

In our first experiment, we run `CountVectorizer` unadorned. The default options are: lowercase everything, get rid of all punctuation, make a word out of anything more than two characters long. The only thing that might not be welcome is the splitting of contractions. For now, we will leave things as they are. (Also, please note, no stopwords were used, so we have an unfiltered word list.)

For this current work, we are running `fit()` and `transform()` separately, but since `fit()` just calculates the parameters and saves them as an internal objects state `transform()`  applies the transformation to a particular set of examples (the ones we just fitted), the two operations are usually simply done at the same time as `fit_transform()`.

In [15]:
# If you want to pass options, pass them here:
vectorizer = CountVectorizer()

# fit the model to the data 
vecs = vectorizer.fit(texts)

# transform the data according to the fitted model
bow = vecs.transform(texts)

# see how many features we have
bow.shape

(1747, 50379)

## Frequency Totals per Word

We can total up our columns for each feature (word), which is something we will be doing per year, per gender, per discipline. Here, we take the vector describing a word and sum it. We then pair the sum with the word in a tuple, which we then sort by frequency. 

(I'm doing it this way because it appears to be the way to do it, but it also strikes me that there should be a way to do this within the array itself, or, perhaps, to do it through **pandas**.)

We save the results to a CSV file so that we can hand-check the words: are these the results we expected? (We don't want any weirdness affecting our overall results.) The hand inspection looks good. I didn't see anything in words 4 or above in frequency that looked off. (So, the simplest solution works!) What I did note was the frequency of certain **numbers**: **100**, **12**, etc. This might be worth taking a closer look: are there *power numbers*? (I am thinking here of Alan Dundes' essay on the "power" of three in American culture.)

**To Do**: It would be nice to be able grab all words of a certain frequency, or range of frequencies.

---
**Follow-up**: whenever I attempt some version of
```python
for item in vecs:
    if vecs.vocabulary_.get(item) == 1691:
        print(item)
```
I get **`TypeError: 'CountVectorizer' object is not iterable`**. My best guess, for now, is that we need to use the tuple above to get this information.

---


In [16]:
# summing up the counts for each word
sum_words = bow.sum(axis=0)

# create a tuple
words_freq = [(word, sum_words[0, idx]) for word, idx in vecs.vocabulary_.items()]

# sort the tuple
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)

# check the results of our work by printing the top 20 more frequent words
print(words_freq[0:20])

[('the', 166093), ('and', 118989), ('to', 102276), ('of', 92416), ('that', 76268), ('in', 62673), ('it', 59191), ('you', 56296), ('we', 54458), ('is', 50072), ('this', 38510), ('so', 29001), ('they', 25157), ('was', 24582), ('for', 24445), ('are', 22592), ('have', 21965), ('but', 21804), ('on', 20978), ('what', 20907)]


In [19]:
with open('../output/word_freq.csv','w') as out:
    csv_out = csv.writer(out)
    csv_out.writerow(['word','count'])
    for row in words_freq:
        csv_out.writerow(row)