# Words

Like cats, there is more than one way to approach getting bags of words out of a collection of texts. And that's all we are doing here. We are not worrying about capturing any context more fine-grained than the text itself. So, no syntax, no sentences, no paragraphs. We may look at parts of speech, but only as a dimension of a word itself. 

That noted, there are two questions any effort at tokenization must answer: (1) **tokens** or **lemmas**? and (2) **stopwords**?

## Loading the Data

Working with only the data from the release, we have two files:

    TEDonly_speakers_final.csv
    TEDplus_speakers_final.csv

The first thing we will do is to load both as `pandas` dataframes, but I think we also want to consider creating a merged file -- something like `TEDall.csv`. 

In [1]:
import pandas as pd

df_only = pd.read_csv('../data/Release_v0/TEDonly_speakers_final.csv')
df_plus = pd.read_csv('../data/Release_v0/TEDplus_speakers_final.csv')

print("df_only = ", df_only.shape, "\n" + "df_plus = ", df_plus.shape)

df_only =  (992, 27) 
df_plus =  (755, 27)


That's a lot of columns, so we want to double check the column names to get the texts of the talks.

In [2]:
with open('../data/Release_v0/TEDonly_speakers_final.csv') as f:
    colnames = f.readline().strip().split(",")
    
print(colnames)

['', 'Talk_ID', 'public_url', 'headline', 'description', 'event', 'duration', 'published', 'tags', 'views', 'text', 'speaker_1', 'speaker1_occupation', 'speaker1_introduction', 'speaker1_profile', 'speaker_2', 'speaker2_occupation', 'speaker2_introduction', 'speaker2_profile', 'speaker_3', 'speaker3_occupation', 'speaker3_introduction', 'speaker3_profile', 'speaker_4', 'speaker4_occupation', 'speaker4_introduction', 'speaker4_profile']


For the sake of this particular experiment, I am going to convert the column into a list.

In [3]:
only = df_only.text.tolist()
plus = df_plus.text.tolist()
print(len(only), len(plus))

992 755


In [4]:
all_talks = only + plus
print(len(all_talks))

1747


For now, as we work through our basic methods, let's work with the `only` texts. The first thing to check is how many talks are empty. 

In [5]:
for text in only:
    if len(text) < 10:
        print(only.index(text))

Now let's see how many short ones there are:

In [6]:
for text in only:
    if len(text) < 300:
        print(only.index(text))

113
235
382
496
573
799
899


And, just to make sure that the list does not include the column header, we will print a bit of the first item:

In [7]:
print(only[0][0:50])

  Thank you so much, Chris. And it's truly a great


Okay, so no empty texts, but some short ones, and our first text is not the column name, but our first talk. 

## Term Frequencies

Next up is to reduce our texts to bags of words. There are a number of ways to do this, but **SciKit-Learn**'s `CountVectorizer` is, I think, the way to go, since it will work well with the Kit's TFIDF vectorizer, and both of those are foundations for either LDA or NMF topic modeling as well as being useful in and of themselves. We can also store the counts as either their own CSV or as an ammendment to our current CSV. 

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

If we run it unadorned, we get a word count of almost 40,000 from our not quite a thousand texts. For now, let's let that be: the default options are to lowercase everything, get rid of all punctuation, etc. The only thing that might not be welcome is that splitting of contractions. 

For this current work, we are running `fit()` and `transform()` separately, but since `fit()` just calculates the parameters and saves them as an internal objects state `transform()`  applies the transformation to a particular set of examples (the ones we just fitted), the two operations are usually simply done at the same time as `fit_transform()`. 

In [9]:
# If you want to pass options, pass them here:
vectorizer = CountVectorizer()

# fit the model to the data 
vecs = vectorizer.fit(only)

# transform the data according to the fitted model
bow = vecs.transform(only)

With a complete matrix in place, we can total up our columns for each feature (word). We can then take that vector, with each position describing one word and convert it into a tuple, which we can then sort by frequency.

In [11]:
# summing up the counts for each word
sum_words = bow.sum(axis=0)

# create a tuple
words_freq = [(word, sum_words[0, idx]) for word, idx in vecs.vocabulary_.items()]

# sort the tuple
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)

# check the results of our work by printing the top 20 more frequent words
print(words_freq[0:20])

[('the', 93853), ('and', 67710), ('to', 57089), ('of', 52313), ('that', 44087), ('it', 35339), ('in', 34728), ('you', 34162), ('we', 30407), ('is', 28569), ('this', 22640), ('so', 16733), ('was', 14124), ('they', 13963), ('for', 13606), ('but', 12527), ('have', 12298), ('are', 12245), ('on', 12118), ('what', 12019)]


In [None]:
import csv

with open('word_freq.csv','w') as out:
    csv_out = csv.writer(out)
    csv_out.writerow(['name','num'])
    for row in data:
        csv_out.writerow(row)

In [21]:
vectorizer.vocabulary_.get('algorithm')

1691

## TF-IDF

In [22]:
from sklearn.feature_extraction.text import TfidfTransformer

In [23]:
tf_transformer = TfidfTransformer(use_idf=False).fit(counts)
tf = tf_transformer.transform(counts)
tf.shape

(992, 39515)

In [26]:
tf

<992x39515 sparse matrix of type '<class 'numpy.float64'>'
	with 584565 stored elements in Compressed Sparse Row format>

In [25]:
tf.vocabulary_get('algorithm')

AttributeError: vocabulary_get not found