***This notebook needs to be revised.***

## Summary

In this notebook we load the complete corpus of the TED-curated talks and we focus on some of the oddities in the texts as a result of their being transcribed. 

## Imports and Data

In [1]:
# Imports
import pandas as pd, re, csv, nltk
from sklearn.feature_extraction.text import CountVectorizer

# Load the Data
df = pd.read_csv('../output/TEDall_speakers.csv')
df.shape

(1747, 27)

In [2]:
texts = df.text.tolist()

In [5]:
# If you want to pass options, pass them here:
vec = CountVectorizer( lowercase = True )

# fit the model to the data 
X = vec.fit_transform(texts)

# see how many features we have
X.shape

(1747, 50379)

50,379 tokens is our raw, unfiltered, no pre-processing baseline. It includes, as we will see, a number of artifacts of the TED transcription process, including a variety of ways to spell out *ah*, the use of numerals for a host of things -- from dates to counts and everything in between, and some things which actually take places outside the talk itself, like audience laughter, which we here term *parentheticals*. Accounting for all these artifacts requires a number of closer inspections, which take place in a separate notebook: [Term Artifacts](Terms-02).

## Frequency Totals per Word

In [5]:
df_base = pd.DataFrame(X.toarray(), columns = vecs.get_feature_names())

sums_base = df_base.sum()

A hand inspection of the output above turned up a couple of interesting issues: that there are parentheticals mixed in with the text of the talks (see 01-Terms-02-Parentheticals) and numbers feature in the talks regularly (see 01-Terms-03-Numbers). 

There are also some interesting tokenization dimensions to a crowd-sourced transcription technology:

| Term    | Freq |
|:---------|---:|
|aa       |12 |
|aaa      |7  |
|aaaa     |2  |
|aaaaa    |1  |
|aaaaaaaah|1  |
|aaaaaaah |1  |
|aaaaaah  |3  |
|aaaah    |2  |
|aaaahhh  |1  |
|aaah     |3  |
|aag      |1  |
|aah      |10 |

There are an equal number of alternate spellings for *shh*. There's also this oddity:

| Term    | Freq |
|:---------|---:|
|FALSE    |115|
|TRUE     |909|

It appears to be simply the occurrences of *true* and *false* but somehow the vectorizer thinks they are Booleans?

The rest of the hand-examination turns up no other issues, so the basic vectorization built into `sklearn` appears to be satisfactory, with the only exception being its breaking of contractions at the apostrophe: e.g., *isn't* becomes *isn* and *'t*, with the latter being thrown away as too small -- which means that the indefinite article *a* is also not present in the frequencies. 

In the section that follows, we take the array holding all the talks and their terms and transpose it so that the terms are on the rows. We then sum the rows and turn the sum column into a series which is then sorted and saved to CSV for hand inspection. What the hand inspection revealed was, beyond the usual stopwords, there were enough terms in the top 300 terms that had semantic possibilities that they should be kept. The best bet moving forward is to use an available, and widely-used, stopword list. We chose the one provided by the NLTK.

In [15]:
term_matrix = pd.DataFrame(X.todense(), columns=vec.get_feature_names())
tfdf = term_matrix.transpose()
tfdf.shape

(50379, 1747)

In [17]:
tfdf.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1738,1739,1740,1741,1742,1743,1744,1745,1746,sum
00,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,15
000,0,1,4,0,1,4,0,0,0,2,...,0,0,0,1,3,1,0,0,1,2098
000000004,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
0000001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
000001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
00001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
000042,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
0001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
00046,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
000th,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,6


In [25]:
# Sum the counts for each term into a column titled "sum":
tfdf['sum'] = tfdf.sum(axis=1)

# Isolate the sums in a series
sums = pd.Series(tfdf['sum'])

In [26]:
sums.sort_values(ascending=False, inplace=True)

In [27]:
sums.to_csv('../output/tf_main.csv')