# Data cleaning: TF-IDF

## 0. Overview:

Conside the two one-sentence documents below:

1. This is the first sentence.
2. Here is another sentence.

While the written language is straighforward for a human reader to evaluate, an automated system will need to translate the written word into a more machine-friendly format that will allow us to use data science techniques to grade the essays. A simple method, would be to simply count up the occurences of each word, called a counting vectorizer. In that case, the two documents above would become:

| |another|first|here|is|sentence|the|this|
|---|---|---|---|---|---|---|---|
|**This is the first sentence.**|0.000|1.000|0.000|1.000|1.000|1.000|1.000|
|**Here is another sentence.**|1.000|0.000|1.000|1.000|1.000|0.000|0.000|

However, the algorithm we have chosen is called [term frequency-inverse document frequency](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), or **tf-idf** ("term frequency" because it emphasizes words that appear often in a particular essay, and "inverse document frequency" because it penalizes words that appear often in all documents). Essentially, this method takes as an input a set of text (in our case an essay) and *vectorizes* it, outputting a vector of numbers corresponding to that document.

To create this output vector, a **tf-idf** algorithm counts the number of times a word appears in a document and scales it according to the formula

$$
w_{j,i} = tf_{j,i} \left(\log\frac{1 + N}{1 + df_i} + 1\right),
$$

where $tf_{i,j}$ is the number of times term $i$ occurs in essay $j$, $N$ is the total number of essays, $df_i$ is the number of documents containing term $i$, and $w_{j,i}$ is the tf-idf weighted term vector for this sentence. The particular tf-idf vectorizer we chose also normalizes each document's tf-idf vector by its 2-norm to make it a unit vector (that is, $\hat{w}_{j,:} = w_{j,:} / ||w_{j,:}||_2$). The same two sentences run through tf-idf vectorizer look like:

| |another|first|here|is|sentence|the|this|
|---|---|---|---|---|---|---|---|
|**This is the first sentence.**|0.000|0.499|0.000|0.355|0.355|0.499|0.499|
|**Here is another sentence.**|0.576|0.000|0.576|0.410|0.410|0.000|0.000|

Notably, the words "sentence" and "is" count less in both documents, since they're repeated in both. Creating these vectors for every essay and stacking them on top of one another provides **tf-idf matrix**, a straightforward method for letting a computer handle text. Conveniently, taking any column from this matrix gives the (weighted) number of times a given word appears in each essay. Taking every column into account, we can now build a model that scores an essay based on the *word-content* of that essay.

## 1. Import packages

In [13]:
# The below code calculates the examples shown above.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

def toMarkDownTable(vect, sentences):
    # Export a markdown table of the vectorizer output
    
    # Fit
    wordVec = vect.fit_transform(sentences)
    
    vectWords = ['|', ' ']
    vectCounts1 = ['|', '**', sentences[0], '**']
    vectCounts2 = ['|', '**',sentences[1], '**']
    dividers = ['|', '---']

    # for word in vect.vocabulary_:
    for word in vect.get_feature_names():
        vectWords += '|%s'%word
        dividers += '|---'
    vectWords += '|'
    dividers += '|'

    for count in np.array(wordVec.todense()[0,:]).reshape(-1):
        vectCounts1 += '|%4.3f'%count
    vectCounts1 += '|'

    for count in np.array(wordVec.todense()[1,:]).reshape(-1):
        vectCounts2 += '|%4.3f'%count
    vectCounts2 += '|'

    # Print out the results
    print ''.join(vectWords)
    print ''.join(dividers)
    print ''.join(vectCounts1)
    print ''.join(vectCounts2)
    

# The example sentences
exampleSentence = 'This is the first sentence.'
sentenceTwo = "Here is another sentence."

# Print out the tables
toMarkDownTable(CountVectorizer(), [exampleSentence, sentenceTwo])
print '\n'
toMarkDownTable(TfidfVectorizer(norm='l2'), [exampleSentence, sentenceTwo])

| |another|first|here|is|sentence|the|this|
|---|---|---|---|---|---|---|---|
|**This is the first sentence.**|0.000|1.000|0.000|1.000|1.000|1.000|1.000|
|**Here is another sentence.**|1.000|0.000|1.000|1.000|1.000|0.000|0.000|


| |another|first|here|is|sentence|the|this|
|---|---|---|---|---|---|---|---|
|**This is the first sentence.**|0.000|0.499|0.000|0.355|0.355|0.499|0.499|
|**Here is another sentence.**|0.576|0.000|0.576|0.410|0.410|0.000|0.000|
