# Data cleaning: TF-IDF

## 0. Overview:

Below is a representative sentence from one essay:

>Did you know that more and more people these days are depending on computers for their safety, natural education, and their social life? ...

While the written language is straighforward for a human reader to evaluate, an automated system will need to translate the written word into a more machine-friendly format that will allow us to use data science techniques to grade the essays. The algorithm we have chosen is called term frequency-inverse document frequency, or **tf-idf** ("term frequency" because it emphasizes words that appear often in a particular essay, and "inverse document frequency" because it penalizes words that appear often in all documents). Essentially, this method takes as an input a set of text (in our case an essay) and *vectorizes* it, outputting a vector of numbers corresponding to that document.

To create this output vector, a **tf-idf** algorithm counts the number of times a word appears in a document and scales it according to the formula

$$
w_{j,i} = tf_{j,i} \cdot \log\frac{N}{df_i},
$$

where $tf_{i,j}$ is the number of times term $i$ occurs in essay $j$, $N$ is the total number of essays, $df_i$ is the number of documents containing term $i$, and $w_{j,i}$ is the tf-idf weighted term vector for this sentence.

## EXAMPLES GO HERE

Creating these vectors for every essay and stacking them on top of one another provides **tf-idf matrix**, a straightforward method for letting a computer handle text. Conveniently, taking any column from this matrix gives the (weighted) number of times a given word appears in each essay. Taking every column into account, we can now build a model that scores an essay based on the *word-content* of that essay.

# TO DO:
- Add in a couple example matrices based on the sentence above. I'm having a bit of trouble making it come out right (see `tfidf_example.py`)

## 1. Import packages

In [3]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import warnings
import os
import copy
from scipy.stats import spearmanr
warnings.filterwarnings('ignore')
%matplotlib inline