# Intro to TF-IDF

### Introduction

In the previous section, we saw a basic technique for turning our words into text.  We simply count up the number of times that each word occurs, let each word be represented by a separate dimension in a vector, and then perform a random forest.

In this lesson, we'll see a different way to encode our words.  Doing so, as we'll see will give more weights to infrequent words.

### What is TF-IDF

TF-IDF stands for term frequency, inverse document frequency.  And we can think of the formula for TF-IDF literally as:

$tfidf = term\_frequency * inverse\_document\_frequency $

In other words take the proportion of a word's frequency in a document, and multiply it by how rare that word is in general, and that should give us a metric of each word's importance in a document.  

For example, if we have a document and both $armadillo$ and $house$ occur in twice in the document, we won't record these words as:

`[2, 2]`, where the first element represents armadillo, but rather something like

`[2.5, 2]` because armadillo is a less frequent word than house.  And therefore we want to add more weight to it.  Of course the formula is a little more complex than that, so let's move through it in the rest of this lesson.

### Defining Terms

Remember that we previously defined the set of terms as follows.

* Term - a term is an individual word
* Document - a document is a set of words 
* Corpus - the set of all words in our dataset

Now from this, we can start to piece together what term frequency-inverse document frequency means.

### Term frequency

Let's take a look at a couple of our restaurant reviews.

* `Wow I loved the pizza.`
* `I loved the rolls, but the pizza crust was not good.`

Both of the reviews say 'loved'.  But, to represent the importance of 'loved' in the review, we can divide by the length of each document.  The 'loved' pizza consists of $\frac{1}{5}$ of the first document, but only $\frac{1}{11}$ of the second document.  This is called the term frequency.  

$term\_frequency = \frac{term\_count}{|D|} $

> where $|D|$ is the length of the document.

So, in general we don't think of our features as a count of words, but as a the proportion each word makes up our document.

### Inverse Document Frequency

Now with inverse document frequency, we weight words based on how rare they are throughout the corpus.  The less frequent a word, the more we weight it.

* $IDF(term) = \frac{\text{total # docs}}{\text{# of docs containing T}}$

So if the word *pizza* appears in five out of one hundred documents in our corpus, we have an inverse document frequency of $IDF(pizza) = \frac{100}{5} = 20$.

If the word crust appears in one document in one hundred, inverse document frequency gives us $IDF(crust) = \frac{100}{1} = 100$.

### TF-IDF Final Touches

So we can now express $TF-IDF$ as the following:

$TF-IDF(w) = \frac{ \text{term count}}{\text{document length}} * \frac{\text{total # of docs}}{\text{docs with term}}$.

> And we can think of this as the word's *proportion* of the document, multiplied by the rarity of the word.

There is one last idea generally in TF-IDF.  The idea is that the rarity of a word changes exponentially.  So the word crust is likely to be much more rare than the word pizza throughout a document.  To prevent vastly different scores based on rarity, we generally take the update IDF to equal the following: 

$TF-IDF(w) = \frac{ \text{term count}}{\text{document length}} * \ln(\frac{\text{total # of docs}}{\text{docs with term}})$.

As we know, applying the log, will tend to reduce our variance by impacting larger numbers more than smaller numbers.

### TFIDF in Sklearn

Let's use our newgroups dataset to explore some of the differences between tf-idf in sklearn.

In [1]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

documents = pd.Series(newsgroups_train['data'])
y = newsgroups_train['target']

Once again, let's look at the first document.

In [30]:
documents[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

Now simply using the bag of words model gives us the following:

In [29]:
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
cv = CountVectorizer(stop_words = ENGLISH_STOP_WORDS)
cv_vectors = cv.fit_transform(documents)
pd.Series(cv_vectors[0].toarray()[0], cv.get_feature_names()).sort_values(ascending = False)[:10]

car           5
lerxst        2
edu           2
umd           2
wam           2
bricklin      1
info          1
funky         1
university    1
separate      1
dtype: int64

Notice that we get a similar ranking with sklearn.

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer=TfidfVectorizer(use_idf=True)
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(documents) 

The main difference is that `edu` is now ranked considerably lower in tf-idf.  This seems like a good change.  As a downside, we also notice that some typos are ranked higher.  Notice that both do a good job at pulling out the word `car` as an important feature.  

In [18]:
feature_names = tfidf_vectorizer.get_feature_names()
pd.Series(tfidf_vectorizer_vectors[0].toarray()[0], feature_names).sort_values(ascending = False)[:10]

car         0.381339
lerxst      0.353835
wam         0.259709
umd         0.211868
tellme      0.176918
bricklin    0.167132
rac3        0.160686
funky       0.155872
was         0.145347
this        0.144473
dtype: float64

### Resources

[computing tf-idf in python](https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/)

[naive bayes classifier](https://medium.com/@baemaek/text-mining-preprocess-and-naive-bayes-classifier-da0000f633b2)