# Term Frequency - Inverse Document Frequency

Valueable references from [Christian S. Perone](https://github.com/perone)'s Blog

Notebook Created by [Prashant Brahmbhatt](https://github.com/hashbanger)

___

### Understanding Tf-Idf

TF-IDF is used to figure out how important is a word in a document. tf-idf are is a very interesting way to convert the textual representation of information into a **Vector Space Model (VSM)**, or into sparse features.  
VSM is an algebraic model representing textual information as a vector, the components of this vector could represent the importance of a term (tf–idf) or even the absence or presence (Bag of Words) of it in a document.

The first step in modeling the document into a vector space is to create a dictionary of terms present in documents.

We need to select all terms from the document and convert it to a dimension in the vector space, but we would want to remove the **stopwords** first that are present in almost all documents. We want to extract important features from documents, features that could identify them among other similar documents. So using terms like “the, is, at, on”, etc.. is unhelpful.

**Lets us create a toy case for as our data**

In [8]:
train_set = ("The sky is blue.",                          #d1
             "The sun is bright.")                        #d2
test_set = ("The sun in the sky is bright.",              #d3
            "We can see the shining sun, the bright sun.")#d4

We have to create a index vocabulary (dictionary) of the words of the train document set, using the documents **d1** and **d2** from the document set, we’ll have the following index vocabulary denoted as ***E(t)*** where the t is the term:

$$ \mathrm{E}(t) = \begin{cases} 1, & \mbox{if } t\mbox{ is ``blue''} \\ 2, & \mbox{if } t\mbox{ is ``bright''} \\ 3, & \mbox{if } t\mbox{ is ``sky''} \\ 4, & \mbox{if } t\mbox{ is ``sun''} \\ \end{cases}$$

 

We’re going to use the term-frequency to represent each term in our vector space.  
The term-frequency is nothing more than a measure of how many times the terms present in our vocabulary ***E(t)*** are present in the documents **d3** or **d4**, we define the term-frequency as a couting function

$$ \mathrm{tf}(t,d) = \sum\limits_{x\in d} \mathrm{fr}(x, t) $$

where the ***fr(x, t)*** is a simple function defined as:  
$$\mathrm{fr}(x,t) = \begin{cases} 1, & \mbox{if } x = t \\ 0, & \mbox{otherwise} \\ \end{cases} $$

***tf(t, d)*** returns how many times the term **t** is present in the document **d**  
example: ***tf("sun", d4)*** = 2

___

### Creating document vector

Understanding how Tf works, we can move on to the creation of the document vector, which is represented by:  
$$  \displaystyle \vec{v_{d_n}} =(\mathrm{tf}(t_1,d_n), \mathrm{tf}(t_2,d_n), \mathrm{tf}(t_3,d_n), \ldots, \mathrm{tf}(t_n,d_n)) $$

Documents **d3** and **d4** can be represented in vectors as:  
    $$ \vec{v_{d_3}} = (\mathrm{tf}(t_1,d_3), \mathrm{tf}(t_2,d_3), \mathrm{tf}(t_3,d_3), \ldots, \mathrm{tf}(t_n,d_3)) \\ \vec{v_{d_4}} = (\mathrm{tf}(t_1,d_4), \mathrm{tf}(t_2,d_4), \mathrm{tf}(t_3,d_4), \ldots, \mathrm{tf}(t_n,d_4)) $$  
    which evaluates to:  
    $$ \vec{v_{d_3}} = (0, 1, 1, 1) \\ \vec{v_{d_4}} = (0, 1, 0, 2) $$

Here in d4 there is no occurence of the words **blue** and **sky** hence the 0 value.

We have a collection of documents, now represented by vectors, we can represent them as a matrix with **D x F** shape, where **|D|** is the cardinality of the document space, or how many documents we have and the F is the number of features, in our case represented by the vocabulary size.  
$$ M_{|D| \times F} = \begin{bmatrix} 0 & 1 & 1 & 1\\ 0 & 2 & 1 & 0 \end{bmatrix} $$

________

In **sklearn**, what we have presented as the term-frequency, is called **CountVectorizer**

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words= 'english')

The **CountVectorizer** already uses as default **“analyzer”** called **WordNGramAnalyzer**, which is responsible to convert the *text to lowercase, accents removal, token extraction, filter stop words,* etc

In [19]:
vectorizer.fit_transform(train_set)
print(vectorizer.vocabulary_)

{'blue': 0, 'sun': 3, 'sky': 2, 'bright': 1}


So the vocabulary is same as we supposed in ***E(t)*** except here it begins from 0.

Now using the same vectorizer to create sparse matrix for our **test_set**

In [26]:
test_matrix = vectorizer.transform(test_set)
print(test_matrix)

  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (1, 1)	1
  (1, 3)	2


This **test_matrix** is a sparse matrix curretly in **Coordinate Format** but can be coverted into dense format.  
Its dimensions will be as discussed **|D| x F**

In [28]:
test_matrix.todense()

matrix([[0, 1, 1, 1],
        [0, 1, 0, 2]], dtype=int64)

_____

### Inverse Document Frequency

The main problem with the term-frequency approach is that it scales up frequent terms and scales down rare terms which are empirically more informative than the high frequency terms.  
The basic intuition is that a term that occurs frequently in many documents is not a good discriminator.

tf-idf gives is how important is a word to a document in a collection, and that’s why tf-idf incorporates local and global parameters, because it takes in consideration not only the isolated term but also the term within the document collection

tf-idf scales down the frequent terms and scales up the rare occuring words. It does that using a logarithmic scale.    
We can remove stopwords as generally pre defined in the stopwords in library but a better way would be to,  

"convert the entire documents in tf-idf weights and then remove the words with value lower than decided threshold."

Going back to our definition of the **tf(t,d)** which is actually the term count of the term t in the document d.  
The use of this simple term frequency could lead us to problems like **keyword spamming**, which is when we have a repeated term in a document with the purpose of improving its ranking on an IR (Information Retrieval) system or even create a bias towards long documents, making them look more important than they are just because of the high frequency of the term in the document.  

So the term frequency **tf(t,d)** of a document on a vector space is usually also normalized.

### Vector Normalization

Suppose we want to normalize **d4** - "We can see the shining sun, the bright sun."  
It's vector representation was,  
$$\vec{v_{d_4}} = (0, 1, 0, 2) $$

To normalize the vector, is the same as calculating the *Unit Vector* of the vector, and they are denoted using the **“hat”** notation: **v^ (v hat)**.  
The definition of the unit vector **v^** of a vector **v** is:

$$\displaystyle \hat{v} = \frac{\vec{v}}{\|\vec{v}\|_p} $$

![Normalize](img/normalize.png)

In [None]:
s`s