# TF-IDF: Term Frequency–Inverse Document Frequency

TF-IDF is a numerical statistic that is intended to reflect how important 
a word is to a document in a collection or corpus. It is often used as a 
weighting factor in searches of information retrieval, 
text mining, and user modeling. The tf–idf value increases proportionally to 
the number of times a word appears in the document and 
is offset by the number of documents in the corpus 
that contain the word, which helps to adjust for the fact that some words 
appear more frequently in general. TF-IDF is one of 
the most popular term-weighting schemes today. A survey conducted 
in 2015 showed that 83% of text-based recommender systems in digital 
libraries use tf–idf.

In this post, we will:

- learn how to build up a tf-idf model from scratch
- apply it to match names (entities) 

TF-IDF was invented by [Karen Jones](https://en.wikipedia.org/wiki/Karen_Sp%C3%A4rck_Jones),
her original papers are among the most cited papers in the field of CIS.

## Definition

- TF: term frequency is the count of a token present in a sentence
- IDF: inverse document frequency is a weight indicating how commonly a 
word is used. The more frequency its usage across documents, the lower its score.
For instance, 'the' shows up very often, it has a low score of IDF, The lower
the score, the less important the world becomes.

The formula to compute the tf-idf for a token $t$ of a document $d$
in a document set is:

$$tf-idf = tf(t, d) \times idf(t), \quad idf(t) = \log(\frac{n}{df(t)+1})$$

where $n$ is the total number of documents in the document sett and 
$df(t)$ is the _document frequency_ of $t$; the document frequency is
the number of documents in the document set that contains the term $t$.


For the IDF, we will implement the smooth version based on the document of 
`sklearn`, which is 

$$idf(t) = \log \left ( \frac{1+n}{1+df(t)} \right ) +1 $$

In [1]:
import time
import struct
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import matplotlib as mplt
import pandas as pd
from collections import Counter
from scipy.sparse import csr_matrix
from sklearn.preprocessing import normalize
%config InlineBackend.figure_formats = ['svg']

In [12]:
# document example a list
corpus_1 = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]
corpus_1 = np.array(corpus_1)
corpus_1

array(['this is the first document',
       'this document is the second document',
       'and this is the third one', 'is this the first document'],
      dtype='<U36')

In [20]:
def count_dimension(corpus: np.ndarray) -> dict:
     """
     Count the dimension of token based on unique tokens in the corpus
     
     Parameters
     ----------
     corpus: np.ndarray, shape (n, ) or (n, 1), where each row is a 
     a long string with spaces such as "this is the first document"
     
     Returns
     ----------
     unique_tokens_dict: dict{'token': dimension: int}
     """
     corpus = corpus.ravel()  # make it shape (n, )
     
     unique_tokens = []
     
     # loop over rows of corpus np.ndarray
     for row in corpus:
          # split token 
          # you might need different loop based on different input
          # here we are dealing with a long string with spaces
          # your row input might be array or list of token 
          for token in row.split():  # split based on space 
               # ignore alphabet as len('b') = 1
               if len(token) >= 2 and token not in unique_tokens:
                    unique_tokens.append(token)
          # sort it 
          unique_tokens.sort()
          
          unique_tokens_dict = {token:i for i, token in enumerate(unique_tokens)}
          
     return unique_tokens_dict

In [21]:
count_dimension(corpus_1)

{'and': 0,
 'document': 1,
 'first': 2,
 'is': 3,
 'one': 4,
 'second': 5,
 'the': 6,
 'third': 7,
 'this': 8}

In [None]:
def count_word(corpus: np.ndarray, word: str) -> int:
    """
    Count word frequency from the corpus
    
    Parameters
    -----------
    corpus: np.ndarray, shape (n, ) or (n, 1)
    word: string
    
    Return
    ------------
    count: integer, word frequency
    """
    corpus = corpus.ravel()
    count = 0
    
    for row in corpus:
        # assumes row is string 
        if word in row:
            count += 1
    
    return count 

The two functions above are very easy to understand. We will use them to 
calculate the term frequency. Before we proceed, let's recap:

- we have documents
- we have texts in each document 
- we have a set of unique words in the texts of each document
- we have a set of unique words in all the documents 

To calculate the inverse document frequency (IDF), we will use the following
formula:

$$idf(t) = \log \left ( \frac{1+n}{1+df(t)} \right ) +1 $$

- $n$ is the total number of documents
- $df(t)$ is the number of documents containing term $t$

To help us to 

In [None]:
def tf_idf_transform(corpus: np.ndarray, token_dimenstion: dict):
    """
    Transform the corpus into a sparse matrix 
    """
    
    corpus = corpus.ravel()
    
    rows = []
    columns = []
    frequency = []
    tf_val = []
    idf_val = []
    
    for idx, row in enumerate(corpus):
        word_freq = dict(Counter(row.split()))
        
    for word, freq in word_freq.items():
        # only retrieve those words with len >= 2
        if len(word) >= 2:
            # get the dimension as idx
            # if the key does not exist, dimension_idx = -1
            word_dimension_idx = token_dimenstion.get(word, -1)
            
            if word_dimension_idx != -1:
                # first, store the index of the document
                rows.append(idx)
                # second, store the index of the word
                columns.append(word_dimension_idx)

In [24]:
Counter(corpus_1[0].split())

Counter({'this': 1, 'is': 1, 'the': 1, 'first': 1, 'document': 1})