# Tf-idf
Term Frequency-Inverse Document Frequency

#### Purpose:
- to evaluate how important a word is to a document in a collection or corpus



TF(Term Frequency), which measures how frequently a term occurs in a document. 

$$TF(\text{T, D}) = \frac{\text{Number of times term T appears in a document}}{\text{Total number of terms in the document D}}$$

IDF(Inverse Document Frequency), which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

$$IDF(\text{T, Corpus}) = log \{ \frac{\text{Total number of documents in Corpus}}{\text{Number of documents with term T in it}} \}$$

For each term T in a document D, we have:

$$tfidf(\text{T, D}) = TF(\text{T}, \text{D}) \cdot IDF(\text{T})$$

Given a doc D that has k unique terms, the vectorized form for D is:

$$\frac{tfidf[\text{term}_k]}{\sqrt{\sum_{i}^k \big ( tfidf[\text{term}_{i}] \big )^2}}$$


# Scikit Learn:
https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

# one item per file:
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

In [2]:
# unique token/terms:
print(vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [3]:
# shape of the matrix of the transformed corpus:
print(X.shape)

(4, 9)


In [4]:
# print the transformed corpus:
X.toarray()

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

In [5]:
vectorizer.transform(['This is the first document']).toarray()

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

In [6]:
vectorizer.transform(['first']).toarray()

array([[0., 0., 1., 0., 0., 0., 0., 0., 0.]])

# From Scratch:

In [7]:
class TfIdf:

    def __init__(self, 
               corpus_dir: str = None, 
               stopword_filename: str = None,
               DEFAULT_IDF: float = 1.5):
        
        self.num_of_docs = parse_corpus_dir(corpus_dir)
        pass

    
    def get_tokens(self, single_file_str: str) -> []:
        pass
    
    def parse_corpus_dir(corpus_dir: str) -> int:
        pass
    
    
    def get_idf(self, term: str) -> float:
        pass
    