# Term Frequency - Inverse document frequency

* https://en.wikipedia.org/wiki/Tf–idf
* http://kak.tx0.org/IR/TFxIDF
* https://ecommons.cornell.edu/bitstream/handle/1813/6721/87-881.pdf?sequence=1
* http://www.tfidf.com


* **recall**: the proportion of relevant items retrieved, measured by the ration of the number of relevant retrieved items to the total number of relevant items in the collection
* **precision**: the proportion of retrieved items that are relevant, measured by the ration of the number relevant retrieved items to the total number of retrieved items
* **TF**: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 
<br>
`TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).`

* **IDF**: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 
<br>
`IDF(t) = log_e(Total number of documents / Number of documents with term t in it).`

* **TF-IDF**: `TF * IDF`



# Implementation

In [157]:
import nltk
from nltk.corpus import reuters
import operator
import matplotlib.pyplot as plt
import numpy as np
import re
import textwrap

%matplotlib inline

In [158]:
def clean_text(original_text):
    original_text = re.sub(r'[~!@#$%^&*()_|+\-=?;:",.<>\{\}\[\]\\\/\n0-9]',' ',original_text.lower())
    original_text = re.sub(r'\s+',' ',original_text)
    return original_text

def get_word_counts(document):
    word_counts = {}
    document_words = document.split(' ')
    for w in document_words:
        if w in word_counts:
            word_counts[w] += 1
        else:
            word_counts[w] = 1
    return word_counts

def tf(document_word_counts, word):
    if word in document_word_counts:
        return document_word_counts[word]
    return 0

def idf(documents_word_counts, word):
    n_of_docs_with_term = 0
    for word_counts in documents_word_counts:
        if word in word_counts:
            n_of_docs_with_term += 1
    if n_of_docs_with_term == 0:
        return 0

    _idf = np.log(len(documents_word_counts) / float(n_of_docs_with_term))
    return _idf, n_of_docs_with_term

def tf_idf(document_word_counts, documents_word_counts, word):
    _tf = tf(document_word_counts, word)
    _idf, n_of_docs_with_term = idf(documents_word_counts, word)
    return {
        'tf': _tf,
        'n_of_docs_with_term': n_of_docs_with_term,
        'idf': _idf,
        'tf_idf': _tf * _idf
    }


In [159]:

corpus_length = 250000
reuters_text = clean_text(reuters.raw()[:corpus_length])
corpus_length = len(reuters_text)
documents = textwrap.wrap(reuters_text, corpus_length / 100)
number_of_documents = len(documents)
print 'n of documents:', number_of_documents
print documents[0][:200]


n of documents: 101
asian exporters fear damage from u s japan rift mounting trade friction between the u s and japan has raised fears among many of asia's exporting nations that the row could inflict far reaching econom


In [160]:

documents_word_counts = [get_word_counts(d) for d in documents]

print tf_idf(documents_word_counts[0], documents_word_counts, 'nations')
print tf_idf(documents_word_counts[0], documents_word_counts, 'mounting')
print tf_idf(documents_word_counts[0], documents_word_counts, 'trade')
print tf_idf(documents_word_counts[0], documents_word_counts, 'row')
print tf_idf(documents_word_counts[0], documents_word_counts, 'inflict')


{'tf': 1, 'idf': 2.8233610476132043, 'n_of_docs_with_term': 6, 'tf_idf': 2.8233610476132043}
{'tf': 1, 'idf': 3.5165082281731497, 'n_of_docs_with_term': 3, 'tf_idf': 3.5165082281731497}
{'tf': 7, 'idf': 1.0042026041970351, 'n_of_docs_with_term': 37, 'tf_idf': 7.029418229379246}
{'tf': 1, 'idf': 4.6151205168412597, 'n_of_docs_with_term': 1, 'tf_idf': 4.6151205168412597}
{'tf': 1, 'idf': 4.6151205168412597, 'n_of_docs_with_term': 1, 'tf_idf': 4.6151205168412597}
