# Tf-idf

Tf-idf, short for term frequency–inverse document frequency, is a method to summerize a text by its word frequecy  weighted with inverse document frequency. The theory behind is if a word happens more in a text is more important than the other words, and a word happens in more document is less significant.

$tfidf_{i,d} = tf_{i,d} \cdot idf_{i}$ where<br>
$idf( t, D ) = log \frac{ \text{| } D \text{ |} }{ 1 + \text{| } \{ d \in D : t \in d \} \text{ |} }$
<br>Read more on https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Let's make a small corpus before impletementing the tf-idf.

In [62]:
# read file line by line and 
texts = ['This is a girl with a telescope .',
        'This is a boy with a pizza .',
         'Irrelevent sentence should tell a difference !']
def tokenizer(texts):
    tokenized_texts = []
    for text in texts:
        tokenized_texts.append(text.lower().split())
    return tokenized_texts
    
tokenized_texts = tokenizer(texts)
print(tokenized_texts)

[['this', 'is', 'a', 'girl', 'with', 'a', 'telescope', '.'], ['this', 'is', 'a', 'boy', 'with', 'a', 'pizza', '.'], ['irrelevent', 'sentence', 'should', 'tell', 'a', 'difference', '!']]


First we count the document frequency for all words

In [63]:
def df_count(tokenized_texts):
    df = {}
    # Let's count the words
    for doc in tokenized_texts:
        word_type = set(doc)
        for word in word_type:
            try:
                df[word] += 1
            except:
                df[word] = 1
    
    return df

df = df_count(tokenized_texts)
print(df)

{'a': 3, 'this': 2, 'with': 2, '.': 2, 'is': 2, 'girl': 1, 'telescope': 1, 'pizza': 1, 'boy': 1, 'should': 1, 'tell': 1, 'irrelevent': 1, '!': 1, 'difference': 1, 'sentence': 1}


Then make a function to get the term frequency

In [66]:
#function for only one documents
def tf_count(tokenized_text):
    tf = {}
    total_no = len(tokenized_text)
    for token in tokenized_text:
            try:
                tf[token] += 1
            except:
                tf[token] = 1
    for key in tf.keys():
        tf[key] = tf[key]/total_no
    return tf

print(tf_count(tokenized_texts[0]))

{'this': 0.125, 'is': 0.125, 'a': 0.25, 'girl': 0.125, 'with': 0.125, 'telescope': 0.125, '.': 0.125}


Then we can calculate the score.

In [70]:
import math
def get_score(tokenized_texts):
    tfidf = []
    df = df_count(tokenized_texts)
    total_docs = len(tokenized_texts)
    for doc in tokenized_texts:
        tf = tf_count(doc)
        score = []
        for word in df.keys():
            try:
                score.append(tf[word]*(math.log10(total_docs/df[word])))
            except:
                score.append(0)
        tfidf.append(score)
    return tfidf
tf_idf = get_score(tokenized_texts)
print(tf_idf)

[[0.0, 0.022011407381960155, 0.022011407381960155, 0.022011407381960155, 0.022011407381960155, 0.059640156839957804, 0.059640156839957804, 0, 0, 0, 0, 0, 0, 0, 0], [0.0, 0.022011407381960155, 0.022011407381960155, 0.022011407381960155, 0.022011407381960155, 0, 0, 0.059640156839957804, 0.059640156839957804, 0, 0, 0, 0, 0, 0], [0.0, 0, 0, 0, 0, 0, 0, 0, 0, 0.06816017924566606, 0.06816017924566606, 0.06816017924566606, 0.06816017924566606, 0.06816017924566606, 0.06816017924566606]]


Now we have document vectors, how can we make use of them?

# Cosine Similarity

Cosine similarity measures the orientation of two n-dimensional sample vectors irrespective to their magnitude.
It is very common to use td-idf with cosine similartiy.<br>
$\cos(\pmb x, \pmb y) = \frac {\pmb x \cdot \pmb y}{||\pmb x|| \cdot ||\pmb y||}$

In [71]:
import math
def cosSim(A,B):
    #compute cosine similarity of A to B: (A dot B)/{||A||*||B||)
    sum_aa, sum_ab, sum_bb = 0, 0, 0
    for i in range(len(A)):
        a = A[i]; b = B[i]
        sum_aa += a*a
        sum_ab += a*b
        sum_bb += b*b
    return sum_ab/math.sqrt(sum_aa*sum_bb)

In [74]:

sim_score = cosSim(tf_idf[0],tf_idf[1])
print('similarity between 1st doc and 2nd doc:', sim_score)
sim_score = cosSim(tf_idf[0],tf_idf[2])
print('similarity between 1st doc and 3nd doc:',sim_score)

similarity between 1st doc and 2nd doc: 0.21409949120674793
similarity between 1st doc and 3nd doc: 0.0


We can see that doc 1 and 2 is more similar to each other than to doc3.