# Similarity Functions

This notebook describes about the similarity functions that can be used to measure the similarity between two sets.

Firstly we import the shingling functions and other helpful functions.

In [1]:
from shingle import *
from math import ceil, floor
import numpy as np

We will then count how frequent a shingle is in the document. For this I have calculated the frequencies in the document called `data/portuguese/two_ends.txt`. Here we are using portuguese corpus.

Then we create a dictionary called `frequencies` which goes like from the word to its frequency.

In [2]:
# Initialize counts
frequencies = {}
text = open("data/portuguese/two_ends.txt", "r+")
for line in text:
    word = line.strip().split(' ')
    frequencies[word[0]] = float(word[1])   

## TF - IDF

TF-IDF (Term-frequency and Inverse Document Frequency) measures similarity using this:
$$
tfidf(q, d) = \sum_{t \in q \cap d} f_{t, q} \cdot \log \frac{N + 1}{df_{t, d} + 0.5}
$$

Firstly, we define `tf` using this, which is just the frequency counts in the intersection.

In [1]:
def tf(intersection, query):
    '''Counts term frequency'''
    tf = [query.count(word) for word in intersection]
    return np.array(tf)

Afterwards, we compute `idf`, which is inverse document frequency. Here we will make use of the dictionary that we created earlier in order to compute document frequencies.

In [2]:
def idf(intersection, document, N):
    '''Counts inverse document frequency'''
    idf = np.array([frequencies[word] for word in intersection])
    idf = np.log10(np.divide(N + 1, idf + 0.5))
    return idf

Finally we simulate the function `tf_idf` which takes the dot product of `tf` and `idf` arrays.

In [3]:
def tf_idf(query, document, N):
    intersection = [word for word in document if word in query] # intersection
    score = np.dot(tf(intersection, query), idf(intersection, document, N))
    return score

In [11]:
query = two_ends("pizza", 2)
document = two_ends("pizza", 2)
tf_idf(query, document, 50000)

13.615376041936951

## BM25


$$
BM25(q,d) = \sum_{t \in q, d} \frac{f_{t, q} \cdot (k1 + 1.0)}{f_{t, q} + k1 \cdot (1.0 - b + b \frac{|q|}{avgdl})} \cdot IDF(t, d)
$$

In [18]:
def bm25_tf(intersection, query, document, k1, b, avgdl, N):
    tf_ = tf(intersection, document)
    numerator = tf_ * (k1 + 1.0)
    denominator = tf_ + k1 * (1.0 - b + b * (len(query) / avgdl))
    bm25_tf = np.divide(numerator, denominator)
    return bm25_tf

In [19]:
def bm25(query, document, k1 = 1.2, b = 0.75, avgdl = 8.3, N = 50000):
    intersection = [word for word in document if word in query] # intersection
    score = np.dot(bm25_tf(intersection, query, document, k1, b, avgdl, N), idf(intersection, document, N))
    return score

In [24]:
query = two_ends("pizza", 2)
document = two_ends("pizza", 2)
bm25(query, document)

15.356193114624382

## Dirichlet

$$
Dir(q, d) = \sum_{t \in q, d} c(t, q) \log \Bigg(1 + \frac{c(t, d)}{\mu \cdot p(t | C)}\Bigg) + |q| \log \frac{\mu}{\mu + |d|}
$$

In [32]:
shingles = 470751
def smooth(intersection, document, mu):
    smooth = []
    for word in intersection:
        prob = 1.0 + np.divide(document.count(word), mu * frequencies[word] / shingles)
        smooth.append(np.log10(prob))
    smooth = np.array(smooth)
    return smooth

In [40]:
def dirichlet(query, document, mu = 100.0):
    intersection = [word for word in document if word in query] # intersection
    add = len(query) * np.log10(np.divide(mu, mu + len(document)))
    score = np.dot(tf(intersection, query), smooth(intersection , document, mu)) + add
    return score

In [41]:
query = two_ends("pizzzza", 2)
document = two_ends("pizzza", 2)
print(dirichlet(query, document))

11.15631567880404
