## New Similarity Metrics for Probability Distribution and Bag of Words 

A small tutorial to illustrate the new similarity functions.

We would need this mostly when comparing how similar two probability distributions are, and in the case of gensim, usually for LSI or LDA topic distributions after we have a LDA model.

Gensim already has functionalities for this, in the sense of getting most similar documents - [this](http://radimrehurek.com/topic_modeling_tutorial/3%20-%20Indexing%20and%20Retrieval.html), [this](https://radimrehurek.com/gensim/tut3.html) and [this](https://radimrehurek.com/gensim/similarities/docsim.html) are such examples of documentation and tutorials.

What this tutorial shows is a building block of these larger methods, which are a small suite of similarity metrics.
We'll start by setting up a small corpus and showing off the methods.

In [1]:
from gensim.corpora import Dictionary
from gensim.models import ldamodel
from gensim.matutils import kullback_leibler, jaccard, hellinger
import numpy

In [2]:
# you can use any corpus, this is just illustratory

texts = [['bank','river','shore','water'],
        ['river','water','flow','fast','tree'],
        ['bank','water','fall','flow'],
        ['bank','bank','water','rain','river'],
        ['river','water','mud','tree'],
        ['money','transaction','bank','finance'],
        ['bank','borrow','money'], 
        ['bank','finance'],
        ['finance','money','sell','bank'],
        ['borrow','sell'],
        ['bank','loan','sell']]

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

In [3]:
numpy.random.seed(1) # setting random seed to get the same results each time.
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=2)

model.show_topics()

[(0,
  u'0.164*bank + 0.142*water + 0.108*river + 0.076*flow + 0.067*borrow + 0.063*sell + 0.060*tree + 0.048*money + 0.046*fast + 0.044*rain'),
 (1,
  u'0.196*bank + 0.120*finance + 0.100*money + 0.082*sell + 0.067*river + 0.065*water + 0.056*transaction + 0.049*loan + 0.046*tree + 0.040*mud')]

Let's take a few sample documents and get them ready to test Similarity.

In [4]:
doc_1 = ['river', 'water', 'shore']
doc_2 = ['bank', 'money', 'sell']
doc_3 = ['money', 'finance', 'tree', 'water']

# now let's make these into a bag of words format

bow_1 = model.id2word.doc2bow(doc_1)   
bow_2 = model.id2word.doc2bow(doc_2)   
bow_3 = model.id2word.doc2bow(doc_3)   

# we can now get the LDA topic distributions for these
lda_bow_1 = model[bow_1]
lda_bow_2 = model[bow_2]
lda_bow_3 = model[bow_3]

## Hellinger and Kullback–Leibler

We're now ready to apply our similarity metrics.

Let's start with the popular Hellinger distance. 
The Hellinger distance metric gives an output in the range [0,1] for two probability distributions, with values closer to 0 meaning they are more similar.

In [5]:
hellinger(lda_bow_1, lda_bow_2)

0.49050745261365591

In [6]:
hellinger(lda_bow_2, lda_bow_3)

0.11733372841789531

Makes sense, right? In the first example, Document 1 and Document 2 are hardly similar, so we get a value of roughly 0.5. 

In the second case, the documents are a lot more similar, semantically. Trained with the model, they also give a higher similarity rating (closer to 0)

Let's run similar examples down with Kullback Leibler.

In [7]:
kullback_leibler(lda_bow_1, lda_bow_3)

0.55465853

In [8]:
kullback_leibler(lda_bow_2, lda_bow_3)

0.051609553

Again, values closer to 0 refer to a closer similarity.

Similarity metrics, as suggested in the examples above, are mainly for probability distributions, but the methods can accept a bunch of formats for input. You can do some further reading on [Kullback Leibler](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence) and [Hellinger](https://en.wikipedia.org/wiki/Hellinger_distance) to figure out what suits your needs.

## Jaccard 

Let us now look at the [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) metric for similarity between bags of words (i.e, documents)

In [9]:
jaccard(bow_1, bow_3)

0.14285714285714285

In [10]:
jaccard(doc_1, doc_3)

0.16666666666666666

In [11]:
jaccard(['word'], ['word'])

1.0

The three examples above feature 2 different input methods. 

In the first case, we present to jaccard document vectors already in bag of words format. The similarity can be defined as the size of the intersection upon the size of the union of the vectors. Unlike the previous distributions we saw, a higher value returned signifies a higher similarity.

We can see (on manual inspection as well), that the similarity is likely to not be very high - and it isn't.

The last two examples illustrate the ability for jaccard to accept even lists (i.e, documents) as inputs.
In the last case, because they are the same vectors, the value returned is 1, or most similar.

## Conclusion

That brings us to the end of this small tutorial.
The scope for adding new similarity metrics is large, as there exist an even larger suite of metrics and methods to add to the matutils.py file. ([This](http://nzcsrsc08.canterbury.ac.nz/site/proceedings/Individual_Papers/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf) is one paper which talks about some of them)

Looking forward to more PRs towards this functionality in Gensim! :)