Add other distance measures to Similarity #64

piskvorky · 2011-11-04T18:16:02Z

Currently, Similarity works purely over cosine similarity (~the angle between query and indexed document).

Make this more general, using e.g. Hellinger distance for models that represent the documents as probability distributions.

At the same time, try to still keep things computationally efficient (using BLAS & mmap etc.).

piskvorky · 2012-01-08T13:41:27Z

Similarity Measures for Text Document Clustering by Anne Huang, 2008.
http://nzcsrsc08.canterbury.ac.nz/site/proceedings/Individual_Papers/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf

cmooony · 2015-10-23T16:51:55Z

I am trying to add another similarity function to gensim.docsim, but when I search for cossim in the source files , I got only two results: cossim in matutils.py and matutils.cossim in test_lee.py.
So I am wandering how exactly the gensim caculate the similarity of the documents? And How can I add my own simi function for gensim?
Thanks and looking forward to your advice!

piskvorky · 2015-10-23T23:49:08Z

The classes in docsim only support dot product + cossim.

You can add another similarity function by simply writing it and using it; or what sort of API do you need?

Or did you expect the Similarity class to accept arbitrary sim functions as input?

cmooony · 2015-10-24T01:47:26Z

Ahh, thanks. I got it. It is computed as dot product, in MatrixSimilarity result = numpy.dot(self.index, query.T).T, and In SparseMatrixSimilarity, it is result = self.index * query.tocsc()
As you mentioned in #69 , I think it is a better way for 'humans' to choose when we need another method predifined in gensim.matutils.
Thanks again for your nice work!

piskvorky · 2015-10-24T03:06:41Z

Well we can support other ways too, it was a honest question.

I'm just not sure what kind of functionality do people expect, how to structure the API. I'm not a fan engineering new functionality without clear use cases :)

cschwem2er · 2016-02-06T11:33:50Z

Is it a reasonable approach to use cosine similarity for gensim lda models? Or should Hellinger's distance be highly prefered? If so, I would love to see support for it. :)

piskvorky · 2016-02-25T05:54:39Z

@methodds looking forward to the PR!

The code for Hellinger distance is really simple; see for example here:
http://stackoverflow.com/questions/22433884/python-gensim-how-to-calculate-document-similarity-using-the-lda-model#answer-22756647

bhargavvader · 2016-03-22T07:48:49Z

Hello, is there any particular reason the cossim method is not being used in docsim, and a simple dot product is used instead? Is it because the cossim method in mathutils is only for sparse vectors?

I looked at your stackoverflow answer and wrote a method for Hellinger- but it's very simple and I'm not sure if I'm going in the right direction.

def hellinger(vec1, vec2, num_topics):
    dense1 = sparse2full(vec1,num_topics)
    dense2 = sparse2full(vec2,num_topics)
    sim = numpy.sqrt(0.5 * ((numpy.sqrt(dense1) - numpy.sqrt(dense2))**2).sum())
    return sim

If I am, I can write up some test cases and submit a pull request. There are also some more generic implementations of Hellinger here - https://gist.github.com/larsmans/3116927

[edit: I imagine what I wrote in a hurry is only for an LDA distribution and is not very generic, as well]

piskvorky · 2016-03-22T09:05:20Z

@bhargavvader: Your version looks more generic that the one in the gist -- not sure what you mean there. It could be made even more generic if you accept more formats on input: dense (numpy), sparse (scipy.sparse), gensim vector (sequence of (id, weight)).

Ping @tmylk -- good intro test, adding tiny sim metric functions like this into matutils.

bhargavvader · 2016-03-22T19:00:38Z

Hello, could you please have a look at this and tell me if I'm heading the right way?

def hellinger(vec1, vec2, lda=None):
    if lda is None:
        if issparse(vec1) and issparse(vec2):
            dense1 = vec1.todense()
            dense2 = vec2.todense()
            sim = numpy.sqrt(0.5 * ((numpy.sqrt(dense1) - numpy.sqrt(dense2))**2).sum())
            return sim
        elif isinstance(vec1,numpy.ndarray) and isinstance(vec2,numpy.ndarray):
            sim = numpy.sqrt(0.5 * ((numpy.sqrt(vec1) - numpy.sqrt(vec2))**2).sum())
            return sim
        elif type(vec1) is list:
            vec1, vec2 = dict(vec1), dict(vec2)
            if len(vec2) < len(vec1):
                vec1, vec2 = vec2, vec1 # swap references so that we iterate over the shorter vector
            sim = numpy.sqrt(0.5*sum(numpy.sqrt(value) - numpy.sqrt(vec2.get(index, 0.0)) for index, value in iteritems(vec1)))
            return sim
    elif isinstance(lda,gensim.models.ldamodel.LdaModel):
        dense1 = matutils.sparse2full(vec1,lda.num_topics)
        dense2 = matutils.sparse2full(vec2,lda.num_topics)
        sim = numpy.sqrt(0.5 * ((numpy.sqrt(dense1) - numpy.sqrt(dense2))**2).sum())
        return sim
    else:
        # return some error
                pass

This works fine for lda distribution vectors, for bag of words representations (I used an implementation similar to that in your cossim method, does it look ok?) , and matrices which have the same dimensions. If it's an ok way of going ahead, I'll try and fix the problem with dense and sparse matrices with different dimensions.

bhargavvader · 2016-03-23T08:05:19Z

If this approach is fine, I could also add something simple like Jaccard Coefficient for gensim vectors (bag of words).

bhargavvader · 2016-04-03T09:30:38Z

@tmylk , should I go ahead and submit a PR for this? And any suggestion on dealing with matrices of different dimensions?

bhargavvader · 2016-08-25T15:23:49Z

@piskvorky what are the steps ahead for this?

piskvorky · 2016-08-26T01:32:41Z

Deferring to @tmylk .

tmylk · 2016-09-18T13:48:28Z

Implemented in #656

jesusepfvazquez · 2018-06-05T14:42:18Z

I wanted to know if there was a way for the SIMILARITY class to accept other similarity functions besides the cosine similarity. For example, I want to write my own similarity function and substitute this function for the cosine similarity function in the gensim package.

Any help would be greatly appreciated. @piskvorky

#64 (comment)

menshikh-iv · 2018-07-30T13:37:01Z

@jvazquez2, unfortunately no way, you can implement your own class using, for example https://github.com/RaRe-Technologies/gensim/blob/2ce4699e048a4bb02be06b0412a42da9bd7fbdfe/gensim/similarities/docsim.py#L722
as base class, or simply calculate all manually.

piskvorky mentioned this issue Nov 29, 2011

make normalization a transformation #69

Closed

tmylk added the difficulty easy Easy issue: required small fix label Jan 23, 2016

piskvorky assigned tmylk Mar 23, 2016

bhargavvader mentioned this issue Apr 4, 2016

[MRG] Adding Distance metrics to matutils + Tutorial #656

Merged

tmylk closed this as completed Sep 18, 2016

jesusepfvazquez unassigned tmylk Jun 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add other distance measures to Similarity #64

Add other distance measures to Similarity #64

piskvorky commented Nov 4, 2011

piskvorky commented Jan 8, 2012

cmooony commented Oct 23, 2015

piskvorky commented Oct 23, 2015

cmooony commented Oct 24, 2015

piskvorky commented Oct 24, 2015

cschwem2er commented Feb 6, 2016

piskvorky commented Feb 25, 2016

bhargavvader commented Mar 22, 2016

piskvorky commented Mar 22, 2016

bhargavvader commented Mar 22, 2016

bhargavvader commented Mar 23, 2016

bhargavvader commented Apr 3, 2016

bhargavvader commented Aug 25, 2016

piskvorky commented Aug 26, 2016

tmylk commented Sep 18, 2016

jesusepfvazquez commented Jun 5, 2018

menshikh-iv commented Jul 30, 2018 •

edited

Loading

Add other distance measures to Similarity #64

Add other distance measures to Similarity #64

Comments

piskvorky commented Nov 4, 2011

piskvorky commented Jan 8, 2012

cmooony commented Oct 23, 2015

piskvorky commented Oct 23, 2015

cmooony commented Oct 24, 2015

piskvorky commented Oct 24, 2015

cschwem2er commented Feb 6, 2016

piskvorky commented Feb 25, 2016

bhargavvader commented Mar 22, 2016

piskvorky commented Mar 22, 2016

bhargavvader commented Mar 22, 2016

bhargavvader commented Mar 23, 2016

bhargavvader commented Apr 3, 2016

bhargavvader commented Aug 25, 2016

piskvorky commented Aug 26, 2016

tmylk commented Sep 18, 2016

jesusepfvazquez commented Jun 5, 2018

menshikh-iv commented Jul 30, 2018 • edited Loading

menshikh-iv commented Jul 30, 2018 •

edited

Loading