Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add other distance measures to Similarity #64

Closed
piskvorky opened this issue Nov 4, 2011 · 17 comments
Closed

Add other distance measures to Similarity #64

piskvorky opened this issue Nov 4, 2011 · 17 comments
Labels
difficulty easy Easy issue: required small fix wishlist Feature request

Comments

@piskvorky
Copy link
Owner

Currently, Similarity works purely over cosine similarity (~the angle between query and indexed document).

Make this more general, using e.g. Hellinger distance for models that represent the documents as probability distributions.

At the same time, try to still keep things computationally efficient (using BLAS & mmap etc.).

@piskvorky
Copy link
Owner Author

Similarity Measures for Text Document Clustering by Anne Huang, 2008.
http://nzcsrsc08.canterbury.ac.nz/site/proceedings/Individual_Papers/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf

@cmooony
Copy link

cmooony commented Oct 23, 2015

I am trying to add another similarity function to gensim.docsim, but when I search for cossim in the source files , I got only two results: cossim in matutils.py and matutils.cossim in test_lee.py.
So I am wandering how exactly the gensim caculate the similarity of the documents? And How can I add my own simi function for gensim?
Thanks and looking forward to your advice!

@piskvorky
Copy link
Owner Author

The classes in docsim only support dot product + cossim.

You can add another similarity function by simply writing it and using it; or what sort of API do you need?

Or did you expect the Similarity class to accept arbitrary sim functions as input?

@cmooony
Copy link

cmooony commented Oct 24, 2015

Ahh, thanks. I got it. It is computed as dot product, in MatrixSimilarity result = numpy.dot(self.index, query.T).T, and In SparseMatrixSimilarity, it is result = self.index * query.tocsc()
As you mentioned in #69 , I think it is a better way for 'humans' to choose when we need another method predifined in gensim.matutils.
Thanks again for your nice work!

@piskvorky
Copy link
Owner Author

Well we can support other ways too, it was a honest question.

I'm just not sure what kind of functionality do people expect, how to structure the API. I'm not a fan engineering new functionality without clear use cases :)

@tmylk tmylk added the difficulty easy Easy issue: required small fix label Jan 23, 2016
@cschwem2er
Copy link

Is it a reasonable approach to use cosine similarity for gensim lda models? Or should Hellinger's distance be highly prefered? If so, I would love to see support for it. :)

@piskvorky
Copy link
Owner Author

@methodds looking forward to the PR!

The code for Hellinger distance is really simple; see for example here:
http://stackoverflow.com/questions/22433884/python-gensim-how-to-calculate-document-similarity-using-the-lda-model#answer-22756647

@bhargavvader
Copy link
Contributor

Hello, is there any particular reason the cossim method is not being used in docsim, and a simple dot product is used instead? Is it because the cossim method in mathutils is only for sparse vectors?

I looked at your stackoverflow answer and wrote a method for Hellinger- but it's very simple and I'm not sure if I'm going in the right direction.

def hellinger(vec1, vec2, num_topics):
    dense1 = sparse2full(vec1,num_topics)
    dense2 = sparse2full(vec2,num_topics)
    sim = numpy.sqrt(0.5 * ((numpy.sqrt(dense1) - numpy.sqrt(dense2))**2).sum())
    return sim

If I am, I can write up some test cases and submit a pull request. There are also some more generic implementations of Hellinger here - https://gist.github.com/larsmans/3116927

[edit: I imagine what I wrote in a hurry is only for an LDA distribution and is not very generic, as well]

@piskvorky
Copy link
Owner Author

@bhargavvader: Your version looks more generic that the one in the gist -- not sure what you mean there. It could be made even more generic if you accept more formats on input: dense (numpy), sparse (scipy.sparse), gensim vector (sequence of (id, weight)).

Ping @tmylk -- good intro test, adding tiny sim metric functions like this into matutils.

@bhargavvader
Copy link
Contributor

Hello, could you please have a look at this and tell me if I'm heading the right way?

def hellinger(vec1, vec2, lda=None):
    if lda is None:
        if issparse(vec1) and issparse(vec2):
            dense1 = vec1.todense()
            dense2 = vec2.todense()
            sim = numpy.sqrt(0.5 * ((numpy.sqrt(dense1) - numpy.sqrt(dense2))**2).sum())
            return sim
        elif isinstance(vec1,numpy.ndarray) and isinstance(vec2,numpy.ndarray):
            sim = numpy.sqrt(0.5 * ((numpy.sqrt(vec1) - numpy.sqrt(vec2))**2).sum())
            return sim
        elif type(vec1) is list:
            vec1, vec2 = dict(vec1), dict(vec2)
            if len(vec2) < len(vec1):
                vec1, vec2 = vec2, vec1 # swap references so that we iterate over the shorter vector
            sim = numpy.sqrt(0.5*sum(numpy.sqrt(value) - numpy.sqrt(vec2.get(index, 0.0)) for index, value in iteritems(vec1)))
            return sim
    elif isinstance(lda,gensim.models.ldamodel.LdaModel):
        dense1 = matutils.sparse2full(vec1,lda.num_topics)
        dense2 = matutils.sparse2full(vec2,lda.num_topics)
        sim = numpy.sqrt(0.5 * ((numpy.sqrt(dense1) - numpy.sqrt(dense2))**2).sum())
        return sim
    else:
        # return some error
                pass

This works fine for lda distribution vectors, for bag of words representations (I used an implementation similar to that in your cossim method, does it look ok?) , and matrices which have the same dimensions. If it's an ok way of going ahead, I'll try and fix the problem with dense and sparse matrices with different dimensions.

@bhargavvader
Copy link
Contributor

If this approach is fine, I could also add something simple like Jaccard Coefficient for gensim vectors (bag of words).

@bhargavvader
Copy link
Contributor

@tmylk , should I go ahead and submit a PR for this? And any suggestion on dealing with matrices of different dimensions?

@bhargavvader
Copy link
Contributor

@piskvorky what are the steps ahead for this?

@piskvorky
Copy link
Owner Author

Deferring to @tmylk .

@tmylk
Copy link
Contributor

tmylk commented Sep 18, 2016

Implemented in #656

@tmylk tmylk closed this as completed Sep 18, 2016
@jesusepfvazquez
Copy link

I wanted to know if there was a way for the SIMILARITY class to accept other similarity functions besides the cosine similarity. For example, I want to write my own similarity function and substitute this function for the cosine similarity function in the gensim package.

Any help would be greatly appreciated. @piskvorky

#64 (comment)

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jul 30, 2018

@jvazquez2, unfortunately no way, you can implement your own class using, for example https://github.com/RaRe-Technologies/gensim/blob/2ce4699e048a4bb02be06b0412a42da9bd7fbdfe/gensim/similarities/docsim.py#L722
as base class, or simply calculate all manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix wishlist Feature request
Projects
None yet
Development

No branches or pull requests

7 participants