-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add other distance measures to Similarity #64
Comments
Similarity Measures for Text Document Clustering by Anne Huang, 2008. |
I am trying to add another similarity function to gensim.docsim, but when I search for cossim in the source files , I got only two results: cossim in matutils.py and matutils.cossim in test_lee.py. |
The classes in docsim only support dot product + cossim. You can add another similarity function by simply writing it and using it; or what sort of API do you need? Or did you expect the |
Ahh, thanks. I got it. It is computed as dot product, in MatrixSimilarity |
Well we can support other ways too, it was a honest question. I'm just not sure what kind of functionality do people expect, how to structure the API. I'm not a fan engineering new functionality without clear use cases :) |
Is it a reasonable approach to use cosine similarity for gensim lda models? Or should Hellinger's distance be highly prefered? If so, I would love to see support for it. :) |
@methodds looking forward to the PR! The code for Hellinger distance is really simple; see for example here: |
Hello, is there any particular reason the I looked at your stackoverflow answer and wrote a method for Hellinger- but it's very simple and I'm not sure if I'm going in the right direction. def hellinger(vec1, vec2, num_topics):
dense1 = sparse2full(vec1,num_topics)
dense2 = sparse2full(vec2,num_topics)
sim = numpy.sqrt(0.5 * ((numpy.sqrt(dense1) - numpy.sqrt(dense2))**2).sum())
return sim If I am, I can write up some test cases and submit a pull request. There are also some more generic implementations of Hellinger here - https://gist.github.com/larsmans/3116927 [edit: I imagine what I wrote in a hurry is only for an LDA distribution and is not very generic, as well] |
@bhargavvader: Your version looks more generic that the one in the gist -- not sure what you mean there. It could be made even more generic if you accept more formats on input: dense (numpy), sparse (scipy.sparse), gensim vector (sequence of Ping @tmylk -- good intro test, adding tiny sim metric functions like this into |
Hello, could you please have a look at this and tell me if I'm heading the right way? def hellinger(vec1, vec2, lda=None):
if lda is None:
if issparse(vec1) and issparse(vec2):
dense1 = vec1.todense()
dense2 = vec2.todense()
sim = numpy.sqrt(0.5 * ((numpy.sqrt(dense1) - numpy.sqrt(dense2))**2).sum())
return sim
elif isinstance(vec1,numpy.ndarray) and isinstance(vec2,numpy.ndarray):
sim = numpy.sqrt(0.5 * ((numpy.sqrt(vec1) - numpy.sqrt(vec2))**2).sum())
return sim
elif type(vec1) is list:
vec1, vec2 = dict(vec1), dict(vec2)
if len(vec2) < len(vec1):
vec1, vec2 = vec2, vec1 # swap references so that we iterate over the shorter vector
sim = numpy.sqrt(0.5*sum(numpy.sqrt(value) - numpy.sqrt(vec2.get(index, 0.0)) for index, value in iteritems(vec1)))
return sim
elif isinstance(lda,gensim.models.ldamodel.LdaModel):
dense1 = matutils.sparse2full(vec1,lda.num_topics)
dense2 = matutils.sparse2full(vec2,lda.num_topics)
sim = numpy.sqrt(0.5 * ((numpy.sqrt(dense1) - numpy.sqrt(dense2))**2).sum())
return sim
else:
# return some error
pass This works fine for lda distribution vectors, for bag of words representations (I used an implementation similar to that in your |
If this approach is fine, I could also add something simple like Jaccard Coefficient for gensim vectors (bag of words). |
@tmylk , should I go ahead and submit a PR for this? And any suggestion on dealing with matrices of different dimensions? |
@piskvorky what are the steps ahead for this? |
Deferring to @tmylk . |
Implemented in #656 |
I wanted to know if there was a way for the SIMILARITY class to accept other similarity functions besides the cosine similarity. For example, I want to write my own similarity function and substitute this function for the cosine similarity function in the gensim package. Any help would be greatly appreciated. @piskvorky |
@jvazquez2, unfortunately no way, you can implement your own class using, for example https://github.com/RaRe-Technologies/gensim/blob/2ce4699e048a4bb02be06b0412a42da9bd7fbdfe/gensim/similarities/docsim.py#L722 |
Currently, Similarity works purely over cosine similarity (~the angle between query and indexed document).
Make this more general, using e.g. Hellinger distance for models that represent the documents as probability distributions.
At the same time, try to still keep things computationally efficient (using BLAS & mmap etc.).
The text was updated successfully, but these errors were encountered: