Skip to content

Commit

Permalink
Respec clip_start, clip_end in most_similar. Fix #601 (#994)
Browse files Browse the repository at this point in the history
* fixed incorrect doctag in most_similar

* remove query tagID from sim

* added test and modified CHANGELOG.md

* correct PEP8

* changed clip end
  • Loading branch information
parulsethi authored and tmylk committed Nov 9, 2016
1 parent 7f867c7 commit a0443e4
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 1 deletion.
1 change: 1 addition & 0 deletions CHANGELOG.md
Expand Up @@ -20,6 +20,7 @@ Changes
* Remove ShardedCorpus from init because of Theano dependency (@tmylk, [#919](https://github.com/RaRe-Technologies/gensim/pull/919))
* Documentation improvements ( @dsquareindia & @tmylk, [#914](https://github.com/RaRe-Technologies/gensim/pull/914), [#906](https://github.com/RaRe-Technologies/gensim/pull/906) )
* Add Annoy memory-mapping example (@harshul1610, [#899](https://github.com/RaRe-Technologies/gensim/pull/899))
* Fixed issue [#601](https://github.com/RaRe-Technologies/gensim/issues/601), correct docID in most_similar for clip range (@parulsethi, [#994](https://github.com/RaRe-Technologies/gensim/pull/994))

0.13.2, 2016-08-19

Expand Down
2 changes: 1 addition & 1 deletion gensim/models/doc2vec.py
Expand Up @@ -460,7 +460,7 @@ def most_similar(self, positive=[], negative=[], topn=10, clip_start=0, clip_end
return dists
best = matutils.argsort(dists, topn=topn + len(all_docs), reverse=True)
# ignore (don't return) docs from the input
result = [(self.index_to_doctag(sim), float(dists[sim])) for sim in best if sim not in all_docs]
result = [(self.index_to_doctag(sim + clip_start), float(dists[sim])) for sim in best if (sim + clip_start) not in all_docs]
return result[:topn]

def doesnt_match(self, docs):
Expand Down
7 changes: 7 additions & 0 deletions gensim/test/test_doc2vec.py
Expand Up @@ -8,6 +8,7 @@
Automated tests for checking transformation algorithms (the models package).
"""


from __future__ import with_statement

import logging
Expand Down Expand Up @@ -159,6 +160,12 @@ def model_sanity(self, model):
self.assertEqual(list(zip(*sims))[0], list(zip(*sims2))[0]) # same doc ids
self.assertTrue(np.allclose(list(zip(*sims))[1], list(zip(*sims2))[1])) # close-enough dists

# sim results should be in clip range if given
clip_sims = model.docvecs.most_similar(fire1, clip_start=len(model.docvecs) // 2, clip_end=len(model.docvecs) * 2 // 3)
sims_doc_id = [docid for docid, sim in clip_sims]
for s_id in sims_doc_id:
self.assertTrue(len(model.docvecs) // 2 <= s_id <= len(model.docvecs) * 2 // 3)

# tennis doc should be out-of-place among fire news
self.assertEqual(model.docvecs.doesnt_match([fire1, tennis1, fire2]), tennis1)

Expand Down

0 comments on commit a0443e4

Please sign in to comment.