Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix formula in gensim.summarization.bm25. Fix #1828 #1833

Merged
merged 16 commits into from Jan 11, 2018

Conversation

sj29-innovate
Copy link
Contributor

No description provided.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 8, 2018

Hey @sj29-innovate, what was it #1832 #1831 #1830?
Probably better pre-calculate length of documents in initialize method, wdyt?
Also, please add tests for bm25.

@sj29-innovate
Copy link
Contributor Author

I think you are correct , i should probably pre calculate all document lengths.

@sj29-innovate
Copy link
Contributor Author

sj29-innovate commented Jan 8, 2018

Please ignore #1832 #1831 #1830 , I was trying to figure out my way through and due to build errors i closed those PR's instead of committing new changes there itself.
I am committing changes to #1833 , have precomputed document lengths in " doc_len" list inside the initialization function so that they can be used elsewhere easily.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sj29-innovate, please make last fixes and I'll merge your PR!

self.initialize()

def initialize(self):
"""Calculates frequencies of terms in documents and in corpus. Also computes inverse document frequencies."""
for document in self.corpus:
frequencies = {}
(self.doc_len).append(len(document))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless ( ), please remove.

@@ -122,7 +124,7 @@ def get_score(self, document, index, average_idf):
continue
idf = self.idf[word] if self.idf[word] >= 0 else EPSILON * average_idf
score += (idf * self.f[index][word] * (PARAM_K1 + 1)
/ (self.f[index][word] + PARAM_K1 * (1 - PARAM_B + PARAM_B * len(document) / self.avgdl)))
/ (self.f[index][word] + PARAM_K1 * (1 - PARAM_B + PARAM_B * self.doc_len[index] / self.avgdl)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add several simple tests for BM25

@sj29-innovate
Copy link
Contributor Author

@menshikh-iv Can you please guide me a bit towards writing unit tests and also what corpus should i be using for the tests. Thanks a lot.

@menshikh-iv
Copy link
Contributor

@sj29-innovate
corpus - any small, for example, you can use common_texts from from gensim.test.utils import common_texts
test - something simple, apply BM25 to corpus and query document (some kind of "sanity-check")

@sj29-innovate
Copy link
Contributor Author

Should I be testing things like document should show maximum matching with itself and other such conditions ? Thanks for your help.

@menshikh-iv
Copy link
Contributor

@sj29-innovate sounds good.

@sj29-innovate
Copy link
Contributor Author

@menshikh-iv ci/circleci tests shows command timed out . At some point in the tests it is asking for username , i think it is getting stuck there.Can you please help me with this.

https://circleci.com/gh/RaRe-Technologies/gensim/25?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link#

@menshikh-iv
Copy link
Contributor

@sj29-innovate please merge fresh "develop" to your PR (CircleCI don't see needed config in your branch -> build from "inner-default-python" config), this should fix your problem!

@sj29-innovate
Copy link
Contributor Author

@menshikh-iv I managed to resolve circleci failing tests . Requesting your feedback for the same. Thanks!

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work @sj29-innovate, please make last changes and I'll merge your PR!

@@ -4,7 +4,7 @@
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove `Keras-2.1.2-py2.7.egg from PR (comment here, because I can't add comment to binary file)

for index, doc_weights in enumerate(weights):
expected = max(doc_weights)
predicted = doc_weights[index]
self.assertEqual(expected, predicted)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be assertAlmostEqual (because scores are float)

weights = get_bm25_weights(common_texts)
for doc_weights in weights:
for weight in doc_weights:
self.assertTrue(weight >= 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 -> 0.

""" A document should always get the same weight when matched with a particular document """
corpus = [['cat', 'dog', 'mouse'], ['cat', 'lion'], ['cat', 'lion']]
weights = get_bm25_weights(corpus)
self.assertEqual(weights[0][1], weights[0][2])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be assertAlmostEqual (because scores are float)

""" Two disjoint documents should have zero matching"""
corpus = [['cat', 'dog', 'lion'], ['goat', 'fish', 'tiger']]
weights = get_bm25_weights(corpus)
self.assertTrue(weights[0][1] == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be assertAlmostEqual (because scores are float), same for next line.

@sj29-innovate
Copy link
Contributor Author

@menshikh-iv I have made the required changes , please let me know if any more changes or add ons are required. Thanks.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 11, 2018

I close+open your PR because CI doesn't run correctly, don't worry.

@menshikh-iv menshikh-iv reopened this Jan 11, 2018
@menshikh-iv menshikh-iv changed the title Fixes #1828 Fix formula in gensim.summarization.bm25. Fix #1828 Jan 11, 2018
@menshikh-iv
Copy link
Contributor

Big thanks @sj29-innovate, good job! Congratz with the first contribution 🥇

@menshikh-iv menshikh-iv merged commit 4c2c638 into piskvorky:develop Jan 11, 2018
@sj29-innovate
Copy link
Contributor Author

@menshikh-iv thank you so much for your constant help , can you please suggest me some other tasks as i really wish to contribute more. thanks!

@sj29-innovate sj29-innovate deleted the samyak-develop branch January 11, 2018 17:18
@menshikh-iv
Copy link
Contributor

Yes, please work on #1401 (this is the really critical bug).

sharanry added a commit to sharanry/gensim that referenced this pull request Jan 15, 2018
* Add wordnet mammal train file for Poincare notebook (piskvorky#1781)

* Adds wordnet mammal train file

* Adds link to data file in notebook

* Fix tox.ini/setup.cfg configuration (piskvorky#1815)

* update according to new pytest_benchmark version

* update wheel-storage url

* use only twine

* Fix docstrings for `gensim.utils` (piskvorky#1797)

* Add docstrings in numpy-style fromat

* fix PEP8

* remove outdated "hack" (smart_open is core dependency right now)

* fix docstrings[1]

* remove unused internal class

* fix docstrings[2]

* fix docstrings[3]

* fix docstrings[4]

* fix docstrings[5]

* fix docstrings[6]

* fix docstrings[7]

* fix docstrings[8]

* add missing `pattern` to doc dependencies

* fix docstrings[9]

* fix docstrings[10]

* Fix docstrings for `gensim.models.rpmodel` (piskvorky#1802)

* first attempt to convert few lines into numpy-style doc

* added parameters in documentation

* more documentation

* few corrections

* show inheritance and undoc members

* show special members

* example is executable now

* link to the paper added, named parameters

* fixed doc

* fixed doc

* fixed whitespaces

* fix docstrings & PEP8

* fix docstrings

* fix typo

* Fix docstrings for `gensim.models.translation_matrix` (piskvorky#1806)

* convert Space class doc to numpy style

* fix docstrings[1]

* fix docstrings[2]

* remove useless load

* fix docstrings[3]

* add missing import

* fix docstrings[4]

* Add CircleCI for build documentation. Fix piskvorky#1807 (piskvorky#1822)

* init config for circle

* change

* rm cache, install tox distinctly

* fix indentation & command

* update venv

* add pip-cache

* add apt packages for latex

* rename

* enable latex rendering

* remove doc building from Travis

* store new doc version

* Refactor API reference `gensim.topic_coherence`. Fix piskvorky#1669 (piskvorky#1714)

* Refactored aggregation

* Micro-Fix for aggregation.py, partially refactored direct_confirmation.py

* Partially refactored indirect_confirmation_measure

* Some additions

* Math attempts

* add math extension for sphinx

* Minor refactoring

* Some refactoring for probability_estimation

* Beta-strings

* Different additions

* Minor changes

* text_analysis left

* Added example for ContextVectorComputer class

* probability_estimation 0.9

* beta_version

* Added some examples for text_analysis

* text_analysis: corrected example for class UsesDictionary

* Final additions for text_analysis.py

* fix cross-reference problem

* fix pep8

* fix aggregation

* fix direct_confirmation_measure

* fix types in direct_confirmation_measure

* partial fix indirect_confirmation_measure

* HotFix for probability_estimation and segmentation

* Refactoring for probability_estimation

* Changes for indirect_confirmation_measure

* Fixed segmentation, partly fixed text_analysis

* Add Notes for text_analysis

* fix di/ind

* fix doc examples in probability_estimation

* fix probability_estimation

* fix segmentation

* fix docstring in probability_estimation

* partial fix test_analysis

* add latex stuff for docs build

* doc fix[1]

* doc fix[2]

* remove apt install from travis (now doc build in circle)

* Fix docstrings for `gensim.models.normmodel` (piskvorky#1805)

* First edits

* changed bow

* Added examples

* Final commit of the night

* Still struggling with docs

* Removed examples but still struggling with documentation

* fix docstring

* fix docstring[2]

* Fix docstrings for `gensim.models.logentropy_model` (piskvorky#1803)

* improve and correct documentation of models/logentropy_model

* include fixes according to comments

* implement fixes suggested

* associate methods with examples.

* fix minor typos

* doc fix

* Fix docstrings for `gensim.matutils` (piskvorky#1804)

* numpy style documentation on matutils.py

* doc fix[1]

* doc fix[2]

* doc fix[3]

* doc fix[4]

* doc fix[5]

* doc fix[6]

* Fix formula in `gensim.summarization.bm25`. Fix piskvorky#1828 (piskvorky#1833)

* bm25 scoring function updated

* Fixes piskvorky#1828

* Fixes piskvorky#1828

* Fixes piskvorky#1828

* Fixes piskvorky#1828

* Fixes piskvorky#1828

* Fixes piskvorky#1828 , Tests added

* Fixes piskvorky#1828 , Tests added

* Fixes piskvorky#1828 , Tests Added

* Fixes piskvorky#1828 , Tests Added

* Fixes piskvorky#1828 , Tests Added

* Fixes piskvorky#1828 , Tests Added

* Fixes piskvorky#1828

* Refactor tests for `gensim.corpora.WikiCorpus`(piskvorky#1821)

* minor style refactoring and comment fixes in accordance to PEP8

* Created test data in legitimate compressed XML format (.xml.bz2) for the WikiCorpus class.

* Used the same raw data found for other sources (9 articles).

* Added Various wiki markup to test the parsing regural expressions

* Added test class for the WikiCorpus source.

* Following the same inheritance schema as in the source TestWikiCorpus > TestTextCorpus > CorpusTestCase.

* Testing methods are overriden where necessary to reflect logic changes.

* All existing functionality is tested (account for markup handling, minimum article length etc)

* Fix python 3 compatibility for generator next method

* code review corrections

* Moved WikiCorpus tests from test/test_wikicorpus.py into its class within the test_corpora.py file.

* Adapted all old tests to the new class

* Current Test class schema ensures that WikiCorpus also passes tests defined in parents

* Deleted test_wikicorpus.py since it is now redundant

* Discarded the empty input test for the WikiCorpus since an empty file is not legitimate XML

* Added 2 more tests

* Fix parameter setting for `FastText.train`. Fix piskvorky#1818 (piskvorky#1837)

* bm25 scoring function updated

* Fixes piskvorky#1828

* Fixes piskvorky#1828

* Fixes piskvorky#1828

* Fixes piskvorky#1828

* Fixes piskvorky#1828

* Fixes piskvorky#1828 , Tests added

* Fixes piskvorky#1828 , Tests added

* Fixes piskvorky#1828 , Tests Added

* Fixes piskvorky#1828 , Tests Added

* Fixes piskvorky#1828 , Tests Added

* Fixes piskvorky#1828 , Tests Added

* Fixes piskvorky#1828

* Function Parameters corrected , Fixes piskvorky#1818

* add missing params + add supercall

* Fix positional params used for `gensim.models.CoherenceModel` in `gensim.models.callbacks` (piskvorky#1823)

* add keyword params for call to gensim.models.CoherenceModel as positional arguments for coherence and topn were incorrect due to skipping param for keyed_vectors

* Fix PEP8
sharanry added a commit to sharanry/gensim that referenced this pull request Jan 15, 2018
…into smart_open

* 'develop' of https://github.com/RaRe-Technologies/gensim:
  Fix positional params used for `gensim.models.CoherenceModel` in `gensim.models.callbacks` (piskvorky#1823)
  Fix parameter setting for `FastText.train`. Fix piskvorky#1818 (piskvorky#1837)
  Refactor tests for `gensim.corpora.WikiCorpus`(piskvorky#1821)
  Fix formula in `gensim.summarization.bm25`. Fix piskvorky#1828 (piskvorky#1833)
  Fix docstrings for `gensim.matutils` (piskvorky#1804)
  Fix docstrings for `gensim.models.logentropy_model` (piskvorky#1803)
  Fix docstrings for `gensim.models.normmodel` (piskvorky#1805)
  Refactor API reference `gensim.topic_coherence`. Fix piskvorky#1669 (piskvorky#1714)
  Add CircleCI for build documentation. Fix piskvorky#1807 (piskvorky#1822)
  Fix docstrings for `gensim.models.translation_matrix` (piskvorky#1806)
  Fix docstrings for `gensim.models.rpmodel` (piskvorky#1802)
  Fix docstrings for `gensim.utils` (piskvorky#1797)
  Fix tox.ini/setup.cfg configuration (piskvorky#1815)
  Add wordnet mammal train file for Poincare notebook (piskvorky#1781)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants