Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix min_count handling in phrases detection using npmi #2072

Merged
merged 6 commits into from
Jul 31, 2018

Conversation

lopusz
Copy link
Contributor

@lopusz lopusz commented May 29, 2018

Dear Gensim Developers,

First of all, gensim is an amazing tool. Thank you for your work!

Recently, I played a bit with phrases (collocation) detection via gensim.

It seems that the npmi version ignores min_count parameter.
I believe min_count should allow me to filter rarely occurring bigrams no matter how strong the collocation score might be. It works fine with the default (original) scorer, but npmi ignores min_count parameter.

The attached example illustrates the point (sorry for txt ext, but github complains about py). It displays detected phrases, calls are with min_count=3. Before modification the NPMI version does not drop cases with 3 or fewer occurrences.

Default scoring
(b'A', b'A') (5, 0.18)
(b'B', b'A') (4, 0.1125)
NPMI scoring
(b'E', b'B') (1, 0.29377325990008535)
(b'A', b'E') (1, 0.2179885169004593)
(b'B', b'A') (4, -0.032919469600875044)
(b'A', b'A') (5, -0.03842191266041472)
(b'B', b'B') (2, -0.23155457855616865)
(b'A', b'B') (1, -0.48823822319945537)

After the correction, the low count bigrams are correctly pruned:

Default scoring
(b'A', b'A') (5, 0.18)
(b'B', b'A') (4, 0.1125)
NPMI scoring
(b'B', b'A') (4, -0.032919469600875044)
(b'A', b'A') (5, -0.03842191266041472)

If my way of reasoning is right, but you want me to add some love to the attached pull
request do not hesitate to let me know (perhaps stating explicitly in the docs that
scorer should handle the min_count would be a nice addition, in case somebody wants to write her/his own...)

Best regards,
Michał
mwe.txt

@lopusz
Copy link
Contributor Author

lopusz commented Jul 13, 2018

This seems to be also the issue here
#2086

@piskvorky piskvorky added the bug Issue described a bug label Jul 13, 2018
@piskvorky
Copy link
Owner

Thanks @lopusz ! Sorry it's taking longer to process, we were busy with the documentation updates.

This indeed looks like a bug to me. @michaelwsherman thoughts?

else:
# Return the value below minimal npmi, to make sure that phrases
# will be created only out of bigrams more frequent than min_count
return -1.1
Copy link
Owner

@piskvorky piskvorky Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning behind -1.1? Looks too magical.

Deserves a code comment at the very least, but maybe a principled value would be even better? -inf? None?

Copy link

@rafabr4 rafabr4 Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That value will be compared against the threshold parameter that was supplied by the user, which even though is not restricted, should fall within -1 and 1.

Also. I would use a >= instead of > in if bigram_count > min_count:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

I'd be in favor of returning None (for all scoring functions) rather than a number for any min_count violation since that would mean all scorers return the same value that means "min_count not met". But I also don't know if None may break things. What tests fail with None? From a quick look this is probably safe.

I don't think returning -1.1 is a great idea either, but I agree about it being commented at the very least.

Copy link

@rafabr4 rafabr4 Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you can't compare int/float with NoneType (at least in python 3.6.2), so line 168 of phrases.py would have to be something like if score is not None and score > threshold:. This would also require to change the default scorer to return None explicitly when min_count is not met, for consistency across scorers.

Other solution, instead of returning None in scorers, could be return -float('Inf') which would result in 'False' when comparing score > threshold.

I personally like solution 1 more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry for a slow reply. The pull-request woke up in a pretty busy time for me.

Thank you for remarks. Indeed -inf sounds a bit less fragile-magical. Code updated.

Personally, I would avoid returning None from function expected to return a number and cluttering the rest of the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to -Inf, I learned something :). LGTM.

@lopusz
Copy link
Contributor Author

lopusz commented Jul 17, 2018

Code updated along the review lines. @piskvorky, Others do you find it mergeable?

@piskvorky
Copy link
Owner

@lopusz thanks! Let's wait for @menshikh-iv 's verdict.

@lopusz
Copy link
Contributor Author

lopusz commented Jul 21, 2018

@menshikh-iv what do you think?

pb = wordb_count / corpus_word_count
pab = bigram_count / corpus_word_count
return log(pab / (pa * pb)) / -log(pab)
if bigram_count > min_count:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sharp inequality goes against the established pattern in the other classes in this module, and in word2vec etc. Should be >= IMO.

@lopusz
Copy link
Contributor Author

lopusz commented Jul 23, 2018

@piskvorky Code updated.

However, I want to bikeshed a bit on that (If the stuff below looks like nonsense, just ignore it...)

I arrived at > sign because IMO gensim api for the default scorer seems to indirectly enforces > min_count in the default scorer. It would be great if default and npmi were compatible.

How is that enforced, you say?

Two things:

  • the score formula (bigram_count - min_count)/SOMETHING
  • Runtime check in the Phrases constructor enforcing threshold>0 (note sharp inequality)
   if threshold <= 0 and scoring == 'default':
         raise ValueError("threshold should be positive for default scoring")

Since, I am not allowed to threshold=0, the default scorer will always give me
bigrams > min_count.

Perhaps the check could enforce only that threshold>=0 for the default scoring?

How I arrived at this problem and why it might be important?

I wanted to generate all the phrases above certain min_count and compare
scores (ranking) provided by both methods for the same bigrams.
It seams reasonable that to have the option to score all
bigrams fulfilling min_count requirement. And indeed min_count should be
compatible with the rest of the library.

What do you think about allowing threshold>=0 for the default scoring?
@piskvorky @menshikh-iv @michaelwsherman @rafabr4

If I am the only bothered about that, just ignore.
Or maybe there are some negative side effects of this additional change I cannot see?

@piskvorky
Copy link
Owner

piskvorky commented Jul 23, 2018

No problem :) Good points.

As a user, without reading the docs, I'd expect min_count to mean "minimum count to pass the filter", not "maximum count that still doesn't pass the filter".

The principle of least surprise.

If min_count means something else, in this PR or in the old code, we should either
a) change the variable name to something more appropriate, or
b) fix the code to match the expectation.

I'm in favour of b). But let's be consistent in the naming and the semantics, across all supported scorers.

@menshikh-iv
Copy link
Contributor

@lopusz I'm +1 for #2072 (comment), i.e. should be >=.

Current code looks good, but please resolve merge conflict.

@lopusz
Copy link
Contributor Author

lopusz commented Jul 30, 2018

Great.

Corrected doc string and comment to reflect the >= change + resolved the merge conflict.

@lopusz
Copy link
Contributor Author

lopusz commented Jul 30, 2018

@menshikh-iv Do you think it is now mergeable?

@menshikh-iv
Copy link
Contributor

@lopusz looks good, thanks! Congratz with first contribution 🥇

@menshikh-iv menshikh-iv merged commit 4d921da into piskvorky:develop Jul 31, 2018
@lopusz
Copy link
Contributor Author

lopusz commented Jul 31, 2018

@menshikh-iv :) Yes, my mergeytocin level has never been so high :)

I hope to be contributing more in the future.

@menshikh-iv
Copy link
Contributor

@lopusz 🌟 great, you are welcome :)

@lopusz lopusz deleted the develop-phrases branch November 27, 2019 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants