Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-document usage doesn't work with MMR diversity changes #54

Closed
emigre459 opened this issue Jul 29, 2021 · 2 comments
Closed

Multi-document usage doesn't work with MMR diversity changes #54

emigre459 opened this issue Jul 29, 2021 · 2 comments

Comments

@emigre459
Copy link

When attempting to look at different keyphrase lists by adjusting diversity for the MMR similarity metric, I find that the keyphrases never change regardless of the diversity value used. The keyphrases' output is corrected when I take the exemplar document out of a list form and just feed it as a single documentPlease see below code and results (using the text snippet from the README tutorial):

Version with a single piece of text in a single-element list:

keywords = kw_model.extract_keywords(
    docs=[doc],
    keyphrase_ngram_range=(2,5),
    use_mmr=True,
    diversity=0.7,
    vectorizer=None
)

>>> [[('examples supervised learning example', 0.6984),
  ('supervised learning machine', 0.6989),
  ('supervised learning', 0.7035),
  ('examples supervised learning', 0.7041),
  ('supervised learning example', 0.7558)]]

Version with a single piece of text fed directly as just a single document:

keywords = kw_model.extract_keywords(
    docs=doc,
    keyphrase_ngram_range=(2,5),
    use_mmr=True,
    diversity=0.7,
    vectorizer=None
)

>>> [('supervised learning example', 0.7558),
 ('output pairs', 0.1079),
 ('reasonable way', 0.1529),
 ('object typically vector', 0.2502),
 ('value called supervisory signal', 0.3348)]

Looking at the code for KeyBERT.extract_keywords(), it appears this is intentional, as there aren't even MMR/MaxSum options for _extract_keywords_multiple_docs(). Is there a reason why these diversity-enhancing metrics can't be used in the multi-document scenario?

@MaartenGr
Copy link
Owner

This was indeed on purpose since MMR and MaxSum work on a individual-document level whereas the multiple_docs implementation compares matrices of embeddings. During that process, MMR and MaxSum cannot be executed as it would significantly slow down the application. If I were to implement that option, it would essentially be same as iterating over extract_keywords which defeats its purpose.

@MaartenGr
Copy link
Owner

Since this issue has been a while without activity, I'll be closing it for now. However, if you are still experiencing the issue or want to discuss it further, let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants