Multi-document usage doesn't work with MMR diversity changes #54

emigre459 · 2021-07-29T15:51:39Z

When attempting to look at different keyphrase lists by adjusting diversity for the MMR similarity metric, I find that the keyphrases never change regardless of the diversity value used. The keyphrases' output is corrected when I take the exemplar document out of a list form and just feed it as a single documentPlease see below code and results (using the text snippet from the README tutorial):

Version with a single piece of text in a single-element list:

keywords = kw_model.extract_keywords(
    docs=[doc],
    keyphrase_ngram_range=(2,5),
    use_mmr=True,
    diversity=0.7,
    vectorizer=None
)

>>> [[('examples supervised learning example', 0.6984),
  ('supervised learning machine', 0.6989),
  ('supervised learning', 0.7035),
  ('examples supervised learning', 0.7041),
  ('supervised learning example', 0.7558)]]

Version with a single piece of text fed directly as just a single document:

keywords = kw_model.extract_keywords(
    docs=doc,
    keyphrase_ngram_range=(2,5),
    use_mmr=True,
    diversity=0.7,
    vectorizer=None
)

>>> [('supervised learning example', 0.7558),
 ('output pairs', 0.1079),
 ('reasonable way', 0.1529),
 ('object typically vector', 0.2502),
 ('value called supervisory signal', 0.3348)]

Looking at the code for KeyBERT.extract_keywords(), it appears this is intentional, as there aren't even MMR/MaxSum options for _extract_keywords_multiple_docs(). Is there a reason why these diversity-enhancing metrics can't be used in the multi-document scenario?

The text was updated successfully, but these errors were encountered:

MaartenGr · 2021-07-30T05:09:16Z

This was indeed on purpose since MMR and MaxSum work on a individual-document level whereas the multiple_docs implementation compares matrices of embeddings. During that process, MMR and MaxSum cannot be executed as it would significantly slow down the application. If I were to implement that option, it would essentially be same as iterating over extract_keywords which defeats its purpose.

MaartenGr · 2021-10-26T06:21:21Z

Since this issue has been a while without activity, I'll be closing it for now. However, if you are still experiencing the issue or want to discuss it further, let me know!

MaartenGr closed this as completed Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-document usage doesn't work with MMR diversity changes #54

Multi-document usage doesn't work with MMR diversity changes #54

emigre459 commented Jul 29, 2021

MaartenGr commented Jul 30, 2021

MaartenGr commented Oct 26, 2021

Multi-document usage doesn't work with MMR diversity changes #54

Multi-document usage doesn't work with MMR diversity changes #54

Comments

emigre459 commented Jul 29, 2021

MaartenGr commented Jul 30, 2021

MaartenGr commented Oct 26, 2021