Repetition of Named Entity in coref resolution #288

grovershreyf9t · 2020-10-30T16:09:19Z

I am trying to find corefs for the following text:
'Alan and Bruder are great friends with Chris and Donald. Alan and Bruder want to head to Lebanon while Chris and Donald wish to stay in United States. Chris and Donald have not made up their mind yet, but will get there soon. Alan and Bruder do not want to separate but there seems to be no choice.'

As seen in the screenshot, I get correct clusters for 'Chris and Donald' (cell 336), but when I try to resolve these corefs, 'Donald' gets repeated twice in the result (cell 337 - second line).

Can someone help me in understanding what might be going wrong here?

svlandeg · 2020-11-18T22:18:00Z

Hm, that looks like a bug to me. Thanks for the report!

svlandeg · 2020-11-18T22:18:56Z

It's really weird that the wrong substitution only happens one time though, the other occurrences seem fine...

grovershreyf9t · 2020-11-19T05:52:37Z

@svlandeg After inspecting some cases, I believe it happens when 'Chris', 'Donald' and 'Chris and Donald' are three individual entities as recognised by neuralcoref.
I also wrote a short custom code to resolve coref which tries to eliminate this error:

    tok_list = list(token.text_with_ws for token in case_doc) # fetches tokens with whitespaces from spacy document
    for cluster in clusters:
      cluster_main_words = set(cluster.main.text.split(' ')) # get tokens from representative cluster name
      for coref in cluster:
        if coref!=cluster.main: #if coreference element is not the representative element of that cluster
            if coref.text!=cluster.main.text and bool(set(coref.text.split(' ')).intersection(cluster_main_words))==False: 
              # if coreference element text and representative element text are not equal and none of the coreference element words are in representative element. This was done to handle nested coreference scenarios
                tok_list[coref.start] = cluster.main.text + case_doc[coref.end-1].whitespace_
                for i in range(coref.start+1, coref.end):
                    tok_list[i] = ""

Basically, I avoided the representative element text to substitute the utterance of text if both are the same. Eg. If 'Chris' is the representative entity and the code encounters 'Chris' as an utterance, if it will not substitute it, since they are the same. By doing this, the duplication could be avoided.

I hope this makes sense.

stale · 2022-01-08T22:06:31Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

svlandeg added the usage label Nov 18, 2020

svlandeg added bug feat / coref and removed usage labels Nov 18, 2020

stale bot added the wontfix label Jan 8, 2022

stale bot closed this as completed Apr 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repetition of Named Entity in coref resolution #288

Repetition of Named Entity in coref resolution #288

grovershreyf9t commented Oct 30, 2020

svlandeg commented Nov 18, 2020

svlandeg commented Nov 18, 2020

grovershreyf9t commented Nov 19, 2020 •

edited

stale bot commented Jan 8, 2022

Repetition of Named Entity in coref resolution #288

Repetition of Named Entity in coref resolution #288

Comments

grovershreyf9t commented Oct 30, 2020

svlandeg commented Nov 18, 2020

svlandeg commented Nov 18, 2020

grovershreyf9t commented Nov 19, 2020 • edited

stale bot commented Jan 8, 2022

grovershreyf9t commented Nov 19, 2020 •

edited