Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repetition of Named Entity in coref resolution #288

Closed
grovershreyf9t opened this issue Oct 30, 2020 · 4 comments
Closed

Repetition of Named Entity in coref resolution #288

grovershreyf9t opened this issue Oct 30, 2020 · 4 comments

Comments

@grovershreyf9t
Copy link

I am trying to find corefs for the following text:
'Alan and Bruder are great friends with Chris and Donald. Alan and Bruder want to head to Lebanon while Chris and Donald wish to stay in United States. Chris and Donald have not made up their mind yet, but will get there soon. Alan and Bruder do not want to separate but there seems to be no choice.'

Screenshot 2020-10-30 at 21 07 39

As seen in the screenshot, I get correct clusters for 'Chris and Donald' (cell 336), but when I try to resolve these corefs, 'Donald' gets repeated twice in the result (cell 337 - second line).

Can someone help me in understanding what might be going wrong here?

@svlandeg svlandeg added the usage label Nov 18, 2020
@svlandeg
Copy link
Collaborator

Hm, that looks like a bug to me. Thanks for the report!

@svlandeg
Copy link
Collaborator

It's really weird that the wrong substitution only happens one time though, the other occurrences seem fine...

@grovershreyf9t
Copy link
Author

grovershreyf9t commented Nov 19, 2020

@svlandeg After inspecting some cases, I believe it happens when 'Chris', 'Donald' and 'Chris and Donald' are three individual entities as recognised by neuralcoref.
I also wrote a short custom code to resolve coref which tries to eliminate this error:

    tok_list = list(token.text_with_ws for token in case_doc) # fetches tokens with whitespaces from spacy document
    for cluster in clusters:
      cluster_main_words = set(cluster.main.text.split(' ')) # get tokens from representative cluster name
      for coref in cluster:
        if coref!=cluster.main: #if coreference element is not the representative element of that cluster
            if coref.text!=cluster.main.text and bool(set(coref.text.split(' ')).intersection(cluster_main_words))==False: 
              # if coreference element text and representative element text are not equal and none of the coreference element words are in representative element. This was done to handle nested coreference scenarios
                tok_list[coref.start] = cluster.main.text + case_doc[coref.end-1].whitespace_
                for i in range(coref.start+1, coref.end):
                    tok_list[i] = "" 

Basically, I avoided the representative element text to substitute the utterance of text if both are the same. Eg. If 'Chris' is the representative entity and the code encounters 'Chris' as an utterance, if it will not substitute it, since they are the same. By doing this, the duplication could be avoided.

I hope this makes sense.

@stale
Copy link

stale bot commented Jan 8, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jan 8, 2022
@stale stale bot closed this as completed Apr 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants