New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repetition of Named Entity in coref resolution #288
Comments
Hm, that looks like a bug to me. Thanks for the report! |
It's really weird that the wrong substitution only happens one time though, the other occurrences seem fine... |
@svlandeg After inspecting some cases, I believe it happens when 'Chris', 'Donald' and 'Chris and Donald' are three individual entities as recognised by neuralcoref. tok_list = list(token.text_with_ws for token in case_doc) # fetches tokens with whitespaces from spacy document
for cluster in clusters:
cluster_main_words = set(cluster.main.text.split(' ')) # get tokens from representative cluster name
for coref in cluster:
if coref!=cluster.main: #if coreference element is not the representative element of that cluster
if coref.text!=cluster.main.text and bool(set(coref.text.split(' ')).intersection(cluster_main_words))==False:
# if coreference element text and representative element text are not equal and none of the coreference element words are in representative element. This was done to handle nested coreference scenarios
tok_list[coref.start] = cluster.main.text + case_doc[coref.end-1].whitespace_
for i in range(coref.start+1, coref.end):
tok_list[i] = "" Basically, I avoided the representative element text to substitute the utterance of text if both are the same. Eg. If 'Chris' is the representative entity and the code encounters 'Chris' as an utterance, if it will not substitute it, since they are the same. By doing this, the duplication could be avoided. I hope this makes sense. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I am trying to find corefs for the following text:
'Alan and Bruder are great friends with Chris and Donald. Alan and Bruder want to head to Lebanon while Chris and Donald wish to stay in United States. Chris and Donald have not made up their mind yet, but will get there soon. Alan and Bruder do not want to separate but there seems to be no choice.'
As seen in the screenshot, I get correct clusters for 'Chris and Donald' (cell 336), but when I try to resolve these corefs, 'Donald' gets repeated twice in the result (cell 337 - second line).
Can someone help me in understanding what might be going wrong here?
The text was updated successfully, but these errors were encountered: