Skip to content

Conversation

@noelslice
Copy link
Contributor

Using the same example input mentioned here: #215 (comment) there seems to be a spurious mention "than Shyam" because the subordinating conjunction "than" was not excluded in the mention span detection.

This PR adds the SCONJ tag to the REMOVE_POS list.

Test case:

import spacy
import neuralcoref

nlp = spacy.load('en_core_web_lg')
neuralcoref.add_to_pipe(nlp, greedyness=0.5)

doc = nlp(u'Ram and Shyam are good boys. Ram is older than Shyam. But, they are not friends.')

from pprint import pprint
print(doc._.coref_clusters)
pprint(doc._.coref_scores)

Current output:

[Ram: [Ram, Ram], Shyam: [Shyam, Shyam]]
{Ram: {Ram: 1.775342583656311},
 Ram and Shyam: {Ram and Shyam: 1.7628642320632935, Ram: -1.576068639755249},
 Shyam: {Ram: -1.5397948026657104,
         Ram and Shyam: -1.5207256078720093,
         Shyam: 1.6105855703353882},
 good boys: {Ram: -1.5992192029953003,
             Ram and Shyam: -1.5002832412719727,
             Shyam: -1.6027263402938843,
             good boys: 1.738552212715149},
 Ram: {Ram: 7.551267623901367,
       Ram and Shyam: -0.8156640529632568,
       Shyam: -1.614872932434082,
       good boys: -1.514532446861267,
       Ram: 1.5904799699783325},
 than Shyam: {Ram: -1.5681246519088745,
              Ram and Shyam: -1.4285391569137573,
              Shyam: -1.5769987106323242,
              good boys: -1.5076005458831787,
              Ram: -1.5768016576766968,
              than Shyam: 1.704783320426941},
 Shyam: {Ram: -1.6349478960037231,
         Ram and Shyam: -1.1569286584854126,
         Shyam: 5.653580665588379,
         good boys: -1.526012897491455,
         Ram: -1.6253626346588135,
         than Shyam: -1.5083305835723877,
         Shyam: 1.242653489112854},
 they: {Ram: -2.0989551544189453,
        Ram and Shyam: -0.7402747869491577,
        Shyam: -2.3023903369903564,
        good boys: -1.5382691621780396,
        Ram: -2.296427011489868,
        than Shyam: -1.0285108089447021,
        Shyam: -2.670758008956909,
        they: 0.07739335298538208},
 friends: {Ram: -1.5777109861373901,
           Ram and Shyam: -1.5296742916107178,
           Shyam: -1.725807785987854,
           good boys: -1.5094072818756104,
           Ram: -1.5740591287612915,
           than Shyam: -1.5106748342514038,
           Shyam: -1.783818006515503,
           they: -1.5725568532943726,
           friends: 2.009723663330078}}

New output ("than Sham" excluded):

[Ram: [Ram, Ram], Shyam: [Shyam, Shyam]]
{Ram: {Ram: 1.775342583656311},
 Ram and Shyam: {Ram and Shyam: 1.7629910707473755, Ram: -1.5760746002197266},
 Shyam: {Ram: -1.5397844314575195,
         Ram and Shyam: -1.5207990407943726,
         Shyam: 1.6113454103469849},
 good boys: {Ram: -1.5991358757019043,
             Ram and Shyam: -1.5002236366271973,
             Shyam: -1.602735996246338,
             good boys: 1.7384239435195923},
 Ram: {Ram: 7.543191909790039,
       Ram and Shyam: -0.8214647769927979,
       Shyam: -1.6146637201309204,
       good boys: -1.5146090984344482,
       Ram: 1.5892621278762817},
 Shyam: {Ram: -1.578922986984253,
         Ram and Shyam: -0.6316158771514893,
         Shyam: 7.046931266784668,
         good boys: -1.525830626487732,
         Ram: -1.813422441482544,
         Shyam: 1.1222282648086548},
 they: {Ram: -2.0966665744781494,
        Ram and Shyam: -0.29233384132385254,
        Shyam: -2.266399621963501,
        good boys: -1.5540210008621216,
        Ram: -2.2621068954467773,
        Shyam: -2.6278762817382812,
        they: 0.0765305757522583},
 friends: {Ram: -1.5773955583572388,
           Ram and Shyam: -1.5293686389923096,
           Shyam: -1.721515417098999,
           good boys: -1.5099279880523682,
           Ram: -1.5666728019714355,
           Shyam: -1.809272050857544,
           they: -1.5722771883010864,
           friends: 2.0099644660949707}}

The live demo also doesn't display this mention:

Screenshot from 2020-07-15 13-53-24

@noelslice
Copy link
Contributor Author

disclaimer: I'm still not convinced the logic in extract_mentions_spans and _extract_from_sent is robust. Working on my understanding of the code. It would help to add some test cases.

@svlandeg
Copy link
Collaborator

svlandeg commented Sep 7, 2020

Thanks for this PR @noelslice! Looks good to me.
There are definitely parts of the code base that could use more test cases - all contributions welcome!

@svlandeg svlandeg merged commit 18c0f4c into huggingface:master Sep 7, 2020
@noelslice
Copy link
Contributor Author

Thanks for having a look and merging this in @svlandeg !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants