Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. annotation errors and/or conversion script bugs #6

Closed
lgessler opened this issue Sep 8, 2021 · 6 comments
Closed

Misc. annotation errors and/or conversion script bugs #6

lgessler opened this issue Sep 8, 2021 · 6 comments

Comments

@lgessler
Copy link
Contributor

lgessler commented Sep 8, 2021

There are some annotations which I'm fairly sure are incorrect and are choking up the JSON conversion script. (These errors occur using the unmodified versions of all scripts taken straight from STRUESLE.) One or two might also be indicative of a bug in the conllulex2json.py file.

  1. vs mistagged as a noun--should be prep

AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

  1. ditto

AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

  1. Script complains about "to" in this snippet at ID=23. Not immediately clear to me what the issue is--perhaps that "to" is labeled ADP/IN? For its xpos I think it ought to be TO, not sure about its upos. Snippet:
13      shit    shit    NOUN    NN      _       16      obl:npmod       _       _       _       _       _       _       _       _       _       _       _
14      this    this    PRON    DT      _       16      nsubj   _       _       _       _       _       _       _       _       _       _       _
15      can     can     AUX     MD      _       16      aux     _       _       _       _       _       _       _       _       _       _       _
16      end     end     VERB    VB      _       4       parataxis       _       _       _       _       _       _       _       _       _       _       _
17      right   right   ADV     RB      _       18      advmod  _       _       _       _       _       _       _       _       _       _       _
18      now     now     ADV     RB      _       16      advmod  _       _       _       _       _       _       _       _       _       _       _
19      if      if      SCONJ   IN      _       21      mark    _       _       _       _       _       _       _       _       _       _       _
20      I       I       PRON    PRP     _       21      nsubj   _       _       _       _       _       _       _       _       _       _       _
21      want    want    VERB    VBP     _       16      advcl   _       _       _       _       _       _       _       _       _       _       _
22      it      it      PRON    PRP     _       21      obj     _       _       _       _       _       _       _       _       _       _       _
23      to      to      ADP     IN      _       21      obl     _       _       _       _       _       `i      `i      _       _       _       _
24      .       .       PUNCT   .       _       4       punct   _       _       _       _       _       _       _       _       _       _       _

Error:

AssertionError: ('french-a17a4340-f9c0-8fef-fa1b-1bf13879399b-02', {'lexlemma': 'to', 'lexcat': 'INF', 'ss': 'i', 'ss2': 'i', 'toknums': [23]}, {'#': 23, 'word': 'to', 'lemma': 'to', 'upos': 'ADP', 'xpos': 'IN', 'feats': None, 'head': 21, 'deprel': 'obl', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-INF-`i'})

Relevant span of code:

            if validate_pos and upos!=lc and (upos,lc) not in {('NOUN','N'),('PROPN','N'),('VERB','V'),
                ('ADP','P'),('ADV','P'),('SCONJ','P'),
                ('ADP','DISC'),('ADV','DISC'),('SCONJ','DISC'),
                ('PART','POSS')}:
                # most often, the single-word lexcat should match its upos
                # check a list of exceptions
                mismatchOK = False
                if xpos=='TO' and lc.startswith('INF'):
                    mismatchOK = True
                elif (xpos=='TO')!=lc.startswith('INF'):
                    assert upos in ['SCONJ', "ADP"] and swe['lexlemma']=='for',(sent['sent_id'],swe,tok)
                    mismatchOK = True
  1. Originator as function:

(in french-c02823ec-60bd-adce-7327-01337eb9d1c8-02)
AssertionError: ('p.Originator should never be function', {'lexlemma': 'you', 'lexcat': 'PRON.POSS', 'ss': 'p.Originator', 'ss2': 'p.Originator', 'toknums': [1]})

  1. lexcat DISC with ADJ:

AssertionError: In spanish-a25e8289-e04a-f5af-ce56-ead9faca65b1-02, single-word expression 'like' has lexcat DISC, which is incompatible with its upos ADJ

  1. "her" tagged with Possessor is incorrectly parsed as iobj and tagged as PRP instead of PRP$. Relevant snippet:
1       My      my      PRON    PRP$    _       2       nmod:poss       _       _       _       _       _       SocialRel       Gestalt _       _       _       _
2       grandma grandma NOUN    NN      _       3       nsubj   _       _       _       _       _       _       _       _       _       _       _
3       had     have    VERB    VBD     _       0       root    _       _       _       _       _       _       _       _       _       _       _
4       her     she     PRON    PRP     _       3       iobj    _       _       _       _       _       Possessor       Possessor       _       _       _       _
5       super   super   ADV     RB      _       6       advmod  _       _       _       _       _       _       _       _       _       _       _
6       thick   thick   ADJ     JJ      _       8       amod    _       _       _       _       _       _       _       _       _       _       _
7       floor   floor   NOUN    NN      _       8       compound        _       _       _       _       _       _       _       _       _       _       _
8       mats    mat     NOUN    NNS     _       3       obj     _       _       _       _       _       _       _       _       _       _       _
9       *       *       PUNCT   NFP     _       8       punct   _       _       _       _       _       _       _       _       _       _       _
10      over    over    ADP     IN      _       13      case    _       _       _       _       _       Locus   Locus   _       _       _       _
11      *       *       PUNCT   NFP     _       13      punct   _       _       _       _       _       _       _       _       _       _       _
12      the     the     DET     DT      _       13      det     _       _       _       _       _       _       _       _       _       _       _
13      accelerator     accelerator     NOUN    NN      _       3       obl     _       _       _       _       _       _       _       _       _       _       _
14      ,       ,       PUNCT   ,       _       3       punct   _       _       _       _       _       _       _       _       _       _       _

Error:

AssertionError: In spanish-ebba3c73-2431-c216-8f4d-d469ee8d5564-01, single-word expression 'her' has lexcat P, which is incompatible with its upos PRON

  1. "NA" is misannotated--this is NA as in North America, i.e. a PROPN/NP, but it's lemmatized as "no", and its tags are weird.

AssertionError: ('german-35000895-1d78-c18a-01ed-f7410b9c0581-01', {'lexlemma': 'no', 'lexcat': 'ADV', 'ss': None, 'ss2': None, 'toknums': [5]}, {'#': 5, 'word': 'NA', 'lemma': 'no', 'upos': 'PART', 'xpos': 'TO', 'feats': None, 'head': 6, 'deprel': 'mark', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-ADV'})

@nschneid
Copy link
Contributor

nschneid commented Sep 8, 2021

Thanks for catching these.

1-2. Yes, I would say "vs" is ADP or CCONJ depending on use.

  1. Should be "if I want it to/PART/TO"

  2. Please give more context, I don't know why "you" should be annotated as prepositional

  3. See what discourse particle "like" is tagged as in STREUSLE/EWT? Probably ADV is better than ADJ for the UPOS.

@lgessler
Copy link
Contributor Author

lgessler commented Sep 8, 2021

re: 4.:

# sent_id = french-c02823ec-60bd-adce-7327-01337eb9d1c8-02
# text = Your point is : being ignorant makes it right ?
1       Your    you     PRON    PRP$    _       2       nmod:poss       _       _       _       _       _       Originator      Originator      _       _       _       _
2       point   point   NOUN    NN      _       3       nsubj   _       _       _       _       _       _       _       _       _       _       _
3       is      be      VERB    VBZ     _       0       root    _       _       _       _       _       _       _       _       _       _       _
4       :       :       PUNCT   :       _       3       punct   _       _       _       _       _       _       _       _       _       _       _
5       being   be      AUX     VBG     _       6       cop     _       _       _       _       _       _       _       _       _       _       _
6       ignorant        ignorant        ADJ     JJ      _       7       nsubj   _       _       _       _       _       _       _       _       _       _       _
7       makes   make    VERB    VBZ     _       3       ccomp   _       _       _       _       _       _       _       _       _       _       _
8       it      it      PRON    PRP     _       7       obj     _       _       _       _       _       _       _       _       _       _       _
9       right   right   ADV     RB      _       7       advmod  _       _       _       _       _       _       _       _       _       _       _
10      ?       ?       PUNCT   .       _       3       punct   _       _       _       _       _       _       _       _       _       _       _

@lgessler
Copy link
Contributor Author

lgessler commented Sep 8, 2021

re: 5 I can't seem to find any in STREUSLE but GUM tags them as UH/INTJ: https://corpling.uis.georgetown.edu/annis/#_q=Imxpa2UiIC4gIiwi&_c=R1VN&cl=5&cr=5&s=0&l=10

and EWT has one example also tagged as UH: https://corpling.uis.georgetown.edu/annis/#_q=Imxpa2UiIC4gIiwi&_c=VURfRW5nbGlzaC1FV1Q&cl=5&cr=5&s=0&l=10

@lgessler
Copy link
Contributor Author

lgessler commented Sep 8, 2021

OK actually on (5), the sentence is just "I was like ...", which GUM consistently tags as RP/ADP when it's a quotative "be like", e.g.:

  • I was like/RP/ADP, I'll call you when I get home

EWT also seems to follow this except for one case where it's UH (results):

  • He was like/UH/INTJ what ???
  • I was like/RP/ADP Ummmm
  • he was like/RP/ADP Oops

lgessler added a commit to lgessler/pastrie that referenced this issue Sep 8, 2021
(1-2) are actually duplicate issues--there was only one mistagged
lgessler added a commit to lgessler/pastrie that referenced this issue Sep 8, 2021
lgessler added a commit to lgessler/pastrie that referenced this issue Sep 8, 2021
lgessler added a commit to lgessler/pastrie that referenced this issue Sep 8, 2021
not redoing the parse
lgessler added a commit to lgessler/pastrie that referenced this issue Sep 8, 2021
lgessler added a commit to lgessler/pastrie that referenced this issue Sep 8, 2021
@lgessler
Copy link
Contributor Author

lgessler commented Sep 8, 2021

Other instances of "your" are Originator~>Gestalt, so I changed it to that for (4)

@lgessler
Copy link
Contributor Author

lgessler commented Sep 15, 2021

All errors (1-7) are addressed in #7

nschneid added a commit that referenced this issue Sep 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants