Proposal for a New Tagging Scheme

James Tauber edited this page Jan 30, 2018 · 20 revisions

NOTE: I'm punting some of this (e.g. anything related to part-of-speech or to aspect/tense). For the stuff that I'm actively considering, see Handling Ambiguity.

Many of these are changes I've wanted to make for a long time, others are more recent ideas.

The beginnings of this proposal came out of discussions with Mike Aubrey at SBL 2014 but should not necessarily be taken an an endorsement by him.

I will make these changes on a branch (although not all at once) so people can see the result but I am still keen for discussion and feedback.


  1. This should be a morphological analysis and not a syntactic or semantic one. Put another way, it should be formal, not functional, and based on the surface form in isolation (other than basic disambiguation). In fact, one of the goals is that the scheme be easy to produce automated analyses with for subsequent further (functional) analysis by humans. Note that I'm not against things like semantically passive verbs being indicated in MorphGNT, just against it being via this set of fields.
  2. Whether a particular parse field is allowed should, wherever possible, be licensed by the part-of-speech tag, not the value of another parse field.
  3. The details below only include changes to the existing data, not new information I am working on such as inflectional classes (which will either live in a separate lexicon or as a new field)
  4. Producing alternative forms of the data, for example in JSON or XML is outside the scope of this proposal.

Specific Proposals

  1. Distinguish proper nouns from common ones, based on their capitalization properties (and likely use the part-of-speech field for this; e.g. NP vs N-).
  2. Distinguish finite verbs from infinitives and participles without abusing the mood field for this purpose (and likely use the part-of-speech field for this, per meta-proposal 2; e.g. VF vs VI vs VP).
  3. Treat neuter nouns as being underspecified in the nominative/accusative case and use a new case value "core" (possibly C) for this.
  4. Treat genitive plurals as unmarked for gender. (OPEN QUESTION: just use - for this or something new like U?)
  5. Introduce a value for gender of "non-neuter" to cover cases where the word is underspecified for masculine vs feminine. (OPEN QUESTIONS: which letter to use for this? Something like A for animate? What about cases that are underspecified for masculine vs neuter?)
  6. Treat adjective degree as derivational rather than inflectional (i.e. treat them as separate lexemes and indicate degree in the lexicon not in the morphological analysis). (OPEN QUESTION: should I use a different part-of-speech for this, effectively moving degree to the second character in the part-of-speech field, e.g. A-, AC, AS?)
  7. (Really just follows on from 2) Only allow the values D, I, S, O for mood.
  8. Only use active and middle A and M values for voice (treating passive as requiring semantic analysis) and not treating θη as formally indicating passive. UPDATE: I might start with the less controversial: P is only used in future and aorist and present, imperfect, perfect and pluperfect always use M where there's currently a P.
  9. Separate aspect and tense into two fields, aspect taking on values for perfective, imperfective and resultative/completive, and tense taking on values for past, non-past and future. (OPEN QUESTIONS: letters to use for this?)
  10. OPEN QUESTION: rethinking pronouns.
  11. only mark vocative where the form is distinct from nominative otherwise just mark as nominative.

Open Questions

Repeated from above for convenience.

  1. should unmarked gender use - or something new like U? UPDATE: Now see Proposal for Gender Tagging.
  2. which letter to use for "non-neuter"? Something like A for animate? UPDATE: Now see Proposal for Gender Tagging.
  3. should there also be a case value of "non-feminine" for cases that are ambiguously masculine/neuter? UPDATE: Now see Proposal for Gender Tagging.
  4. should adjective degree be marked as a sub-type of part-of-speech (e.g. A-, AC, AS)?
  5. which letters should be used for aspects and tenses?
  6. how do we indicate the future infinitive? have a future tense
  7. how should we do pronouns?