Relation between annotations and text classes is not explicit #29

proycon · 2017-06-07T11:24:20Z

FoLiA allows for multiple text content elements on the same structural elements, these other text content elements must carry a different class. This indicates an alternative text for the same element and is used for instance for pre-OCR vs. post-OCR or pre-normalisation vs. post-normalisation distinctions (e.g. by Ticcl), or for transliterations (e.g a text in a chinese characters as well as pinyin). The standard text class, is always current (the only case in which FoLiA predefines a class).

In Nederlab, historical text is modernised, the modernised text is stored in the contemporary text class and the original historical text is in the default current class. Now the issue is that they want to annotate both spelling variants. Software such as Frog allows to specify what text class to use as input, and it is viable to run Frog multiple times, with some post-processing, and add alternative annotations that are based on a different text class input. This is what I currently implemented and which works okay.

The problem with this approach , however, is that: The relation between annotations and text classes is not explicit. It is now merely a convention in my Nederlab pipeline that the alternatives are based on the historical text, whilst the authoritative annotations are based on the contemporary variant.

This is a limitation in FoLiA that should be thought about and remedied. In FoLiA annotations are tied to structural elements (e.g. words/tokens) rather than on any particular text surface form (all textual forms are equally valid and describe the same thing). How do we establish a link with a text class?

For morphology/phonology and corrections this issue does not occur as those explicitly use text content elements; but for normal token annotation and span annotation (wref) it is not and an elegant solution needs to be devised. A symptom of this problem is apparent also in the serialisation of the wref/@t attribute, which always now always contains the current layer even if the span annotation was derived from another text layer.

This issue also encroaches upon another (deliberate) limitation in FoLiA; the general inability to have multiple tokenisations (though there are already soms ways around this).

The text was updated successfully, but these errors were encountered:

proycon · 2017-06-07T12:31:50Z

An initial proposal for a solution

Introduce a new common attribute textclass on all token & span annotations. By default, if omitted, the value of textclass is current. This ensures backwards compatibility and allows us to by default omit an explicit class assignment (and save on verbosity), just as we do with <t class="current"> == <t>.

An example of this on token annotation :

         <w class="WORD" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.head.86.s.1.w.3">
              <t>aengename</t>
              <t class="contemporary">aangename</t>
              <metric value="lexicon" class="modernisationsource"/>
              <pos head="ADJ" class="ADJ(prenom,basis,met-e,stan)" confidence="0.885728" 
                textclass="contemporary">
                <feat class="prenom" subset="positie"/>
                <feat class="basis" subset="graad"/>
                <feat class="met-e" subset="buiging"/>
                <feat class="stan" subset="naamval"/>
              </pos>
              <lemma class="aangenaam" textclass="contemporary" />
              <morphology>
                <morpheme>
                  <t class="contemporary">aangenaam</t>
                </morpheme>
                <morpheme>
                  <t class="contemporary">e</t>
                </morpheme>
              </morphology>
              <alt auth="no" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.head.86.s.1.w.3.alt.1">
                <pos head="N" class="N(soort,ev,basis,zijd,stan)" confidence="0.976563">
                  <feat class="soort" subset="ntype"/>
                  <feat class="ev" subset="getal"/>
                  <feat class="basis" subset="graad"/>
                  <feat class="zijd" subset="genus"/>
                  <feat class="stan" subset="naamval"/>
                </pos>
              </alt>
              <alt auth="no" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.head.86.s.1.w.3.alt.2">
                <lemma class="aengename"/>
              </alt>
              <altlayers auth="no" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.head.86.s.1.w.3.altlayers.1">
                <morphology>
                  <morpheme>
                    <t>aengename</t>
                  </morpheme>
                </morphology>
              </altlayers>
            </w>

And on span annotation (the entity is a false-positive named entity but that's what Frog outputs):

            <entities xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.entities.1">
              <entity class="per" confidence="0.444326" set="http://ilk.uvt.nl/folia/sets/frog-ner-nl" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.entities.1.entity.3"
                 textclass="contemporary">
                <wref id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.w.9" t="Verbleide"/>
              </entity>
           </entities>
           <altlayers auth="no" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.altlayers.2">
              <entities xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.altlayers.2.entities.1">        
                <entity class="per" confidence="0.401266" set="http://ilk.uvt.nl/folia/sets/frog-ner-nl" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.altlayers.2.entities.1.entity.3">
                  <wref id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.w.9" t="Verbleyde"/>
                </entity>
              </entities>
            </altlayers>

Things to note that are part of this proposal:

the entity/@t attribute now corresponds to the textclass of the entity.
textclass="current" is the default and needs not be serialised.
Default rules for unique elements apply; textclass does not make an element unique like set does. So for e.g. pos, given a certain set, only one can be authoritative and the rest must be alternatives (in alt), regardless of textclass.
Elements such as morphology and phonology, correction and str are not span/token annotation and this issue does not apply to them (they explicitly take text content so this issue does not arise). They do not need and don't get a textclass attribute. Whether we want to allow it on certain higher order elements is debatable: alignment, desc, comment and perhaps even metric may be candidates where it might make sense too.

proycon · 2017-07-20T14:04:30Z

I think this proposal is accepted. I'll probably release a forward-compatible v1.4.3 release that already allows for this, so it's not held up by the other more complicated issues for v1.5.

kosloot · 2017-08-14T08:27:40Z

Implemented the textclass attribute in libfolia 1.4.3 branch too, and in foliatest

proycon added enhancement question labels Jun 7, 2017

proycon assigned proycon and kosloot Jun 7, 2017

proycon added the PRIORITY label Jun 7, 2017

proycon added this to the v1.5 milestone Jun 23, 2017

proycon removed the question label Jul 20, 2017

proycon added a commit that referenced this issue Jul 20, 2017

Adding textclass attribute (#29) to FoLiA specification (v1.4.3)

9ef6b52

proycon modified the milestones: v1.4.3, v1.5 Jul 20, 2017

proycon added a commit that referenced this issue Jul 20, 2017

Adding textclass example to example.xml (#29)

c01f616

proycon added a commit to proycon/pynlpl that referenced this issue Jul 20, 2017

Implemented textclass attribute (proycon/folia#29)

29b9824

proycon mentioned this issue Aug 14, 2017

Implement textclass attribute support LanguageMachines/frog#36

Closed

proycon added a commit that referenced this issue Aug 16, 2017

Documented textclass attribute (1.4.3) #29

d487c7c

proycon closed this as completed Aug 16, 2017

proycon added a commit to proycon/foliapy that referenced this issue Sep 6, 2018

Implemented textclass attribute (proycon/folia#29)

8fc081b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relation between annotations and text classes is not explicit #29

Relation between annotations and text classes is not explicit #29

proycon commented Jun 7, 2017

proycon commented Jun 7, 2017

proycon commented Jul 20, 2017

kosloot commented Aug 14, 2017

Relation between annotations and text classes is not explicit #29

Relation between annotations and text classes is not explicit #29

Comments

proycon commented Jun 7, 2017

proycon commented Jun 7, 2017

An initial proposal for a solution

proycon commented Jul 20, 2017

kosloot commented Aug 14, 2017