-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[proposal] token annotations on multi-word spans (group annotations) and discussion of other multi-word issues. #51
Comments
A few remarks:
as a side note: |
<w xml:id="w.1">
<t>it</t>
<morphology>
<morpheme class='word'>
<t>it</t>
<pos class="pron" />
<lemma class="it" />
</morpheme>
</morphology>
</w> or no morphemes? w xml:id="w.33227">
<t class="default">vyt</t>
<lemma class="uit"/>
<pos set="http://rdf.ivdnt.org/pos/mnl-morfcode" class="285"/>
<pos set="http://rdf.ivdnt.org/pos/cgn-mnl" class="ADV(bw-deel-ww)"/>
<part class="wordPart" n="1">
<feat subset="deel" class="bw-deel-ww"/>
<feat subset="pos" class="ADV"/>
<feat subset="lemma" class="uit"/>
</part>
</w>
<w xml:id="w.33228">
<t class="default">heeft</t>
<lemma class="hebben"/>
<pos set="http://rdf.ivdnt.org/pos/mnl-morfcode" class="213"/>
<pos set="http://rdf.ivdnt.org/pos/cgn-mnl" class="WW(hulp-of-koppel,pv,tgw,met-t)"/>
<part class="wordPart" n="1">
<feat subset="wwtype" class="hulp-of-koppel"/>
<feat subset="wvorm" class="pv"/>
<feat subset="pvtijd" class="tgw"/>
<feat subset="pvagr" class="met-t"/>
<feat subset="pos" class="WW"/>
<feat subset="lemma" class="hebben"/>
</part>
</w>
<w xml:id="w.33229">
<t class="default">gegheven</t>
<lemma class="geven"/>
<pos set="http://rdf.ivdnt.org/pos/mnl-morfcode" class="274"/>
<pos set="http://rdf.ivdnt.org/pos/cgn-mnl" class="WW(hoofd,part,met-n,hoofddeel-ww)"/>
<part class="wordPart" n="1">
<feat subset="wwtype" class="hoofd"/>
<feat subset="wvorm" class="part"/>
<feat subset="buiging" class="met-n"/>
<feat subset="deel" class="hoofddeel-ww"/>
<feat subset="pos" class="WW"/>
<feat subset="lemma" class="geven"/>
</part>
</w>
...
<dependencies>
<dependency class="separable-part">
<hd>
<wref id="w.33229" t="gegheven"/>
</hd>
<dep>
<wref id="w.33227" t="vyt"/>
</dep>
</dependency>
</dependencies> |
Thanks for the reactions thus-far! @JessedeDoes, in reaction to your points:
In your solution the set for You did: <w xml:id="Corpus-Gysseling-1-1_9464f509-128c-410e-833f-a9439ee8b83a.text.1.body.1.p.1.part.4.w.17">
<t class="default">dyserine.</t>
<lemma class="DE+IJZEREN"/>
<pos set="http://rdf.ivdnt.org/pos/mnl-morfcode" class="470+101"/>
<pos set="http://rdf.ivdnt.org/pos/cgn-mnl"
class="LID(bep,zonder)+ADJ(ev,met-e)"/>
<part class="wordPart" n="1">
<feat subset="lwtype" class="bep"/>
<feat subset="buiging" class="zonder"/>
<feat subset="pos" class="LID"/>
<feat subset="lemma" class="DE"/>
</part>
<part class="wordPart" n="2">
<feat subset="getal" class="ev"/>
<feat subset="buiging" class="met-e"/>
<feat subset="pos" class="ADJ"/>
<feat subset="lemma" class="IJZEREN"/>
</part>
</w> It should be: <w xml:id="Corpus-Gysseling-1-1_9464f509-128c-410e-833f-a9439ee8b83a.text.1.body.1.p.1.part.4.w.17">
<t class="default">dyserine.</t>
<lemma class="DE+IJZEREN"/>
<pos set="http://rdf.ivdnt.org/pos/mnl-morfcode" class="470+101"/>
<pos set="http://rdf.ivdnt.org/pos/cgn-mnl"
class="LID(bep,zonder)+ADJ(ev,met-e)"/>
<morphology>
<morpheme class="wordPart" n="1">
<pos class="LID">
<feat subset="lwtype" class="bep"/>
<feat subset="buiging" class="zonder"/>
</pos>
<lemma de="DE"/>
</morpheme>
<morpheme class="wordPart" n="2">
<pos class="ADJ">
<feat subset="getal" class="ev"/>
<feat subset="buiging" class="met-e"/>
</pos>
<lemma class="IJZEREN"/>
</morpheme>
</w> (You can optionally also associate a text with the morphemes, like
Good. I don't want to enforce any linguistic theory with FoLiA and leave as much to the user as possible, I simply count all substructure inside tokens/words as morphology (and anything above the word/token level would be syntax). If there are counter-examples where this grossly inaccurate, I'd be glad to hear it.
<w xml:id="w.1">
<t>it</t>
<pos class="pron" />
<lemma class="it" />
</w> Adding an extra morphology layer with one root morpheme and the same information is unconventional but possible if you insist, but it's rather redundant.
I think dependencies are quite a fair solution here yes to express the relation. @kosloot I'll add a separate comment to address your concerns, otherwise I write too much again :) |
That was my initial thought as well. My concern that lead me on the other path was mainly that people might confuse But you have a point; adding a
I don't think this is any less real morphological information :) But I see what you mean. You can just add morphology layers with different sets yes if you want to make conflicting subdivisions.
Yeah, I don't think calling this all morphology is a misnomer, but that's up for the experts to decide.
Complete programming language would be yet another step further :) I don't want to go there either. The idea was that specific operators would be FoLiA set definition thing, and only the notion of operators and brackets would be a FoLiA thing, but even then, distinguishing operators would only be relevant for things like deep validation, so most tools can be totally oblivious about it (just like many tools are current oblivious to set definitions all together and happily perform with non-existing sets). It would indeed be specialised external tools that actual deal with this information. It would make expressing more complex vocabularies possible, without explicitly enumerating all possibilities (which can blow up in complexity real fast). |
I've been considering this issue a bit more. Although the use for "Group annotations" would then be the term for inline annotations (aka token annotations) on span annotations. One downside is that the RelaxNG schema is not strict enough to take the declarations and the groupannotations attribute into account, but then again, the RelaxNG schema is not strict enough for the current FoLiA either, and validation by a proper validation tool as provided by the FoLiA libraries is always needed. |
…ication (will be constrained by groupannotations="yes" attribute in declaration) (#51)
…nstrain AbstractSpanAnnotation.accepts() based on its value) (proycon/folia#51)
proycon/flat#138 and serving as an example and test for hidden tokens (#51)
This has been implemented as proposed, and documented here: https://folia.readthedocs.io/en/latest/span_annotation_category.html#group-annotations-inline-annotations-on-span-annotations |
Since it's inception, FoLiA makes a distinction between annotations on single tokens (or other single structural elements), and annotations made on spans of tokens. These are called token annotations and span annotation respectively, the former is implemented inline, using the natural hierarchy in XML, whereas the latter is a stand-off layer. Each particular annotation type (e.g lemma/pos/entities/syntax etc) is implemented as one of these forms. Whether a particular annotation type is implemented as a token or span annotation depends on the nature of the annotation type.
FoLiA is, by design, limited to a single tokenisation, or no tokenisation at all, in which case actual linguistic annotation abilities are limited. Tokens are represented as
<w>
(word) elements. How tokenisation should be performed is not prescribed by FoLiA but left to the tokeniser. Whitespace in a token is not prohibited (as long as the token contains more than just whitespace) so the notion of a word or token is a flexible one and the two concepts are not strongly distinguished.However, it appears that more expressive flexibility is needed as challenges appear in the situations where:
case of a constraction (e.g "it's"). This is already largely solved by the morphology layer in FoLiA.
Both are symptoms of the same underlying theme; the lack of atomicity of the token/word. The most straightforward solution would seem to be to retokenize the document, but this is too rigid and not always feasible or desireable. Sometimes maintaining an explicit distinction between tokens/words/groups of words is needed.
Multi-token words
Consider the following FoLiA mock-example of three tokens which together form a compound noun:
FoLiA already has facilities to express that a group of tokens forms some type of entity (named or otherwise), or to correct the tokens to a single new one (
<correction>
). But in cases where this all is undesireable, where you want to keep the tokens as-is because they were expressed in the original in thay way, but still express that it concerns a single word with a single part of speech tag and lemma; new facilities are needed to use token annotations with spans.When looking at other formats; NAF makes an explicit distinction between tokens (wordforms) and what it calls terms, and then proceeds to annotate largely (not always consistently so) on the terms rather than the wordforms. An extension of FoLiA is therefore also needed for the NAF-FoLiA convertor (see issue cltl/NAFFoLiAPy#4, and as such maybe relevant also for @antske).
I propose the following: adding a facility to FoLiA that can group words (like any normal span annotation element, no news here) but that allows for token annotations within its scope.
I think the simplest and least intrusive way to do this is to expand the existing entity annotation, example:
This would cover non-continguous spans just as well.
Such an annotation would be declared in the header as follows:
The
type
attribute is new here and would default to simple, the current behaviour. The value complex is used for the proposed extension, to explicitly denote that we are allowing token annotations on entities. I want this attribute so we can explicitly distinguish the two, documents with the new complex entities pose extra challenges for FoLiA tools so we want to know whether this will happen from the declaration already.Alternatives to this solution would be:
wgroup
). The disadvantage is that it may be a bit too similar toentity
and the two could be used interchangeably when no further annotations are added.The motivation for the proposed solution is to keep changes as minimal and simple as possible and not introduce too many new things. Despite the simplicity of the change, it does have quite some implications for the tools and libraries.
I do not propose that other span elements can in turn refer (
wref
) to entities rather than tokens/words (there are already facilities for doing that anyway), and it would add unnecessary ambiguity.Non-atomic tokens
In cases where we have a token annotation (e.g. pos, lemma) that can not be assigned to a single token but only a part of it, we can use the already existing morphology layer:
Consider the example of the English contraction it's:
Here I want to stress that this is not the only possible representation for this contraction, as we can just as well express it with two tokens as shown in the next example. It's not FoLiA's job to favour one over the other, but that is a decision of the creator/researcher/tokeniser, FoLiA just has to provide the facilities that make both models possible:
The morphology notation in FoLiA is very powerful and nestable. Consider the arabic token فيبيتك. This consists of three words meaning "in your house", translitterated in the below example for ease of reading:
Morphemes (and phonemes) can explicitly be referred to (like of words/tokens) from any span annotation (
wref
).My question (mainly for @kdepuydt, @JessedeDoes) if is this solution is sufficient (it can capture contractions, clitics, etc.. ) and linguistically accurate enough (e.g. grouping it all under morphology)? If there are counter-examples, I'd be very interested.
Compound classes
One point that arises from current annotations in the CRM and Gysseling corpora (historical dutch), is the use of what I call compound PoS-tags and lemmas. Take the arabic example above, the token itself does not have a PoS tag, but one may want to force a tag anyway and assign something like prep+n+pron. Recall that FoLiA itself does not define the tagset, so this would be valid. However, the semantics of it being some kind of compound class would not be formalised in any way. The question arises whether we need facilities for explicitly representing compound classes? Perhaps we
should allow FoLiA set definitions to define operators such as
+
, allowing for more expressivity in classes. This as opposed to really defining operators in FoLiA itself, because that begs the question which operators are needed and that is more a property of the vocabulary in question. In categorial grammars for instance, one would want to define/
and\
. In other vocabularies, perhaps more set-theoretic operators such as∪
and∩
make sense. If operators are introduced, then of course bracketing and operator precendence becomes a factor to take into account a well. The classwould cease to be a simple reference and allow for a mini-language in it's own right, although for many tools this is of no consequence.
I'm not yet including a specific proposal for this, but would very much like to hear your thoughts on this direction.
The text was updated successfully, but these errors were encountered: