Referencing implicit/empty/ghost words from span annotation #58

proycon · 2018-10-30T10:55:38Z

How to refer to words/tokens that are not actually there?

proycon · 2019-02-21T11:52:32Z

It is time to pick this up again now that a lot of progress on FoLiA v2 is made. Most of this discussion, with @luutuntin, took place in proycon/flat#138. We need an explicit mechanism to refer to tokens that do not really exist, mostly for syntactic movements.

I have a proposal: introduce a <hiddenw> element (hidden word/token), that may be used just like words/tokens (<w>) but which denotes a word/token that is explicitly not part of the original text, and therefore does not appear in normal text serialisation. It however, may be a valid target for <wref>. Example, following the earlier examples in proycon/flat#138 and @luutuntin's nice tree:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0">
        <desc>empty expletive subject</desc>
        <pos class="EX" />
    </hiddenw>
    <w xml:id="s.1.w.1" space="no">
        <t>Is</t>
        <pos class="BEP" />
    </w>
    <w xml:id="s.1.w.2">
        <t>n't</t>
        <pos class="NEG" />
    </w>
    <w xml:id="s.1.w.3">
        <t>a</t>
        <pos class="D" />
    </w>
    <w xml:id="s.1.w.4">
        <t>whole</t>
        <pos class="ADJ" />
    </w>
    <w xml:id="s.1.w.5">
        <t>lot</t>
        <pos class="N" />
    </w>
    <w xml:id="s.1.w.6" space="no">
        <t>left</t>
        <pos class="VAN" />
    </w>
    <w xml:id="s.1.w.7">
        <t>.</t>
        <pos class="PUNC" />
    </w>
    <syntax>        
        <su xml:id="s.1.su.1" class="IP-MAT">
            <su xml:id="s.1.su.2" class="NP-SBJ">
                <wref id="s.1.w.0" />
            </su>
            <su xml:id="s.1.su.3" class="VP">
                <su xml:id="s.1.su.4" class="BEP">
                    <wref id="s.1.w.1" />
                </su>
                <su xml:id="s.1.su.5" class="NEG">
                    <wref id="s.1.w.2" />
                </su>
                <su xml:id="s.1.su.6" class="VP">
                    <su xml:id="s.1.su.7" class="NP=LGS">
                        <wref id="s.1.w.3" />
                        <su xml:id="s.1.su.8" class="ADJP">
                            <wref id="s.1.w.4" />
                        </su>
                        <wref id="s.1.w.5" />
                    </su>
                    <wref id="s1.w.6" />
                </su>
            </su>
            <su class="PUNC">
                <wref id="s.1.w.7" />
            </su>
        </su>
    </syntax>
</s>

The hidden tokens would have their own annotation type and can be bound to a set, which allows for multiple hidden tokenisation layers, in case multiple are needed for different purposes. The <hiddenw> elements are a structure element (albeit one that is hidden by default) so may appear interleaved with the normal tokenisation layer. Existing expressions that operate on words should not be bothered by it. but libraries do need extra code to ensure this element is skipped in text serialisation (and text consistency) by default (I don't want to forbid text content (<t>) in the hidden tokens as there is probably good use for that).

Thoughts and comments welcome!

luutuntin · 2019-02-23T23:35:07Z

I'm happy with your proposal. I also agree that we shouldn't forbid text content (<t>) in the hidden tokens. For instance, we can use it in our example as below:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0">
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />
    </hiddenw>
    ...
</s>

When you mentioned that the hidden tokens would have their own annotation type, do you mean that hidden tokens can have the same token annotations as word tokens (<w>) do (such as part-of-speech, lemma, language identification, lexical semantic sense, domain, subjectivity), and will also have additional (hidden) token annotations such as annotation layer?

proycon · 2019-02-27T10:24:29Z

do you mean that hidden tokens can have the same token annotations as word tokens () do (such as part-of-speech, lemma, language identification, lexical semantic sense, domain, subjectivity)

Yes

and will also have additional (hidden) token annotations such as annotation layer?

I rather meant that hidden tokens would be a new specific annotation type itself and needs to be declared. There's no annotation layer associated with this type, as it's a structural element rather than a span element.

…enough to omit it from any text serialisation and consistency checks #58

luutuntin · 2019-02-27T12:37:36Z

Thank you. What I mean by "additional (hidden) token annotations such as annotation layer" is, for example:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0">
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />`
        <annotation_layer='syntax'>
    </hiddenw>
    ...
</s>

proycon · 2019-02-27T12:55:10Z

The syntax annotation layer would be embedded in <s> and refers back to <hiddenw> using the normal <wref> mechanism. If you want to make explicit that the hidden token is a syntactic one, just invent a class for it in some set, something like:

<hiddenw class="syntactic">

luutuntin · 2019-02-27T14:09:22Z

I see. So the example above should be like this, right?:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0">
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />`
        <annotation_layer class='syntax'>
    </hiddenw>
    ...
</s>

proycon/flat#138 and serving as an example and test for hidden tokens (#58)

proycon · 2019-02-27T14:14:01Z

no, like this:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0" class="syntax">
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />`
        <annotation_layer class='syntax'>
    </hiddenw>
</s>

"syntax" is then just a custom defined class in a set which you use for hidden tokens (you could also opt for something more especific like exp as a class, it's up to you). The set just needs to be declared in the document metadata:

 <hiddentoken-annotation set="http://wherever/the/set/definition/is/if/it/exists/at/all" />

luutuntin · 2019-02-27T14:49:11Z

But then we don't need <annotation_layer class='syntax'>, I assume:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0" class='syntax'>
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />`
        <!--annotation_layer='syntax'-->
    </hiddenw>
    ...
</s>

My next questions (which are not critical now) will be how do we deal with the case:

when the same hidden token is used in different annotation layers (as I don't think we can have <hiddenw class="syntax" class="semantic_role">)?
when we want to annotate a hidden token in different aspects, for example: the annotation layer it belong to and some other aspect we haven't thought of yet? (Do we mix these categories in the set of hiddentoken-annotation?)

proycon · 2019-02-27T16:43:26Z

Ah yes, sorry, I accidentally left in <annotation_layer class='syntax'> from copy pasting your example. But that should go away indeed as it is not FoLiA even. :)

when the same hidden token is used in different annotation layers (as I don't think we can have )?

Multiple classes are not allowed indeed. But you can refer to the same token from multiple span annotation layers using <wref>, that's no problem. Alternatively, you can use multiple hidden token layers in different sets (but I wouldn't really recommend that as it makes things needlessly complex).

The fact you refer to a hidden token from a syntax layer should already make clear that it has a role in syntax, so putting classes on the <hiddenw> itself may already be overkill, see the following example:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0">
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />`
    </hiddenw>
    ...
    <syntax>        
        <su xml:id="s.1.su.1" class="IP-MAT">
            <su xml:id="s.1.su.2" class="NP-SBJ">
                <wref id="s.1.w.0" /> <!-- reference to the hiddenw -->
            </su>
            ...
     </syntax>
      ..
     <semroles>
        <predicate xml:id="s.1.pr.1">
          <semrole xml:id="s.1.pr.1.r.1" class="AGENT">
             <wref id="s.1.w.0" /> <!-- another reference to the hiddenw -->
          </semrole>
          ...
        </predicate>
     </semroles>

luutuntin · 2019-02-27T16:49:44Z

Thank you.

luutuntin · 2019-03-10T01:46:33Z

Today, I just reviewed FoLiA documentation (PDF) - section 2.10.8 Corrections. I can see that, for example, an insertion can be a solution for introducing a hidden word/token. Is there any reason that makes this not a good solution?

proycon · 2019-03-13T12:49:38Z

That would not be an appropriate solution for hidden words, because those are not explicitly hidden words. Corrections really express "the old situation was wrong, this is how it was and this is how it should be instead", which is semantically different from what you want with hidden tokens.

luutuntin · 2019-03-13T13:13:42Z

Thank you.

proycon · 2019-03-14T11:59:22Z

This is now released as proposed with FoliA v2.0.0 and documented here as part of the new FoLiA documentation: https://folia.readthedocs.io/en/latest/hiddentoken_annotation.html

luutuntin · 2019-03-14T12:57:35Z

Great.

proycon added the enhancement label Oct 30, 2018

proycon added this to the v2.0 milestone Oct 30, 2018

proycon self-assigned this Oct 30, 2018

proycon mentioned this issue Oct 30, 2018

folialint produces invalid FoLiA out of dubious input LanguageMachines/libfolia#23

Closed

proycon added the to do staged to be worked on label Dec 19, 2018

proycon added in progress and removed to do staged to be worked on labels Feb 21, 2019

proycon added a commit that referenced this issue Feb 27, 2019

Adding hidden words to specification #58

ee83949

proycon added a commit that referenced this issue Feb 27, 2019

Set printable/speakable to false for hiddenword, that hopefully does …

4bb85af

…enough to omit it from any text serialisation and consistency checks #58

proycon added a commit that referenced this issue Feb 27, 2019

Added syntactic movement example, adapted from @luutuntin 's example in

c226701

proycon/flat#138 and serving as an example and test for hidden tokens (#58)

proycon added a commit that referenced this issue Feb 27, 2019

allow hiddenw where w is allowed #58

bf22e8e

proycon added a commit that referenced this issue Feb 27, 2019

added documentation for hiddentoken annotation #58

535f45f

proycon added a commit to proycon/foliapy that referenced this issue Feb 27, 2019

Added a test for hidden words (proycon/folia#58)

20fbe24

proycon added ready Implemented but not released yet and removed in progress labels Feb 28, 2019

proycon closed this as completed Mar 13, 2019

proycon mentioned this issue Mar 18, 2019

Implement hidden words annotation LanguageMachines/libfolia#31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Referencing implicit/empty/ghost words from span annotation #58

Referencing implicit/empty/ghost words from span annotation #58

proycon commented Oct 30, 2018

proycon commented Feb 21, 2019

luutuntin commented Feb 23, 2019 •

edited

Loading

proycon commented Feb 27, 2019

luutuntin commented Feb 27, 2019 •

edited

Loading

proycon commented Feb 27, 2019

luutuntin commented Feb 27, 2019

proycon commented Feb 27, 2019

luutuntin commented Feb 27, 2019 •

edited

Loading

proycon commented Feb 27, 2019

luutuntin commented Feb 27, 2019

luutuntin commented Mar 10, 2019 •

edited

Loading

proycon commented Mar 13, 2019

luutuntin commented Mar 13, 2019

proycon commented Mar 14, 2019

luutuntin commented Mar 14, 2019

Referencing implicit/empty/ghost words from span annotation #58

Referencing implicit/empty/ghost words from span annotation #58

Comments

proycon commented Oct 30, 2018

proycon commented Feb 21, 2019

luutuntin commented Feb 23, 2019 • edited Loading

proycon commented Feb 27, 2019

luutuntin commented Feb 27, 2019 • edited Loading

proycon commented Feb 27, 2019

luutuntin commented Feb 27, 2019

proycon commented Feb 27, 2019

luutuntin commented Feb 27, 2019 • edited Loading

proycon commented Feb 27, 2019

luutuntin commented Feb 27, 2019

luutuntin commented Mar 10, 2019 • edited Loading

proycon commented Mar 13, 2019

luutuntin commented Mar 13, 2019

proycon commented Mar 14, 2019

luutuntin commented Mar 14, 2019

luutuntin commented Feb 23, 2019 •

edited

Loading

luutuntin commented Feb 27, 2019 •

edited

Loading

luutuntin commented Feb 27, 2019 •

edited

Loading

luutuntin commented Mar 10, 2019 •

edited

Loading