Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Referencing implicit/empty/ghost words from span annotation #58

Closed
proycon opened this issue Oct 30, 2018 · 15 comments
Closed

Referencing implicit/empty/ghost words from span annotation #58

proycon opened this issue Oct 30, 2018 · 15 comments
Assignees
Labels
enhancement ready Implemented but not released yet
Milestone

Comments

@proycon
Copy link
Owner

proycon commented Oct 30, 2018

How to refer to words/tokens that are not actually there?

Discussed as part of proycon/flat#138

@proycon proycon added this to the v2.0 milestone Oct 30, 2018
@proycon proycon self-assigned this Oct 30, 2018
@proycon proycon added the to do staged to be worked on label Dec 19, 2018
@proycon
Copy link
Owner Author

proycon commented Feb 21, 2019

It is time to pick this up again now that a lot of progress on FoLiA v2 is made. Most of this discussion, with @luutuntin, took place in proycon/flat#138. We need an explicit mechanism to refer to tokens that do not really exist, mostly for syntactic movements.

I have a proposal: introduce a <hiddenw> element (hidden word/token), that may be used just like words/tokens (<w>) but which denotes a word/token that is explicitly not part of the original text, and therefore does not appear in normal text serialisation. It however, may be a valid target for <wref>. Example, following the earlier examples in proycon/flat#138 and @luutuntin's nice tree:

syntax_tree_empty_expletive_subject

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0">
        <desc>empty expletive subject</desc>
        <pos class="EX" />
    </hiddenw>
    <w xml:id="s.1.w.1" space="no">
        <t>Is</t>
        <pos class="BEP" />
    </w>
    <w xml:id="s.1.w.2">
        <t>n't</t>
        <pos class="NEG" />
    </w>
    <w xml:id="s.1.w.3">
        <t>a</t>
        <pos class="D" />
    </w>
    <w xml:id="s.1.w.4">
        <t>whole</t>
        <pos class="ADJ" />
    </w>
    <w xml:id="s.1.w.5">
        <t>lot</t>
        <pos class="N" />
    </w>
    <w xml:id="s.1.w.6" space="no">
        <t>left</t>
        <pos class="VAN" />
    </w>
    <w xml:id="s.1.w.7">
        <t>.</t>
        <pos class="PUNC" />
    </w>
    <syntax>        
        <su xml:id="s.1.su.1" class="IP-MAT">
            <su xml:id="s.1.su.2" class="NP-SBJ">
                <wref id="s.1.w.0" />
            </su>
            <su xml:id="s.1.su.3" class="VP">
                <su xml:id="s.1.su.4" class="BEP">
                    <wref id="s.1.w.1" />
                </su>
                <su xml:id="s.1.su.5" class="NEG">
                    <wref id="s.1.w.2" />
                </su>
                <su xml:id="s.1.su.6" class="VP">
                    <su xml:id="s.1.su.7" class="NP=LGS">
                        <wref id="s.1.w.3" />
                        <su xml:id="s.1.su.8" class="ADJP">
                            <wref id="s.1.w.4" />
                        </su>
                        <wref id="s.1.w.5" />
                    </su>
                    <wref id="s1.w.6" />
                </su>
            </su>
            <su class="PUNC">
                <wref id="s.1.w.7" />
            </su>
        </su>
    </syntax>
</s>

The hidden tokens would have their own annotation type and can be bound to a set, which allows for multiple hidden tokenisation layers, in case multiple are needed for different purposes. The <hiddenw> elements are a structure element (albeit one that is hidden by default) so may appear interleaved with the normal tokenisation layer. Existing expressions that operate on words should not be bothered by it. but libraries do need extra code to ensure this element is skipped in text serialisation (and text consistency) by default (I don't want to forbid text content (<t>) in the hidden tokens as there is probably good use for that).

Thoughts and comments welcome!

@proycon proycon added in progress and removed to do staged to be worked on labels Feb 21, 2019
@luutuntin
Copy link

luutuntin commented Feb 23, 2019

I'm happy with your proposal. I also agree that we shouldn't forbid text content (<t>) in the hidden tokens. For instance, we can use it in our example as below:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0">
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />
    </hiddenw>
    ...
</s>

When you mentioned that the hidden tokens would have their own annotation type, do you mean that hidden tokens can have the same token annotations as word tokens (<w>) do (such as part-of-speech, lemma, language identification, lexical semantic sense, domain, subjectivity), and will also have additional (hidden) token annotations such as annotation layer?

@proycon
Copy link
Owner Author

proycon commented Feb 27, 2019

do you mean that hidden tokens can have the same token annotations as word tokens () do (such as part-of-speech, lemma, language identification, lexical semantic sense, domain, subjectivity)

Yes

and will also have additional (hidden) token annotations such as annotation layer?

I rather meant that hidden tokens would be a new specific annotation type itself and needs to be declared. There's no annotation layer associated with this type, as it's a structural element rather than a span element.

proycon added a commit that referenced this issue Feb 27, 2019
…enough to omit it from any text serialisation and consistency checks #58
@luutuntin
Copy link

luutuntin commented Feb 27, 2019

Thank you. What I mean by "additional (hidden) token annotations such as annotation layer" is, for example:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0">
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />`
        <annotation_layer='syntax'>
    </hiddenw>
    ...
</s>

@proycon
Copy link
Owner Author

proycon commented Feb 27, 2019

The syntax annotation layer would be embedded in <s> and refers back to <hiddenw> using the normal <wref> mechanism. If you want to make explicit that the hidden token is a syntactic one, just invent a class for it in some set, something like:

<hiddenw class="syntactic">

@luutuntin
Copy link

I see. So the example above should be like this, right?:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0">
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />`
        <annotation_layer class='syntax'>
    </hiddenw>
    ...
</s>

proycon added a commit that referenced this issue Feb 27, 2019
proycon/flat#138 and serving as an example and test for hidden tokens (#58)
@proycon
Copy link
Owner Author

proycon commented Feb 27, 2019

no, like this:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0" class="syntax">
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />`
        <annotation_layer class='syntax'>
    </hiddenw>
</s>

"syntax" is then just a custom defined class in a set which you use for hidden tokens (you could also opt for something more especific like exp as a class, it's up to you). The set just needs to be declared in the document metadata:

 <hiddentoken-annotation set="http://wherever/the/set/definition/is/if/it/exists/at/all" />

proycon added a commit that referenced this issue Feb 27, 2019
@luutuntin
Copy link

luutuntin commented Feb 27, 2019

But then we don't need <annotation_layer class='syntax'>, I assume:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0" class='syntax'>
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />`
        <!--annotation_layer='syntax'-->
    </hiddenw>
    ...
</s>

My next questions (which are not critical now) will be how do we deal with the case:

  • when the same hidden token is used in different annotation layers (as I don't think we can have <hiddenw class="syntax" class="semantic_role">)?
  • when we want to annotate a hidden token in different aspects, for example: the annotation layer it belong to and some other aspect we haven't thought of yet? (Do we mix these categories in the set of hiddentoken-annotation?)

@proycon
Copy link
Owner Author

proycon commented Feb 27, 2019

Ah yes, sorry, I accidentally left in <annotation_layer class='syntax'> from copy pasting your example. But that should go away indeed as it is not FoLiA even. :)

when the same hidden token is used in different annotation layers (as I don't think we can have )?

Multiple classes are not allowed indeed. But you can refer to the same token from multiple span annotation layers using <wref>, that's no problem. Alternatively, you can use multiple hidden token layers in different sets (but I wouldn't really recommend that as it makes things needlessly complex).

The fact you refer to a hidden token from a syntax layer should already make clear that it has a role in syntax, so putting classes on the <hiddenw> itself may already be overkill, see the following example:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0">
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />`
    </hiddenw>
    ...
    <syntax>        
        <su xml:id="s.1.su.1" class="IP-MAT">
            <su xml:id="s.1.su.2" class="NP-SBJ">
                <wref id="s.1.w.0" /> <!-- reference to the hiddenw -->
            </su>
            ...
     </syntax>
      ..
     <semroles>
        <predicate xml:id="s.1.pr.1">
          <semrole xml:id="s.1.pr.1.r.1" class="AGENT">
             <wref id="s.1.w.0" /> <!-- another reference to the hiddenw -->
          </semrole>
          ...
        </predicate>
     </semroles>

@luutuntin
Copy link

Thank you.

proycon added a commit to proycon/foliapy that referenced this issue Feb 27, 2019
@proycon proycon added ready Implemented but not released yet and removed in progress labels Feb 28, 2019
@luutuntin
Copy link

luutuntin commented Mar 10, 2019

Today, I just reviewed FoLiA documentation (PDF) - section 2.10.8 Corrections. I can see that, for example, an insertion can be a solution for introducing a hidden word/token. Is there any reason that makes this not a good solution?

@proycon
Copy link
Owner Author

proycon commented Mar 13, 2019

That would not be an appropriate solution for hidden words, because those are not explicitly hidden words. Corrections really express "the old situation was wrong, this is how it was and this is how it should be instead", which is semantically different from what you want with hidden tokens.

@luutuntin
Copy link

Thank you.

@proycon proycon closed this as completed Mar 13, 2019
@proycon
Copy link
Owner Author

proycon commented Mar 14, 2019

This is now released as proposed with FoliA v2.0.0 and documented here as part of the new FoLiA documentation: https://folia.readthedocs.io/en/latest/hiddentoken_annotation.html

@luutuntin
Copy link

Great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement ready Implemented but not released yet
Projects
None yet
Development

No branches or pull requests

2 participants