Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER example questions #20

Open
tuurma opened this issue Apr 11, 2019 · 5 comments
Open

NER example questions #20

tuurma opened this issue Apr 11, 2019 · 5 comments

Comments

@tuurma
Copy link

tuurma commented Apr 11, 2019

I took the liberty to create a small example inspired by NER standoff example here https://github.com/laurentromary/stdfSpec/blob/AnnArbor/Scenarios/StandOffScenarios.xml#L203-L221

Let's start with a very short text, segmented in w for the ease of addressing. I have omitted the spans for specifying targets even as it would be quite superfluous.

<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
     <!-- The header of the main document being annotated -->
      <fileDesc>
         <titleStmt>
            <title>My standoff NER sample</title>
         </titleStmt>
         <publicationStmt>
            <p>Publication Information</p>
         </publicationStmt>
         <sourceDesc>
            <p>Information about the source</p>
         </sourceDesc>
      </fileDesc>
  </teiHeader>
   <standOff>
      <teiHeader>
         <!-- The header of the annotation part -->
      </teiHeader>
      <!-- The annotations -->
      <listAnnotation type="people">
         <annotationBlock>
            <person xml:id="JH">
               <persName>Jon Hamon</persName>
            </person>
            <interp ana="#JH" inst="#w5"/>
         </annotationBlock>
         <annotationBlock>
            <person xml:id="CH">
               <persName>Charles Hamon</persName>
            </person>
            <interp ana="#CH" inst="#w9"/>
         </annotationBlock>
      </listAnnotation>
   </standOff>
  <text>
      <body>
         <!-- The text being annotated, possibly segmented (e.g. in <w>) -->
         <p>
            <w xml:id="w1">Some</w> <w xml:id="w2">text</w> <w xml:id="w3">talking</w> <w xml:id="w4">about</w> <w xml:id="w5">Jon</w> 
            <w xml:id="w6">and</w> <w xml:id="w7">his</w> <w xml:id="w8">brother</w> <w xml:id="w9">Charles</w>.
         </p>
      </body>
  </text>
</TEI>

Everything looks very well. Let's expand the sample text a little bit to include one more sentence about Jon and Charles our protagonists:

Some text talking about Jon
and his brother Charles. And
the time when both brothers
died in a car crash
together as the effect of
Charles drunk driving.

 <text>
      <body>
         <!-- The text being annotated, possibly segmented (e.g. in <w>) -->
         <p>
            <w xml:id="w1">Some</w> <w xml:id="w2">text</w> <w xml:id="w3">talking</w> <w xml:id="w4">about</w> <w xml:id="w5">Jon</w> 
            <w xml:id="w6">and</w> <w xml:id="w7">his</w> <w xml:id="w8">brother</w> <w xml:id="w9">Charles</w>. <w xml:id="w10">And</w> 
            <w xml:id="w11">the</w> <w xml:id="w12">time</w> <w xml:id="w13">when</w> <w xml:id="w14">both</w> <w xml:id="w15">brothers</w> 
            <w xml:id="w16">died</w> <w xml:id="w17">in</w> <w xml:id="w18">a</w> <w xml:id="w19">car</w> <w xml:id="w20">crash</w> 
            <w xml:id="w21">together</w> <w xml:id="w22">as</w> <w xml:id="w23">the</w> <w xml:id="w24">effect</w> <w xml:id="w25">of</w> 
            <w xml:id="w26">Charles</w> <w xml:id="w27">drunk</w> <w xml:id="w28">driving</w>.
         </p>
      </body>
  </text>

What shall we do then to cover all references to CH? Create another annotationBlock and duplicate person there? Surely not.

      <listAnnotation type="people">
...
         <annotationBlock>
            <person xml:id="JH">
               <persName>Jon Hamon</persName>
            </person>
            <interp ana="#JH" inst="#w5"/>
         </annotationBlock>
         <annotationBlock>
            <person xml:id="CH">
               <persName>Charles Hamon</persName>
            </person>
            <interp ana="#CH" inst="#w9"/>
         </annotationBlock>
         <annotationBlock>
            <person xml:id="CH2">
               <persName>Charles Hamon</persName>
            </person>
            <interp ana="#CH2" inst="#w15"/>
         </annotationBlock>
         <annotationBlock>
            <person xml:id="CH3">
               <persName>Charles Hamon</persName>
            </person>
            <interp ana="#CH3" inst="#w26"/>
         </annotationBlock>
      </listAnnotation>

Push multiple targets into @inst?

     <listAnnotation type="people">
...
         <annotationBlock>
            <person xml:id="CH">
               <persName>Charles Hamon</persName>
            </person>
            <interp ana="#CH" inst="#w9 #w15 #w27"/>
         </annotationBlock>
      </listAnnotation>

Slightly better but somehow omits the fact that w15, the word brothers not only references Charles but his (now late) brother Jon as well.

Things get even more problematic if we consider having a corpus of multiple TEI resources referencing shared prosopography.

I would like the example in the proposal to cover each of the use cases above in a clear manner, ideally one that is on an abstract level at least consistent across different annotation types (eg morphological annotation vs NER vs events vs commentaries).

I've started this issue as a voice in standoff discussion started here https://docs.google.com/document/d/1rloyaZzQJQIsBkCC_1BC33lrTQIvZ3FC-YUXL0bpwdk/edit# and would be happy to isolate and similarly discuss other use case examples presented in https://github.com/laurentromary/stdfSpec/blob/AnnArbor/Scenarios/StandOffScenarios.xml

@laurentromary
Copy link
Owner

laurentromary commented Apr 13, 2019

Hi @tuurma , you actually dropped the main feature from the original example , which makes your encoding problematic as you've observed, namely the presence of <span> elements to point to the source texte. In conformance to the WADM model <span> implement the notion of target, whereas <interp> is there to implement annotation (the hinge between target and body). By doing that (and this is the intended behavior for the WADM model), you can point to complex, possibly discontinuous, targets, using several <span>.

@laurentromary
Copy link
Owner

Concerning your other question, when the annotations are generated from a NERD module such as [http://nerd.huma-num.fr/nerd/], you do want to have the actual identified out made explicit for each instance and then (e.g. using @corresp or <idno>) indicate the disambiguated result. This is particularly important since the forms can be different (C. Hamon, Dr. Hamon, etc.) and also because you may want to add other features to the output (score etc.)

@tuurma
Copy link
Author

tuurma commented Apr 13, 2019

@laurentromary Thanks for your comments. Is that helpful to see and discuss full example https://github.com/tuurma/stdfSpec/blob/AnnArbor/Samples/NER-Jon_and_Charles.xml?

I've skipped the spans initially but even with them my questions still stand. Let me elaborate below.

Important assumption is that this example serves the sole purpose of identifying people in the text. Inline TEI equivalent would be just using eg

<persName ref="#JH #CH"><w xml:id="w15">brothers</w></persName>
  1. multiple fragments of the transcription reference the same entity; example should illustrate how to avoid redundant entries and ideally how to link with existing TEI mechanisms of providing canonical identification for people, also for cases where it is not a part of the document being annotated
  2. text fragment may reference multiple different entities; example should illustrate how to encode such cases in unambiguous way
  3. each single annotation, regardless if created and maintained by human or software agent should have possibility to associate relevant metadata: agent responsible, time of creation/update etc
  4. annotations may but don't necessarily have to reside within the document being annotated; example (or its variant) should illustrate how to create standoff annotations for external documents

@sydb
Copy link

sydb commented Apr 29, 2019

First, as I said I would, I've updated the structure of @tuurma’s example, leaving out entirely the details of how the annotations are actually encoded:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="https://raw.githubusercontent.com/sydb/stdfSpec/linkDataBlock/Specification/standoff-proposal.rnc" type="application/relax-ng-compact-syntax"?>
<?xml-model href="https://raw.githubusercontent.com/sydb/stdfSpec/linkDataBlock/Specification/standoff-proposal.isosch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns:so="http://www.tei-c.org/proposal/standoff/ns"
     xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
    <!-- The header of the main document being annotated -->
    <fileDesc>
      <titleStmt>
        <title>My standoff NER sample</title>
      </titleStmt>
      <publicationStmt>
        <p>Publication Information</p>
      </publicationStmt>
      <sourceDesc>
        <p>Information about the source</p>
      </sourceDesc>
    </fileDesc>
  </teiHeader>
  <text>
    <body>
      <!-- The text being annotated, possibly segmented (e.g. in <w>) -->
      <p>
        <w xml:id="w1">Some</w>
        <w xml:id="w2">text</w>
        <w xml:id="w3">talking</w>
        <w xml:id="w4">about</w>
        <w xml:id="w5">Jon</w> 
        <w xml:id="w6">and</w>
        <w xml:id="w7">his</w>
        <w xml:id="w8">brother</w>
        <w xml:id="w9">Charles</w>
        <pc>.</pc>
        <w xml:id="w10">And</w> 
        <w xml:id="w11">the</w>
        <w xml:id="w12">time</w>
        <w xml:id="w13">when</w>
        <w xml:id="w14">both</w>
        <w xml:id="w15">brothers</w> 
        <w xml:id="w16">died</w>
        <w xml:id="w17">in</w>
        <w xml:id="w18">a</w>
        <w xml:id="w19">car</w>
        <w xml:id="w20">crash</w> 
        <w xml:id="w21">together</w>
        <w xml:id="w22">as</w>
        <w xml:id="w23">the</w>
        <w xml:id="w24">effect</w>
        <w xml:id="w25">of</w> 
        <w xml:id="w26">Charles</w>
        <w xml:id="w27">drunk</w>
        <w xml:id="w28">driving</w>
        <pc>.</pc>
      </p>
    </body>
  </text>
  <TEI>
    <teiHeader>
      <!-- The header of the annotation part -->
    </teiHeader>
    <!-- The annotations -->
    <so:ldb type="people">
      <!-- could just as easily be <so:standOff> -->
      <!-- either way, annotations go here pointing into <text>, above -->
    </so:ldb>
  </TEI>
</TEI>

As for the annotations themselves, I am not sure I understand things well enough to really say definitively how it should be done. But certainly

        <interp ana="#JH" inst="#w5"/>

is not correct. It is the content of <interp> that provides the interpretation, not the value of @ana. (Thus this example implies that the analysis of the interpretation is the <person> Jon Hamon. The intent, of course, is that the analysis of the word(s) pointed to by @inst be the <person> Jon Hamon.)

        <interp inst="#w5"><ptr type="referenceTo" target="#JH"/></interp>

makes much more sense to me. Except, of course, that <ptr> is not valid inside <interp>, meaning that we have to either change the content model or wrap the <ptr> in a <desc>. (Or better yet, use <span>.)

As for @tuurma’s issues, my first step is to see if I understand them.

  1. multiple fragments of the transcription reference the same entity

So the usual, non-<annotationBlock> mechanism would just be something like one of the following.

  <interp inst="#w9 #w15 #w26"><desc><ptr type="referenceTo" target="#CH"/></desc></interp>
  <link type="reference" target="#CH #w9 #w15 #w26"/>
  <relation active="#w9 #w15 #w26" passive="#CH" name="referenceTo"/>
  <span target="#w9 #w15 #w26"><ptr type="referenceTo" target="#CH"/></span>

Personally, I like <relation> and <span>, dislike <interp> (because the addition of the @inst attribute always seemed silly to me — it just means that now <interp> can do the same thing <span> can do just fine, meaning we have 2 ways to do something when (IMHO) one would have done just fine), and am very cautious about <link> because it relies on the positionality of the values within @target, which some consider fragile.

  1. text fragment may reference multiple different entities

Seems to me the encoding would be different depending on whether the reference is a conjunction (refers simultaneously to more than one) or a disjunction (refers to one of several). The usual, non-<annotationBlock> mechanism for a conjunction would just be something like the following.

  <span target="#w15"><ptr type="referenceTo" target="#CH #JH"/></span>

The usual, non-<annotationBlock> mechanism for a disjunction would just be something like the following.

        <w xml:id="w29">It</w>
        <w xml:id="w30">was</w>
        <w xml:id="w31">one</w>
        <w xml:id="w32">of</w>
        <w xml:id="w33">them</w>
        <w xml:id="w34">that</w>
        <w xml:id="w35">had</w>
        <w xml:id="w36">written</w>
        <w xml:id="w37">an</w>
        <w xml:id="w38">anonymous</w>
        <w xml:id="w39">letter-to-the-editor</w>
        <w xml:id="w40">about</w>
        <w xml:id="w41">the</w>
        <w xml:id="w42">perils</w>
        <w xml:id="w43">of</w>
        <w xml:id="w44">DWI</w>
        <pc>.</pc>
        <altGrp type="persons">
          <alt xml:id="CoJ" target="#CH #JH"/>
        </altGrp>
        <span from="#w31" to="#w33"><ptr type="referenceTo" target="#CoJ"/></span>

@tuurma
Copy link
Author

tuurma commented Apr 29, 2019

Tried to adjust my example (cf tuurma@1605d6c) Would you say it's closer to what we spoke about? Factored out persons, replaced interp with annotation and grouped them for now with spans which just mark the targets within annotationBlock` but I'm not sure this is an ideal solution. Would probably make more sense to work with an example closer to real life, like Laurent said: with actual NER output, scoring etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants