-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NER example questions #20
Comments
Hi @tuurma , you actually dropped the main feature from the original example , which makes your encoding problematic as you've observed, namely the presence of |
Concerning your other question, when the annotations are generated from a NERD module such as [http://nerd.huma-num.fr/nerd/], you do want to have the actual identified out made explicit for each instance and then (e.g. using |
@laurentromary Thanks for your comments. Is that helpful to see and discuss full example https://github.com/tuurma/stdfSpec/blob/AnnArbor/Samples/NER-Jon_and_Charles.xml? I've skipped the spans initially but even with them my questions still stand. Let me elaborate below. Important assumption is that this example serves the sole purpose of identifying people in the text. Inline TEI equivalent would be just using eg <persName ref="#JH #CH"><w xml:id="w15">brothers</w></persName>
|
First, as I said I would, I've updated the structure of @tuurma’s example, leaving out entirely the details of how the annotations are actually encoded: <?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="https://raw.githubusercontent.com/sydb/stdfSpec/linkDataBlock/Specification/standoff-proposal.rnc" type="application/relax-ng-compact-syntax"?>
<?xml-model href="https://raw.githubusercontent.com/sydb/stdfSpec/linkDataBlock/Specification/standoff-proposal.isosch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns:so="http://www.tei-c.org/proposal/standoff/ns"
xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<!-- The header of the main document being annotated -->
<fileDesc>
<titleStmt>
<title>My standoff NER sample</title>
</titleStmt>
<publicationStmt>
<p>Publication Information</p>
</publicationStmt>
<sourceDesc>
<p>Information about the source</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<body>
<!-- The text being annotated, possibly segmented (e.g. in <w>) -->
<p>
<w xml:id="w1">Some</w>
<w xml:id="w2">text</w>
<w xml:id="w3">talking</w>
<w xml:id="w4">about</w>
<w xml:id="w5">Jon</w>
<w xml:id="w6">and</w>
<w xml:id="w7">his</w>
<w xml:id="w8">brother</w>
<w xml:id="w9">Charles</w>
<pc>.</pc>
<w xml:id="w10">And</w>
<w xml:id="w11">the</w>
<w xml:id="w12">time</w>
<w xml:id="w13">when</w>
<w xml:id="w14">both</w>
<w xml:id="w15">brothers</w>
<w xml:id="w16">died</w>
<w xml:id="w17">in</w>
<w xml:id="w18">a</w>
<w xml:id="w19">car</w>
<w xml:id="w20">crash</w>
<w xml:id="w21">together</w>
<w xml:id="w22">as</w>
<w xml:id="w23">the</w>
<w xml:id="w24">effect</w>
<w xml:id="w25">of</w>
<w xml:id="w26">Charles</w>
<w xml:id="w27">drunk</w>
<w xml:id="w28">driving</w>
<pc>.</pc>
</p>
</body>
</text>
<TEI>
<teiHeader>
<!-- The header of the annotation part -->
</teiHeader>
<!-- The annotations -->
<so:ldb type="people">
<!-- could just as easily be <so:standOff> -->
<!-- either way, annotations go here pointing into <text>, above -->
</so:ldb>
</TEI>
</TEI> As for the annotations themselves, I am not sure I understand things well enough to really say definitively how it should be done. But certainly <interp ana="#JH" inst="#w5"/> is not correct. It is the content of <interp inst="#w5"><ptr type="referenceTo" target="#JH"/></interp> makes much more sense to me. Except, of course, that As for @tuurma’s issues, my first step is to see if I understand them.
So the usual, non- <interp inst="#w9 #w15 #w26"><desc><ptr type="referenceTo" target="#CH"/></desc></interp>
<link type="reference" target="#CH #w9 #w15 #w26"/>
<relation active="#w9 #w15 #w26" passive="#CH" name="referenceTo"/>
<span target="#w9 #w15 #w26"><ptr type="referenceTo" target="#CH"/></span> Personally, I like
Seems to me the encoding would be different depending on whether the reference is a conjunction (refers simultaneously to more than one) or a disjunction (refers to one of several). The usual, non- <span target="#w15"><ptr type="referenceTo" target="#CH #JH"/></span> The usual, non- <w xml:id="w29">It</w>
<w xml:id="w30">was</w>
<w xml:id="w31">one</w>
<w xml:id="w32">of</w>
<w xml:id="w33">them</w>
<w xml:id="w34">that</w>
<w xml:id="w35">had</w>
<w xml:id="w36">written</w>
<w xml:id="w37">an</w>
<w xml:id="w38">anonymous</w>
<w xml:id="w39">letter-to-the-editor</w>
<w xml:id="w40">about</w>
<w xml:id="w41">the</w>
<w xml:id="w42">perils</w>
<w xml:id="w43">of</w>
<w xml:id="w44">DWI</w>
<pc>.</pc>
<altGrp type="persons">
<alt xml:id="CoJ" target="#CH #JH"/>
</altGrp>
<span from="#w31" to="#w33"><ptr type="referenceTo" target="#CoJ"/></span> |
Tried to adjust my example (cf tuurma@1605d6c) Would you say it's closer to what we spoke about? Factored out persons, replaced |
I took the liberty to create a small example inspired by NER standoff example here https://github.com/laurentromary/stdfSpec/blob/AnnArbor/Scenarios/StandOffScenarios.xml#L203-L221
Let's start with a very short text, segmented in
w
for the ease of addressing. I have omitted the spans for specifying targets even as it would be quite superfluous.Everything looks very well. Let's expand the sample text a little bit to include one more sentence about Jon and Charles our protagonists:
What shall we do then to cover all references to
CH
? Create anotherannotationBlock
and duplicateperson
there? Surely not.Push multiple targets into
@inst
?Slightly better but somehow omits the fact that
w15
, the wordbrothers
not only references Charles but his (now late) brother Jon as well.Things get even more problematic if we consider having a corpus of multiple TEI resources referencing shared prosopography.
I would like the example in the proposal to cover each of the use cases above in a clear manner, ideally one that is on an abstract level at least consistent across different annotation types (eg morphological annotation vs NER vs events vs commentaries).
I've started this issue as a voice in standoff discussion started here https://docs.google.com/document/d/1rloyaZzQJQIsBkCC_1BC33lrTQIvZ3FC-YUXL0bpwdk/edit# and would be happy to isolate and similarly discuss other use case examples presented in https://github.com/laurentromary/stdfSpec/blob/AnnArbor/Scenarios/StandOffScenarios.xml
The text was updated successfully, but these errors were encountered: