New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process input in format XML TEI #32

lfoppiano opened this Issue Sep 18, 2017 · 5 comments


None yet
3 participants

lfoppiano commented Sep 18, 2017

For references, some examples/resources related to this:

@lfoppiano lfoppiano self-assigned this Sep 18, 2017


This comment has been minimized.


kermitt2 commented Sep 18, 2017

There is an existing implementation here:

The only difficult aspect is the position information of the entities. It is done above via standoff annotation on the XML which supposes to have xml:id on textual elements (added before in anHALytics) and to have a space-preserve attribute (that is not ignored by the other tools and the humans of course!).

So one solution is to return both the entities and an augmented "XML" file with added xml:id when required. We also need to add the xml:id in the JSON positions for each entities.


This comment has been minimized.


lfoppiano commented Nov 6, 2017

(My previous comments got lost somewhere)

For the first implementation we assume the input xml has xml:id. It's simpler to handle input/output while testing consistency.

Regarding the positions, it's pretty tricky. The anHALytics approach is simple and works most of the cases however I noticed that might not work when there are <note> inline in the text. They are not part of the sentence and also not part of the formatting. To be tested.

For the output format, I will start to use the json output, for the XML output, @laurentromary could you point me to the latest resource/example/usecase I can get inspired to? (I remember was pretty much following our case with named entities, but I forgot everything about it 😓 )


This comment has been minimized.


laurentromary commented Nov 6, 2017

I suggest you give me an example of your output and we work out a possible TEI-XML equivalent. OK?


This comment has been minimized.


kermitt2 commented Nov 6, 2017

The xml:id must always be present in the deepest element we want to position wrt the the text. There could be an issue if the annotation span overlaps several several elements. For simplification, we can make this case never happening -> if we have something like this:

<p> bla1 <note>bla2</note> bla2</p>

we add for anchoring the offsets:

<p xml:id="id1"> bla1 <note xml:id="id2">bla2</note> bla2</p>

then we process separately with entity-fishing bla1, then bla2, then bla3 -> we never have annotation over bla1+bla2 or bla2+bla3 and we're safe.

So in the case of an inline <note>:

  • if the annotation is inside the <note>, we use an xml:id associated to the <note>
  • if the annotation is outside the <note>, we use the xml:id of the containing element (e.g. <p>), offset after the <note> are incremented with the text length of the <note>

If we want to support annotation span overlapping several several elements, I think the solution is to introduce a list of offset positions, similar to the list of bounding boxes in the PDF annotation, but it might be quite complicated to do in the xml world.

Of course, this all works with the xml:space="preserve" constraint - which is apparently not followed by the TEI example of OpenEdition :/

Regarding the output, I assume so far we are only doing JSON output! TEI output would be a lot of work I think - and if we are addressing monographs, scaling thousand of TEI standoff annotations might be a pratical issue to cover too.


This comment has been minimized.


lfoppiano commented Nov 7, 2017

@laurentromary indeed, anyway for the moment we freeze the tei output

@kermitt2 OK, so for the <note> we have a separate sentence, so there won't be any overlapping entity bothering. In this way shuld be fine. I think the such approximation (no cross entity between tags) holds also for formatting tags (see example below).

Le problème exposé dans cet ouvrage m’est apparu sur une intuition. Pour avoir travaillé dans des <hi rendition="#style01">startups</hi> puis dans des sociétés de capital-risque, j’ai pu constater que des sommes d’argent considérables affluaient dans les entreprises de logiciel. Par ailleurs, en tant que développeuse logiciel amateur, j’étais bien consciente que je n’aurais rien pu produire toute seule. J’utilisais du code gratuit et public (plus connu sous le nom de <hi rendition="#style01">code open source</hi>) dont j’assemblais des éléments afin de répondre à des objectifs personnels ou commerciaux. Et franchement, les personnes impliquées dans ces projets avaient, quel que soit leur rôle, fait le plus gros du travail.

Regarding formatting tags, however they might be are a slighly more pain in the ass.

Would the fact that we have a sentence then a single world, then another sentence somehow penalise or affect the mention/links calculation, even having the context updated with the previous entities?

I personally don't like so much to add xml:id on formatting tags, but on the other hand... if the alternative is to ignore it or not based on the fact that they are formatting or non-formatting tags, looks pretty hard to predict what could be present in the <p> (to classify formatting from non-formatting tags ence ignoring or not the tags or consider them as <note>).

@lfoppiano lfoppiano removed their assignment Jun 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment