Prepare evaluation material for entity detection (through analiticcl) #1

proycon · 2021-08-17T16:28:29Z

We need some evaluation material to ascertain how well analiticcl performs in combination with the chosen input lexicons. In order to do that we need a simple gold standard that marks the different kind of entities we are interested in (names, locations, objects). A full annotation task to build a representative gold standard is probably too much work at this stage, so we'll have to settle for a annotating a few scans to just get an initial impression of the performance to steer development.

We can either evaluate analiticcl and the input lexicons by themselves at a lower-level, or evaluate the wider pipeline (with @brambg's components). The latter is probably the best option. In that case the input would be PageXML:

select a few PageXML scans from the HTR data that contain different kinds of entities (names, locations, objects)
annotate all entities, marking the following:

the category (something simple like per, loc, obj suffices probably)
the normalised/corrected spelling

We can go about this in two ways:

Have annotators use a text editor and directly edit the PageXML (it is human readable enough), using some agreed upon XML tags. Syntax can be as simple as elements like obj,per and loc and a norm attribute:

een geverft ront <obj norm="tafeltje">taeffeltje</obj>

This is rather ad-hoc but a simple and quick solution. It does however require a basic amount of technical expertise from the annotators and be comfortable in using a text editor to edit XML. The output will be some ad-hoc Page-XML hybrid.

Set up a proper annotation environment using FLAT (as discussed in the meeting today). This implies the following:

We need to convert the PageXML to FoLiA, for which we already have a tool FoLiA-page available in foliautils, but this tool does need some further adaptation for our purposes.
FLAT requires tokenisation of the data (which can be done by ucto, which is a bit at odds with the kind of task we are doing, where the tokenisation is uncertain, HTR errors complicate the matter.
We need to prepare a FLAT instance for our annotation task.
(any conversion back to an ad-hoc enhanced PageXML would be fairly senseless in this scenario)

This is less ad-hoc, uses well-defined formats and tools and will be more accessible to annotators. But it does come at the cost of increased complexity and extra overhead (=extra work/time) to set up. It might be overkill for our purposes.

In both scenarios, entities that span a linebreak (or are hyphenated) will pose an additional challenge.

Whatever option we choose, in order to evaluate analiticcl and the wider pipeline, we need an evaluation pipeline (as part of @brambg's pipeline I'd say) that reads the gold standard (either the Page-XML superset or FoLiA XML), compares the system output with it, and produces output metrics (precision/recall/F1).

(Tagging @menzowindhouwer , @HarmTseard, @LvanWissen, @brambg )

The text was updated successfully, but these errors were encountered:

proycon · 2021-08-18T14:15:56Z

Some issues are already popping up regarding the PageXML->FoLiA conversion, they will be resolved (takes some time) but the question remains whether we want to go this route.

@HarmTseard Can you check with the people you want to appoint as annotators what their preferences are for this annotation task? (and including yourself of course)

brambg · 2021-08-18T15:04:32Z

Leon has given me a file with scan "classifiers"
The classifiers used are:

Akkoord
Akte van executeurschap
Attestatie
Beraad
Bestek
Bevrachtingscontract
Bewijs aan minderjarigen
Bijlbrief
Bodemerij
Boedelinventaris
Boedelscheiding
Borgtocht
Cessie
Compagnieschap
Consent
Contract
Conventie (echtscheiding)
Huur
Huwelijkse voorwaarden
Hypotheek
Insinuatie
Interrogatie
Koop
Kwitantie
Machtiging
Non prejuditie
Obligatie
Onbekend
Overig
Procuratie
Renunciatie
Revocatie
Scheepsverklaring
Schenking
Testament
Transport
Trouwbelofte
Uitspraak
Voogdij
Wisselprotest

Would it be helpful if I assemble a set of PageXML for manual annotation based on some of the classifiers, like "Boedelinventaris", "Huwelijkse voorwaarden" or "Contract" ?

proycon · 2021-08-18T15:20:08Z

That sounds like a good way to select some diverse material yeah, we'd best consult with Harm who knows more about the actual contents and will be working on the annotation.

proycon · 2021-08-19T13:25:40Z

The PageXML to FoLiA converter (FoLiA-page) is ready now.

proycon · 2021-08-20T10:52:42Z

We also have a third option to avoid the FoLiA overhead (which I'm skeptical about for this task) but still have an annotation tool: W can use another annotation tool. I'm looking into doccano as a hopefully more lightweight alternative. We still have overhead and conversion issues, but this time to a simpler CoNLL format.

LvanWissen · 2021-08-20T11:27:47Z

Is https://recogito.pelagios.org/ an option?

proycon · 2021-08-20T12:56:36Z

I don't know that one yet but I'm open to everything if we have a convenient way of getting the PageXML into whatever the tool needs, and some useful output format (ideally web annotations directly on the PageXML? as that is what our system outputs too, then the evaluation pipeline would compare web annotations). Preferably with the least amount of overhead (that's what keeping me from recommending the FoLiA approach).

proycon · 2021-08-20T13:48:37Z

I tried both recogito and doccano with plain text input, and it works pretty great and easy for both. Doccano doesn't seem to allow an open-vocabulary (for correcting words), Recogito seems more suitable. This is less overhead than FLAT and FoLiA so I suggest we go this way.

For Recogito, we'd only need something that interprets the results (it can export a simple CSV or full web annotations), and we'd need to convert the PageXML to plain text (one TextLine per line should do?) for input and perhaps some minimal bookkeeping to relate the offsets back.

With recogito we can use the "tag" field for a label like "per","obj","loc", and the comment field for the corrected spelling.

brambg · 2021-09-03T07:17:44Z

Based on the extracted text of the HTR of 2408, I've made 2 sample versions:

All in one:
Pro: This has the Re-Apply advantage: tag one term, and recogito notifies you there are more occurences of that term, with the option of re-applying the annotation to those occurences. (Which takes a while)
Con: The document is large, you'd have to add markers in the text (before uploading) for where which page starts.

Per page:
Pro: easy to pick the page
Con: when uploading multiple files, the order is not guaranteed. You'd have to manually fix the order of the Document Parts in the Metadata tab.

The script that extracts the text from the pagexml also saves the necessary metadata to be able to map the recogiito csv export back to the original position in the pagexml.

We probably want a unit size smaller than the whole archive, but larger than an individual page.

LvanWissen · 2021-09-03T11:52:07Z

There's an option in between: we can segment it per notarial deed, so per probate inventory. For every deed in the archive (e.g. https://archief.amsterdam/indexen/deeds/9d6d21e3-1fe7-666d-e053-b784100a1840) we know on which scans the deed is located (e.g. A16098000031 - A16098000044).

In principle, there is information at VeleHanden on where the deeds are segmented on the scan, but this information is not published.

brambg · 2021-09-06T12:26:34Z

That would be a good sized segmentation. How can I get the information linking scans to deeds?

LvanWissen · 2021-09-06T13:48:51Z

Query: https://api.triplydb.com/s/sHNKX0vs6 (only for deed types that contain 'boedel')

Result: na_boedel_deeds_scans20210906.csv
Please keep the deed URI (or at least the uuid in it) as identifier for each individual deed!

Should we already try to match the names and location descriptions from the index to the HTR?

brambg · 2021-09-06T14:46:07Z

Should we already try to match the names and location descriptions from the index to the HTR?

Is this the "Tag de tekst" index you're referring to? Sure, if this data is available in a suitable format I wouldn't mind doing an Analiticcl evaluation with it.

brambg · 2021-09-06T15:02:05Z

For the precision/recall/F1 metrics we need to compare the sets of recognized entities from the Recogito export to those produced by Analiticcl.

I see 2 properties to match the entities on (other than the original words + their location):

the person/location/object category
the normalised/corrected spelling

I'm assuming we should separate these 2 properties when comparing, so we get 2 sets of metrics:

one set where we match the entities based on the category assigned (+ original words & location)
one set where we match the entities based on the normalised/corrected spelling (+ original words & location)

2 sets, because we would be using the metrics to evaluate 2 different separate functions of Analiticcl

Does this make sense?

proycon · 2021-09-06T15:21:35Z

Yes, we want two separate evaluations indeed. This would correspond with the the detection and correction stages.

LvanWissen · 2021-09-07T11:26:15Z

Should we already try to match the names and location descriptions from the index to the HTR?

Is this the "Tag de tekst" index you're referring to? Sure, if this data is available in a suitable format I wouldn't mind doing an Analiticcl evaluation with it.

No, this is not yet available (?). What we already have are the person names and location descriptions from the index itself. Example: https://api.triplydb.com/s/eUOYpzVZB. Included in https://github.com/knaw-huc/golden-agents-htr/tree/master/resources (see README, personnamesNA_20201204.csv).

For pre-tagging purposes: can we use TICCLAT (the lexica) for this, if there is information on (art historical) objects available in there?

proycon · 2021-09-07T12:29:09Z

We can use TICCLAT in analiticcl yes, as a variant list, it's just a mapping of words to spelling variants and has no specific domain, so I'm sure there are objects in it too, yes.

The whole idea of this evaluation pipeline is that we can try multiple input lexicons, variant lists, analiticcl parameters, etc, and have some kind of ground truth to measure what works best for us.

LvanWissen · 2021-09-14T13:29:22Z

Query: https://api.triplydb.com/s/sHNKX0vs6 (only for deed types that contain 'boedel')

Result: na_boedel_deeds_scans20210906.csv
Please keep the deed URI (or at least the uuid in it) as identifier for each individual deed!

Should we already try to match the names and location descriptions from the index to the HTR?

See: https://api.triplydb.com/s/HDqkdD6m9 for deeds, dates and their types. Or here for all:
na_boedel_deeds_dates_20210914.csv

And here for an overview of deeds, dates, types, inventories (incl. temporal coverage) and if they contain 'minuutakten': https://api.triplydb.com/s/iZIuKePza. Or here for all:
na_boedel_deeds_inventory_minuut_20210917.csv

proycon changed the title ~~Prepare evaluation material for analiticcl~~ Prepare evaluation material for entity detection (through analiticcl) Aug 17, 2021

proycon mentioned this issue Aug 18, 2021

FoLiA-page: add support for linebreaks LanguageMachines/foliautils#65

Open

proycon assigned proycon, HarmTseard and brambg Aug 18, 2021

proycon mentioned this issue Aug 19, 2021

foliatextcontent: Add offsets for existing elements proycon/foliatools#43

Closed

proycon mentioned this issue Apr 25, 2022

Implement evaluation scripts #18

Closed

proycon closed this as completed Jun 3, 2022

proycon added a commit that referenced this issue Jun 13, 2022

added script to extract boedelindex based on @LvanWissen 's query in #1

35b2377

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare evaluation material for entity detection (through analiticcl) #1

Prepare evaluation material for entity detection (through analiticcl) #1

proycon commented Aug 17, 2021 •

edited

Loading

proycon commented Aug 18, 2021 •

edited

Loading

brambg commented Aug 18, 2021 •

edited

Loading

proycon commented Aug 18, 2021

proycon commented Aug 19, 2021

proycon commented Aug 20, 2021 •

edited

Loading

LvanWissen commented Aug 20, 2021 via email •

edited

Loading

proycon commented Aug 20, 2021

proycon commented Aug 20, 2021 •

edited

Loading

brambg commented Sep 3, 2021 •

edited

Loading

LvanWissen commented Sep 3, 2021 •

edited

Loading

brambg commented Sep 6, 2021

LvanWissen commented Sep 6, 2021 •

edited

Loading

brambg commented Sep 6, 2021

brambg commented Sep 6, 2021

proycon commented Sep 6, 2021

LvanWissen commented Sep 7, 2021

proycon commented Sep 7, 2021

LvanWissen commented Sep 14, 2021 •

edited

Loading

Prepare evaluation material for entity detection (through analiticcl) #1

Prepare evaluation material for entity detection (through analiticcl) #1

Comments

proycon commented Aug 17, 2021 • edited Loading

proycon commented Aug 18, 2021 • edited Loading

brambg commented Aug 18, 2021 • edited Loading

proycon commented Aug 18, 2021

proycon commented Aug 19, 2021

proycon commented Aug 20, 2021 • edited Loading

LvanWissen commented Aug 20, 2021 via email • edited Loading

proycon commented Aug 20, 2021

proycon commented Aug 20, 2021 • edited Loading

brambg commented Sep 3, 2021 • edited Loading

LvanWissen commented Sep 3, 2021 • edited Loading

brambg commented Sep 6, 2021

LvanWissen commented Sep 6, 2021 • edited Loading

brambg commented Sep 6, 2021

brambg commented Sep 6, 2021

proycon commented Sep 6, 2021

LvanWissen commented Sep 7, 2021

proycon commented Sep 7, 2021

LvanWissen commented Sep 14, 2021 • edited Loading

proycon commented Aug 17, 2021 •

edited

Loading

proycon commented Aug 18, 2021 •

edited

Loading

brambg commented Aug 18, 2021 •

edited

Loading

proycon commented Aug 20, 2021 •

edited

Loading

LvanWissen commented Aug 20, 2021 via email •

edited

Loading

proycon commented Aug 20, 2021 •

edited

Loading

brambg commented Sep 3, 2021 •

edited

Loading

LvanWissen commented Sep 3, 2021 •

edited

Loading

LvanWissen commented Sep 6, 2021 •

edited

Loading

LvanWissen commented Sep 14, 2021 •

edited

Loading