Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare evaluation material for entity detection (through analiticcl) #1

Closed
proycon opened this issue Aug 17, 2021 · 18 comments
Closed
Assignees

Comments

@proycon
Copy link
Member

proycon commented Aug 17, 2021

We need some evaluation material to ascertain how well analiticcl performs in combination with the chosen input lexicons. In order to do that we need a simple gold standard that marks the different kind of entities we are interested in (names, locations, objects). A full annotation task to build a representative gold standard is probably too much work at this stage, so we'll have to settle for a annotating a few scans to just get an initial impression of the performance to steer development.

We can either evaluate analiticcl and the input lexicons by themselves at a lower-level, or evaluate the wider pipeline (with @brambg's components). The latter is probably the best option. In that case the input would be PageXML:

  1. select a few PageXML scans from the HTR data that contain different kinds of entities (names, locations, objects)
  2. annotate all entities, marking the following:
  • the category (something simple like per, loc, obj suffices probably)
  • the normalised/corrected spelling

We can go about this in two ways:

  1. Have annotators use a text editor and directly edit the PageXML (it is human readable enough), using some agreed upon XML tags. Syntax can be as simple as elements like obj,per and loc and a norm attribute:
een geverft ront <obj norm="tafeltje">taeffeltje</obj>

This is rather ad-hoc but a simple and quick solution. It does however require a basic amount of technical expertise from the annotators and be comfortable in using a text editor to edit XML. The output will be some ad-hoc Page-XML hybrid.

  1. Set up a proper annotation environment using FLAT (as discussed in the meeting today). This implies the following:
  • We need to convert the PageXML to FoLiA, for which we already have a tool FoLiA-page available in foliautils, but this tool does need some further adaptation for our purposes.
  • FLAT requires tokenisation of the data (which can be done by ucto, which is a bit at odds with the kind of task we are doing, where the tokenisation is uncertain, HTR errors complicate the matter.
  • We need to prepare a FLAT instance for our annotation task.
  • (any conversion back to an ad-hoc enhanced PageXML would be fairly senseless in this scenario)

This is less ad-hoc, uses well-defined formats and tools and will be more accessible to annotators. But it does come at the cost of increased complexity and extra overhead (=extra work/time) to set up. It might be overkill for our purposes.

In both scenarios, entities that span a linebreak (or are hyphenated) will pose an additional challenge.

Whatever option we choose, in order to evaluate analiticcl and the wider pipeline, we need an evaluation pipeline (as part of @brambg's pipeline I'd say) that reads the gold standard (either the Page-XML superset or FoLiA XML), compares the system output with it, and produces output metrics (precision/recall/F1).

(Tagging @menzowindhouwer , @HarmTseard, @LvanWissen, @brambg )

@proycon proycon changed the title Prepare evaluation material for analiticcl Prepare evaluation material for entity detection (through analiticcl) Aug 17, 2021
@proycon
Copy link
Member Author

proycon commented Aug 18, 2021

Some issues are already popping up regarding the PageXML->FoLiA conversion, they will be resolved (takes some time) but the question remains whether we want to go this route.

@HarmTseard Can you check with the people you want to appoint as annotators what their preferences are for this annotation task? (and including yourself of course)

@brambg
Copy link
Contributor

brambg commented Aug 18, 2021

Leon has given me a file with scan "classifiers"
The classifiers used are:

  • Akkoord
  • Akte van executeurschap
  • Attestatie
  • Beraad
  • Bestek
  • Bevrachtingscontract
  • Bewijs aan minderjarigen
  • Bijlbrief
  • Bodemerij
  • Boedelinventaris
  • Boedelscheiding
  • Borgtocht
  • Cessie
  • Compagnieschap
  • Consent
  • Contract
  • Conventie (echtscheiding)
  • Huur
  • Huwelijkse voorwaarden
  • Hypotheek
  • Insinuatie
  • Interrogatie
  • Koop
  • Kwitantie
  • Machtiging
  • Non prejuditie
  • Obligatie
  • Onbekend
  • Overig
  • Procuratie
  • Renunciatie
  • Revocatie
  • Scheepsverklaring
  • Schenking
  • Testament
  • Transport
  • Trouwbelofte
  • Uitspraak
  • Voogdij
  • Wisselprotest

Would it be helpful if I assemble a set of PageXML for manual annotation based on some of the classifiers, like "Boedelinventaris", "Huwelijkse voorwaarden" or "Contract" ?

@proycon
Copy link
Member Author

proycon commented Aug 18, 2021

That sounds like a good way to select some diverse material yeah, we'd best consult with Harm who knows more about the actual contents and will be working on the annotation.

@proycon
Copy link
Member Author

proycon commented Aug 19, 2021

The PageXML to FoLiA converter (FoLiA-page) is ready now.

@proycon
Copy link
Member Author

proycon commented Aug 20, 2021

We also have a third option to avoid the FoLiA overhead (which I'm skeptical about for this task) but still have an annotation tool: W can use another annotation tool. I'm looking into doccano as a hopefully more lightweight alternative. We still have overhead and conversion issues, but this time to a simpler CoNLL format.

@LvanWissen
Copy link
Member

LvanWissen commented Aug 20, 2021 via email

@proycon
Copy link
Member Author

proycon commented Aug 20, 2021

I don't know that one yet but I'm open to everything if we have a convenient way of getting the PageXML into whatever the tool needs, and some useful output format (ideally web annotations directly on the PageXML? as that is what our system outputs too, then the evaluation pipeline would compare web annotations). Preferably with the least amount of overhead (that's what keeping me from recommending the FoLiA approach).

@proycon
Copy link
Member Author

proycon commented Aug 20, 2021

I tried both recogito and doccano with plain text input, and it works pretty great and easy for both. Doccano doesn't seem to allow an open-vocabulary (for correcting words), Recogito seems more suitable. This is less overhead than FLAT and FoLiA so I suggest we go this way.

For Recogito, we'd only need something that interprets the results (it can export a simple CSV or full web annotations), and we'd need to convert the PageXML to plain text (one TextLine per line should do?) for input and perhaps some minimal bookkeeping to relate the offsets back.

With recogito we can use the "tag" field for a label like "per","obj","loc", and the comment field for the corrected spelling.

@brambg
Copy link
Contributor

brambg commented Sep 3, 2021

Based on the extracted text of the HTR of 2408, I've made 2 sample versions:

All in one:
Pro: This has the Re-Apply advantage: tag one term, and recogito notifies you there are more occurences of that term, with the option of re-applying the annotation to those occurences. (Which takes a while)
Con: The document is large, you'd have to add markers in the text (before uploading) for where which page starts.

Per page:
Pro: easy to pick the page
Con: when uploading multiple files, the order is not guaranteed. You'd have to manually fix the order of the Document Parts in the Metadata tab.

The script that extracts the text from the pagexml also saves the necessary metadata to be able to map the recogiito csv export back to the original position in the pagexml.

We probably want a unit size smaller than the whole archive, but larger than an individual page.

@LvanWissen
Copy link
Member

LvanWissen commented Sep 3, 2021

There's an option in between: we can segment it per notarial deed, so per probate inventory. For every deed in the archive (e.g. https://archief.amsterdam/indexen/deeds/9d6d21e3-1fe7-666d-e053-b784100a1840) we know on which scans the deed is located (e.g. A16098000031 - A16098000044).

In principle, there is information at VeleHanden on where the deeds are segmented on the scan, but this information is not published.

@brambg
Copy link
Contributor

brambg commented Sep 6, 2021

That would be a good sized segmentation. How can I get the information linking scans to deeds?

@LvanWissen
Copy link
Member

LvanWissen commented Sep 6, 2021

Query: https://api.triplydb.com/s/sHNKX0vs6 (only for deed types that contain 'boedel')

Result: na_boedel_deeds_scans20210906.csv
Please keep the deed URI (or at least the uuid in it) as identifier for each individual deed!

Should we already try to match the names and location descriptions from the index to the HTR?

@brambg
Copy link
Contributor

brambg commented Sep 6, 2021

Should we already try to match the names and location descriptions from the index to the HTR?

Is this the "Tag de tekst" index you're referring to? Sure, if this data is available in a suitable format I wouldn't mind doing an Analiticcl evaluation with it.

@brambg
Copy link
Contributor

brambg commented Sep 6, 2021

For the precision/recall/F1 metrics we need to compare the sets of recognized entities from the Recogito export to those produced by Analiticcl.

I see 2 properties to match the entities on (other than the original words + their location):

  • the person/location/object category
  • the normalised/corrected spelling

I'm assuming we should separate these 2 properties when comparing, so we get 2 sets of metrics:

  • one set where we match the entities based on the category assigned (+ original words & location)
  • one set where we match the entities based on the normalised/corrected spelling (+ original words & location)

2 sets, because we would be using the metrics to evaluate 2 different separate functions of Analiticcl

Does this make sense?

@proycon
Copy link
Member Author

proycon commented Sep 6, 2021

Yes, we want two separate evaluations indeed. This would correspond with the the detection and correction stages.

@LvanWissen
Copy link
Member

Should we already try to match the names and location descriptions from the index to the HTR?

Is this the "Tag de tekst" index you're referring to? Sure, if this data is available in a suitable format I wouldn't mind doing an Analiticcl evaluation with it.

No, this is not yet available (?). What we already have are the person names and location descriptions from the index itself. Example: https://api.triplydb.com/s/eUOYpzVZB. Included in https://github.com/knaw-huc/golden-agents-htr/tree/master/resources (see README, personnamesNA_20201204.csv).

For pre-tagging purposes: can we use TICCLAT (the lexica) for this, if there is information on (art historical) objects available in there?

@proycon
Copy link
Member Author

proycon commented Sep 7, 2021

We can use TICCLAT in analiticcl yes, as a variant list, it's just a mapping of words to spelling variants and has no specific domain, so I'm sure there are objects in it too, yes.

The whole idea of this evaluation pipeline is that we can try multiple input lexicons, variant lists, analiticcl parameters, etc, and have some kind of ground truth to measure what works best for us.

@LvanWissen
Copy link
Member

LvanWissen commented Sep 14, 2021

Query: https://api.triplydb.com/s/sHNKX0vs6 (only for deed types that contain 'boedel')

Result: na_boedel_deeds_scans20210906.csv
Please keep the deed URI (or at least the uuid in it) as identifier for each individual deed!

Should we already try to match the names and location descriptions from the index to the HTR?

See: https://api.triplydb.com/s/HDqkdD6m9 for deeds, dates and their types. Or here for all:
na_boedel_deeds_dates_20210914.csv

And here for an overview of deeds, dates, types, inventories (incl. temporal coverage) and if they contain 'minuutakten': https://api.triplydb.com/s/iZIuKePza. Or here for all:
na_boedel_deeds_inventory_minuut_20210917.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants