-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare evaluation material for entity detection (through analiticcl) #1
Comments
Some issues are already popping up regarding the PageXML->FoLiA conversion, they will be resolved (takes some time) but the question remains whether we want to go this route. @HarmTseard Can you check with the people you want to appoint as annotators what their preferences are for this annotation task? (and including yourself of course) |
Leon has given me a file with scan "classifiers"
Would it be helpful if I assemble a set of PageXML for manual annotation based on some of the classifiers, like "Boedelinventaris", "Huwelijkse voorwaarden" or "Contract" ? |
That sounds like a good way to select some diverse material yeah, we'd best consult with Harm who knows more about the actual contents and will be working on the annotation. |
The PageXML to FoLiA converter (FoLiA-page) is ready now. |
We also have a third option to avoid the FoLiA overhead (which I'm skeptical about for this task) but still have an annotation tool: W can use another annotation tool. I'm looking into doccano as a hopefully more lightweight alternative. We still have overhead and conversion issues, but this time to a simpler CoNLL format. |
Is https://recogito.pelagios.org/ an option?
|
I don't know that one yet but I'm open to everything if we have a convenient way of getting the PageXML into whatever the tool needs, and some useful output format (ideally web annotations directly on the PageXML? as that is what our system outputs too, then the evaluation pipeline would compare web annotations). Preferably with the least amount of overhead (that's what keeping me from recommending the FoLiA approach). |
I tried both recogito and doccano with plain text input, and it works pretty great and easy for both. Doccano doesn't seem to allow an open-vocabulary (for correcting words), Recogito seems more suitable. This is less overhead than FLAT and FoLiA so I suggest we go this way. For Recogito, we'd only need something that interprets the results (it can export a simple CSV or full web annotations), and we'd need to convert the PageXML to plain text (one TextLine per line should do?) for input and perhaps some minimal bookkeeping to relate the offsets back. With recogito we can use the "tag" field for a label like "per","obj","loc", and the comment field for the corrected spelling. |
Based on the extracted text of the HTR of 2408, I've made 2 sample versions: All in one: Per page: The script that extracts the text from the pagexml also saves the necessary metadata to be able to map the recogiito csv export back to the original position in the pagexml. We probably want a unit size smaller than the whole archive, but larger than an individual page. |
There's an option in between: we can segment it per notarial deed, so per probate inventory. For every deed in the archive (e.g. https://archief.amsterdam/indexen/deeds/9d6d21e3-1fe7-666d-e053-b784100a1840) we know on which scans the deed is located (e.g. A16098000031 - A16098000044). In principle, there is information at VeleHanden on where the deeds are segmented on the scan, but this information is not published. |
That would be a good sized segmentation. How can I get the information linking scans to deeds? |
Query: https://api.triplydb.com/s/sHNKX0vs6 (only for deed types that contain 'boedel') Result: na_boedel_deeds_scans20210906.csv Should we already try to match the names and location descriptions from the index to the HTR? |
Is this the "Tag de tekst" index you're referring to? Sure, if this data is available in a suitable format I wouldn't mind doing an Analiticcl evaluation with it. |
For the precision/recall/F1 metrics we need to compare the sets of recognized entities from the Recogito export to those produced by Analiticcl. I see 2 properties to match the entities on (other than the original words + their location):
I'm assuming we should separate these 2 properties when comparing, so we get 2 sets of metrics:
2 sets, because we would be using the metrics to evaluate 2 different separate functions of Analiticcl Does this make sense? |
Yes, we want two separate evaluations indeed. This would correspond with the the detection and correction stages. |
No, this is not yet available (?). What we already have are the person names and location descriptions from the index itself. Example: https://api.triplydb.com/s/eUOYpzVZB. Included in https://github.com/knaw-huc/golden-agents-htr/tree/master/resources (see README, personnamesNA_20201204.csv). For pre-tagging purposes: can we use TICCLAT (the lexica) for this, if there is information on (art historical) objects available in there? |
We can use TICCLAT in analiticcl yes, as a variant list, it's just a mapping of words to spelling variants and has no specific domain, so I'm sure there are objects in it too, yes. The whole idea of this evaluation pipeline is that we can try multiple input lexicons, variant lists, analiticcl parameters, etc, and have some kind of ground truth to measure what works best for us. |
See: https://api.triplydb.com/s/HDqkdD6m9 for deeds, dates and their types. Or here for all: And here for an overview of deeds, dates, types, inventories (incl. temporal coverage) and if they contain 'minuutakten': https://api.triplydb.com/s/iZIuKePza. Or here for all: |
We need some evaluation material to ascertain how well analiticcl performs in combination with the chosen input lexicons. In order to do that we need a simple gold standard that marks the different kind of entities we are interested in (names, locations, objects). A full annotation task to build a representative gold standard is probably too much work at this stage, so we'll have to settle for a annotating a few scans to just get an initial impression of the performance to steer development.
We can either evaluate analiticcl and the input lexicons by themselves at a lower-level, or evaluate the wider pipeline (with @brambg's components). The latter is probably the best option. In that case the input would be PageXML:
We can go about this in two ways:
obj
,per
andloc
and anorm
attribute:This is rather ad-hoc but a simple and quick solution. It does however require a basic amount of technical expertise from the annotators and be comfortable in using a text editor to edit XML. The output will be some ad-hoc Page-XML hybrid.
FoLiA-page
available in foliautils, but this tool does need some further adaptation for our purposes.This is less ad-hoc, uses well-defined formats and tools and will be more accessible to annotators. But it does come at the cost of increased complexity and extra overhead (=extra work/time) to set up. It might be overkill for our purposes.
In both scenarios, entities that span a linebreak (or are hyphenated) will pose an additional challenge.
Whatever option we choose, in order to evaluate analiticcl and the wider pipeline, we need an evaluation pipeline (as part of @brambg's pipeline I'd say) that reads the gold standard (either the Page-XML superset or FoLiA XML), compares the system output with it, and produces output metrics (precision/recall/F1).
(Tagging @menzowindhouwer , @HarmTseard, @LvanWissen, @brambg )
The text was updated successfully, but these errors were encountered: