Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display / edit of GT labelling metadata #36

Open
kba opened this issue Jan 13, 2022 · 1 comment
Open

Display / edit of GT labelling metadata #36

kba opened this issue Jan 13, 2022 · 1 comment

Comments

@kba
Copy link
Contributor

kba commented Jan 13, 2022

It would be convenient if browse-ocrd was capable to display and edit GT labelling metadata.

(we're discussing the OCR-D GT call, so I wanted to note it down lest we forget)

@bertsky
Copy link
Contributor

bertsky commented Jan 18, 2022

Note: In METS, the labels are a flat sequence of gt:state elements with @prop from the above mentioned schema file, one per page.

   <mets:dmdSec ID="DMDGT_0001">
      <mets:mdWrap MDTYPE="OTHER" OTHERMDTYPE="GT">
         <mets:xmlData>
            <gt:gt>
               <gt:state prop="granularity/physical/document-related/word"/>
               <gt:state prop="granularity/physical/document-related/text-line"/>
               <gt:state prop="granularity/physical/document-related/region"/>
               <gt:state prop="data-attributes/document-related/visual/text/font/multi-font/typefaces"/>
               <gt:state prop="data-attributes/document-related/visual/text/font/multi-font/font-sizes"/>
               <gt:state prop="data-attributes/language/mixed"/>
               <gt:state prop="condition/production-related/document-faults/ink-from-facing"/>
               <gt:state prop="condition/wear/additions/informative/annotations"/>
               <gt:state prop="condition/production-related/document-characteristics/low-contrast"/>
               <gt:state prop="condition/acquisition/method-flaws/imaging/uneven-illumination"/>
            </gt:gt>
         </mets:xmlData>
      </mets:mdWrap>
   </mets:dmdSec>

These are then referenced under each physical structMap's page via @DMDID.

IMO in core we first need some additional API to support that. Like (in analogy to pageId):

OcrdMets.get_gt_labelling(self, for_fileIds=None) # returns dict of file ID to label list
OcrdMets.get_gt_labelling_for_file(self, ocrd_file) # returns label list
OcrdMets.set_gt_labelling_for_file(self, labels, ocrd_file) # takes label list
# but also:
OcrdMets.add_file(self, ... labels=None, ...) # add full label list
OcrdMets.find_files(self, ... labels=None, ...) # filter by label list (match any)

What's your opinion, @kba?

Perhaps – instead of parsing this from the METS, we could also see to it that OCR-D mirrors them in the parsed PAGE-XML, i.e. OcrdPage.

For example as:

  <MetadataItem type="imageProperties" name="gt-labelling">
    <Labels externalModel="https://github.com/OCR-D/gt-labelling/blob/master/xsd_schema/OCR-D_GT_schema.xsd" externalId="http://www.ocr-d.de/GT/">
      <Label value="granularity/physical/document-related/word"/>
      <Label value="granularity/physical/document-related/text-line"/>
      <Label value="granularity/physical/document-related/region"/>
      <Label value="data-attributes/document-related/visual/text/font/multi-font/typefaces"/>
      <Label value="data-attributes/document-related/visual/text/font/multi-font/font-sizes"/>
      <Label value="data-attributes/language/mixed"/>
      <Label value="condition/production-related/document-faults/ink-from-facing"/>
      <Label value="condition/wear/additions/informative/annotations"/>
      <Label value="condition/production-related/document-characteristics/low-contrast"/>
      <Label value="condition/acquisition/method-flaws/imaging/uneven-illumination"/>
    </Labels>
  </MetadataItem>

This would make it easier to access the labels from a processor or PAGE viewer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants