Skip to content

Latest commit

 

History

History
138 lines (107 loc) · 8.41 KB

provenance_data.rst

File metadata and controls

138 lines (107 loc) · 8.41 KB

Provenance Data

It is often desireable to know exactly what tools (and what versions thereof and even with what parameters) were invoked in which order to produce a FoLiA document, this is called provenance data. In the metadata section, right after the annotation_declarations FoLiA allows for a <provenance> block containing this information. It is not mandatory but it is strongly recommended.

The <provenance> block defines one or more processors, processors are processes or entities that have processed and often performend some kind of manipulation of the document, such as adding annotations. The processors are listed in the order they were invoked. The annotation_declarations in turn link to these processors to tie a particular annotation type and set to one or more processors.

A <processor> carries the following attributes:
  • xml:id (mandatory) -- The ID of the processor, this is how it is referred to from the <annotator processor=".." /> element in the annotation_declarations and from the processor attribute (part of the common FoLiA attributes) on individual annotations.
  • name (mandatory) -- The name identifies actual tool or human annotator
  • type -- Each processor contains a type:
    • auto - (default) - The processor is an automated tool that provided annotations
    • manual - The processor refers a manual annotator
    • generator - The processor indicates the FoLiA library used by the parent and sibling processors (unless sibling processes specify another generator in their scope)
    • datasource - The processor is a reference to a particular data source that was used by the parent processor. If there is no parent processor but it is instead directly part of the provenance chain, often as the very first element, then you can interpret this to be the original data source from which the document sprung.
  • version -- (optional but strongly recommended) is the version of the processor aka tool
  • document_version (optional) -- The version of the document, refers to any label the user desires to indicate a version of the document, so the format is not predetermined and needs not be numeric.
  • command (optional) -- The exact command that was run
  • host (optional) -- The host on which the processor ran, this identifies individual systems on a network/cluster.
  • user (optional) -- The user/executor which ran the processor, this identifies who ran an automated process rather than who the annotator was!
  • src (optional) -- The source of the processor, a URL to the tool itself in case the software is an online tool, or to its website or source code repository if not. If the processor is of the datasource type, then this attribute should point to that data set or a website describing it. The format attribute can be used to further specify the type of source.
  • format (optional) -- MIME type describing the kind of resource pointed to by src. Use text/html for websites. Especially useful for processors of type datasource.
  • folia_version (optional) - The folia version that was written
  • begindatetime (optional) -- Specifies when the process started, format is YYYY-MM-DDThh:mm:ss (note the literal T in the middle to separate date from time), as per the XSD Datetime data type.
  • enddatetime (optional) -- Specifies when the process finished, format is YYYY-MM-DDThh:mm:ss (note the literal T in the middle to separate date from time), as per the XSD Datetime data type.
  • resourcelink (optional) - The URI of any RDF resource describing this processor. This allows linking to the external world of linked open data from the provenance chain in FoLiA.
  • Additional custom metadata is allowed in the form of <meta> elements (just like with folia native metadata) inside the scope of a processor, FoLiA does not define the semantics of any such metadata, i.e. they are tool/application-specific and could for instance be used to specify tool parameters used.

First consider a fairly minimalistic example, note that we include the annotation_declarations as well with a link to the processor:

<annotations>
  <token-annotation set="tokconfig-nl">
      <annotator processor="p0" />
  </token-annotation>
</annotations>
<provenance>
    <processor xml:id="p0" name="ucto" version="0.15" folia_version="2.0" command="ucto -Lnld" host="mhysa" user="proycon" begindatetime="2018-09-12T00:00:00" enddatetime="2018-09-12T00:00:10" document_version="1" />
</provenance>

Individual annotations in the document can refer to this processor using the processor attribute:

<w class="PUNCTUATION" processor="p0">
 <t>.</t>
</w>

If there is only one <annotator> defined for a certain annotation type and set in the annotation_declarations, then it is the default and no processor attribute is necessary.

One of the powerful features of processors is that they can be nested, this creates subprocessors and captures situations where one processor invokes others as part of its operation. Subprocessors can also provide some extra information on their parent processor, as they can for example state what FoLiA library was used (type="generator") or what data sources were used by the processor (type="datasource"). Moreover, arbitrary metadata can be added to any processor in the form of <meta> elements (just like with FoLiA's native metadata), FoLiA does not define the semantics of any such metadata, i.e. they are tool/application-specific and could for instance be used to specify tool parameters used. Note that whereas the order of the processors in the <provenance> block is strictly significant, the order of subprocessors is not.

With all this in mind, we can expand our previous example:

<provenance>
    <processor xml:id="p0" name="ucto" version="0.15" folia_version="2.0" command="ucto -Lnld" host="mhysa" user="proycon" begindatetime="2018-09-12T00:00:00" enddatetime="2018-09-12T00:00:10" document_version="1" />
        <meta id="config">tokconfig-nld</meta>
        <meta id="language">nld</meta>
        <processor xml:id="p0.1" name="libfolia" version="2.0" folia_version="2.0" type="generator" />
        <processor xml:id="p0.1" name="tokconfig-nld" version="2.0" folia_version="2.0" type="datasource" />
    </processor>
</provenance>

Or consider the following example in which we have a tool that is an annotation environment in which human annotators edit a FoLiA document and add/edit annotations:

<provenance>
    <processor xml:id="p2" name="flat" version="0.8" folia_version="2.0" host="flat.science.ru.nl" begindatetime="2018-09-12T00:10:00" enddatetime="2018-09-12T00:20:00" document_version="3">
        <processor xml:id="p2.0" name="foliapy" version="2.0" folia_version="2.0" type="generator" />
        <processor xml:id="p2.1" name="proycon" type="manual" />
        <processor xml:id="p2.2" name="ko" type="manual" />
    </processor>
</provenance>

From the annotation_declarations, we can then also refer directly to subprocessors. Moreover, a processor can be referred to from multiple annotation types/sets:

<annotations>
  ...
  <pos-annotation set="...">
      <annotator processor="p2.1" />
      <annotator processor="p2.2" />
  </pos-annotation>
  <lemma-annotation set="...">
      <annotator processor="p2.1" />
  </lemma-annotation>
  ...
</annotations>

Of course, providing all this is not mandatory and requires the specific tool to actually supply this provenance data. It is still possible to have FoLiA documents without provenance data at all.

The following example provides a small but complete FoLiA document with provenance data:

../../examples/provenance.2.0.0.folia.xml

And another more real-life example:

../../examples/pos-features-deep.2.0.0.folia.xml

Another example with many annotation types and extensive provenance data:

../../examples/spacy-core-web-sm-en.2.0.0.folia.xml

annotation_declarations set_definitions