Skip to content

Commit

Permalink
Starting a guidelines section in the documentation #43 (#70)
Browse files Browse the repository at this point in the history
  • Loading branch information
proycon committed Mar 13, 2019
1 parent f825455 commit c3251ad
Show file tree
Hide file tree
Showing 4 changed files with 55 additions and 4 deletions.
46 changes: 46 additions & 0 deletions docs/source/guidelines.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
Guidelines
=================

This section collects guidelines, tips, do's and don'ts and conventions in dealing with FoLiA documents.

For data creators/publishers
-------------------------------

1. **Always validate all FoLiA documents you create and intend to publish!**. Use one of the official validation tools
(``foliavalidator`` and ``folialint``). See :ref:`validation`. This will already catch most of the issues that could
arise out of not following these guidelines.
2. Never invent custom XML elements and attributes. If you really must, make sure they are in a different XML namespace.
See :ref:`foreign_annotation`.
3. If you want to encode something and FoLiA does not seem to offer a good solution yet, or if you are simply unsure
whether the solution you want to use is appropriate, contact the FoLiA developers on our `Issue tracker <https://github.com/proycon/folia/issues/>`_.
FoLiA can be extended in collaboration. Do not simply add your own elements/attributes.
4. Mind the sets you use. Creating and publishing set definitions is recommended but not strictly mandatory for most uses. See :ref:`set_definitions`
5. Identifiers should never change: Once you assign an identifier to something and publish your data: do not change any
identifier that is in use.
6. All annotation types you use must be declared, see :ref:`annotation_declarations`. Take care not to declare annotation types that you don't actually use in your document unless you have good reason to believe the annotation type will be added soon.

For developers
-----------------

1. Using a high-level FoLiA programming library, if available for your programming language, is strongly recommended over parsing/writing/querying the XML yourself, as it will
make things a lot easier and save a lot of work.
2. Always use the latest version of FoLiA and its libraries.
3. Mind the sets you use. Actively check whether the sets uses in a document are in fact the ones your software handles,
i.e. check the declarations (see :ref:`annotation_declarations`). For example, do not blindly assume any part-of-speech tag will be in your intended vocabulary. See
:ref:`set_definitions`
4. Considering that FoLiA is vast, it is fine to only support a subset of a certain annotation types in your software,
or not to support certain complexities such as :ref:`correction_annotation`. Just make sure to check the declarations
based on which you can reject processing a document.

Conventions
-----------------------

Conventions are good practices that you will encounter and are encouraged to follow, but they remain just conventions
rather than strict guidelines.

1. Most FoLiA software assigns verbose identifiers for all elements. We use the the ID of the FoLiA
document as the base identifier and then append the element type and sequence number, all delimited by dots. The IDs
are cumulative in nature, so we get for instance ``example.p.1.s.2.w.3`` for the third word in the second sentence in
the first paragraph of the document with ID ``example``. See :ref:`identifiers`
2. Adding metadata to your document is always encouraged.

2 changes: 1 addition & 1 deletion docs/source/implementations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Currently, the following FoLiA library implementations exist. Both follow a high
object-oriented model in which FoLiA XML elements correspond with classes.

* `FoLiApy <https://github.com/proycon/foliapy>`_ - A FoLiA library in Python.
* `libfolia <https://github.com/LanguageMachines/libfolia>`_ - A FoLiA library in C++. Obtain from https://proycon.github.io/folia
* `libfolia <https://github.com/LanguageMachines/libfolia>`_ - A FoLiA library in C++.

Both libraries are shipped as part of our `LaMachine <https://proycon.github.io/LaMachine>`_ software
distribution.
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ version: 2.0.0
foreign_annotation
querying
implementations
guidelines

* :ref:`genindex`
* :ref:`search`
10 changes: 7 additions & 3 deletions docs/source/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,9 @@ all annotation types.

Read the full specification in the following section: :ref:`set_definitions`


.. _validation:

Validation
-------------

Expand All @@ -167,7 +170,7 @@ design, we prefer to be explicit and do away with any ambiguity or any ad-hoc co
FoLiA is clear for both humans and machines. Specific validator software is provided to this end.

* A first level of validation is performed by comparing your document against the FoLiA schema (in RelaxNG), this gives you a
good indication whether the document is formed corrected; but is not sufficient for full validation!
good indication whether the document is formed corrected; but is **not sufficient** for full validation!
* For full validation, process the document using one of the provided validation tools. These tools make a distinction
between **shallow validation** and **deep validation**, the distinction being that only in the latter case the validity of all used
classes will be put to the test using the set definitions. Shallow validations allows users to still use FoLiA without
Expand Down Expand Up @@ -350,7 +353,7 @@ attributes per annotation type. Altogether, we distinguish the following:
* ``datetime`` -- The date and time when this annotation was recorded, the format is ``YYYY-MM-DDThh:mm:ss`` (note the literal T in the middle to separate date from time), as per the XSD Datetime data type.
* ``n`` -- A number in a sequence, corresponding to a number in the original document, for example chapter numbers, section numbers, list item numbers. This this not have to be an actual number but other sequence identifiers are also possible (think alphanumeric characters or roman numerals).
* ``textclass`` -- Refers to the text class this annotation is based on. This is an advanced attribute, if not specified, it defaults to ``current``. See :ref:`textclass_attribute`.
* ``space`` -- This attribute indicates whether spacing should be inserted after this element (it's default value is always ``yes``, so it does not need to be specified in that case), but if tokens or other structural elements are glued together then the value should be set to ``no``. This allows for reconstruction of the detokenised original text.
* ``space`` -- This attribute indicates whether spacing should be inserted after this element (it's default value is always ``yes``, so it does not need to be specified in that case), but if tokens or other structural elements are glued together then the value should be set to ``no``. This allows for reconstruction of the detokenised original text.

**Speech attributes**, the following attributes apply mostly in a speech context (please read :ref:`speech` for more):

Expand All @@ -366,9 +369,10 @@ details.
* ``xlink:href`` -- Creates a hyperlink on a text to the specified URL
* ``xlink:type`` -- Specifies the type of the hyperlink. (should be set to ``simple`` in almost all cases)

.. _identifiers:

Identifiers
~~~~~~~~~~~~~~~
.. _identifiers:

Many elements in FoLiA take an identifier by which the element is uniquely identifiable. This makes referring to any
part of a FoLiA document easy. Identifiers should be unique in the entire document, and ideally within the entire corpus
Expand Down

0 comments on commit c3251ad

Please sign in to comment.