Merge pull request #145 from kermitt2/feature/holdout-set

Create holdout set
lfoppiano · Nov 9, 2022 · 8da45fe · 8da45fe
2 parents 0957bc6 + 06c7e11
commit 8da45fe
Show file tree

Hide file tree

Showing 96 changed files with 290,025 additions and 12,867 deletions.
diff --git a/README.md b/README.md
diff --git a/doc/conf.py b/doc/conf.py
@@ -47,17 +47,17 @@
 
 # General information about the project.
 project = 'Grobid-quantities'
-copyright = '2016, Grobid contributors'
+copyright = '2022, Grobid contributors'
 author = 'Patrice Lopez <patrice.lopez@science-miner.com>, Luca Foppiano <luca@foppiano.org>'
 
 # The version info for the project you're documenting, acts as replacement for
 # |version| and |release|, also used in various other places throughout the
 # built documents.
 #
 # The short X.Y version.
-version = '0.5.2'
+version = '0.7.1'
 # The full version, including alpha/beta/rc tags.
-release = '0.5.2'
+release = '0.7.1'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.

diff --git a/doc/gettingStarted.rst b/doc/gettingStarted.rst
@@ -1,30 +1,36 @@
+.. topic:: Getting started, build, install
+
 .. _Python client GitHub page: https://github.com/lfoppiano/grobid-quantities-python-client
 
-.. topic:: Getting started, build, install
+.. _not compatible with Windows: https://grobid.readthedocs.io/en/latest/Troubleshooting/#windows-related-issues
+
+
 
 Getting started
 ===============
 
-Grobid-quantities requires *JDK 1.8 or greater* and Grobid to be installed.
+Before you start
+~~~~~~~~~~~~~~~~
+.. warning:: Grobid and grobid-quantities are `not compatible with Windows`_. Windows users can easily use Grobid and grobid-quantities through docker comtainers. See below.
 
 Install and build
 ~~~~~~~~~~~~~~~~~
 
 Docker containers
-~~~~~~~~~~~~~~~~~
+-----------------
 The simplest way to run grobid-quantities is via docker containers.
-To run the container with the default configuration:
-::
-     docker run --rm --init -p 8060:8060 -p 8061:8061  lfoppiano/grobid-quantities:0.7.1
 
-To run the container with custom configuration, is possible by providing a configuration file with the parameter ``-v``
-Grobid quantities repository provides already the file `resources/config/config-docker.yml` that contains the correct grobidHome and can be modified to best suits ones's needs: 
+The Grobid-quantities repository provides a configuration file for docker: `resources/config/config-docker.yml`, which should work out of the box, although we recommend to **check the configuration** (e.g., to enable modules using deep learning).
+
+To run the container use:
 ::
      docker run --rm --init -p 8060:8060 -p 8061:8061 -v resources/config/config-docker.yml:/opt/grobid/grobid-quantities/config.yml:ro  lfoppiano/grobid-quantities:0.7.1
 
+The container will respond on port http://localhost:8060, and 8061 for the admin interface.
 
 Local installation 
-~~~~~~~~~~~~~~~~~~~~~
+------------------
+Grobid-quantities requires *JDK 1.8 or greater*, and Grobid to be installed.
 
 First install the latest development version of GROBID as explained by the `documentation <http://grobid.readthedocs.org>`_.
 
@@ -44,7 +50,7 @@ Then, build everything with:
    ./gradlew clean build
 
 
-You should have the directories of the models ``quantities``, ``units`` and ``values`` inside ``../grobid-home/models``
+You should have the directories of the models ``quantities*``, ``units*`` and ``values*`` inside ``../grobid-home/models``
 
 Run some test:
 ::
@@ -55,15 +61,18 @@ Run some test:
 
 
 Start and use the service
-~~~~~~~~~~~~~~~~~~~~~~~~~
+'''''''''''''''''''''''''
 
 Grobid-quantities can be run with the following command:
 ::
 
   java -jar build/libs/grobid-quantities-{version}-onejar.jar server resources/config/config.yml
 
 
-There is a GUI interface demo accessible at ``http://localhost:8060``, and a REST API, reachable under ``http://localhost:8060/service`` and documented in the :ref:`rest_api`
+Accessing the service
+~~~~~~~~~~~~~~~~~~~~~
+
+Grobid-quantitiesa provides a graphical demo accessible at ``http://localhost:8060``, and a REST API, reachable under ``http://localhost:8060/service`` and documented in the :ref:`rest_api`
 
 To test the API, is possible to run a simple text using ``curl``:
 
@@ -72,11 +81,11 @@ To test the API, is possible to run a simple text using ``curl``:
   curl -X POST -F "text=I've lost two minutes." localhost:8060/service/processQuantityText
 
 
-**Note**: The model is designed and trained to work at *paragraph level*. The expected text input to the parser is a paragraph or a text segment of similar size, not a complete document. In case you have a long textual document, it is better either to exploit existing structures (e.g. XML/HTML ``<p>`` elements) to initially segment it into paragraphs or sentences, or to apply an automatic paragraph/sentence segmentation. Then send them separately to grobid-quantities to be processed.
+.. note:: The model is designed and trained to work at *paragraph level*. The expected text input to the parser is a paragraph or a text segment of similar size, not a complete document. In case you have a long textual document, it is better either to exploit existing structures (e.g. XML/HTML ``<p>`` elements) to initially segment it into paragraphs or sentences, or to apply an automatic paragraph/sentence segmentation. Then send them separately to grobid-quantities to be processed.
 
 
-Clients
-~~~~~~~
+Using the python client
+-----------------------
 
 The easiest way to interact with the server is to use the Python Client.
 It removes the complexity of dealing with the output data, and managing single or multi-thread processing.

diff --git a/doc/guidelines.rst b/doc/guidelines.rst
@@ -6,41 +6,51 @@ Annotation guidelines
 The first step of the annotation process is to generate training data from unlabeled documents based on the current models.
 The procedure is explained in details in :ref:`training_data`. GROBID will create the training data corresponding to these documents in the right TEI format and with pre-annotations.
 The annotation work then consists of manually checking the produced annotations and adding the missing ones.
-**It is very important not to modify the text content in these generated files, not adding spaces or other characters, but only adding or moving XML tags.**
 
-When the training data has been manually corrected, move the file under the repository ``resouces/dataset/${model}/corpus/`` for retraining, or under ``resouces/dataset/${model}/evaluation/`` is the annotated data should be used for evalutation only.
+.. warning:: It is very important not to modify the text content in these generated files, not adding spaces or other characters, but only adding or moving XML tags.
+
+When the training data has been manually corrected, move the file under the repository ``resouces/dataset/${model}/corpus/`` for retraining, or under ``resouces/dataset/${model}/evaluation/`` if the annotated data should be used for evaluation only.
 To see the different evaluation options, see GROBID documentation on `training and evaluating <http://grobid.readthedocs.org/en/latest/Training-the-models-of-Grobid>`_.
 
-**NOTE**: the exact directory where the data is picked up could be also a ``final`` under ``corpus``. Please check the description under each model definition, below.
+.. note:: the exact directory where the data is picked up could be also a ``final`` under ``corpus``. Please check the description under each model definition, below.
 
 
 In this document we describe the annotation guidelines for the following models:
- - :ref:`Quantities CRF model` - STABLE
- - :ref:`Units CRF model` - STABLE
- - :ref:`Values CRF model` - STABLE
- - :ref:`Quantified objects CRF model` / substance CRF model - WIP
+ - :ref:`Quantities model` - STABLE
+ - :ref:`Units model` - STABLE
+ - :ref:`Values model` - STABLE
+ - :ref:`Quantified objects CRF model` / substance CRF model - **Work in progress**
 
 
 .. _Quantities CRF model:
 
-Quantities CRF model
+Quantities model
 --------------------
-The Quantities CRF models is the first model of the chain and segment text into units and values.
-This model pick up the training data from ``resouces/dataset/${model}/corpus/final``.
+The Quantities model is the first model of the chain and segment text into units and values.
+This model pick up the training data from ``resouces/dataset/${model}/corpus``.
+
+Currently it supports three types of measurements:
+ - single value (or *atomic* value),
+ - continuous values in intervals (or range of values),
+ - lists of discrete values.
 
-Currently it supports three types of measurements: single value (or *atomic* value), continuous values in intervals (or range of values) or lists of discrete values.
-At the present time we do not distinguish between conjunctive and disjunctive lists.
+.. note:: At the present time we do not distinguish between conjunctive and disjunctive lists.
 
 Unit type vocabulary
 ~~~~~~~~~~~~~~~~~~~~
 
-The list of unit types (temperature, pressure, length, etc.) is controlled and based on SI definition. This controlled vocabulary contains currently 50 types.
-The unit types are given in the file ```src/main/java/org/grobid/core/utilities/UnitUtilities.java```. They are used to get the right transformation.
+The list of unit types (temperature, pressure, length, etc.) is controlled and based on SI definition.
+This controlled vocabulary contains currently around 50 types.
+The unit types are provided in the file ```src/main/java/org/grobid/core/utilities/UnitUtilities.java``` and they are used to get the right transformation.
+
+.. note:: at the moment we do not support disambiguation of overlapping units.
+
 The given names of the unit types has to be used when annotating measurement. 
 
-In the future, the list of units should however not be controlled and GROBID should support units never seen before.
+.. note:: In the future, the list of units should however not be controlled and GROBID should support units never seen before. For now it is admitted to annotate with ``UNKNOWN`` in case of doubt about the type.
 
-For now it is admitted to annotate with ``UNKNOWN`` in case of doubt about the type. Examples:
+
+Examples:
 
 • ``m^2/kg`` (`specific surface area <https://en.wikipedia.org/wiki/Specific_surface_area>`_)
 
@@ -106,7 +116,7 @@ Intervals
 An interval introduces a range of values. We can distinguish two kinds of interval expressions:
 
 1. Bounded value
-^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^
 
 Interval defined by a lower bound value and an upper bound value:
 
@@ -133,7 +143,7 @@ Note that an interval can be introduced by only one boundary value:
 
 
 2. Base and differential value
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Take the example
 
 .. code-block:: xml
@@ -276,7 +286,7 @@ The encoding is then straightforward for atomic values (with attribute ``@when``
 
 
 Time tag (and difference with Date tag)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 • if only the part of a date is expressed (for example the time of a day), but we can infer the date, a complete date is implicit and the context can make it being fully quantified.
 For example ``20:10 UTC`` will be annotated:
@@ -685,7 +695,7 @@ can be annotated as:
 
 The quantified object is identified by its ID  and linked to the measure via the attribute `ptr="#ID"`.
 
-*NOTE* This implementation allows the linking of objects directly attached on the left or right of the measurement, for the time being far entities are not supported.
+.. note:: This implementation allows the linking of objects directly attached on the left or right of the measurement, for the time being far entities are not supported.
 
 
 *How to annotate?*

diff --git a/doc/img/cascade-schema.png b/doc/img/cascade-schema.png
diff --git a/doc/index.rst b/doc/index.rst
@@ -6,7 +6,9 @@
 Welcome to Grobid-quantities's documentation!
 =============================================
 
-Grobid-quantities is a ML-based application for identification, parsing and normalisation of any expressions of measurements (e.g. pressure, temperature, etc.) from text. This work focuses on technical and scientific articles and supports input data from raw text, PDF and XML. Extracted measurements normalised toward the International System of Units (SI).
+Grobid-quantities is a ML-based application for identification, parsing and normalisation of any expressions of measurements (e.g. pressure, temperature, etc.) from text.
+This work focuses on technical and scientific articles and supports input data from raw text, PDF, and XML.
+Extracted measurements are normalised toward the International System of Units (SI).
 
 
 .. toctree::

diff --git a/doc/introduction.rst b/doc/introduction.rst
@@ -1,20 +1,27 @@
+.. topic:: Introduction
+
 .. _Grobid: http://github.com/kermitt2/grobid
 .. _Units of measurement: http://unitsofmeasurement.github.io/
 
 
-.. topic:: Introduction
-
 Introduction
 ===============
 
-Grobid-quantities is a Java application, based on `Grobid`_ (GeneRation Of BIbliographic Data), a machine learning framework for parsing and structuring raw documents such as PDF or plain text. Grobid-quantities is designed for large-scale processing tasks in batch or via a web REST API.
+Grobid-quantities is a Java application, based on `Grobid`_ (GeneRation Of BIbliographic Data), a machine learning framework for parsing and structuring raw documents such as PDF or plain text.
+Grobid-quantities is designed for large-scale processing tasks in batch or via a web REST API.
+
+The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task.
+
+.. figure:: img/cascade-schema.png
+   :alt: Grobid-quantities cascade schema
+
+The models are trained using the Conditional Random Field (CRF) algorithm and Recurrent neural networks (RNN) using the bidirectional LSTM with CRF as activation layer (BidLSTM_CRF).
 
-The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task. The models are trained using CRF (Conditional Random Field) algorithm.
 
 **quantities** are modelled using three different types:
     (a) ``atomic values`` in case of single measurements (e.g., 10 grams),
     (b) ``interval`` (e.g. ``from 3 to 5 km``) and ``range`` (``100 +- 4``  ) for continuous values, and,
-    (c) ``lists`` of discrete values:
+    (c) ``lists`` of discrete values where the measurement unit is shared.
 
 **units** are decomposed and restructured. Complementary information like unit system, type of measurement are attached by lookup in an internal lexicon.
 
@@ -28,7 +35,7 @@ The machine learning engine architecture follows the cascade approach, where eac
 
 The measurements that are identified are normalised toward the International System of Units (SI) using the java library `Units of measurement`_.
 
-Grobid-quantities also contains a module implementing the identification of the "quantified" object/substance related to the measure. This module is currently *experimental*.
+Grobid-quantities also contains a module implementing the identification of the "quantified" object/substance related to the measure. This module is still *experimental*.
 
 The following screenshot illustrate an example of measurement that is extracted, parsed and normalised, the quantified substance, *streptomycin* is additionally recognised:
 

diff --git a/doc/references.rst b/doc/references.rst
@@ -1,3 +1,4 @@
+.. _References:
 
 References
 ==========
@@ -7,30 +8,33 @@ How to cite
 
 If you want to cite this work, please simply refer to the github project, with optionally the `Software Heritage <https://www.softwareheritage.org/>`_ project-level permanent identifier:
 ::
-    grobid-quantities (2015-2022) <https://github.com/kermitt2/grobid-quantities>, swh:1:dir:dbf9ee55889563779a09b16f9c451165ba62b6d7
+ grobid-quantities (2015-2022) <https://github.com/kermitt2/grobid-quantities>, swh:1:dir:dbf9ee55889563779a09b16f9c451165ba62b6d7
 
 Here's a BibTeX entry using the `Software Heritage <https://www.softwareheritage.org/>`_ project-level permanent identifier:
 ::
-    @misc{grobid-quantities,
-        title = {grobid-quantities},
-        howpublished = {\url{https://github.com/kermitt2/grobid-quantities}},
-        publisher = {GitHub},
-        year = {2015--2022},
-        archivePrefix = {swh},
-        eprint = {1:dir:dbf9ee55889563779a09b16f9c451165ba62b6d7}
-    }
-
-Main works using grobid-quantities
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+  @misc{grobid-quantities,
+    title = {grobid-quantities},
+    howpublished = {\url{https://github.com/kermitt2/grobid-quantities}},
+    publisher = {GitHub},
+    year = {2015--2022},
+    archivePrefix = {swh},
+    eprint = {1:dir:dbf9ee55889563779a09b16f9c451165ba62b6d7}
+  }
+
+
+Main papers about grobid-quantities
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+|    Luca Foppiano, Laurent Romary, Masashi Ishii, and Mikiko Tanifuji.
+|    Automatic identification and normalisation of physical measurements in scientific literature.
+|    September 2019, ACM, DocEng '19, Berlin, Germany.
+|    https://hal.inria.fr/hal-02294424
 
 |    Kyle Hundman and Chris A. Mattmann.
 |    Measurement Context Extraction from Text: Discovering Opportunities and Gaps in Earth Science.
 |    2017, KDD 2017, Halifax, Nova Scotia, Canada.
 |    https://arxiv.org/pdf/1710.04312.pdf
 
-|    Luca Foppiano, Laurent Romary, Masashi Ishii, and Mikiko Tanifuji.
-|    Automatic identification and normalisation of physical measurements in scientific literature.
-|    September 2019, ACM, DocEng '19, Berlin, Germany.
 
 Other
 ~~~~~