Merge pull request #168 from lfoppiano/update-training-data

Update training data, update models, fix docker, update e2e eval, new documentation with Markdown
lfoppiano · Mar 27, 2024 · caacf7d · caacf7d
2 parents 5d475a8 + 8eb8830
commit caacf7d
Show file tree

Hide file tree

Showing 142 changed files with 1,334,313 additions and 847,015 deletions.
diff --git a/.readthedocs.yml b/.readthedocs.yml
@@ -9,10 +9,10 @@ build:
 submodules:
   exclude: all
 
-#python:
-#  install:
-#    - requirements: doc/requirements.txt
+python:
+  install:
+    - requirements: doc/requirements.txt
 
-#mkdocs:
-#  configuration: mkdocs.yml
-#  fail_on_warning: false
+mkdocs:
+  configuration: doc/mkdocs.yml
+  fail_on_warning: false
diff --git a/Dockerfile.local b/Dockerfile.local
@@ -59,7 +59,7 @@ WORKDIR /opt
 # build runtime image
 # -------------------
 
-FROM grobid/grobid:0.8.0 as runtime
+FROM lfoppiano/grobid:0.8.0-full-slim as runtime
 
 # setting locale is likely useless but to be sure
 ENV LANG C.UTF-8

diff --git a/build.gradle b/build.gradle
@@ -374,7 +374,7 @@ task copyModels(type: Copy) {
 task downloadTransformers(dependsOn: copyModels) {
     doLast {
         download {
-            src "https://transformers-data.s3.eu-central-1.amazonaws.com/quantities-transformers.zip"
+            src "https://transformers-data.s3.eu-central-1.amazonaws.com/quantities-transformers-240226.zip"
             dest "${grobidHome}/models/quantities-transformers.zip"
             overwrite false
             print "Download bulky transformers files under grobid-home: ${grobidHome}"

diff --git a/doc/Makefile b/doc/Makefile
diff --git a/doc/conf.py b/doc/conf.py
diff --git a/doc/evaluation-scores.md b/doc/evaluation-scores.md
@@ -0,0 +1,125 @@
+# Evaluation scores
+
+## End 2 end evaluation
+
+The end-to-end evaluation was performed with the [MeasEval dataset](https://github.com/harperco/MeasEval) (SemEval-2021
+Task 8).
+The scores in the following table are the micro average. MeasEval was annotated to allow approximated entities, which
+are not supported in grobid-quantities.
+
+| Type (Ref)                | Matching  method | Precision | Recall | F1-score | Support |
+|---------------------------|------------------|-----------|--------|----------|---------|
+| Quantities (QUANT)        | strict           | 54.09     | 54.47  | 54.28    | 1137    |
+| Quantities (QUANT)        | soft             | 67.02     | 67.49  | 67.26    | 1137    |
+| Quantified substance (ME) | strict           | 13.82     | 9.67   | 11.38    | 615     |
+| Quantified substance (ME) | soft             | 21.63     | 15.13  | 17.80    | 615     |
+
+Note: the ME (Measured Entity) is still experimental in Grobid-quantities.
+
+To reproduce the end-to-end evaluation, you can run the `scripts/measeval_e2e_eval.py` script (use the requirements.txt
+to install the correct dependencies).
+
+## Machine Learning Named Entities Recognition Evaluation
+
+The scores (P: Precision, R: Recall, F1: F1-score) for all the models are performed either as 10-fold cross-validation
+or using a holdout dataset.
+The holdout dataset of Grobid-quantities is composed by the following examples:
+
+- Quantities ML: 10 articles
+- Units ML: [UNISCOR dataset](references.md) with around 1600 examples
+- Values ML: 950 examples
+
+For Deep learning models (BidLSTM_CRF/BidLSTM_CRF_FEATURES, BERT_CRF) models, we provide the average over 5 runs.
+
+The models are organised as follows:
+
+- BidLSTM_CRF is a RNN model based on (Lample et al., 2016) work, with a CRF model as activation function
+- BidLSTM_CRF_FEATURES is an extension of BidLSTM_CRF that allow using layout features
+- BERT_CRF is a BERT-based model obtained by fine-tuning a SciBERT encoder. Like others, the activation function is
+  composed by a CRF layer.
+
+### Results from
+
+The evaluation was performed on the holdout dataset from the grobid-quantities dataset.
+Average values are computed as Micro average.
+To reproduce it, see `evaluation_doc`{.interpreted-text role="ref"}.
+
+#### Quantities
+
+| Labels          | CRF   |       |       |              | BERT_CRF |       |       |        | Support |
+|-----------------|-------|-------|-------|--------------|----------|-------|-------|--------|---------|
+| Metrics         | P     | R     | F1    |              | P        | R     | F1    | St.dev |         |
+| `<unitLeft>`    | 90.26 | 83.84 | 86.93 |              | 93.13    | 89.96 | 91.52 | 0.0086 | 464     |
+| `<unitRight>`   | 36.36 | 30.77 | 33.33 |              | 23.67    | 40.00 | 29.70 | 0.0139 | 13      |
+| `<valueAtomic>` | 75.75 | 77.97 | 76.84 |              | 85.46    | 87.99 | 86.70 | 0.0041 | 581     |
+| `<valueBase>`   | 80.77 | 60.00 | 68.85 |              | 98.75    | 90.29 | 94.33 | 0.0163 | 35      |
+| `<valueLeast>`  | 76.24 | 61.11 | 67.84 |              | 84.58    | 72.22 | 77.91 | 0.0212 | 126     |
+| `<valueList>`   | 27.27 | 11.32 | 16.00 |              | 61.10    | 39.62 | 47.79 | 0.0262 | 53      |
+| `<valueMost>`   | 68.35 | 55.67 | 61.36 |              | 78.93    | 71.75 | 75.16 | 0.0179 | 97      |
+| `<valueRange>`  | 91.18 | 88.57 | 89.86 |              | 100.00   | 91.43 | 95.52 | 0.0000 | 35      |
+| ----            |
+| All (micro avg) | 79.49 | 73.72 | 76.5  |              | 86.50    | 83.97 | 85.22 | 0.0031 | 1404    |
+
+| Labels          | BidLSTM_CRF |       |       |        |        | BidLSTM_CRF_FEATURES |       |       |        | Support |
+|-----------------|-------------|-------|-------|--------|--------|----------------------|-------|-------|--------|---------|
+| Metrics         | P           | R     | F1    | St.dev |        | P                    | R     | F1    | St.dev |         |
+| `<unitLeft>`    | 87.58       | 89.96 | 88.75 | 0.0074 |        | 86.95                | 89.57 | 88.24 | 0.0097 | 464     |
+| `<unitRight>`   | 25.01       | 30.77 | 27.50 | 0.0193 |        | 23.99                | 30.77 | 26.91 | 0.0146 | 13      |
+| `<valueAtomic>` | 79.52       | 85.71 | 82.49 | 0.0044 |        | 78.33                | 86.57 | 82.24 | 0.0062 | 581     |
+| `<valueBase>`   | 83.84       | 97.14 | 89.97 | 0.0185 |        | 80.99                | 97.14 | 88.32 | 0.0115 | 35      |
+| `<valueLeast>`  | 83.79       | 62.38 | 71.45 | 0.0294 |        | 84.37                | 60.00 | 70.06 | 0.0335 | 126     |
+| `<valueList>`   | 80.12       | 13.58 | 23.05 | 0.0326 |        | 69.29                | 14.34 | 23.37 | 0.0715 | 53      |
+| `<valueMost>`   | 75.91       | 70.92 | 73.22 | 0.0311 |        | 75.54                | 67.01 | 70.99 | 0.0370 | 97      |
+| `<valueRange>`  | 92.87       | 94.86 | 93.84 | 0.0783 |        | 95.58                | 97.14 | 96.35 | 0.0673 | 35      |
+| ----            |             |       |
+| All (micro avg) | 82.12       | 81.28 | 81.70 | 0.0048 |        | 81.26                | 81.11 | 81.19 | 0.0090 | 1404    | 
+
+#### Units
+
+Units were evaluated using UNISCOR dataset. For more information check the section [UNISCOR](references.md#uniscor).
+
+| Labels          | CRF   |       |       | | BERT_CRF |       |       |        | Support |
+|-----------------|-------|-------|-------|-|----------|-------|-------|--------|---------|
+| Metrics         | P     | R     | F1    | | P        | R     | F1    | St.dev |         |
+| `<base>`        | 80.64 | 82.71 | 81.66 | | 73.63    | 76.26 | 74.89 | 0.0231 | 3228    |
+| `<pow>`         | 71.94 | 74.34 | 73.12 | | 80.20    | 57.35 | 66.75 | 0.0752 | 1773    |
+| `<prefix>`      | 92.6  | 86.48 | 89.43 | | 77.61    | 88.05 | 82.12 | 0.0338 | 1287    |
+| -----           |       
+| All (micro avg) | 80.39 | 81.12 | 80.76 | | 75.55    | 73.34 | 74.41 | 0.0178 | 6288    |
+
+
+| Labels          | BidLSTM_CRF |       |       |         | | BidLSTM_CRF_FEATURES |         |        |        | Support |
+|-----------------|-------------|-------|-------|---------|-|----------------------|---------|--------|--------|---------|
+| Metrics         | P           | R     | F1    | St.dev  | | P                    | R       | F1     | St.dev |         |
+| `<base>`        | 52.17       | 46.16 | 48.93 | 0.0494  | | 51.99                | 48.00   | 49.88  | 0.0259 | 3228    |
+| `<pow>`         | 94.25       | 56.89 | 70.94 | 0.0125  | | 94.20                | 56.92   | 70.96  | 0.0062 | 1773    |
+| `<prefix>`      | 81.36       | 82.88 | 82.01 | 0.0119  | | 82.11                | 82.94   | 82.43  | 0.0201 | 1287    |
+| -----           |             
+| All (micro avg) | 68.12       | 56.70 | 61.85 | 0.0282  | | 67.76                | 57.67   | 62.29  | 0.0173 | 6288    |
+
+#### Values
+
+| Labels          | CRF   |       |       | BERT_CRF |        |        |        |         |
+|-----------------|-------|-------|-------|----------|--------|--------|--------|---------|
+| Metrics         | P     | R     | F1    | P        | R      | F1     | St.dev | Support |
+| `<alpha>`       | 96.9  | 99.21 | 98.04 | 99.21    | 99.37  | 99.29  | 0.0017 | 464     |   
+| `<base>`        | 100   | 92.31 | 96    | 100.00   | 100.00 | 100.00 | 0.0000 | 13      |   
+| `<number>`      | 99.14 | 99.63 | 99.38 | 99.43    | 99.46  | 99.44  | 0.0005 | 581     |   
+| `<pow>`         | 100   | 100   | 100   | 100.00   | 100.00 | 100.00 | 0.0000 | 35      |
+| -----           |
+| All (micro avg) | 98.86 | 99.48 | 99.17 | 99.42    | 99.46  | 99.44  | 0.0004 | 1093    | 
+
+| Labels          | BidLSTM_CRF |       |       |        |       | BidLSTM_CRF_FEATURES |       |       |        |         |
+|-----------------|-------------|-------|-------|--------|-------|----------------------|-------|-------|--------|---------|
+| Metrics         | P           | R     | F1    | St.dev |       | P                    | R     | F1    | St.dev | Support |
+| `<alpha>`       | 97.82       | 99.53 | 98.66 | 0.0035 |       | 93.13                | 89.96 | 91.52 | 0.0086 | 464     |
+| `<base>`        | 97.78       | 67.69 | 79.46 | 0.0937 |       | 23.67                | 40.00 | 29.70 | 0.0139 | 13      |
+| `<number>`      | 98.92       | 99.33 | 99.13 | 0.0008 |       | 85.46                | 87.99 | 86.70 | 0.0041 | 581     |
+| `<pow>`         | 69.11       | 73.85 | 71.29 | 0.1456 |       | 98.75                | 90.29 | 94.33 | 0.0163 | 35      |
+| -----           |             |       |
+| All (micro avg) | 98.34       | 98.59 | 98.47 | 0.0023 |       | 86.50                | 83.97 | 85.22 | 0.0031 | 1093    |
+
+### Other published results
+
+> :information_source: The paper \"Automatic Identification and Normalisation of Physical Measurements in Scientific
+> Literature,\" published in September 2019, reported macro averaged evaluation scores.