Clarification on the annotated data used for training

kermitt2 · Aug 31, 2016 · e1683dd · e1683dd
1 parent dbb9492
commit e1683dd
Show file tree

Hide file tree

Showing 3 changed files with 36 additions and 15 deletions.
diff --git a/grobid-ner/doc/build-and-install.md b/grobid-ner/doc/build-and-install.md
@@ -6,39 +6,61 @@ GROBID is library for extracting bibliographical information from technical and
 The tool offers a convenient environment for creating efficient text mining tool based on CRF.
 
 Clone source code from github:
+
+```bash
 > git clone https://github.com/kermitt2/grobid.git
+```
 
 Or download directly the zip file:
+
+```bash
 > https://github.com/kermitt2/grobid/zipball/master
+```
 
 <!--- ## [Build the project](https://github.com/kermitt2/grobid/wiki/Build-the-project) -->
 
 The standard method for building the project is to use maven. In the main directory:
+
+```bash
 > mvn -Dmaven.test.skip=true clean install
+```
 
 Grobid should then be installed and ready.
 
 It is also possible to build the project with ant.
+
+```bash
 > ant package
+```
 
 ## Grobid NER Settings
 
 Clone source code from github:
+
+```bash
 > git clone https://github.com/kermitt2/grobid-ner.git
+```
 
 Or download directly the zip file:
+
+```bash
 > https://github.com/kermitt2/grobid/zipball/master
+```
 
 GROBID NER is actually a sub-project of GROBID. 
 Although GROBID NER might be merged with GROBID in the future, at this point the GROBID NER sub-module simply need to added manually. 
 In the main directory of GROBID NER:
 
+```bash
 > cp -r grobid-ner /path/to/grobid/
 
 > cp -r grobid-home/models/* /path/to/grobid/grobid-home/
+```
 
 Then build the GROBID NER subproject:
 
+```bash
 > cd /path/to/grobid/grobid-ner
 
 > mvn clean install
+```
diff --git a/grobid-ner/doc/index.md b/grobid-ner/doc/index.md
@@ -6,8 +6,10 @@ GROBID NER is a Named-Entity Recogniser module for [GROBID](https://raw.github.c
 GROBID NER has been developed more specifically for the purpose of further supporting post disambiguation and resolution of entities against knowledge bases such as Wikipedia.
 
 The current models shipped with the source uses 26 Named Entity [classes](classes-ane-senses.md) and have been trained using the following dataset: 
- - Reuters NER [CONLL 2003](http://www.cnts.ua.ac.be/conll2003/ner/) manually annotated training data (10k words, 26 classes). This dataset is not public, so not shipped with the code. In order to obtain it, 
- - Manually annotated extract from the Wikipedia article on World War 1 (approximately 10k words, 26 classes)
+
+* Reuters NER [CONLL 2003](http://www.cnts.ua.ac.be/conll2003/ner/) manually annotated training data (10k words, 26 classes). This dataset is not public, so not shipped with the code. In order to obtain it, 
+
+* Manually annotated extract from the Wikipedia article on World War 1 (approximately 10k words, 26 classes)
 
 The training has been completed with a very large semi-supervised training based on the Wikipedia Idilia data set. 
 

diff --git a/grobid-ner/doc/training-ner-model.md b/grobid-ner/doc/training-ner-model.md
@@ -1,18 +1,15 @@
 # Generate training corpus 
 
-### Datasets
+## Datasets
 
-Grobid NER has been trained on several different datasets :
-
- - *Reuters dataset*: This dataset was made available from Reuters Ltd. It is not public and it is not provided within this project. 
-To obtain it, contact [NIST](http://trec.nist.gov/data/reuters/reuters.html).
-
- - *[CONLL 2003](http://www.cnts.ua.ac.be/conll2003/ner/) NER manually annotated dataset*: made available for CONLL 2003 conference.
-This dataset contains the annotations used for the CONLL conference. It is public and it ships only annotations. 
-It requires the Reuters dataset (see above) [to be built](http://www.cnts.ua.ac.be/conll2003/ner/000README). 
+grobid-ner project includes the following dataset:
+
+- manually annotated extract of the *Wikipedia article on World War 1* (approximately 10k words, 26 classes). This data, as the whole Wikipedia content, is available under the licence [Creative Commons Attribution-ShareAlike License](https://creativecommons.org/licenses/by-sa/3.0/). 
 
- - *Manually annotated extract from the Wikipedia articles* on World War 1 (approximately 10k words, 26 classes). 
-This data can be downloaded from [IDILIA download page](http://download.idilia.com/datasets/wikipedia/index.html). 
-This data, as the whole wikipedia, is freely available under the licence [Creative Commons Attribution-ShareAlike License](https://creativecommons.org/licenses/by-sa/3.0/). 
+- manually annotated extract of the *Holocaust data* from the [EHRI](https://portal.ehri-project.eu) research portal, openly available ([data policy](https://portal.ehri-project.eu/data-policy)) (26 classes).
+
+In addition to these datasets, the CRF models shipped with Grobid NER has been also trained with these different datasets :
 
- - *Holocaust data* from the [EHRI](https://portal.ehri-project.eu) research portal, openly available ([data policy](https://portal.ehri-project.eu/data-policy)).
+ - a small extract of the *Reuters dataset* corresponding to the *[CONLL 2003](http://www.cnts.ua.ac.be/conll2003/ner/) subset*, around 10k words, which has been manually re-annotated with 26 classes. The Reuters corpus is available by contacting [NIST](http://trec.nist.gov/data/reuters/reuters.html). If you are interested by the 26-classes manual annotations only, we can distribute them but you will need to combine these annotations with the Reuters corpus.  
+
+ - A large set of Wikipedia articles automatically annotated by Idilia. This data can be downloaded from [IDILIA download page](http://download.idilia.com/datasets/wikipedia/index.html). We use a part of this dataset in a semi-supervised training step as a complement to the supervised training based on the manually annotated corpus.