Updated documentation - added some training infos

kermitt2 · Sep 1, 2016 · 3263de0 · 3263de0
1 parent 2bd81a2
commit 3263de0
Show file tree

Hide file tree

Showing 2 changed files with 25 additions and 0 deletions.
diff --git a/grobid-ner/doc/index.md b/grobid-ner/doc/index.md
@@ -25,3 +25,5 @@ Annotated data will be always welcomed, if you like to contribute, you can conta
 
 * [Annotation guidelines](annotation-guidelines.md)
 
+* [Training NER models](training-ner-model.md)
+
diff --git a/grobid-ner/doc/training-ner-model.md b/grobid-ner/doc/training-ner-model.md
@@ -13,3 +13,26 @@ In addition to these datasets, the CRF models shipped with Grobid NER has been a
  - a small extract of the *Reuters dataset* corresponding to the *[CONLL 2003](http://www.cnts.ua.ac.be/conll2003/ner/) subset*, around 10k words, which has been manually re-annotated with 26 classes. The Reuters corpus is available by contacting [NIST](http://trec.nist.gov/data/reuters/reuters.html). If you are interested by the 26-classes manual annotations only, we can distribute them but you will need to combine these annotations with the Reuters corpus.  
 
  - A large set of Wikipedia articles automatically annotated by Idilia. This data can be downloaded from [IDILIA download page](http://download.idilia.com/datasets/wikipedia/index.html). We use a part of this dataset in a semi-supervised training step as a complement to the supervised training based on the manually annotated corpus.
+
+
+## Training NER Model
+
+### Training data
+Since the training data are not freely available, it is necessarily to assembly them beforehand.
+
+TBD
+
+
+### Train the NER model 
+The assumption is that all the dataset have been downloaded and the property file grobid-ner.properties updated accordingly. 
+
+```
+mvn generate-resources -Ptrain_ner
+```
+
+The process is pretty heavy and it will require several days, depending on the hardware available.  
+
+
+### Train the Sense model 
+
+TBD