Skip to content

Commit

Permalink
Updated documentation - added some training infos
Browse files Browse the repository at this point in the history
  • Loading branch information
lfoppiano committed Sep 1, 2016
1 parent 2bd81a2 commit 3263de0
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 0 deletions.
2 changes: 2 additions & 0 deletions grobid-ner/doc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,5 @@ Annotated data will be always welcomed, if you like to contribute, you can conta

* [Annotation guidelines](annotation-guidelines.md)

* [Training NER models](training-ner-model.md)

23 changes: 23 additions & 0 deletions grobid-ner/doc/training-ner-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,26 @@ In addition to these datasets, the CRF models shipped with Grobid NER has been a
- a small extract of the *Reuters dataset* corresponding to the *[CONLL 2003](http://www.cnts.ua.ac.be/conll2003/ner/) subset*, around 10k words, which has been manually re-annotated with 26 classes. The Reuters corpus is available by contacting [NIST](http://trec.nist.gov/data/reuters/reuters.html). If you are interested by the 26-classes manual annotations only, we can distribute them but you will need to combine these annotations with the Reuters corpus.

- A large set of Wikipedia articles automatically annotated by Idilia. This data can be downloaded from [IDILIA download page](http://download.idilia.com/datasets/wikipedia/index.html). We use a part of this dataset in a semi-supervised training step as a complement to the supervised training based on the manually annotated corpus.


## Training NER Model

### Training data
Since the training data are not freely available, it is necessarily to assembly them beforehand.

TBD


### Train the NER model
The assumption is that all the dataset have been downloaded and the property file grobid-ner.properties updated accordingly.

```
mvn generate-resources -Ptrain_ner
```

The process is pretty heavy and it will require several days, depending on the hardware available.


### Train the Sense model

TBD

0 comments on commit 3263de0

Please sign in to comment.