Skip to content

Commit

Permalink
Update training documentation for French NER
Browse files Browse the repository at this point in the history
  • Loading branch information
kermitt2 committed Oct 20, 2016
1 parent 6769cf8 commit f04d968
Showing 1 changed file with 16 additions and 5 deletions.
21 changes: 16 additions & 5 deletions grobid-ner/doc/training-ner-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

## Datasets

### English

grobid-ner project includes the following dataset:

- manually annotated extract of the *Wikipedia article on World War 1* (approximately 10k words, 26 classes). This data, as the whole Wikipedia content, is available under the licence [Creative Commons Attribution-ShareAlike License](https://creativecommons.org/licenses/by-sa/3.0/).
Expand All @@ -14,14 +16,15 @@ In addition to these datasets, the CRF models shipped with Grobid NER has been a

- A large set of Wikipedia articles automatically annotated by Idilia. This data can be downloaded from [IDILIA download page](http://download.idilia.com/datasets/wikipedia/index.html). We use a part of this dataset in a semi-supervised training step as a complement to the supervised training based on the manually annotated corpus.

### French

It is possible to train grobid-ner with _Le Monde corpus_, a specific XML parser is included.

## Training NER Model
## Training NER Models

### Training data
Since the training data are not freely available, it is necessarily to assembly them beforehand.

TBD


### Train the NER model
The assumption is that all the required datasets have been downloaded and the property file `grobid-ner.properties` updated accordingly.
Expand All @@ -38,9 +41,17 @@ To start the training:
mvn generate-resources -Ptrain_ner
```

The process is pretty heavy and it will require several days, depending on the hardware available.
Due to the semi-supervised training, the process is pretty heavy and it will require several days, depending on the hardware available.

The French NER can be trained as follow:

> mvn generate-resources -Ptrain_nerfr

### Train the Sense model

TBD
To start the training:

```
mvn generate-resources -Ptrain_nersense
```

0 comments on commit f04d968

Please sign in to comment.