-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Clarification on the annotated data used for training
- Loading branch information
Showing
3 changed files
with
36 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,15 @@ | ||
# Generate training corpus | ||
|
||
### Datasets | ||
## Datasets | ||
|
||
Grobid NER has been trained on several different datasets : | ||
|
||
- *Reuters dataset*: This dataset was made available from Reuters Ltd. It is not public and it is not provided within this project. | ||
To obtain it, contact [NIST](http://trec.nist.gov/data/reuters/reuters.html). | ||
|
||
- *[CONLL 2003](http://www.cnts.ua.ac.be/conll2003/ner/) NER manually annotated dataset*: made available for CONLL 2003 conference. | ||
This dataset contains the annotations used for the CONLL conference. It is public and it ships only annotations. | ||
It requires the Reuters dataset (see above) [to be built](http://www.cnts.ua.ac.be/conll2003/ner/000README). | ||
grobid-ner project includes the following dataset: | ||
|
||
- manually annotated extract of the *Wikipedia article on World War 1* (approximately 10k words, 26 classes). This data, as the whole Wikipedia content, is available under the licence [Creative Commons Attribution-ShareAlike License](https://creativecommons.org/licenses/by-sa/3.0/). | ||
|
||
- *Manually annotated extract from the Wikipedia articles* on World War 1 (approximately 10k words, 26 classes). | ||
This data can be downloaded from [IDILIA download page](http://download.idilia.com/datasets/wikipedia/index.html). | ||
This data, as the whole wikipedia, is freely available under the licence [Creative Commons Attribution-ShareAlike License](https://creativecommons.org/licenses/by-sa/3.0/). | ||
- manually annotated extract of the *Holocaust data* from the [EHRI](https://portal.ehri-project.eu) research portal, openly available ([data policy](https://portal.ehri-project.eu/data-policy)) (26 classes). | ||
|
||
In addition to these datasets, the CRF models shipped with Grobid NER has been also trained with these different datasets : | ||
|
||
- *Holocaust data* from the [EHRI](https://portal.ehri-project.eu) research portal, openly available ([data policy](https://portal.ehri-project.eu/data-policy)). | ||
- a small extract of the *Reuters dataset* corresponding to the *[CONLL 2003](http://www.cnts.ua.ac.be/conll2003/ner/) subset*, around 10k words, which has been manually re-annotated with 26 classes. The Reuters corpus is available by contacting [NIST](http://trec.nist.gov/data/reuters/reuters.html). If you are interested by the 26-classes manual annotations only, we can distribute them but you will need to combine these annotations with the Reuters corpus. | ||
|
||
- A large set of Wikipedia articles automatically annotated by Idilia. This data can be downloaded from [IDILIA download page](http://download.idilia.com/datasets/wikipedia/index.html). We use a part of this dataset in a semi-supervised training step as a complement to the supervised training based on the manually annotated corpus. |