Skip to content

Commit

Permalink
Added class to guess the grobid-home, added lemonde xml with obfuscat…
Browse files Browse the repository at this point in the history
…ed content only for training parser purposes,

Updated documentation.
  • Loading branch information
lfoppiano committed Aug 31, 2016
1 parent cc07cd0 commit 98c398d
Show file tree
Hide file tree
Showing 9 changed files with 154 additions and 51 deletions.
9 changes: 3 additions & 6 deletions grobid-ner/doc/Licence.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,6 @@ For citing the tool, please refer to the github project: <https://github.com/gro

### Datasets

The datasets distributed with this project are publicly available under the following licences:
- [Wikipedia](http://www.wikipedia.org) data is available under the licence [Creative Commons Attribution-ShareAlike License](https://creativecommons.org/licenses/by-sa/3.0/).
- [EHRI](https://portal.ehri-project.eu) data from the research portal, openly available as mentioned in the EHRI [data policy](https://portal.ehri-project.eu/data-policy).

The following datasets has been used as training data, but are not distributed with the project:
- Reuters corpus, not publicly available. To obtain it, contact [NIST](http://trec.nist.gov/data/reuters/reuters.html).
The datasets used for training the models behind this tool are not all publicly available.
They can be requested at the respective owners and rebuild the original Grobid NER dataset.
See the [respective pages](training-ner-model.md) for more information.
8 changes: 2 additions & 6 deletions grobid-ner/doc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,13 @@ GROBID NER is a Named-Entity Recogniser module for [GROBID](https://raw.github.c
GROBID NER has been developed more specifically for the purpose of further supporting post disambiguation and resolution of entities against knowledge bases such as Wikipedia.

The current models shipped with the source uses 26 Named Entity [classes](classes-ane-senses.md) and have been trained using the following dataset:
- Reuters NER [CONLL 2003](http://www.cnts.ua.ac.be/conll2003/ner/) manually annotated training data (10k words, 26 classes). This dataset is not public, so not shipped with the code. In order to obtain it,
- Manually annotated extract from the Wikipedia article on World War 1 (approximately 10k words, 26 classes)
- Reuters NER [CONLL 2003](http://www.cnts.ua.ac.be/conll2003/ner/) manually annotated training data (10k words, 26 classes). This dataset is not public, so not shipped with the code. In order to obtain it,
- Manually annotated extract from the Wikipedia article on World War 1 (approximately 10k words, 26 classes)

The training has been completed with a very large semi-supervised training based on the Wikipedia Idilia data set.

Annotated data will be always welcomed, if you like to contribute, you can contact us via email or by opening an issue in the GitHub project.

## About

* [License](License.md)

## User manual

* [Install GROBID NER](build-and-install.md)
Expand Down
14 changes: 14 additions & 0 deletions grobid-ner/doc/training-ner-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Generate training corpus

### Datasets

The Grobid NER has been trained on several different datasets:
- Reuters NER [CONLL 2003](http://www.cnts.ua.ac.be/conll2003/ner/) manually annotated training data (10k words, 26 classes). This dataset is not public, so not shipped with the code. In order to obtain it,
- Manually annotated extract from the Wikipedia article on World War 1 (approximately 10k words, 26 classes)

The datasets distributed with this project are publicly available under the following licences:
- [Wikipedia](http://www.wikipedia.org) data is available under the licence [Creative Commons Attribution-ShareAlike License](https://creativecommons.org/licenses/by-sa/3.0/).
- [EHRI](https://portal.ehri-project.eu) data from the research portal, openly available as mentioned in the EHRI [data policy](https://portal.ehri-project.eu/data-policy).

The following datasets has been used as training data, but are not distributed with the project:
- Reuters corpus, not publicly available. To obtain it, contact [NIST](http://trec.nist.gov/data/reuters/reuters.html).
42 changes: 42 additions & 0 deletions grobid-ner/src/main/java/org/grobid/core/utilities/GrobidHome.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
package org.grobid.core.utilities;

import org.grobid.core.exceptions.GrobidPropertyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import static org.apache.commons.lang3.StringUtils.isEmpty;

/**
* Created by lfoppiano on 31/08/16.
*/
public class GrobidHome {

private static Logger LOGGER = LoggerFactory.getLogger(GrobidHome.class);

/**
* Try to get the GROBID_HOME from the environment variable or by using some default locations:
* - ../grobid-home
* - ../../grobid-home (in case the whole repository is cloned directly under the grobid project)
*/
public static void findGrobidHome() {
String grobidHome = System.getenv("GROBID_HOME");
if (!isEmpty(grobidHome)) {
GrobidProperties.set_GROBID_HOME_PATH(grobidHome);
GrobidProperties.setGrobidPropertiesPath(grobidHome + "/config/grobid.properties");
} else {
try {
LOGGER.trace("Trying grobid home from the usual location at ../grobid-home ");
GrobidProperties.set_GROBID_HOME_PATH("../grobid-home");
GrobidProperties.setGrobidPropertiesPath("../grobid-home/config/grobid.properties");
} catch (GrobidPropertyException gpe) {
LOGGER.error("Grobid HOME not found, trying to fish it from ../../grobid-home ");
try {
GrobidProperties.set_GROBID_HOME_PATH("../../grobid-home");
GrobidProperties.setGrobidPropertiesPath("../../grobid-home/config/grobid.properties");
} catch (GrobidPropertyException gpe2) {
LOGGER.error("Grobid HOME at ../../grobid-home not found, set the environment variable GROBID_HOME");
}
}
}
}
}
15 changes: 5 additions & 10 deletions grobid-ner/src/main/java/org/grobid/trainer/AssembleNERCorpus.java
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,7 @@
import org.grobid.core.lexicon.Lexicon;
import org.grobid.core.main.LibraryLoader;
import org.grobid.core.mock.MockContext;
import org.grobid.core.utilities.GrobidProperties;
import org.grobid.core.utilities.OffsetPosition;
import org.grobid.core.utilities.Pair;
import org.grobid.core.utilities.TextUtilities;
import org.grobid.core.utilities.*;
import org.grobid.trainer.sax.ReutersSaxHandler;
import org.grobid.trainer.sax.SemDocSaxHandler;
import org.grobid.trainer.sax.TextSaxHandler;
Expand Down Expand Up @@ -770,18 +767,16 @@ public void assembleWikipedia() {
*/
public static void main(String[] args) {
try {
String pGrobidHome = "../grobid-home";
String pGrobidProperties = "../grobid-home/config/grobid.properties";
GrobidHome.findGrobidHome();

MockContext.setInitialContext(pGrobidHome, pGrobidProperties);
GrobidProperties.getInstance();
MockContext.setInitialContext();

AssembleNERCorpus assembler = new AssembleNERCorpus();
//assembler.assembleCoNLL();
assembler.assembleWikipedia();
assembler.assembleWikipedia();
}
catch (Exception e) {
e.printStackTrace();

}
finally {
try {
Expand Down
30 changes: 2 additions & 28 deletions grobid-ner/src/test/java/org/grobid/core/EngineMockTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import org.grobid.core.exceptions.GrobidPropertyException;
import org.grobid.core.factory.GrobidFactory;
import org.grobid.core.mock.MockContext;
import org.grobid.core.utilities.GrobidHome;
import org.grobid.core.utilities.GrobidProperties;
import org.junit.AfterClass;
import org.junit.BeforeClass;
Expand All @@ -23,37 +24,10 @@ public static void destroyInitialContext() throws Exception {

@BeforeClass
public static void initInitialContext() throws Exception {
findGrobidHome();
GrobidHome.findGrobidHome();

GrobidProperties.getInstance();
MockContext.setInitialContext();
engine = GrobidFactory.getInstance().createEngine();
}

/**
* Try to get the GROBID_HOME from the environment variable or by using some default locations:
* - ../grobid-home
* - ../../grobid-home (in case the whole repository is cloned directly under the grobid project)
*/
public static void findGrobidHome() {
String grobidHome = System.getenv("GROBID_HOME");
if (!isEmpty(grobidHome)) {
GrobidProperties.set_GROBID_HOME_PATH(grobidHome);
GrobidProperties.setGrobidPropertiesPath(grobidHome + "/config/grobid.properties");
} else {
try {
LOGGER.trace("Trying grobid home from the usual location at ../grobid-home ");
GrobidProperties.set_GROBID_HOME_PATH("../grobid-home");
GrobidProperties.setGrobidPropertiesPath("../grobid-home/config/grobid.properties");
} catch (GrobidPropertyException gpe) {
LOGGER.error("Grobid HOME not found, trying to fish it from ../../grobid-home ");
try {
GrobidProperties.set_GROBID_HOME_PATH("../../grobid-home");
GrobidProperties.setGrobidPropertiesPath("../../grobid-home/config/grobid.properties");
} catch (GrobidPropertyException gpe2) {
LOGGER.error("Grobid HOME at ../../grobid-home not found, set the environment variable GROBID_HOME");
}
}
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ public void testSampleParsing_shouldWork() throws Exception {
String splitted[] = output.split("\n");

assertThat(splitted[0], is("-DOCSTART- id248980"));
assertThat(splitted[1], is("Certes\tO"));
assertThat(splitted[1], is("zzbbzb\tO"));
assertThat(splitted[2], is(",\tO"));
}

Expand Down
84 changes: 84 additions & 0 deletions grobid-ner/src/test/resources/le.monde.corpus.sample.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
<?xml version="1.0" encoding="utf-8"?>
<corpus>
<subcorpus name="ftb6 1">
<document id="id248980">
<sentence id="E1">
zzbbzb, bbzb bz zbb bb'bbz bzzbbzz bbbbbb zz zzbbbbz bbb bbb bbbbzb zz bbb, bzbbzbzbb b'zzzbbz zb 10
zbbbb, b'zbbzbb bzb ébé bz zbbbz bzbb zzbbz bbbzzzbbbbé, bzzbbbzbbbzbb bzb bbzbzbbbbzb, b'ébzbb bzb bz
bbbb bbzbbbzbz.
</sentence>
<sentence id="E2">
bbbbbbbb zbb-bb bbz bz bbbbzbbzbzbb z zézé zbbbb bbz bzb zzbzbbzbbzb bbbbbbbbzb bz bbbbbbzbb zbzbbz
bbbbbbbb zbbzbbzbbbz zb zbbbbzèbzbb, zbzz bb zzb zbbzbzbz, bbz bz zézzbbz zzb bbbébêbb zzb bzbzzzb bzbbz
bzb bz bzbbbbzb z'bb bbbbèbz bbb zzbbbbb à bbbbbbbbzb bzb zzbzb bbbbbbzb zb zb zébzbbzbzbb bébébzbbbé
zzb bbbbbzbbzb zz bz "<ENAMEX type="Organization" sub_type="InstitutionalOrganization" eid="null"
name="Sécurité sociale en France" uri="null">bécb</ENAMEX>".
</sentence>
<sentence id="E3">
zb zbbb, bb b'zbb bbzbbbé zzbzbb bz zbbbbbzbbbbz zzb bzbbébzbbzbbb zb zbbbb bézbzzb (bbb bbzbzbbzbb bz
bêbz zbbbz zz bzbbébzbbzbbbbbé bbz bzb bbbzbzzbbbbzb zb bbbzz bzbzbbé), bbbb bbébzzbbéb bzb bzbbb
bbbzbbbéb bbbzbbzb bbz bzb bzb zbbzbbzb zzb bézzzbbb à bz bzzbzbzbz zz bèbbzb zb bzb zbzbbzb zb
zzzzbbzzbzb.
</sentence>
<sentence id="E4">
bz bbbbzbbzbzbb bz bébbbb zbbz à bzbbbbzb bz zzbbz zzbb bz zzbb zz bzbbzbzbbzb bbb bbb zbbbzbzbb zzbb bz
bbzbbz zz bzbb bbzzbzzbbé à bébzb bz bbbbèbz zz bzbbé.
</sentence>
<sentence id="E5">
bbbbbz zb bbbb, zbzzbb b'zbbzbz à zz bbz bz zbbbzbb z'zbzb zbbbbz zzbb bzb bbbzbzbbb bbbbb bz zbbbzbbbbb
bézbzzbz zz 1989 bébbzbéz bzb b'zbbbbzbzz-bzbzzbz zb bzb bézzzbbb...
</sentence>
<sentence id="E6">
M. <ENAMEX type="Person" gender="m" eid="1000000001656194" oldname="René Teulade" name="René Teulade"
uri="http://fr.wikipedia.org/wiki/René Teulade">Teulade
</ENAMEX> bzbb, à bbbbz bbbbz, zbbbbzébzb bbz "bz zbézbzbbbbé zb bbbbèbz zbbbzbbbbbbzb zbb zb bzb".
</sentence>
<sentence id="E12">
z'zzbbz, bzbzz bbz b'bbbbbbzbzz zz b. <ENAMEX type="Person" gender="m" eid="1000000001656194"
oldname="René Teulade" name="René Teulade"
uri="http://fr.wikipedia.org/wiki/René Teulade">Teulade
</ENAMEX> zb zz bbb bbézézzbbzbb, b. <ENAMEX type="Person" gender="m" eid="1000000000009172"
oldname="Jean-Louis Bianco" name="Jean-Louis Bianco"
uri="http://fr.wikipedia.org/wiki/Jean-Louis Bianco">Jean-Louis
Bbabcb</ENAMEX>, à bbzbzzb bz zzbbz zz bz bzbbbbzbbbzbbbb zb bbbbèbz zz bzbbé zbbbzbçzbb à bbbbzb bzb
zbbbbb.
</sentence>
<sentence id="E13">
Sur lzs zouzz zzrnizrs mois, lzs zépznszs zz sznté n'ont progrzssé quz zz 5% zlors quz lzur zroissznzz
vzrizit zntrz 6% zt 9% zzs trois zzrnièrzs znnézs.
</sentence>
<sentence id="E14">
zb bzzbbz bbzb, bzb bébzbzbbbbbbb zb bzbzbbbbbzbzbb ézbbbbbbbz zb zz bz bbbbéz zb zbôbzbz bbb bzb
bzbbbézb zz zbbbbzbbbbb, bbb bbbb bz bbzzbbbz zzbb bzb bbbzbzbbb bbbbb bzb bb zézbbbzbb zz 30 bbbbbzbzb
zz zbzbzb zz bz bbébbbzbbz zz bz "<ENAMEX type="Organization" sub_type="InstitutionalOrganization"
eid="null" name="Sécurité sociale en France" uri="null">
Sécu</ENAMEX>", bbbbbzbzbb zbzbbz zzbzbbzbz bbz bzîbbbbz zbbzzbz zzb zébzbbzb.
</sentence>
<sentence id="E15">
Lz <ENAMEX type="Organization" eid="1000000001671259" sub_type="InstitutionalOrganization"
oldname="Caisse nationale de l’assurance maladie des travailleurs salariés"
name="Caisse nationale de l’assurance maladie des travailleurs salariés"
uri="http://fr.wikipedia.org/wiki/Caisse nationale de l’assurance maladie des travailleurs salariés">
zzbbbz bzbbbbzbz z'zbbbbzbzz-bzbzzbz
</ENAMEX> bbzbb z'zbbbzbbb zz bzbbbz zb bbbbb bb zbbbbbbbbz bzbzbzzbb zz bbzbbz bbbbb bz zébzb zz
bzbzbbbbzbzbb zzb zbbbbéb.
</sentence>
<sentence id="E16">
zzbbz bzbbbz, bbb bbbbbzbb êbbz zbbbbbbéz zzbb bzb bbbzbzbbzb bzbzbbzb, bzbbzbbbzbb z'ézbbbbbbzb bbzbbbz
4 bbbbbzbzb zz zbzbzb.
</sentence>
</document>
<document id="id248982">
<sentence id="E49">
bzbbb bbz ébbzz zz b'<ENAMEX type="Organization" sub_type="InstitutionalOrganization"
eid="1000000000002268"
oldname="Organisation de coopération et de développement économiques"
name="Organisation de coopération et de développement économiques"
uri="http://fr.wikipedia.org/wiki/Organisation de coopération et de développement économiques">
OCDE</ENAMEX>, zb zbb zbb, zbbbz 1980 zb 1990, bz bbbzbzbbbbbé zbbzbb bbbbbzbbé zz 47% zzbb bz bzzbzbb
bbzbbbbbzb zb bzbbzbzbb zz 2% zzbb bzb bzbbbzzb, bbbb bbz zbbbzbbzbbbb zzb bbbb à bz bbbzbzbbbb zzbb
zbbb bbbbzbz zzbb b'bbzbbbbbz.
</sentence>
</document>
</subcorpus>
</corpus>
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ pages:
- ['using-grobid-ner.md', 'User Manual', 'Using GROBID NER']
- ['class-and-senses.md', 'User Manual', 'Class and Senses']
- ['annotation-guidelines.md', 'User Manual', 'Annotation Guidelines']
- ['training-ner-model.md', 'User Manual', 'Datasets and training']

0 comments on commit 98c398d

Please sign in to comment.