Skip to content

Commit

Permalink
brexit 1
Browse files Browse the repository at this point in the history
  • Loading branch information
wigdan committed May 23, 2017
2 parents 3160217 + 9224549 commit 8a58798
Show file tree
Hide file tree
Showing 47 changed files with 1,057 additions and 496 deletions.
14 changes: 7 additions & 7 deletions grobid-ner/doc/Licence.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@

### Grobid & Grobid NER
## Grobid & Grobid NER

Grobid and Grobid NER are distributed under [Apache 2.0 license](http://www.apache.org/licenses/LICENSE-2.0).
Grobid and Grobid NER are distributed under [Apache 2.0 license](http://www.apache.org/licenses/LICENSE-2.0).

Author and contact: Patrice Lopez (<patrice.lopez@science-miner.com>)

Author and contact: Patrice Lopez (<patrice.lopez@science-miner.com>)

For citing the tool, please refer to the github project: <https://github.com/grobid/grobid-ner> (2013-2017)

### Datasets
## Datasets

All the datasets used for training the models behind this tool are not all publicly available.
All the datasets used for training the models behind this tool are not all publicly available.
For the missing ones, they can be requested at the respective owners and rebuild the original Grobid NER dataset.
See the [respective pages](training-ner-model.md) for more information.
See the [respective pages](training-ner-model.md) for more information.
8 changes: 4 additions & 4 deletions grobid-ner/doc/annotation-examples.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
## Annotation examples and cases
<h1>Annotation examples and cases</h1>

### Location
### Location

#### Eretz Israel
In the following case the sentence is `he continued to write articles and newspaper stories regarding events in the history of the Jewish people throughout the world, the Zionist movement in Bulgaria and the world and the history of the Sephardi Jews and their situation in Eretz Israel.`.
In the following case the sentence is `he continued to write articles and newspaper stories regarding events in the history of the Jewish people throughout the world, the Zionist movement in Bulgaria and the world and the history of the Sephardi Jews and their situation in Eretz Israel.`.
In this case `Eretz Israel` is annotated as 'LOCATION' and not as a 'CONCEPT':
```
in O
Expand All @@ -12,4 +12,4 @@ Israel LOCATION
. O
```

Link to discussion [here](https://github.com/kermitt2/grobid-ner/issues/18).
Link to discussion [here](https://github.com/kermitt2/grobid-ner/issues/18).
57 changes: 25 additions & 32 deletions grobid-ner/doc/annotation-guidelines.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Annotation guidelines for GROBID NER
<h1>Annotation Guidelines for GROBID NER</h1>

### Principle
# Principle

Creating annotated corpus for Named Entities Recognition suppose to identify Named Entities in a text and to classify these Named entities based on the context into a set of classes, 27 classes in the case of grobid-ner.

Similarly as grobid's other CRF models, grobid-ner can bootstrap training data. grobid-ner can automatically generate training data from any text files ( [Link to Page] ), labeling tokens with the named entity classes based on the existing model. A human annotator then corrects the generated training data by modifying the labels produced for each token. This curated training data can be added to the existing training data and used to train a new improved model.
Similarly as grobid's other CRF models, grobid-ner can bootstrap training data. grobid-ner can automatically generate training data from any text files ( [Link to Page] ), labeling tokens with the named entity classes based on the existing model. A human annotator then corrects the generated training data by modifying the labels produced for each token. This curated training data can be added to the existing training data and used to train a new improved model.

### Format
# Format

The current format of the training data follows the [CONLL 2003 NER format](http://www.cnts.ua.ac.be/conll2003/ner/), which is a n-column tab separated text file.
In our case, the first column is a token, the second column is the NER class and a third optional column givens a word sense:
Expand All @@ -18,38 +18,33 @@ token2 CLASS word_sense2
token3 0 0
```

Non-named entity tokens are labeled with the default label ```0```.
Non-named entity tokens are labeled with the default label ```0```.

Word senses are optional and correspond to a WordNet synset. They are only indicated for Named Entity tokens.
Word senses are optional and correspond to a WordNet synset. They are only indicated for Named Entity tokens.

The `B-` prefix is used to indicate the beginning of an entity. This B- marker is necessary when several entities of the same NER class are immediatly repeated. As the prefix marker is really needed only in rare cases, it can be omited by default.

The end of a sentence is maked by an empty line.
The end of a sentence is maked by an empty line.

Tokens and tokenization must not be modified during manual correction. Only the labels can be changed.
Tokens and tokenization must not be modified during manual correction. Only the labels can be changed.

### Classes
# Classes

The list of NER classes with examples are given in the [classes page](class-and-senses.md).

### Largest entity mention
The list of NER classes with examples are given in the [classes page](class-and-senses.md).

Entities with more than one token can embed sub-entities. The approach currently followed by grobid-ner is to annotated only the largest entity mention and not the sub-entities. For example:
# Largest entity mention

1. Let's consider the token _British_. Depending on the context, _British_ in isolation can labelled with the classes NATIONAL (when introducing a relation to Great Britain), PERSON_TYPE (for the British people) or CONCEPT (when refering to the British English language)

In contrast, _British referendum_ is entirely labeled with the class EVENT, because British is part of a larger entity mention. The fact that British here also refers to the country (so class NATIONAL) must not be annotated. _British government_ is similarly entirely labeled with class INSTITUTION.
Entities with more than one token can embed sub-entities. The approach currently followed by grobid-ner is to annotate only the largest entity mention and not the sub-entities. See the [**"largest entity mention** section" section](largest-entity-mention.md).

2. Similarly, in order to be consistent, for phrases like __President of the United State__ and __United State President__, the class labeling will be identical, entirely as PERSON. The manual annotator must be careful not to annotated two NE following (in particular for the first case, with __United State__ as LOCATION), and in general to annotate only the largest entity.


# Practical example of correction

### Practical example of correction
<!-- TODO modify this section (xml and not conll) / check when the conversion conll-xml is done -->

For example the sentence:
For example the sentence:

```
World War I (WWI) was a global war centred in Europe that began on 28 July 1914 and lasted until 11 November 1918.
World War I (WWI) was a global war centred in Europe that began on 28 July 1914 and lasted until 11 November 1918.
```

The training data automatically generated by grobid-ner is as follow.
Expand Down Expand Up @@ -82,26 +77,26 @@ November B-PERIOD
1918 PERIOD
. O
```

Annotation process:

1. The first tokens __World War I__ are correctly maked as Named Entities of class EVENT, but incorectly labeled as three independant entities (note the B- at the beginning of each class). The correction will be:

Annotation process:

1. The first tokens __World War I__ are correctly maked as Named Entities of class EVENT, but incorectly labeled as three independant entities (note the B- at the beginning of each class). The correction will be:

```
World B-EVENT
War EVENT
I EVENT
```

Note that as the entity is not adjacent to any other entity, the ```B-``` marker is optional.
Note that as the entity is not adjacent to any other entity, the ```B-``` marker is optional.

2. __WWI__ is not maked as Named Entity and should be tagged as ACRONYM

```
WWI B-ACRONYM
```

3. __Europe__ refers to the european continent, therefore the class LOCATION is correct.
3. __Europe__ refers to the european continent, therefore the class LOCATION is correct.

4. The tokens __28 July 1914__ correspond to a single PERIOD and not two:

Expand All @@ -111,15 +106,15 @@ July PERIOD
1914 PERIOD
```

5. lastly the tokens __11 Novembre 1918__ has been wrongly identified as two entities:
5. lastly the tokens __11 Novembre 1918__ has been wrongly identified as two entities:

```
11 B-PERIOD
November PERIOD
1918 PERIOD
```

The result is as following:
The result is as following:

```
World B-EVENT
Expand Down Expand Up @@ -149,5 +144,3 @@ November PERIOD
1918 PERIOD
. O
```


0 comments on commit 8a58798

Please sign in to comment.