Skip to content

Commit

Permalink
Training guidelines update
Browse files Browse the repository at this point in the history
  • Loading branch information
lfoppiano committed Aug 26, 2016
1 parent 5203a6d commit 97b7e8c
Showing 1 changed file with 14 additions and 10 deletions.
24 changes: 14 additions & 10 deletions grobid-ner/doc/training-guidelines.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
# Guidelines for annotation of NAmed Entities Recognition
# Guidelines for annotation of Named Entities Recognition

The creation of annotated corpus for Named Entities is the process of find the correct class of named entities for words based on the context.
Grobid-NER can automatically generate training data from text files ( [Link to Page] ), recognising the best named entities with the model currently used.
The creation of annotated corpus for Named Entities is the process of find the correct class of named entities for words based on the context.
Grobid-NER can automatically generate training data from text files ( [Link to Page] ), recognising the best named entities with the model currently used.
The goal of the annotator is to correct the generated entities by: (1) changing them, (2) extending them to the proximity tokens or (3) removing them.

### Format
The format the training data is managed is the [CONLL 2003 format](http://www.cnts.ua.ac.be/conll2003/ner/), which is a 2 column tab separated file.

The format the training data is managed is the [CONLL 2003 format](http://www.cnts.ua.ac.be/conll2003/ner/), which is a 2 column tab separated file.
The first column is the token, the second column is the class:
```
token B-CLASS
token CLASS
```

The `B-` prefix is used to indicate the beginning of the class. This is important when the same class is repeated for two adiacent entities, **normally this is a very rare event**.
The `B-` prefix is used to indicate the beginning of the class. This is important when the same class is repeated for two adiacent entities, **normally this is a very rare event**.

During training it's mandatory not to modify the token for any reason. Only the column of the class can be changed.

Expand All @@ -26,20 +26,24 @@ Composed concept should be considered instead of simple concept. Usually extende

1. the token _british_:
`_british_ is tagged with class NATIONAL`
but
but
`_british_ referendum it's an EVENT`
`_british_ government it's an INSTITUTION`

2. composed token like European Union should be considered as a whole (please note that INSTITUTION could vary based on the context):

```
European B-INSTITUTION
Union INSTITUTION
Union INSTITUTION
```

instead the following case, which is not correct because it consider the two tokens separately:

```
European B-NATIONAL
Union B-CONCEPT

Union B-CONCEPT
```

3. more realworld case is when the entity precede an object that make the object specific:

European B-EVENT
Expand Down

0 comments on commit 97b7e8c

Please sign in to comment.