brexit 1

kermitt2 · May 23, 2017 · 8a58798 · 8a58798
2 parents 3160217 + 9224549
commit 8a58798
Show file tree

Hide file tree

Showing 47 changed files with 1,057 additions and 496 deletions.
diff --git a/grobid-ner/doc/Licence.md b/grobid-ner/doc/Licence.md
@@ -1,14 +1,14 @@
 
-### Grobid & Grobid NER
+## Grobid & Grobid NER
 
-Grobid and Grobid NER are distributed under [Apache 2.0 license](http://www.apache.org/licenses/LICENSE-2.0). 
+Grobid and Grobid NER are distributed under [Apache 2.0 license](http://www.apache.org/licenses/LICENSE-2.0).
+
+Author and contact: Patrice Lopez (<patrice.lopez@science-miner.com>)
 
-Author and contact: Patrice Lopez (<patrice.lopez@science-miner.com>) 
-
 For citing the tool, please refer to the github project: <https://github.com/grobid/grobid-ner> (2013-2017)
 
-### Datasets
+## Datasets
 
-All the datasets used for training the models behind this tool are not all publicly available. 
+All the datasets used for training the models behind this tool are not all publicly available.
 For the missing ones, they can be requested at the respective owners and rebuild the original Grobid NER dataset.
-See the [respective pages](training-ner-model.md) for more information. 
+See the [respective pages](training-ner-model.md) for more information.
diff --git a/grobid-ner/doc/annotation-examples.md b/grobid-ner/doc/annotation-examples.md
@@ -1,9 +1,9 @@
-## Annotation examples and cases
+<h1>Annotation examples and cases</h1>
 
-### Location 
+### Location
 
 #### Eretz Israel
-In the following case the sentence is `he continued to write articles and newspaper stories regarding events in the history of the Jewish people throughout the world, the Zionist movement in Bulgaria and the world and the history of the Sephardi Jews and their situation in Eretz Israel.`. 
+In the following case the sentence is `he continued to write articles and newspaper stories regarding events in the history of the Jewish people throughout the world, the Zionist movement in Bulgaria and the world and the history of the Sephardi Jews and their situation in Eretz Israel.`.
 In this case `Eretz Israel` is annotated as 'LOCATION' and not as a 'CONCEPT':  
 ```
 in	O
@@ -12,4 +12,4 @@ Israel	LOCATION
 .	O
 ```
 
-Link to discussion [here](https://github.com/kermitt2/grobid-ner/issues/18).
+Link to discussion [here](https://github.com/kermitt2/grobid-ner/issues/18).
diff --git a/grobid-ner/doc/annotation-guidelines.md b/grobid-ner/doc/annotation-guidelines.md
@@ -1,12 +1,12 @@
-# Annotation guidelines for GROBID NER
+<h1>Annotation Guidelines for GROBID NER</h1>
 
-### Principle
+# Principle
 
 Creating annotated corpus for Named Entities Recognition suppose to identify Named Entities in a text and to classify these Named entities based on the context into a set of classes, 27 classes in the case of grobid-ner.
 
-Similarly as grobid's other CRF models, grobid-ner can bootstrap training data. grobid-ner can automatically generate training data from any text files ( [Link to Page] ), labeling tokens with the named entity classes based on the existing model. A human annotator then corrects the generated training data by modifying the labels produced for each token. This curated training data can be added to the existing training data and used to train a new improved model. 
+Similarly as grobid's other CRF models, grobid-ner can bootstrap training data. grobid-ner can automatically generate training data from any text files ( [Link to Page] ), labeling tokens with the named entity classes based on the existing model. A human annotator then corrects the generated training data by modifying the labels produced for each token. This curated training data can be added to the existing training data and used to train a new improved model.
 
-### Format 
+# Format
 
 The current format of the training data follows the [CONLL 2003 NER format](http://www.cnts.ua.ac.be/conll2003/ner/), which is a n-column tab separated text file.
 In our case, the first column is a token, the second column is the NER class and a third optional column givens a word sense:
@@ -18,38 +18,33 @@ token2           CLASS     word_sense2
 token3           0         0
 ```
 
-Non-named entity tokens are labeled with the default label ```0```. 
+Non-named entity tokens are labeled with the default label ```0```.
 
-Word senses are optional and correspond to a WordNet synset. They are only indicated for Named Entity tokens. 
+Word senses are optional and correspond to a WordNet synset. They are only indicated for Named Entity tokens.
 
 The `B-` prefix is used to indicate the beginning of an entity. This B- marker is necessary when several entities of the same NER class are immediatly repeated. As the prefix marker is really needed only in rare cases, it can be omited by default.  
 
-The end of a sentence is maked by an empty line. 
+The end of a sentence is maked by an empty line.
 
-Tokens and tokenization must not be modified during manual correction. Only the labels can be changed. 
+Tokens and tokenization must not be modified during manual correction. Only the labels can be changed.
 
-### Classes
+# Classes
 
-The list of NER classes with examples are given in the [classes page](class-and-senses.md). 
-
-### Largest entity mention
+The list of NER classes with examples are given in the [classes page](class-and-senses.md).
 
-Entities with more than one token can embed sub-entities. The approach currently followed by grobid-ner is to annotated only the largest entity mention and not the sub-entities. For example: 
+# Largest entity mention
 
-  1. Let's consider the token _British_. Depending on the context, _British_ in isolation can labelled with the classes NATIONAL (when introducing a relation to Great Britain), PERSON_TYPE (for the British people) or CONCEPT (when refering to the British English language)
-
-    In contrast, _British referendum_ is entirely labeled with the class EVENT, because British is part of a larger entity mention. The fact that British here also refers to the country (so class NATIONAL) must not be annotated. _British government_ is similarly entirely labeled with class INSTITUTION.
+Entities with more than one token can embed sub-entities. The approach currently followed by grobid-ner is to annotate only the largest entity mention and not the sub-entities. See the [**"largest entity mention** section" section](largest-entity-mention.md).
 
-  2. Similarly, in order to be consistent, for phrases like __President of the United State__ and __United State President__, the class labeling will be identical, entirely as PERSON. The manual annotator must be careful not to annotated two NE following (in particular for the first case, with __United State__ as LOCATION), and in general to annotate only the largest entity.  
-
 
+# Practical example of correction
 
-### Practical example of correction
+<!-- TODO modify this section (xml and not conll) / check when the conversion conll-xml is done -->
 
-For example the sentence: 
+For example the sentence:
 
 ```
-World War I (WWI) was a global war centred in Europe that began on 28 July 1914 and lasted until 11 November 1918. 
+World War I (WWI) was a global war centred in Europe that began on 28 July 1914 and lasted until 11 November 1918.
 ```
 
 The training data automatically generated by grobid-ner is as follow.  
@@ -82,26 +77,26 @@ November    B-PERIOD
 1918        PERIOD
 .           O
 ```    
-
-Annotation process: 
 
-1. The first tokens __World War I__ are correctly maked as Named Entities of class EVENT, but incorectly labeled as three independant entities (note the B- at the beginning of each class). The correction will be: 
-
+Annotation process:
+
+1. The first tokens __World War I__ are correctly maked as Named Entities of class EVENT, but incorectly labeled as three independant entities (note the B- at the beginning of each class). The correction will be:
+
 ```
 World        B-EVENT
 War          EVENT
 I            EVENT
 ```
 
-Note that as the entity is not adjacent to any other entity, the ```B-``` marker is optional. 
+Note that as the entity is not adjacent to any other entity, the ```B-``` marker is optional.
 
 2. __WWI__ is not maked as Named Entity and should be tagged as ACRONYM
 
 ```
 WWI        B-ACRONYM
 ```
 
-3. __Europe__ refers to the european continent, therefore the class LOCATION is correct. 
+3. __Europe__ refers to the european continent, therefore the class LOCATION is correct.
 
 4. The tokens __28 July 1914__ correspond to a single PERIOD and not two:
 
@@ -111,15 +106,15 @@ July      PERIOD
 1914      PERIOD
 ```
 
-5. lastly the tokens __11 Novembre 1918__ has been wrongly identified as two entities: 
-  
+5. lastly the tokens __11 Novembre 1918__ has been wrongly identified as two entities:
+
 ```
 11          B-PERIOD
 November    PERIOD
 1918        PERIOD
 ```
 
-The result is as following: 
+The result is as following:
 
 ```
 World       B-EVENT
@@ -149,5 +144,3 @@ November    PERIOD
 1918        PERIOD
 .           O
 ```    
-
-