Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/kermitt2/grobid-ner
Browse files Browse the repository at this point in the history
  • Loading branch information
FEREDJ committed May 15, 2017
2 parents db57270 + 29f1b71 commit 98e009e
Show file tree
Hide file tree
Showing 2 changed files with 76 additions and 23 deletions.
87 changes: 70 additions & 17 deletions grobid-ner/doc/class-and-senses.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,22 +15,22 @@ The following table describes the 27 named entity classes produced by the model.
| <a style="text-decorations:none; #color:#265C83" href=#artifact> ARTIFACT | human-made object, including softwares | _FIAT 634_, _Microsoft Word_ |
| AWARD | award for art, science, sport, etc. | _Balon d'or_, _Nobel prize_|
| BUSINESS | company / commercial organisation | _Air Canada_, _Microsoft_ |
| CONCEPT | abstract concept not included in another class | _English_ (as language) |
| CONCEPT | abstract concept not included in another class | _English_ (as language), _Communism_, _Zionism_ |
| CONCEPTUAL | entity relating to a concept | _Greek_ myths, _European Union membership_ |
| <a style="text-decorations:none; #color:#265C83" href=#creation> CREATION | artistic creation, such as song, movie, book, TV show, etc. | _Monna Lisa_, _Mullaholland drive_, _Kitchen Nightmares_, _EU Referendum: The Great Debate_, _Europe: The Final Debate_ |
| EVENT | event | _World War 2_, _Battle of France_ |
| IDENTIFIER | systematized identifier such as phone number, email address, ISBN | |
| INSTALLATION | structure built by humans | _Strasbourg Cathedral_, _Sforza Castle_ |
| INSTITUTION | organization of people and a location or structure that share the same name | _Yale University_, the _European Patent Office_, the _British government_ |
| <a style="text-decorations:none; color:#08c" href=#legal> LEGAL | legal mentions such as article of law, convention, cases, treaty., etc. | _European Patent Convention_; _Maastricht Treaty_; _Article 52(2)(c) and (3)_; _Roe v. Wade, 410 U.S. 113 (1973)_; _European Union Referendum Act 2015_ |
| LOCATION | physical location, including planets and galaxies. | _Los Angeles_, _Northern Madagascar_, _Southern Thailand_, _Channel Islands_, _Earth_, _Milky Way_ |
| <a style="text-decorations:none; color:#08c" href=#location> LOCATION | physical location, including planets and galaxies. | _Los Angeles_, _Northern Madagascar_, _Southern Thailand_, _Channel Islands_, _Earth_, _Milky Way_, _West Mountain_ |
| <a style="text-decorations:none; color:#08c" href=#measure> MEASURE | numerical amount, including an optional unit of measure | _1,500_, _six million_, _72%_, |
| MEDIA | media organization or publication | _Le monde_, _The New York Times_ |
| NATIONAL | relating to a location | _North American_, _German_, _Britain_ |
| NATIONAL | relating to a location | _North American_, _German_, _British_ |
| ORGANISATION | organized group of people | _Alcoholics Anonymous_ |
| <a style="text-decorations:none; color:#08c" href=#period> PERIOD | date, historical era or other time period, time expressions | _January_, the _2nd half of 2010_, _1985-1989_, _from 1930 to 1945_, _since 1918_, the _first four years_ |
| PERSON | first, middle, last names and aliases of people and fictional characters | _John Smith_ |
| <a style="text-decorations:none; color:#08c" href=#person_type> PERSON_TYPE | person type or role classified according to group membership | _African-American_, _Asian_, _Conservative_, _Liberal_, _Jews_ |
| <a style="text-decorations:none; color:#08c" href=#person_type> PERSON_TYPE | person type or role classified according to group membership | _African-American_, _Asian_, _Conservative_, _Liberal_, _Jews_, _Communist_ |
| PLANT | name of a plant | _Ficus religiosa_ |
| SPORT_TEAM | sport group or organisation | _The Yankees_ |
| SUBSTANCE | natural substance | |
Expand Down Expand Up @@ -105,6 +105,19 @@ Human-made object, including softwares.

---
#### LOCATION
➡ When there are modifiers along the location, they are included in the entity, for example:
```xml
- <ENAMEX type="LOCATION">Suvalkų area</ENAMEX>
- <ENAMEX type="LOCATION">Pakruojis local rural district</ENAMEX>
- <ENAMEX type="LOCATION">coast of Honolulu</ENAMEX>
```
➡ The articles and prepositions (_from_, _the_) are not included in the entity.

➡ In some cases surrounding elements are not included in the entity, for example _"west of the"_ in:
```xml
They established safe zones west of the <ENAMEX type="LOCATION">Rocky Mountains</ENAMEX>.
```
[issue #21](https://github.com/kermitt2/grobid-ner/issues/21)

---
#### MEASURE
Expand Down Expand Up @@ -181,8 +194,32 @@ issues [#13](https://github.com/kermitt2/grobid-ner/issues/13) and [#25](https:/

---
#### PERSON_TYPE

➡ Even though it's an approximation, entities like _**Jewry**_ (which means Jewish community) are included in this class. [(issue #28)](https://github.com/kermitt2/grobid-ner/issues/28)

<!-- TODO
➡ **PERSON_TYPE vs. NATIONAL**
Sometimes the context determines if an
➡ **PERSON_TYPE vs. CONCEPTUAL**
* Even if the entity doesn't modify a person, it's annotated as PERSON_TYPE, for example:
```xml
<ENAMEX type="PERSON_TYPE">Zionist</ENAMEX> events
<ENAMEX type="PERSON_TYPE">Jewish</ENAMEX> problems
```
These entities are annotated PERSON_TYPE as long as they can be substituted for another group of people. For example with `communist, liberal, blond people, diabetic, muffin lovers` **<span style="color:red">/!\ TO CHECK /!\ </span>**.
* Examples where the entity is included in a larger entity:
```xml
<ENAMEX type="ORGANISATION">Zionist movement</ENAMEX>
<ENAMEX type="ORGANISATION">Central Committee of the Zionist Union</ENAMEX>
```
[(issue #15)](https://github.com/kermitt2/grobid-ner/issues/15)
-->

---
#### PLANT

Expand All @@ -194,9 +231,14 @@ issues [#13](https://github.com/kermitt2/grobid-ner/issues/13) and [#25](https:/

---
#### TITLE
➡ Personal or honorific title, with a relatively loose definition. For example the following entities are annotated as TITLE: _**chairman**_, _**member**_, _**founder**_.

➡ The [Wikipedia page](https://en.wikipedia.org/wiki/Title) examples can be useful.
➡ Personal or honorific title, with a relatively loose definition. The [Wikipedia page](https://en.wikipedia.org/wiki/Title) examples can be useful. For example the following entities are annotated as TITLE: _**chairman**_, _**member**_, _**founder**_.

➡ For some terms, the context will determine the annotation. The term `engineer` for example can be a TITLE or not depending on the country:

* In France or Germany it is linked with a specific diploma so it's annotated as TITLE if the term is linked to these countries.

* In UK or USA, it refers to the job, so it's **not** annotated.

➡ To decide between TITLE and PERSON:

Expand All @@ -210,6 +252,11 @@ issues [#13](https://github.com/kermitt2/grobid-ner/issues/13) and [#25](https:/

* In case of the largest entity match of TITLE + PERSON, the priority goes to PERSON. For example _**The President of the United States Barack Obama**_ as a whole is annotated PERSON.

➡ The same principle applies between TITLE and PERSON_TYPE, for example this case of the largest entity match of TITLE + PERSON_TYPE:
```xml
<ENAMEX type="PERSON_TYPE">Members of the British Royal Family</ENAMEX> had fled.
```

issues [#12](https://github.com/kermitt2/grobid-ner/issues/12) and [#33](https://github.com/kermitt2/grobid-ner/issues/33)

---
Expand All @@ -222,15 +269,6 @@ issues [#12](https://github.com/kermitt2/grobid-ner/issues/12) and [#33](https:/

### Miscellaneous

➡ Punctuation (like quotation marks) are to be left outside the tags, for example: `"<ENAMEX type="PERSON_TYPE">socialists</ENAMEX>"` [(issue #26)](https://github.com/kermitt2/grobid-ner/issues/26).

**Currencies** alone (_pound sterling_, _US dollar_) should not be annotated [(issue #23)](https://github.com/kermitt2/grobid-ner/issues/23).

➡ When there is a **dash**, it can be considered a space, for example _**Nobel prize-winning economist**_ is annotated [(issue #31)](https://github.com/kermitt2/grobid-ner/issues/31):
```xml
<ENAMEX type="AWARD">Nobel prize</ENAMEX>-winning economist
```

➡ the classes may apply to fictive entities, for example:
```xml
- a multipurpose hand tool, the <ENAMEX type="ARTIFACT">"Lobotomizer"</ENAMEX> or <ENAMEX type="ARTIFACT">"Lobo"</ENAMEX> (...), for close-quarters combat.
Expand All @@ -239,21 +277,36 @@ issues [#12](https://github.com/kermitt2/grobid-ner/issues/12) and [#33](https:/
[issue #24](https://github.com/kermitt2/grobid-ner/issues/24)



## Conventions

For the class assignation to entities, GROBID NER follows the longest match convention. For instance, the entity _University of Minnesota_ as a whole (longest match) will belong to the class INSTITUTION. Its component _Minnesota_ is a LOCATION, but as it is part of a larger entity chunk, it will not be identified.
For the class assignation to entities, GROBID NER follows the longest match convention. For instance, the entity _University of Minnesota_ as a whole (longest match) will belong to the class INSTITUTION. Its component _Minnesota_ is a LOCATION, but as it is part of a larger entity chunk, it will not be identified.

<!-- TODO
/!\ ATTENTION LE PRINCIPE DE LARGEST ENTITY MATCH A AUSSI UN PARAGRAPHE DANS "ANNOTATION GUIDELINES" !!! À UNIFIER
à ajouter dans largest entity match :
- examples:
issue #7 .
German South-West Africa -> all LOCATION
American Jewish Holocaust survivors -> PERSON_TYPE
- noter qu'il y a une exception au largest entity match bidule : la classe MEASURE lorsque c'est devant, etc., cf issue 32
-->

➡ There is no specific class for foreign words. They are annotated in one of the existing classes, if relevant, otherwise they are not annotated. In all cases, they are identified in parallel by another attribute.


➡ Punctuation (like quotation marks) are to be left outside the tags, for example: `"<ENAMEX type="PERSON_TYPE">socialists</ENAMEX>"` [(issue #26)](https://github.com/kermitt2/grobid-ner/issues/26).

**Currencies** alone (_pound sterling_, _US dollar_) should not be annotated [(issue #23)](https://github.com/kermitt2/grobid-ner/issues/23).

➡ When there is a **dash**, it can be considered a space, for example _**Nobel prize-winning economist**_ is annotated [(issue #31)](https://github.com/kermitt2/grobid-ner/issues/31):
```xml
<ENAMEX type="AWARD">Nobel prize</ENAMEX>-winning economist
```


## Sense information

When possible, senses information are also assigned to entities in the form of one or several WordNet synsets.
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@
<sentence xml:id="P3E4">Initially, these nations were able to cover up their smaller outbreaks, until a much larger outbreak in <ENAMEX type="LOCATION">South Africa</ENAMEX> brings the plague to public attention.</sentence>
</p>
<p xml:lang="en" xml:id="P4">
<sentence xml:id="P4E0">As the infection spreads, <ENAMEX type="LOCATION">Israel</ENAMEX> abandons the <ENAMEX type="LOCATION">Palestinian</ENAMEX> territories and initiates a nationwide quarantine, closing its borders to everyone except uninfected <ENAMEX type="PERSON_TYPE">Jews</ENAMEX> and <ENAMEX type="PERSON_TYPE">Palestinians</ENAMEX>.</sentence>
<sentence xml:id="P4E1">Its military then puts down an ultra-Orthodox uprising, which is later referred to as an <ENAMEX type="LOCATION">Israeli</ENAMEX> civil war.</sentence>
<sentence xml:id="P4E0">As the infection spreads, <ENAMEX type="LOCATION">Israel</ENAMEX> abandons the <ENAMEX type="LOCATION">Palestinian territories</ENAMEX> and initiates a nationwide quarantine, closing its borders to everyone except uninfected <ENAMEX type="PERSON_TYPE">Jews</ENAMEX> and <ENAMEX type="PERSON_TYPE">Palestinians</ENAMEX>.</sentence>
<sentence xml:id="P4E1">Its military then puts down an ultra-Orthodox uprising, which is later referred to as an <ENAMEX type="NATIONAL">Israeli</ENAMEX> civil war.</sentence>
<sentence xml:id="P4E2">The <ENAMEX type="LOCATION">United States</ENAMEX> does little to prepare because it is overconfident in its ability to suppress any threat.</sentence>
<sentence xml:id="P4E3">Although special forces teams contain initial outbreaks, a widespread effort never starts: the nation is deprived of political will by &quot;brushfire wars&quot;, and a widely distributed and marketed placebo vaccine creates a false sense of security.</sentence>
</p>
Expand All @@ -44,18 +44,18 @@
<sentence xml:id="P6E1">It calls for the establishment of small sanctuaries, leaving large groups of survivors abandoned in special zones in order to distract the undead and allowing those within the main safe zone time to regroup and recuperate.</sentence>
<sentence xml:id="P6E2">Governments worldwide assume similar plans or relocate the populace to safer foreign territory, such as the attempted complete evacuation of the <ENAMEX type="LOCATION">Japanese archipelago</ENAMEX> to <ENAMEX type="LOCATION">Kamchatka</ENAMEX>.</sentence>
<sentence xml:id="P6E3">Because zombies freeze solid in severe cold, many civilians in <ENAMEX type="LOCATION">North America</ENAMEX> flee to the wildernesses of <ENAMEX type="LOCATION">northern Canada</ENAMEX> and the <ENAMEX type="LOCATION">Arctic</ENAMEX>, where <ENAMEX type="MEASURE">eleven million</ENAMEX> people die of starvation and hypothermia.</sentence>
<sentence xml:id="P6E4">It is implied that some turn to cannibalism to survive; further interviews from other sources imply that cannibalism occurred in areas of the <ENAMEX type="LOCATION">United States</ENAMEX> where food shortages occurred.</sentence>
<sentence xml:id="P6E4">It is implied that some turn to <ENAMEX type="CONCEPT">cannibalism</ENAMEX> to survive; further interviews from other sources imply that <ENAMEX type="CONCEPT">cannibalism</ENAMEX> occurred in areas of the <ENAMEX type="LOCATION">United States</ENAMEX> where food shortages occurred.</sentence>
<sentence xml:id="P6E5">The <ENAMEX type="MEASURE">three</ENAMEX> remaining astronauts in the <ENAMEX type="INSTALLATION">International Space Station</ENAMEX> survive the war by salvaging supplies from the abandoned <ENAMEX type="INSTALLATION">Chinese space station</ENAMEX> and maintain some military and civilian satellites using an orbital fuel station.</sentence>
<sentence xml:id="P6E6">A surviving <ENAMEX type="TITLE">member of the ISS crew</ENAMEX> describes &quot;mega&quot; swarms of zombies on the <ENAMEX type="LOCATION">American Great Plains</ENAMEX> and <ENAMEX type="LOCATION">Central Asia</ENAMEX>, and how the crisis affected <ENAMEX type="LOCATION">Earth</ENAMEX>&apos;s atmosphere.</sentence>
</p>
<p xml:lang="en" xml:id="P7">
<sentence xml:id="P7E0">The <ENAMEX type="INSTITUTION">U.S.</ENAMEX> eventually establishes safe zones west of the <ENAMEX type="LOCATION">Rocky Mountains</ENAMEX> and spends much of the <ENAMEX type="PERIOD">next decade</ENAMEX> eradicating zombies in that region.</sentence>
<sentence xml:id="P7E0">The <ENAMEX type="LOCATION">U.S.</ENAMEX> eventually establishes safe zones west of the <ENAMEX type="LOCATION">Rocky Mountains</ENAMEX> and spends much of the <ENAMEX type="PERIOD">next decade</ENAMEX> eradicating zombies in that region.</sentence>
<sentence xml:id="P7E1">All aspects of civilian life are devoted to supporting the war effort against the pandemic.</sentence>
<sentence xml:id="P7E2">Much of it resembles total war strategies: rationing of fuel and food, cultivation of private gardens, and civilian neighborhood patrols.</sentence>
<sentence xml:id="P7E3">The <ENAMEX type="INSTITUTION">U.S. government</ENAMEX> also initiates a &quot;<ENAMEX type="LEGAL">Re-education Act</ENAMEX>&quot; to train the civilian population for the war effort and restore order; the people with skills such as carpentry and construction find themselves more valuable than people with managerial skills.</sentence>
</p>
<p xml:lang="en" xml:id="P8">
<sentence xml:id="P8E0"><ENAMEX type="PERIOD">Seven years after the outbreak began</ENAMEX>, a conference is held off the coast of <ENAMEX type="LOCATION">Honolulu</ENAMEX>, aboard the <ENAMEX type="INSTALLATION">USS Saratoga</ENAMEX>, where most of the world&apos;s leaders argue that they can outlast the zombie plague if they stay in their safe zones.</sentence>
<sentence xml:id="P8E0"><ENAMEX type="PERIOD">Seven years after the outbreak</ENAMEX> began, a conference is held off the <ENAMEX type="LOCATION">coast of Honolulu</ENAMEX>, aboard the <ENAMEX type="INSTALLATION">USS Saratoga</ENAMEX>, where most of the world&apos;s leaders argue that they can outlast the zombie plague if they stay in their safe zones.</sentence>
<sentence xml:id="P8E1">The <ENAMEX type="TITLE">U.S. President</ENAMEX>, however, argues for going on the offensive.</sentence>
<sentence xml:id="P8E2">Determined to lead by example, the <ENAMEX type="INSTITUTION">U.S. military</ENAMEX> reinvents itself to meet the specific strategic requirements of fighting the undead: using semi-automatic, high-power rifles and volley firing, focusing on head shots and slow, steady rates of fire (a tactic &quot;re-invented&quot; by the <ENAMEX type="INSTITUTION">Indian Army</ENAMEX> during the <ENAMEX type="EVENT">Great Panic</ENAMEX>); and devising a multipurpose hand tool, the &quot;<ENAMEX type="ARTIFACT">Lobotomizer</ENAMEX>&quot; or &quot;<ENAMEX type="ARTIFACT">Lobo</ENAMEX>&quot; (described as a combination of a shovel and a battle axe), for close-quarters combat.</sentence>
<sentence xml:id="P8E3">The military, backed by a resurgent <ENAMEX type="NATIONAL">American</ENAMEX> wartime economy, began the <ENAMEX type="PERIOD">three-year</ENAMEX>-long process of retaking the contiguous <ENAMEX type="LOCATION">United States</ENAMEX> from both the undead as well as groups of hostile human survivors.</sentence>
Expand Down Expand Up @@ -103,7 +103,7 @@
<sentence xml:id="P15E1">He claimed inspiration from &quot;<ENAMEX type="CREATION">The Good War: An Oral History of World War Two</ENAMEX>&quot; (<ENAMEX type="PERIOD">1984</ENAMEX>) by <ENAMEX type="PERSON">Studs Terkel</ENAMEX>, stating: &quot;[<ENAMEX type="PERSON">Terkel</ENAMEX>&apos;s book is] an oral history of <ENAMEX type="EVENT">World War II</ENAMEX>.</sentence>
<sentence xml:id="P15E2">I read it when I was a teenager and it&apos;s sat with me ever since.</sentence>
<sentence xml:id="P15E3">When I sat down to write <ENAMEX type="CREATION">World War Z: An Oral History of the Zombie War</ENAMEX>, I wanted it to be in the vein of an oral history.&quot;</sentence>
<sentence xml:id="P15E4">[2] <ENAMEX type="PERSON">Brooks</ENAMEX> also cited renowned zombie film director <ENAMEX type="PERSON">George A. Romero</ENAMEX> as an influence and criticized <ENAMEX type="CREATION">The Return of the Living Dead</ENAMEX> films: &quot;They cheapen zombies, make them silly and campy.</sentence>
<sentence xml:id="P15E4">[2] <ENAMEX type="PERSON">Brooks</ENAMEX> also cited renowned zombie <ENAMEX type="PERSON">film director George A. Romero</ENAMEX> as an influence and criticized <ENAMEX type="CREATION">The Return of the Living Dead</ENAMEX> films: &quot;They cheapen zombies, make them silly and campy.</sentence>
<sentence xml:id="P15E5">They&apos;ve done for <ENAMEX type="CREATION">the living dead</ENAMEX> what the old <ENAMEX type="CREATION">Batman</ENAMEX> TV show did for <ENAMEX type="CREATION">The Dark Knight</ENAMEX>.&quot;</sentence>
<sentence xml:id="P15E6">[2] <ENAMEX type="PERSON">Brooks</ENAMEX> acknowledged making several references to popular culture in the novel, including one to alien robot franchise Transformers, but declined to identify the others so that readers could discover them independently.</sentence>
</p>
Expand Down

0 comments on commit 98e009e

Please sign in to comment.