Skip to content

Commit

Permalink
Updating a bit the documentatioin
Browse files Browse the repository at this point in the history
  • Loading branch information
kermitt2 committed Aug 29, 2016
1 parent eed6b73 commit 7dd5480
Show file tree
Hide file tree
Showing 4 changed files with 20 additions and 18 deletions.
4 changes: 2 additions & 2 deletions Readme.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# grobid-ner

[![Documentation Status](https://readthedocs.org/projects/grobid-ner/badge/?version=latest)](https://readthedocs.org/projects/grobid-ner/?badge=latest)
[![Documentation Status](https://readthedocs.org/projects/grobid-ner/badge/?version=latest)](http://grobid-ner.readthedocs.io/en/latest/)
[![License](http://img.shields.io/:license-apache-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0.html)

## Purpose

GROBID NER is a Named-Entity Recogniser based on the GROBID library ([grobid](https://raw.github.com/kermitt2/grobid)), a text mining tool exploiting CRF. The installation of GROBID is necessary.

Grobid NER has been developed more specifically for the purpose of supporting disambiguation and resolution of the entities against knowledge bases such as Wikipedia. For a description of the NER, installation, usage and other technical features, see the [documentation](https://readthedocs.org/projects/grobid-ner/?badge=latest).
Grobid NER has been developed more specifically for the purpose of supporting disambiguation and resolution of the entities against knowledge bases such as Wikipedia. For a description of the NER, installation, usage and other technical features, see the [documentation](http://grobid-ner.readthedocs.io/en/latest/).

## License

Expand Down
12 changes: 6 additions & 6 deletions grobid-ner/doc/build-and-install.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Grobid NER is a module of [Grobid](https://github.com/kermitt2/grobid) .
GROBID NER is a module of [Grobid](https://github.com/kermitt2/grobid) .

## Grobid Installation

Grobid is library for extracting bibliographical information from technical and scientific documents.
GROBID is library for extracting bibliographical information from technical and scientific documents.
The tool offers a convenient environment for creating efficient text mining tool based on CRF.

Clone source code from github:
Expand All @@ -29,15 +29,15 @@ Clone source code from github:
Or download directly the zip file:
> https://github.com/kermitt2/grobid/zipball/master
Grobid NER is actually a sub-project of Grobid.
Although Grobid NER will be merged with Grobid in the future, at this point the Grobid NER sub-module simply need to added manually.
In the main directory of Grobid NER:
GROBID NER is actually a sub-project of GROBID.
Although GROBID NER might be merged with GROBID in the future, at this point the GROBID NER sub-module simply need to added manually.
In the main directory of GROBID NER:

> cp -r grobid-ner /path/to/grobid/
> cp -r grobid-home/models/* /path/to/grobid/grobid-home/
Then build the Grobid NER subproject:
Then build the GROBID NER subproject:

> cd /path/to/grobid/grobid-ner
Expand Down
6 changes: 3 additions & 3 deletions grobid-ner/doc/class-and-senses.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Grobid NER identifies named-entities and classifies them in 26 classes, as compared to the 4-classes or 7-classes model of most of the existing NER open source tools (usually using the Reuters/CoNLL 2003 annotated corpus, or the MUC annotated corpus).
GROBID NER identifies named-entities and classifies them in 26 classes, as compared to the 4-classes or 7-classes model of most of the existing NER open source tools (usually using the Reuters/CoNLL 2003 annotated corpus, or the MUC annotated corpus).

In addition the entities are often enriched with WordNet sense annotations to help further disambiguation and resolution of the entity. Grobid NER has been developed for the purposed of disambiguating and resolving entities against knowledge bases such as Wikipedia and FreeBase. Sense information can help to disambiguate the entity, because they refine based on contextual clues the entity class.
In addition the entities are often enriched with WordNet sense annotations to help further disambiguation and resolution of the entity. GROBID NER has been developed for the purposed of disambiguating and resolving entities against knowledge bases such as Wikipedia and FreeBase. Sense information can help to disambiguate the entity, because they refine based on contextual clues the entity class.

## Named entity classes

Expand Down Expand Up @@ -37,7 +37,7 @@ The following table describes the 26 named entity classes produced by the model.
## Conventions

For the class assignation to entities, Grobid NER follows the longest match convention. For instance, the entity _University of Minnesota_ as a whole (longest match) will belong to the class INSTITUTION. Its component _Minnesota_ is a LOCATION, but as it is part of a larger entity chunk, it will not be identified.
For the class assignation to entities, GROBID NER follows the longest match convention. For instance, the entity _University of Minnesota_ as a whole (longest match) will belong to the class INSTITUTION. Its component _Minnesota_ is a LOCATION, but as it is part of a larger entity chunk, it will not be identified.


## Sense information
Expand Down
16 changes: 9 additions & 7 deletions grobid-ner/doc/index.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
# GROBID Named Entity Recognition Documentation

## Purpose
## Purposes

Grobid NER is a Named-Entity Recogniser module for [GROBID](https://raw.github.com/kermitt2/grobid), a text mining tool exploiting CRF.
Grobid NER has been developed more specifically for the purpose of supporting disambiguation and resolution of entities against knowledge bases such as Wikipedia.
GROBID NER is a Named-Entity Recogniser module for [GROBID](https://raw.github.com/kermitt2/grobid), a tool based on CRF.
GROBID NER has been developed more specifically for the purpose of further supporting post disambiguation and resolution of entities against knowledge bases such as Wikipedia.

The models supplied with the source have been trained using the following dataset:
- [CONLL 2003](http://www.cnts.ua.ac.be/conll2003/ner/) Manually annotated training data (20k words, 4 classes)
- Wikipedia semi-automatic generated data (approximately 10k words, 26 classes)
The current models shipped with the source uses 26 Named Entity classes and have been trained using the following dataset:
- Reuters NER [CONLL 2003](http://www.cnts.ua.ac.be/conll2003/ner/) partially manually annotated training data (10k words)
- Manually annotated extract from the Wikipedia article on World War 1 (approximately 10k words)

Training data and annotation work will be always welcomed, if you like to contribute, you can contact us via email or by opening an issue in the GitHUB project.
The training has been completed with a very large semi-supervised training based on the Wikipedia Idilia data set.

Annotated data will be always welcomed, if you like to contribute, you can contact us via email or by opening an issue in the GitHub project.

## About

Expand Down

0 comments on commit 7dd5480

Please sign in to comment.