Skip to content

orlandxrf/spanish-ner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Probabilistic vs Deep Learning based approaches for narrow domain NER in Spanish

Experiments from NER task in Spanish language using two corpora: CoNLL-2002 and Mexican news

Data

The datasets used are described as follows:

CoNLL-2002 dataset

The CoNLL-2002 dataset is in the language Spanish and it has four classes under tagging schema IOB (Inside/Outside/Beginning). A Spanish sentence example with IOB schema is:

El presidente Andrés Manuel López Obrador estará viajando a la ciudad de Puebla de Zaragoza este Miércoles .

PERSON: Andrés-PER-B Manuel-PER-I López-PER-I Obrador-PER-I
LOCATION: Puebla-LOC-B de-LOC-I Zaragoza-LOC-I

No. Class IOB Schema Description
1 ORG B-ORG, I-ORG Organizations names
2 PER B-PER, I-PER People names
3 LOC B-LOC, I-LOC Location names
4 MISC B-MISC, I-MISC Miscellaneous

The CoNLL-2002 dataset was split into four partitions.

No. Partition Original Split Tokens Tags
1 Test A 1915 2177 9646 8
2 Test B 1517 1848 9086 8
3 Train 8323 9947 26099 8
4 Ensemble 11755 13972 31405 8

In headings, Original are the original sentences length. Split were splitted to length 50 (tokens).

CoNLL-2002 dataset distribution

The plot shows the tags/classes distribution from dataset under IOB schema. In the chart, the x-axis shows the four classes and the y-axis describe the number of entities per class.

alt CoNLL-2002 dataset distributio

CoNLL-2002 Sentences histogram

The histogram is used to identify the sentences-length from the dataset.

alt CoNLL-2002 Sentences histogram


Mx-news dataset

The Mx-news dataset was built on the political news domain using 250 documents, it is in the language Spanish and it has seventeen classes. One annotator used tagging schema IOBES (Inside/Outside/Beginning/End/Single) to labeled this corpus. Using the previous Spanish sentence example with IOBES schema:

El presidente Andrés Manuel López Obrador estará viajando a la ciudad de Puebla de Zaragoza este Miércoles .

PERSON: Andrés-PER-B Manuel-PER-I López-PER-I Obrador-PER-E
LOCATION: Puebla-LOC-B de-LOC-I Zaragoza-LOC-E
DATE: Miércoles-DAT-S

No. Class IOB Schema Description
1 PER B-PER, I-PER, E-PER, S-PER People names, aliases and abbreviations
2 ORG B-ORG, I-ORG, E-ORG, S-ORG Organizations, institutions
3 DAT B-DAT, I-DAT, E-DAT, S-DAT Dates on different formats
4 TIT B-TIT, I-TIT, E-TIT, S-TIT Title or position of persons
5 GPE B-GPE, I-GPE, E-GPE, S-GPE Country names, states, cities, municipalities
6 PEX B-PEX, I-PEX, E-PEX, S-PEX Political party names, aliases and abbreviations
7 TIM B-TIM, I-TIM, E-TIM, S-TIM Time expresions
8 FAC B-FAC, I-FAC, E-FAC, S-FAC Facility names
9 EVT B-EVT, I-EVT, E-EVT, S-EVT Event names
10 ADD B-ADD, I-ADD, E-ADD, S-ADD Addresses expressions, URLs and Twitter users
11 MNY B-MNY, I-MNY, E-MNY, ----- Monetary amounts
12 DOC B-DOC, I-DOC, E-DOC, S-DOC Documents, laws, rules
13 PRO B-PRO, I-PRO, E-PRO, S-PRO Product names, brands, application names
14 PRC B-PRC, I-PRC, E-PRC, ----- Percentage expressions
15 DEM B-DEM, -----, E-DEM, S-DEM Geographical or racial origin of people
16 AGE B-AGE, I-AGE, E-AGE, ----- People age
17 LOC B-LOC, I-LOC, E-LOC, S-LOC Locations about regions, rivers, lakes

The Mx-news dataset was split into four partitions.

No. Partition Original Split Tokens Tags
1 Split I 1295 1666 7628 63
2 Split II 1295 1677 7726 63
3 Split III 1297 1661 7664 63
4 Ensemble 3888 5004 13273 65

In headings, Original are the original sentences length. Split were splitted to length 50 (tokens).

Mx-news dataset distribution (seventeen classes)

The plot shows the tags/classes distribution from dataset under IOBES schema. In the chart, the x-axis shows the seventeen classes and the y-axis describe the number of entities per class.

alt Mx-news dataset distribution 17

Mx-news dataset distribution (seventeen classes)

Mx-news dataset was reduced to four classes: ORG, PER, LOC, MISC (the fourteen left) is labeled under IOB (Inside/Beginning/Out) schema. The plot shows the tags/classes distribution from dataset under IOB schema. In the chart, the x-axis shows the four classes and the y-axis describe the number of entities per class.

alt Mx-news dataset distribution 4

Mx-news Sentences histogram

The histogram is used to identify the sentences-length from the dataset. The Natural Language Toolkit was used to split sentences.

alt Mx-news Sentences histogram


ELMo word embeddings

The ELMo word embeddings were trained using elmoformanylangs Python library.

Download CoNLL-2002 ELmo embeddings

Download Mx-news ELMo embeddings

Requirements

The Python libraries used:

  • elmoformanylangs==0.0.2
  • graphviz==0.10.1
  • Keras==2.2.4
  • Keras-Applications==1.0.7
  • keras-contrib==2.0.8
  • Keras-Preprocessing==1.0.9
  • matplotlib==3.0.3
  • numpy==1.16.2
  • seqeval==0.0.10
  • scikit-learn==0.20.3
  • sklearn-crfsuite==0.3.6
  • tensorflow==1.13.1

About

Experiments from NER task in Spanish language using CoNLL-2002 and Mexican news datasets

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages