### View NER Labels

This notebook loads the results for all NER tools and verifies the set of labels each tool uses.

In [1]:
import pandas as pd

In [2]:
flair = pd.read_csv(f'../data/results/flair/flair_ner.csv')
spacy = pd.read_csv(f'../data/results/spacy_entityrecognizer/spacy_ner_sm.csv')
nltk = pd.read_csv(f'../data/results/nltk/nltk_ner_uppercased.csv')
stanza = pd.read_csv(f'../data/results/stanza/stanza_ner.csv')

**Get list of all labels that occur in the output data**

In [52]:
all_labels_found = set(pd.concat([flair['labels'], spacy['labels'], nltk['labels'], stanza['labels']]))

In [53]:
print(all_labels_found)

{'LOC', 'NORP', 'CARDINAL', 'MISC', 'PERSON', 'ORGANIZATION', 'QUANTITY', 'FAC', 'PER', 'DATE', 'ORG', 'GPE', 'PRODUCT', 'TIME'}


**See which labels occur for which tool**

We know that:
- spacy and stanza recognize the 18 labels in OntoNotes
- flair recognizes the 4 labels in CoNLL-03
- nltk recognizes the 7 labels in ACE-2005

K |

In [44]:
# sets of labels (obtained at each benchmarks release notes)
conll03_labels = ['PER','ORG','LOC','MISC']
ace05_labels = ['PERSON','ORGANIZATION','LOCATION','GPE','FACILITY','WEAPON','VEHICLE']
ontonotes_labels = ['PERSON', 'ORG', 'LOC', 'GPE', 'FAC', 'CARDINAL', 'DATE', 'EVENT', 'LANGUAGE', 'LAW', 'MONEY', 'NORP', 'ORDINAL', 'PERCENT', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART']

In [37]:
# Flair
print(flair['labels'].value_counts())

labels
LOC     327
ORG     267
PER     164
MISC     83
Name: count, dtype: int64


In [42]:
# spacy
print(spacy['labels'].value_counts())
print(f"Labels unused: {[label for label in ontonotes_labels if label not in list(spacy['labels'])]}")

labels
ORG         72
PERSON      22
CARDINAL     6
DATE         6
TIME         4
QUANTITY     3
GPE          3
PRODUCT      2
NORP         1
FAC          1
Name: count, dtype: int64
Labels unused: ['LOC', 'EVENT', 'LANGUAGE', 'LAW', 'MONEY', 'ORDINAL', 'PERCENT', 'WORK_OF_ART']


In [43]:
# NLTK
print(nltk['labels'].value_counts())
print(f"Labels unused: {[label for label in ace05_labels if label not in list(nltk['labels'])]}")

labels
ORGANIZATION    450
PERSON            6
GPE               2
Name: count, dtype: int64
Labels unused: ['LOCATION', 'FACILITY', 'WEAPON', 'VEHICLE']


In [45]:
# Stanza
print(stanza['labels'].value_counts())
print(f"Labels unused: {[label for label in ontonotes_labels if label not in list(stanza['labels'])]}")

labels
CARDINAL    10
GPE         10
ORG          5
DATE         4
PERSON       3
PRODUCT      3
TIME         1
Name: count, dtype: int64
Labels unused: ['LOC', 'FAC', 'EVENT', 'LANGUAGE', 'LAW', 'MONEY', 'NORP', 'ORDINAL', 'PERCENT', 'QUANTITY', 'WORK_OF_ART']


**Union of labels**

Below is the full set of labels for which any of the NER tools may recognize entities.

In [55]:
# Consolidate and remove duplicate abbreviations
all_labels = set(conll03_labels + ace05_labels + ontonotes_labels)
all_labels.remove('PER')
all_labels.remove('ORG')
all_labels.remove('LOC')
all_labels.remove('FAC')
print(all_labels)

{'LOCATION', 'NORP', 'ORGANIZATION', 'QUANTITY', 'FACILITY', 'GPE', 'CARDINAL', 'ORDINAL', 'PERCENT', 'WEAPON', 'PERSON', 'WORK_OF_ART', 'DATE', 'PRODUCT', 'MONEY', 'MISC', 'LANGUAGE', 'EVENT', 'VEHICLE', 'LAW', 'TIME'}


In [54]:
# (It's larger than the union of unique labels produced in output)
all_labels_found.remove('ORG')
all_labels_found.remove('PER')
print(all_labels_found)

{'LOC', 'NORP', 'CARDINAL', 'MISC', 'PERSON', 'ORGANIZATION', 'QUANTITY', 'FAC', 'DATE', 'GPE', 'PRODUCT', 'TIME'}
