# Combining TaxoNERD with gazetteer-based NER for improved taxonomic entities recognition

TaxoNERD's models are natively able to detect scientific names, common names, abbreviated species names and user-defined abbreviations in text. However, the performance on common (vernacular) names is not very good because these names are not systematically annotated in the training corpus. 

A very simple way to improve the performance of TaxoNERD on this category of taxon mentions is to couple TaxoNERD with a gazetteer-based NER engine, e.g. spaCy's [EntityRuler](https://spacy.io/api/entityruler). 

In this notebook, we show how TaxoNERD can be extended with an instance of EntityRuler for improved taxonomic entities recognition.

## TaxoNERD initialization and testing

In [1]:
import spacy
import json
from tqdm.notebook import trange, tqdm
from spacy.language import Language
from taxonerd import TaxoNERD

We start by loading the ``en_core_eco_biobert`` model. The ``pysbd_sentencizer`` and ``parser`` components are not needed here so they can be excluded.

In [2]:
taxonerd = TaxoNERD(prefer_gpu=True)
taxonerd.load(model="en_core_eco_biobert", exclude=["pysbd_sentencizer", "parser"])
taxonerd.nlp.pipe_names

['transformer',
 'tagger',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'taxo_abbrev_detector']

Let's try our model on a piece of text !

In [3]:
text = """A study on pathological effects of Acholeplasma laidlawii isolated from buffaloes in Mice Model. 
-- Respiratory distress has become a hot issue that is causing severe infection in livestock industry of Pakistan. 
The exact and timely diagnosis is incredible to treat the disease. However, Acholeplasma (A.) laidlawii is found 
very significant from buffalo lungs but being a ubiquitous organism, its pathogenic description is not completely 
understood. The study was designed to validate the involvement of A. laidlawii in respiratory diseases in buffaloes.
For this purpose, experimental trials on mice were conducted to confirm the involvement of the organism in 
respiratory tract infection. It was re-isolated from experimentally infected mice, showing lesions in respiratory 
tract (83.3%), proving Koch's postulates. Statistically, the experimental group-A (subcutaneous route) showed 
significant difference (P < 0.05) except in case of mortality feature. The group-B (intraperitoneal route) 
indicated non-significant difference (P>0.05) in all cases. Based on current study, it may be concluded that the 
organism is opportunistic, and can produce either disease or lesions on targeted organs in stressed animals, 
particularly buffaloes."""

In [36]:
taxonerd.find_in_text(text)

Unnamed: 0,offsets,text
T0,LIVB 35 57,Acholeplasma laidlawii
T1,LIVB 290 317,Acholeplasma (A.) laidlawii
T2,LIVB 509 521,A. laidlawii


TaxoNERD successfully detects the three mentions of *Acholeplasma laidlawii*. However it fails to detect common names of taxonomic entities such as "buffaloes" and "mice". To address this weakness of TaxoNERD, we propose to extend TaxoNERD's pipeline with an instance of spaCy's EntityRuler component for recognizing names provided as part of a name gazetteer.

## Creating the name gazetteer

In this example, we will use a very simple gazetteer for demonstration purposes.

In [47]:
gazetteer = ["buffalo", "mouse", "cat", "dog"]

Of course, it would be better to create a comprehensive list of vernacular names by querying taxonomic resources such as the NCBI Taxonomy, the GBIF backbone taxonomy or Wikidata. But for now, let's stick to our 4-name gazetteer.

## Patterns initialization

The EntityRuler is a component that lets you add named entities based on pattern dictionaries. The entity ruler accepts two types of patterns: phrase patterns for exact string matches, and token patterns with one dictionary describing one token.

While exact string matching is very useful and quite effective for detecting "static" entities such as organization names (or scientific names of organisms), it is less efficient to detect common nouns which very often have a different form in singular and plural, such as vernacular species names.

For instance, using exact string matching with our 4-name gazetteer would not help us detect the mentions of "buffaloes" and "mice" in our piece of text. Exact string matching would need a comprehensive list of names with both singular and plural forms for each name.

So, it seems that token patterns are the way to go ! More specifically, we will define a set of patterns that use the base form (or [lemma](https://en.wikipedia.org/wiki/Lemmatisation)) of each name in our gazetteer.

Let's start by defining a utility component that simply converts token lemmas to lowercase:

In [48]:
@Language.component("lower_case_lemmas")
def lower_case_lemmas(doc) :
    for token in doc :
        token.lemma_ = token.lemma_.lower()
    return doc

nlp.add_pipe("lower_case_lemmas", after="lemmatizer")
nlp.pipe_names

['transformer',
 'tagger',
 'attribute_ruler',
 'lemmatizer',
 'lower_case_lemmas',
 'ner',
 'taxo_abbrev_detector',
 'entity_ruler']

Then, we create a list of token patterns by processing each name in the gazetteer using TaxoNERD's pipeline to obtain the (lowercased) lemmas of the name tokens.

In [49]:
patterns = []
label = "LIVB"
for name in gazetteer:
    doc = taxonerd.nlp(name)
    patterns.append({"label": label, "pattern": [{"LEMMA": token.lemma_} for token in doc]})
patterns

[{'label': 'LIVB', 'pattern': [{'LEMMA': 'buffalo'}]},
 {'label': 'LIVB', 'pattern': [{'LEMMA': 'mouse'}]},
 {'label': 'LIVB', 'pattern': [{'LEMMA': 'cat'}]},
 {'label': 'LIVB', 'pattern': [{'LEMMA': 'dog'}]}]

## EntityRuler initialization and testing

Adding an entity ruler to TaxoNERD's pipeline is as simple as:

In [51]:
ruler = taxonerd.nlp.add_pipe("entity_ruler")
taxonerd.nlp.pipe_names

['transformer',
 'tagger',
 'attribute_ruler',
 'lemmatizer',
 'lower_case_lemmas',
 'ner',
 'taxo_abbrev_detector',
 'entity_ruler']

The final step is to add the patterns to the entity ruler:

In [53]:
ruler.add_patterns(patterns)

Now that everything is ready, we can test our TaxoNERD pipeline that has been extended with gazetteer-based NER.

In [54]:
taxonerd.find_in_text(text)

Unnamed: 0,offsets,text
T0,LIVB 35 57,Acholeplasma laidlawii
T1,LIVB 72 81,buffaloes
T2,LIVB 290 317,Acholeplasma (A.) laidlawii
T3,LIVB 350 357,buffalo
T4,LIVB 509 521,A. laidlawii
T5,LIVB 549 558,buffaloes
T6,LIVB 601 605,mice
T7,LIVB 745 749,mice
T8,LIVB 1239 1248,buffaloes


In addition to the three mentions of *Acholeplasma laidlawii*, TaxoNERD now successfully detects the mentions of "buffalo(es)" and "mice" thanks to the entity ruler.