# Entity linking against custom taxonomies

This notebook is a step-by-step example of linking taxon mentions to a custom taxonomy using TaxoNERD.

Entity linking in TaxoNERD is based on fuzzy string matching using TF-IDF and the approximate nearest neighbours algorithm. This is exactly the same approach as implemented in scispacy.

### 1. Download the taxonomy

The first step is to download the taxonomic reference. For this example, we will create a custom linker for the World Checklist of Vascular Plants (WCVP) database.

In [1]:
from io import BytesIO
from urllib.request import urlopen
from zipfile import ZipFile

url = "http://sftp.kew.org/pub/data-repositories/WCVP/wcvp.zip"
extract_to = "./tmp/wcvp"

with urlopen(url) as zipresp:
    with ZipFile(BytesIO(zipresp.read())) as zfile:
        zfile.extractall(extract_to)

### 2. Export the taxonomy to a JSON Lines file

TaxoNERD expects the taxonomy to be provided as a JSON Lines (.jsonl) file.
JSON Lines are essentially multiple lines, where each line is a valid JSON object, separated by a newline character "\n". Each line must have the following format:

{"concept_id" : concept_id, "canonical_name" : canonical_name, "aliases" : [name_1, name_2..., name_N], "definition" : ""}

with:
- concept_id: the identifier of the taxon in the reference taxonomy
- canonical_name: the accepted taxon name (used for display only)
- aliases: a list of names and synonyms for the taxon
- definition: not used by TaxoNERD, leave empty

In [2]:
import pandas as pd

# Read the taxonomy
taxo_df = pd.read_csv("./tmp/wcvp/wcvp_names.csv", sep="|", usecols=["plant_name_id", "ipni_id", "taxon_status", "taxon_name", "accepted_plant_name_id", "taxon_authors", "first_published"])
# Keep only the names that can be associated to an accepted name
taxo_df = taxo_df.dropna(subset=["accepted_plant_name_id"])
taxo_df.head()

Unnamed: 0,plant_name_id,ipni_id,taxon_status,first_published,taxon_name,taxon_authors,accepted_plant_name_id
0,195508,243233-2,Synonym,(1931),Stachys pustulosa,Rydb.,195467.0
1,197585,767122-1,Synonym,(1830),Stenostomum dichotomum,DC.,197582.0
2,76791,595920-1,Synonym,(1878),Eugenia scoparia,Duthie,199254.0
3,74373,593644-1,Synonym,(1878),Eugenia areolata,(DC.) Duthie,200472.0
4,205204,884387-1,Synonym,(1971),Thymus pallasianus subsp. brachyodon,(Borbás) Jalas,204938.0


In [3]:
from tqdm.notebook import tqdm
import numpy as np
import json

def dump_entries(output_file, entries):
    with open(output_file, "a", encoding="utf8") as f:
        f.write(
            '\n'.join(json.dumps(i, ensure_ascii=False) for i in entries)
        )

def wcvp_to_jsonl(taxo_df, output_file):
    with open(output_file, 'w'): pass
    count = 0
    total_names = 0
    entries = []
    for taxid in tqdm(taxo_df.accepted_plant_name_id.unique()):
        if count % 10000 == 0 and count != 0:
            print(count, "dump", len(entries))
            dump_entries(output_file, entries)
            entries = []
            break # We stop at the first dump to keep this tutorial quick, remove this line to export the whole taxonomy
        taxa = taxo_df.query("accepted_plant_name_id==@taxid")
        accepted = taxa[taxa["taxon_status"]=="Accepted"]
        if not accepted.empty:
            concept_id = accepted.ipni_id.iloc[0] # We use the ipni_id as the concept_id
            canonical_name = accepted.taxon_name.iloc[0]
            aliases = set()
            for idx, row in taxa.iterrows():
                aliases.add(row["taxon_name"])
                if not pd.isnull(row["taxon_authors"]): # We assemble a name with authorship if authorship information is available
                    aliases.add(row["taxon_name"]+" "+row["taxon_authors"])
            entries.append({"concept_id" : concept_id, "aliases" : list(aliases), "canonical_name" : canonical_name, "definition" : ""})
            count += 1
            total_names += len(aliases)
    dump_entries(output_file, entries)
    print(f"Written {total_names} names for {count} taxa to {output_file}")

In [4]:
output_file = f"./tmp/wcvp/wcvp.jsonl"
wcvp_to_jsonl(taxo_df, output_file)

  0%|          | 0/433705 [00:00<?, ?it/s]

10000 dump 10000
Written 115160 names for 10000 taxa to ./tmp/wcvp/wcvp.jsonl


### 3. Fit a TF-IDF vectorizer and build the Approximate Nearest Neighbour index 

We now have to estimate TF-IDF vectors for all the names in our knowledge base (the .jsonl file) and build an ANN index using the HNSW algorithm. This can take a while and use a lot of memory, especially for large taxonomies. All generated files will be saved into the out_path directory.

In [5]:
from taxonerd.linking.linking_utils import KnowledgeBase
from taxonerd.linking.candidate_generation import create_tfidf_ann_index

In [6]:
wcvp_kb = KnowledgeBase(file_path="./tmp/wcvp/wcvp.jsonl", prefix="WCVP:")
_, _, _ = create_tfidf_ann_index(kb=wcvp_kb, out_path="./tmp/wcvp/processed")

No tfidf vectorizer on ./tmp/wcvp/processed/tfidf_vectorizer.joblib or ann index on ./tmp/wcvp/processed/nmslib_index.bin
Fitting tfidf vectorizer on 114760 aliases
Saving tfidf vectorizer to ./tmp/wcvp/processed/tfidf_vectorizer.joblib
Fitting and saving vectorizer took 1.511032 seconds
Finding empty (all zeros) tfidf vectors
Deleting 0/114760 aliases because their tfidf is empty
Saving list of concept ids and tfidfs vectors to ./tmp/wcvp/processed/concept_aliases.json and ./tmp/wcvp/processed/tfidf_vectors_sparse.npz
Fitting ann index on 114760 aliases (takes 2 hours)



0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************


Fitting ann index took 205.331933 seconds


### 4. Tell TaxoNERD to use the custom linker

All files needed by TaxoNERD to link taxonomic mentions to entities in the custom taxonomy have been created. All you have to do is tell TaxoNERD where to find these files by creating a LinkerPaths instance. As of TaxoNERD 1.5.2, the linker argument of the TaxoNERD.load() method accepts paths to linker configuration files.

In [7]:
import json

linker_cfg = {
    "name":"wcvp",
    "kb": {
        "file_path":"./tmp/wcvp/wcvp.jsonl",
        "prefix":"WCVP:",
    },
    "linker_paths": {
        "ann_index":"./tmp/wcvp/processed/nmslib_index.bin",
        "tfidf_vectorizer":"./tmp/wcvp/processed/tfidf_vectorizer.joblib",
        "tfidf_vectors":"./tmp/wcvp/processed/tfidf_vectors_sparse.npz",
        "concept_aliases_list":"./tmp/wcvp/processed/concept_aliases.json",
    }
}

with open("./tmp/wcvp/wcvp.cfg", "w") as f:
    json.dump(linker_cfg, f)

In [8]:
from taxonerd import TaxoNERD
from pathlib import Path

taxonerd = TaxoNERD(prefer_gpu=False)
taxonerd.load(model="en_core_eco_md", exclude=[], linker="./tmp/wcvp/wcvp.cfg", threshold=0.7)

<spacy.lang.en.English at 0x7fd303c1b3a0>

In [9]:
taxonerd.find_in_text("Wild distribution of Pandanus tectorius occurs in the exposed coastal headlands beaches and near-coastal forests of south Asia")

Unnamed: 0,offsets,text,entity,sent
T0,LIVB 21 39,Pandanus tectorius,"[(WCVP:'895770-1', Pandanus tectorius, 1.0)]",0


It also works when using TaxoNERD's CLI:

In [10]:
!taxonerd ask -l ./tmp/wcvp/wcvp.cfg "Wild distribution of Pandanus tectorius occurs in the exposed coastal headlands beaches and near-coastal forests of south Asia"

T0	LIVB 21 39	Pandanus tectorius	"[(""WCVP:'895770-1'"", 'Pandanus tectorius', 1.0)]"


Et voilà !