# Combinining spaCy NER models with xMEN for German Clinical Entity Linking

## Preparation

### Download NER Model

In [1]:
!huggingface-cli download phlobo/de_ggponc_medbertde de_ggponc_medbertde-any-py3-none-any.whl --local-dir ../local_files

Consider using `hf_transfer` for faster downloads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
../local_files/de_ggponc_medbertde-any-py3-none-any.whl


In [2]:
!pip install -q ../local_files/de_ggponc_medbertde-any-py3-none-any.whl

### Prepare dicts and index

`xmen dict conf/ggponc.yaml`

`xmen index conf/ggponc.yaml --all --overwrite`

## Run spaCy NER Model on Sample Data

In [3]:
import spacy
nlp = spacy.load('de_ggponc_medbertde')

In [4]:
sentences = [
    "Cetuximab ist ein monoklonaler Antikörper, der gegen den epidermalen Wachstumsfaktorrezeptor (EGFR) gerichtet ist und" \
       "dient zur Therapie des fortgeschrittenen kolorektalen Karzinoms zusammen mit Irinotecan oder in Kombination mit FOLFOX bzw. " \
       "allein nach Versagen einer Behandlung mit Oxaliplatin und Irinotecan.",
    "Die HPV-Diagnostik hat beim Plattenepithelkarzinom der Mundhöhle keinen validen Nutzen als prognostischer Faktor."
]
sentences

['Cetuximab ist ein monoklonaler Antikörper, der gegen den epidermalen Wachstumsfaktorrezeptor (EGFR) gerichtet ist unddient zur Therapie des fortgeschrittenen kolorektalen Karzinoms zusammen mit Irinotecan oder in Kombination mit FOLFOX bzw. allein nach Versagen einer Behandlung mit Oxaliplatin und Irinotecan.',
 'Die HPV-Diagnostik hat beim Plattenepithelkarzinom der Mundhöhle keinen validen Nutzen als prognostischer Faktor.']

In [5]:
docs = list(nlp.pipe(sentences))

In [6]:
import pandas as pd

In [7]:
ents = []
for d in docs:
    for span in sorted(d.spans['entities'], key=lambda s: s.start):
        ents.append({'mention' : span.text, 'class' : span.label_})
pd.DataFrame(ents)

Unnamed: 0,mention,class
0,Cetuximab,Clinical_Drug
1,monoklonaler Antikörper,Clinical_Drug
2,epidermalen Wachstumsfaktorrezeptor,Nutrient_or_Body_Substance
3,EGFR,Nutrient_or_Body_Substance
4,Therapie des fortgeschrittenen kolorektalen Ka...,Therapeutic
5,fortgeschrittenen kolorektalen Karzinoms,Diagnosis_or_Pathology
6,Irinotecan,Clinical_Drug
7,FOLFOX,Therapeutic
8,Versagen einer Behandlung,Other_Finding
9,Behandlung mit Oxaliplatin und Irinotecan,Therapeutic


## Candidate Generation

In [8]:
from xmen.data import from_spacy
from xmen.linkers import SapBERTLinker, TFIDFNGramLinker, EnsembleLinker
from xmen import load_config

In [9]:
dataset = from_spacy(docs, span_key='entities')

In [10]:
dataset

Dataset({
    features: ['id', 'document_id', 'passages', 'entities', 'coreferences', 'relations', 'events', 'corpus_id', 'lang'],
    num_rows: 2
})

In [11]:
conf = load_config('../examples/conf/ggponc.yaml')

In [12]:
ngram_linker = TFIDFNGramLinker(**conf.linker.candidate_generation.ngram)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [13]:
SapBERTLinker.clear()
sap_linker = SapBERTLinker(cuda=False, **conf.linker.candidate_generation.sapbert)

In [14]:
linker = EnsembleLinker()
linker.add_linker('ngram', ngram_linker, k=conf.linker.candidate_generation.ngram.k, threshold=0.9)
linker.add_linker('sap', sap_linker, k=conf.linker.candidate_generation.sapbert.k, threshold=0.8)

prediction = linker.predict_batch(dataset)

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

### Semantic Type Filtering

We filter the generated output to make sure the semantic type of the predicted concepts actually matches the semantic class of the named entity.

As the GGPONC entity classes are based on SNOMED CT top level concepts, while we link against UMLS CUIS, we provide a mapping of GGPONC enitity types to UMLS TUIs in `ggponc2tui.tsv`

Semantic Type Filtering is particularly useful for ambiguous abbreviations (e.g., "EGFR" as in the example)

In [15]:
from xmen.kb import load_kb
from xmen.data import SemanticTypeFilter
import pandas as pd
from pathlib import Path

In [16]:
kb = load_kb(Path(conf.cache_dir) / 'ggponc' / 'ggponc.jsonl')

In [17]:
from xmen.data import SemanticTypeFilter

type2tui = pd.read_csv('ggponc_tuis.csv').groupby('class')['tui'].apply(list).to_dict()
type_filter = SemanticTypeFilter(type2tui, kb)

In [18]:
filtered_prediction = type_filter.transform_batch(prediction)

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [19]:
# Before Filtering
entity = prediction[0]['entities'][3]
print(entity['text'])
print(entity['normalized'][0])
print(kb.cui_to_entity['C3811844'])

['EGFR']
{'db_id': 'C3811844', 'db_name': 'UMLS', 'score': 1.0, 'predicted_by': ['ngram', 'sap']}
CUI: C3811844, Name: Geschaetzte glomerulaere Filtrationsrate
Definition: A laboratory test that estimates kidney function. It is calculated using an individual's serum creatinine measurement, age, gender, and race. Actual results are reported when the estimated glomerular filtration rate is less than 60 ml/min.
TUI(s): T059
Aliases: (total: 3): 
	 eGFR, Estimated Glomerular Filtration Rate, Estimated glomerular filtration rate


In [20]:
# After Filtering
entity = filtered_prediction[0]['entities'][3]
print(entity['text'])
print(entity['normalized'][0])
print(kb.cui_to_entity['C1739039'])

['EGFR']
{'db_id': 'C1739039', 'db_name': 'UMLS', 'score': 1.0, 'predicted_by': ['ngram', 'sap']}
CUI: C1739039, Name: EGFR
Definition: The protein found on the surface of some cells and to which epidermal growth factor binds, causing the cells to divide. It is found at abnormally high levels on the surface of many types of cancer cells, so these cells may divide excessively in the presence of epidermal growth factor.
TUI(s): T116, T192
Aliases (abbreviated, total: 20): 
	 EGF Receptor, ERBB Protein, HER1 protein, human, epidermal growth factor receptor related protein, human, EGFR protein, human, HER-1, epidermal growth factor receptor (erythroblastic leukemia viral (v-erb-b) oncogene homolog, avian) protein, human, Proto-Oncogene c-erbB-1, ERRP protein, human, Erb-B2 Receptor Tyrosine Kinase 1


In [21]:
from util import get_dataframe
get_dataframe(filtered_prediction, kb)

Unnamed: 0,mention,class,cui,canonical name,linked by,score
0,Cetuximab,Clinical_Drug,C0995188,Cetuximab,"[ngram, sap]",1.0
1,monoklonaler Antikörper,Clinical_Drug,C0003250,"Antikörper, monoklonale","[ngram, sap]",0.982318
2,epidermalen Wachstumsfaktorrezeptor,Nutrient_or_Body_Substance,C3812393,ErbB-Rezeptoren,"[ngram, sap]",0.957236
3,EGFR,Nutrient_or_Body_Substance,C1739039,EGFR,"[ngram, sap]",1.0
4,Therapie des fortgeschrittenen kolorektalen Ka...,Therapeutic,Not linkable,,,
5,fortgeschrittenen kolorektalen Karzinoms,Diagnosis_or_Pathology,C4721579,Kolorektalkarzinom mit Metastasen,[sap],0.843411
6,Irinotecan,Clinical_Drug,C0123931,Irinotecan,"[ngram, sap]",1.0
7,FOLFOX,Therapeutic,C0309154,FUROX,[sap],0.842153
8,Versagen einer Behandlung,Other_Finding,C0162643,Behandlungsfehler,[sap],0.936106
9,Behandlung mit Oxaliplatin und Irinotecan,Therapeutic,C0796324,IROX Regimen,[sap],0.879975


## Re-Ranking

In [22]:
from xmen.linkers import default_ensemble
linker_no_thresh = default_ensemble(Path(conf.linker.candidate_generation.ngram.index_base_path).parent, cuda=False)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [23]:
candidates = type_filter.transform_batch(linker_no_thresh.predict_batch(dataset))

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [24]:
from xmen.reranking import CrossEncoderReranker

In [25]:
ce_candidates = CrossEncoderReranker.prepare_data(candidates, None, kb, k=64)

Context length: 128
Use NIL values: True


Map:   0%|          | 0/2 [00:00<?, ? examples/s]

  0%|          | 0/14 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]

In [26]:
rr = CrossEncoderReranker.load("phlobo/xmen-de-ce-medmentions", device=0)

In [27]:
reranked = rr.rerank_batch(candidates, ce_candidates, k=64)

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [28]:
# Before Re-ranking
get_dataframe(candidates, kb)

Unnamed: 0,mention,class,cui,canonical name,linked by,score
0,Cetuximab,Clinical_Drug,C0995188,Cetuximab,"[ngram, sapbert]",1.0
1,monoklonaler Antikörper,Clinical_Drug,C0003250,"Antikörper, monoklonale","[ngram, sapbert]",0.982318
2,epidermalen Wachstumsfaktorrezeptor,Nutrient_or_Body_Substance,C3812393,ErbB-Rezeptoren,"[ngram, sapbert]",0.957236
3,EGFR,Nutrient_or_Body_Substance,C1739039,EGFR,"[ngram, sapbert]",1.0
4,Therapie des fortgeschrittenen kolorektalen Ka...,Therapeutic,C0281190,Prevention of Colorectal Cancer,[sapbert],0.736036
5,fortgeschrittenen kolorektalen Karzinoms,Diagnosis_or_Pathology,C4721579,Kolorektalkarzinom mit Metastasen,"[ngram, sapbert]",0.843411
6,Irinotecan,Clinical_Drug,C0123931,Irinotecan,"[ngram, sapbert]",1.0
7,FOLFOX,Therapeutic,C0309154,FUROX,[sapbert],0.842153
8,Versagen einer Behandlung,Other_Finding,C0162643,Behandlungsfehler,"[ngram, sapbert]",0.936106
9,Behandlung mit Oxaliplatin und Irinotecan,Therapeutic,C0796324,IROX Regimen,[sapbert],0.879975


In [29]:
# After Re-ranking
get_dataframe(reranked, kb)

Unnamed: 0,mention,class,cui,canonical name,linked by,score
0,Cetuximab,Clinical_Drug,C0995188,Cetuximab,"[ngram, sapbert]",0.040555
1,monoklonaler Antikörper,Clinical_Drug,C0003250,"Antikörper, monoklonale","[ngram, sapbert]",0.043325
2,epidermalen Wachstumsfaktorrezeptor,Nutrient_or_Body_Substance,C3812393,ErbB-Rezeptoren,"[ngram, sapbert]",0.019964
3,EGFR,Nutrient_or_Body_Substance,C1368111,EGFR-ECD,"[ngram, sapbert]",0.020019
4,Therapie des fortgeschrittenen kolorektalen Ka...,Therapeutic,C4763871,Colorectal Cancer Surgery,[sapbert],0.017078
5,fortgeschrittenen kolorektalen Karzinoms,Diagnosis_or_Pathology,C0009402,Kolorektales Karzinom,"[ngram, sapbert]",0.019689
6,Irinotecan,Clinical_Drug,C0123931,Irinotecan,"[ngram, sapbert]",0.051333
7,FOLFOX,Therapeutic,C0392943,Fluorouracil/Leucovorin Calcium/Oxaliplatin,"[ngram, sapbert]",0.0253
8,Versagen einer Behandlung,Other_Finding,C0162643,Behandlungsfehler,"[ngram, sapbert]",0.041808
9,Behandlung mit Oxaliplatin und Irinotecan,Therapeutic,C0796324,IROX Regimen,[sapbert],0.01942
