# Combinining NER models with xMEN for German Clinical Entity Linking

## Preparation

### Get access to GGPONC Models

https://www.leitlinienprogramm-onkologie.de/projekte/ggponc-english/

Download the spaCy model from the v2.0 release (`models` folder) into a location of your choice (we assume `../temp/ggponc/spacy`).

### Prepare dicts and index

`xmen dict conf/ggponc.yaml`

`xmen index conf/ggponc.yaml --all --overwrite`

In [None]:
!pip install spacy-transformers

In [None]:
!git clone https://github.com/hpi-dhc/ggponc_annotation ../temp/ggponc/ggponc_annotation

# Run GGPONC NER Model on sample data

In [1]:
from pathlib import Path
GGPONC_PROJECT_PATH = Path("../temp/ggponc/ggponc_annotation")
GGPONC_MODEL_PATH = Path('../temp/ggponc/spacy')

In [2]:
import sys
sys.path.append(str(GGPONC_PROJECT_PATH / 'spacy'))

In [3]:
import spacy
import snomed_spans # Import custom span suggester and scorer for spaCy spancat 

nlp = spacy.load(GGPONC_MODEL_PATH)



In [4]:
sentences = [
    "Cetuximab ist ein monoklonaler Antikörper, der gegen den epidermalen Wachstumsfaktorrezeptor (EGFR) gerichtet ist und" \
       "dient zur Therapie des fortgeschrittenen kolorektalen Karzinoms zusammen mit Irinotecan oder in Kombination mit FOLFOX bzw. " \
       "allein nach Versagen einer Behandlung mit Oxaliplatin und Irinotecan.",
    "Die HPV-Diagnostik hat beim Plattenepithelkarzinom der Mundhöhle keinen validen Nutzen als prognostischer Faktor.",
    "Als Alternative empfiehlt die ASCCP bei zytologischem Verdacht auf CIN 1/2 die sofortige Kolposkopie."
]
sentences

['Cetuximab ist ein monoklonaler Antikörper, der gegen den epidermalen Wachstumsfaktorrezeptor (EGFR) gerichtet ist unddient zur Therapie des fortgeschrittenen kolorektalen Karzinoms zusammen mit Irinotecan oder in Kombination mit FOLFOX bzw. allein nach Versagen einer Behandlung mit Oxaliplatin und Irinotecan.',
 'Die HPV-Diagnostik hat beim Plattenepithelkarzinom der Mundhöhle keinen validen Nutzen als prognostischer Faktor.',
 'Als Alternative empfiehlt die ASCCP bei zytologischem Verdacht auf CIN 1/2 die sofortige Kolposkopie.']

In [5]:
docs = list(nlp.pipe(sentences))

In [6]:
for d in docs:
    for span in sorted(d.spans['snomed'], key=lambda s: s.start):
        print(span, '---', span.label_)

Cetuximab --- Clinical_Drug
monoklonaler Antikörper --- Clinical_Drug
epidermalen Wachstumsfaktorrezeptor --- Nutrient_or_Body_Substance
EGFR --- Nutrient_or_Body_Substance
Therapie des fortgeschrittenen kolorektalen Karzinoms --- Therapeutic
fortgeschrittenen kolorektalen Karzinoms --- Diagnosis_or_Pathology
Irinotecan --- Clinical_Drug
FOLFOX --- Therapeutic
Versagen --- Diagnosis_or_Pathology
Behandlung mit Oxaliplatin und Irinotecan --- Therapeutic
Oxaliplatin --- Clinical_Drug
Irinotecan --- Clinical_Drug
HPV-Diagnostik --- Diagnostic
Plattenepithelkarzinom der Mundhöhle --- Diagnosis_or_Pathology
zytologischem Verdacht auf CIN 1/2 --- Other_Finding
sofortige Kolposkopie --- Diagnostic


# Run Entity Linker

In [7]:
from xmen.data import from_spacy
from xmen.linkers import SapBERTLinker, TFIDFNGramLinker, EnsembleLinker
from xmen.confhelper import load_config

In [8]:
dataset = from_spacy(docs, span_key='snomed')

In [9]:
dataset[0]

{'id': 0,
 'document_id': 0,
 'passages': [{'id': 0,
   'offsets': [[0, 310]],
   'text': ['Cetuximab ist ein monoklonaler Antikörper, der gegen den epidermalen Wachstumsfaktorrezeptor (EGFR) gerichtet ist unddient zur Therapie des fortgeschrittenen kolorektalen Karzinoms zusammen mit Irinotecan oder in Kombination mit FOLFOX bzw. allein nach Versagen einer Behandlung mit Oxaliplatin und Irinotecan.'],
   'type': 'sentence'}],
 'entities': [{'id': 0,
   'normalized': [],
   'offsets': [[0, 9]],
   'text': ['Cetuximab'],
   'type': 'Clinical_Drug'},
  {'id': 1,
   'normalized': [],
   'offsets': [[18, 41]],
   'text': ['monoklonaler Antikörper'],
   'type': 'Clinical_Drug'},
  {'id': 2,
   'normalized': [],
   'offsets': [[57, 92]],
   'text': ['epidermalen Wachstumsfaktorrezeptor'],
   'type': 'Nutrient_or_Body_Substance'},
  {'id': 3,
   'normalized': [],
   'offsets': [[94, 98]],
   'text': ['EGFR'],
   'type': 'Nutrient_or_Body_Substance'},
  {'id': 4,
   'normalized': [],
   'offse

In [10]:
conf = load_config('../conf/ggponc.yaml')

In [11]:
ngram_linker = TFIDFNGramLinker(**conf.linker.candidate_generation.ngram)

In [12]:
SapBERTLinker.clear()
sap_linker = SapBERTLinker(cuda=False, **conf.linker.candidate_generation.sapbert)

In [13]:
linker = EnsembleLinker()
linker.add_linker('ngram', ngram_linker, k=10, threshold=0.9)
linker.add_linker('sap', sap_linker, k=10, threshold=0.8)

prediction = linker.predict_batch(dataset, batch_size=1)

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

## Semantic Type Filtering

We filter the generated output to make sure the semantic type of the predicted concepts actually matches the semantic class of the named entity.

As the GGPONC entity classes are based on SNOMED CT top level concepts, while we link against UMLS CUIS, we provide a mapping of GGPONC enitity types to UMLS TUIs in `ggponc2tui.tsv`

Semantic Type Filtering is particularly useful for ambiguous abbreviations (e.g., "EGFR" as in the example)

In [14]:
from xmen.kb import load_kb
from xmen.data import SemanticTypeFilter
import pandas as pd

In [15]:
kb = load_kb(Path(conf.cache_dir) / 'ggponc' / 'ggponc.jsonl')

In [16]:
tui_df = pd.read_csv('ggponc2tui.csv')
type2tui = {}
for c in ['Diagnosis_or_Pathology', 'Other_Finding', 'Clinical_Drug', 'Nutrient_or_Body_Substance',
       'External_Substance', 'Therapeutic', 'Diagnostic']:
    type2tui[c] = list(tui_df.TUI[tui_df[c] == 'x'].values)

In [17]:
type_filter = SemanticTypeFilter(type2tui, kb)

In [18]:
filtered_prediction = type_filter.transform_batch(prediction)

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

In [19]:
# Before Filtering
entity = prediction[0]['entities'][3]
print(entity['text'])
print(entity['normalized'][0])
print(kb.cui_to_entity['C3811844'])

['EGFR']
{'db_id': 'C3811844', 'db_name': 'UMLS', 'predicted_by': ['ngram', 'sap'], 'score': 1.0}
CUI: C3811844, Name: Geschaetzte glomerulaere Filtrationsrate
Definition: A laboratory test that estimates kidney function. It is calculated using an individual's serum creatinine measurement, age, gender, and race. Actual results are reported when the estimated glomerular filtration rate is less than 60 ml/min.
TUI(s): T059
Aliases: (total: 3): 
	 eGFR, Estimated Glomerular Filtration Rate, Estimated glomerular filtration rate


In [20]:
# After Filtering
entity = filtered_prediction[0]['entities'][3]
print(entity['text'])
print(entity['normalized'][0])
print(kb.cui_to_entity['C1739039'])

['EGFR']
{'db_id': 'C1739039', 'db_name': 'UMLS', 'predicted_by': ['ngram', 'sap'], 'score': 1.0}
CUI: C1739039, Name: EGFR
Definition: The protein found on the surface of some cells and to which epidermal growth factor binds, causing the cells to divide. It is found at abnormally high levels on the surface of many types of cancer cells, so these cells may divide excessively in the presence of epidermal growth factor.
TUI(s): T116, T192
Aliases (abbreviated, total: 20): 
	 EGF Receptor, ERBB Protein, HER1 protein, human, epidermal growth factor receptor related protein, human, EGFR protein, human, HER-1, epidermal growth factor receptor (erythroblastic leukemia viral (v-erb-b) oncogene homolog, avian) protein, human, Proto-Oncogene c-erbB-1, ERRP protein, human, Erb-B2 Receptor Tyrosine Kinase 1


# Final Output

In [21]:
for d in filtered_prediction:
    for e in d['entities']:
        span = ' '.join(e['text'])
        label = e['type']
        top_concept = e['normalized'][0] if len(e['normalized']) > 0 else None
        if top_concept:
            cui = top_concept['db_id']
            print(span, '---', label, '--->', f"{cui} ({kb.cui_to_entity[cui].canonical_name}), Score: {top_concept['score']:.2f}", )
        else:
            print(span, '---', label, '-/-', '### Not linkable ###')

Cetuximab --- Clinical_Drug ---> C0995188 (Cetuximab), Score: 1.00
monoklonaler Antikörper --- Clinical_Drug ---> C0003250 (Antikörper, monoklonale), Score: 0.98
epidermalen Wachstumsfaktorrezeptor --- Nutrient_or_Body_Substance ---> C3812393 (ErbB-Rezeptoren), Score: 0.96
EGFR --- Nutrient_or_Body_Substance ---> C1739039 (EGFR), Score: 1.00
Therapie des fortgeschrittenen kolorektalen Karzinoms --- Therapeutic -/- ### Not linkable ###
fortgeschrittenen kolorektalen Karzinoms --- Diagnosis_or_Pathology ---> C4721579 (Kolorektalkarzinom mit Metastasen), Score: 0.84
Irinotecan --- Clinical_Drug ---> C0123931 (Irinotecan), Score: 1.00
FOLFOX --- Therapeutic ---> C0309154 (FUROX), Score: 0.84
Versagen --- Diagnosis_or_Pathology -/- ### Not linkable ###
Behandlung mit Oxaliplatin und Irinotecan --- Therapeutic ---> C0796324 (IROX Regimen), Score: 0.88
Oxaliplatin --- Clinical_Drug ---> C0069717 (Oxaliplatin), Score: 1.00
Irinotecan --- Clinical_Drug ---> C0123931 (Irinotecan), Score: 1.00
HP