# Combinining NER models with xMEN for German Clinical Entity Linking

## Preparation

### Get access to GGPONC Models

https://www.leitlinienprogramm-onkologie.de/projekte/ggponc-english/

Download the spaCy model from the v2.0 release (`models` folder) into a location of your choice (we assume `../local_files/ggponc/spacy`).

### Prepare dicts and index

`xmen dict conf/ggponc.yaml`

`xmen index conf/ggponc.yaml --all --overwrite`

In [1]:
!git clone https://github.com/hpi-dhc/ggponc_annotation ../local_files/ggponc/ggponc_annotation

fatal: destination path '../local_files/ggponc/ggponc_annotation' already exists and is not an empty directory.


# Run GGPONC NER Model on sample data

In [2]:
from pathlib import Path
GGPONC_PROJECT_PATH = Path("../local_files/ggponc/ggponc_annotation")
GGPONC_MODEL_PATH = Path('../local_files/ggponc/spacy')

In [3]:
import sys
sys.path.append(str(GGPONC_PROJECT_PATH / 'spacy'))

In [4]:
import spacy
import snomed_spans # Import custom span suggester and scorer for spaCy spancat 

nlp = spacy.load(GGPONC_MODEL_PATH)


If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current 'transformers' and 'spacy-transformers' versions. For more details and available updates, run: python -m spacy validate


In [5]:
sentences = [
    "Cetuximab ist ein monoklonaler Antikörper, der gegen den epidermalen Wachstumsfaktorrezeptor (EGFR) gerichtet ist und" \
       "dient zur Therapie des fortgeschrittenen kolorektalen Karzinoms zusammen mit Irinotecan oder in Kombination mit FOLFOX bzw. " \
       "allein nach Versagen einer Behandlung mit Oxaliplatin und Irinotecan.",
    "Die HPV-Diagnostik hat beim Plattenepithelkarzinom der Mundhöhle keinen validen Nutzen als prognostischer Faktor.",
    "Als Alternative empfiehlt die ASCCP bei zytologischem Verdacht auf CIN 1/2 die sofortige Kolposkopie."
]
sentences

['Cetuximab ist ein monoklonaler Antikörper, der gegen den epidermalen Wachstumsfaktorrezeptor (EGFR) gerichtet ist unddient zur Therapie des fortgeschrittenen kolorektalen Karzinoms zusammen mit Irinotecan oder in Kombination mit FOLFOX bzw. allein nach Versagen einer Behandlung mit Oxaliplatin und Irinotecan.',
 'Die HPV-Diagnostik hat beim Plattenepithelkarzinom der Mundhöhle keinen validen Nutzen als prognostischer Faktor.',
 'Als Alternative empfiehlt die ASCCP bei zytologischem Verdacht auf CIN 1/2 die sofortige Kolposkopie.']

In [6]:
docs = list(nlp.pipe(sentences))

In [7]:
import pandas as pd

In [8]:
ents = []
for d in docs:
    for span in sorted(d.spans['snomed'], key=lambda s: s.start):
        ents.append({'mention' : span.text, 'class' : span.label_})
pd.DataFrame(ents)

Unnamed: 0,mention,class
0,Cetuximab,Clinical_Drug
1,monoklonaler Antikörper,Clinical_Drug
2,epidermalen Wachstumsfaktorrezeptor,Nutrient_or_Body_Substance
3,EGFR,Nutrient_or_Body_Substance
4,Therapie des fortgeschrittenen kolorektalen Ka...,Therapeutic
5,fortgeschrittenen kolorektalen Karzinoms,Diagnosis_or_Pathology
6,Irinotecan,Clinical_Drug
7,FOLFOX,Therapeutic
8,Versagen,Diagnosis_or_Pathology
9,Behandlung mit Oxaliplatin und Irinotecan,Therapeutic


# Run Entity Linker

In [9]:
from xmen.data import from_spacy
from xmen.linkers import SapBERTLinker, TFIDFNGramLinker, EnsembleLinker
from xmen import load_config

In [10]:
dataset = from_spacy(docs, span_key='snomed')

In [11]:
dataset

Dataset({
    features: ['id', 'document_id', 'passages', 'entities', 'coreferences', 'relations', 'events', 'corpus_id', 'lang'],
    num_rows: 3
})

In [12]:
conf = load_config('../examples/conf/ggponc.yaml')

In [13]:
ngram_linker = TFIDFNGramLinker(**conf.linker.candidate_generation.ngram)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [14]:
SapBERTLinker.clear()
sap_linker = SapBERTLinker(cuda=False, **conf.linker.candidate_generation.sapbert)

In [15]:
linker = EnsembleLinker()
linker.add_linker('ngram', ngram_linker, k=conf.linker.candidate_generation.ngram.k, threshold=0.9)
linker.add_linker('sap', sap_linker, k=conf.linker.candidate_generation.sapbert.k, threshold=0.8)

prediction = linker.predict_batch(dataset)

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

## Semantic Type Filtering

We filter the generated output to make sure the semantic type of the predicted concepts actually matches the semantic class of the named entity.

As the GGPONC entity classes are based on SNOMED CT top level concepts, while we link against UMLS CUIS, we provide a mapping of GGPONC enitity types to UMLS TUIs in `ggponc2tui.tsv`

Semantic Type Filtering is particularly useful for ambiguous abbreviations (e.g., "EGFR" as in the example)

In [16]:
from xmen.kb import load_kb
from xmen.data import SemanticTypeFilter
import pandas as pd

In [17]:
kb = load_kb(Path(conf.cache_dir) / 'ggponc' / 'ggponc.jsonl')

In [18]:
tui_df = pd.read_csv('ggponc2tui.csv')
type2tui = {}
for c in ['Diagnosis_or_Pathology', 'Other_Finding', 'Clinical_Drug', 'Nutrient_or_Body_Substance',
       'External_Substance', 'Therapeutic', 'Diagnostic']:
    type2tui[c] = list(tui_df.TUI[tui_df[c] == 'x'].values)

In [19]:
type_filter = SemanticTypeFilter(type2tui, kb)

In [20]:
filtered_prediction = type_filter.transform_batch(prediction)

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

In [21]:
# Before Filtering
entity = prediction[0]['entities'][3]
print(entity['text'])
print(entity['normalized'][0])
print(kb.cui_to_entity['C3811844'])

['EGFR']
{'db_id': 'C3811844', 'db_name': 'UMLS', 'score': 1.0, 'predicted_by': ['ngram', 'sap']}
CUI: C3811844, Name: Geschaetzte glomerulaere Filtrationsrate
Definition: A laboratory test that estimates kidney function. It is calculated using an individual's serum creatinine measurement, age, gender, and race. Actual results are reported when the estimated glomerular filtration rate is less than 60 ml/min.
TUI(s): T059
Aliases: (total: 3): 
	 eGFR, Estimated Glomerular Filtration Rate, Estimated glomerular filtration rate


In [22]:
# After Filtering
entity = filtered_prediction[0]['entities'][3]
print(entity['text'])
print(entity['normalized'][0])
print(kb.cui_to_entity['C1739039'])

['EGFR']
{'db_id': 'C1739039', 'db_name': 'UMLS', 'score': 1.0, 'predicted_by': ['ngram', 'sap']}
CUI: C1739039, Name: EGFR
Definition: The protein found on the surface of some cells and to which epidermal growth factor binds, causing the cells to divide. It is found at abnormally high levels on the surface of many types of cancer cells, so these cells may divide excessively in the presence of epidermal growth factor.
TUI(s): T116, T192
Aliases (abbreviated, total: 20): 
	 EGF Receptor, ERBB Protein, HER1 protein, human, epidermal growth factor receptor related protein, human, EGFR protein, human, HER-1, epidermal growth factor receptor (erythroblastic leukemia viral (v-erb-b) oncogene homolog, avian) protein, human, Proto-Oncogene c-erbB-1, ERRP protein, human, Erb-B2 Receptor Tyrosine Kinase 1


## Output

In [23]:
def get_dataframe(predictions):
    ents = []
    for d in predictions:
        for e in d['entities']:
            span = ' '.join(e['text'])
            label = e['type']
            top_concept = e['normalized'][0] if len(e['normalized']) > 0 else None        
            if top_concept:
                cui = top_concept['db_id']
                ents.append({'mention' : span, 'class' :  label, 'cui' : cui, 'canonical name' : kb.cui_to_entity[cui].canonical_name, 'linked by' : top_concept['predicted_by'], 'score' : top_concept['score']})
            else:
                ents.append({'mention' : span, 'class' :  label, 'cui' : 'Not linkable'})
    return pd.DataFrame(ents)

In [24]:
get_dataframe(filtered_prediction)

Unnamed: 0,mention,class,cui,canonical name,linked by,score
0,Cetuximab,Clinical_Drug,C0995188,Cetuximab,"[ngram, sap]",1.0
1,monoklonaler Antikörper,Clinical_Drug,C0003250,"Antikörper, monoklonale","[ngram, sap]",0.982318
2,epidermalen Wachstumsfaktorrezeptor,Nutrient_or_Body_Substance,C3812393,ErbB-Rezeptoren,"[ngram, sap]",0.957236
3,EGFR,Nutrient_or_Body_Substance,C1739039,EGFR,"[ngram, sap]",1.0
4,Therapie des fortgeschrittenen kolorektalen Ka...,Therapeutic,Not linkable,,,
5,fortgeschrittenen kolorektalen Karzinoms,Diagnosis_or_Pathology,C4721579,Kolorektalkarzinom mit Metastasen,[sap],0.843411
6,Irinotecan,Clinical_Drug,C0123931,Irinotecan,"[ngram, sap]",1.0
7,FOLFOX,Therapeutic,C0309154,FUROX,[sap],0.842153
8,Versagen,Diagnosis_or_Pathology,Not linkable,,,
9,Behandlung mit Oxaliplatin und Irinotecan,Therapeutic,C0796324,IROX Regimen,[sap],0.879975


## Re-Ranking

In [25]:
from xmen.linkers import default_ensemble
linker_no_thresh = default_ensemble(Path(conf.linker.candidate_generation.ngram.index_base_path).parent, cuda=False)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [26]:
candidates = type_filter.transform_batch(linker_no_thresh.predict_batch(dataset))

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

In [27]:
from xmen.reranking import CrossEncoderReranker

In [28]:
ce_candidates = CrossEncoderReranker.prepare_data(candidates, None, kb, k=64)

Context length: 128
Use NIL values: True


Map:   0%|          | 0/3 [00:00<?, ? examples/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

In [29]:
rr = CrossEncoderReranker.load("phlobo/xmen-de-ce-medmentions", device=0)

In [30]:
reranked = rr.rerank_batch(candidates, ce_candidates, k=64)

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

In [31]:
# Before Re-ranking
get_dataframe(candidates)

Unnamed: 0,mention,class,cui,canonical name,linked by,score
0,Cetuximab,Clinical_Drug,C0995188,Cetuximab,"[ngram, sapbert]",1.0
1,monoklonaler Antikörper,Clinical_Drug,C0003250,"Antikörper, monoklonale","[ngram, sapbert]",0.982318
2,epidermalen Wachstumsfaktorrezeptor,Nutrient_or_Body_Substance,C3812393,ErbB-Rezeptoren,"[ngram, sapbert]",0.957236
3,EGFR,Nutrient_or_Body_Substance,C1739039,EGFR,"[ngram, sapbert]",1.0
4,Therapie des fortgeschrittenen kolorektalen Ka...,Therapeutic,C0281190,Prevention of Colorectal Cancer,[sapbert],0.736036
5,fortgeschrittenen kolorektalen Karzinoms,Diagnosis_or_Pathology,C4721579,Kolorektalkarzinom mit Metastasen,"[ngram, sapbert]",0.843411
6,Irinotecan,Clinical_Drug,C0123931,Irinotecan,"[ngram, sapbert]",1.0
7,FOLFOX,Therapeutic,C0309154,FUROX,[sapbert],0.842153
8,Versagen,Diagnosis_or_Pathology,C0231184,Inefficiency,[sapbert],0.795575
9,Behandlung mit Oxaliplatin und Irinotecan,Therapeutic,C0796324,IROX Regimen,[sapbert],0.879975


In [32]:
# After Re-ranking
get_dataframe(reranked)

Unnamed: 0,mention,class,cui,canonical name,linked by,score
0,Cetuximab,Clinical_Drug,C0995188,Cetuximab,"[ngram, sapbert]",0.040555
1,monoklonaler Antikörper,Clinical_Drug,C0003250,"Antikörper, monoklonale","[ngram, sapbert]",0.043325
2,epidermalen Wachstumsfaktorrezeptor,Nutrient_or_Body_Substance,C3812393,ErbB-Rezeptoren,"[ngram, sapbert]",0.019964
3,EGFR,Nutrient_or_Body_Substance,C1368111,EGFR-ECD,"[ngram, sapbert]",0.020019
4,Therapie des fortgeschrittenen kolorektalen Ka...,Therapeutic,C4763871,Colorectal Cancer Surgery,[sapbert],0.017078
5,fortgeschrittenen kolorektalen Karzinoms,Diagnosis_or_Pathology,C0009402,Kolorektales Karzinom,"[ngram, sapbert]",0.019689
6,Irinotecan,Clinical_Drug,C0123931,Irinotecan,"[ngram, sapbert]",0.051333
7,FOLFOX,Therapeutic,C0392943,Fluorouracil/Leucovorin Calcium/Oxaliplatin,"[ngram, sapbert]",0.0253
8,Versagen,Diagnosis_or_Pathology,C0018801,Herzinsuffizienz,"[ngram, sapbert]",0.01759
9,Behandlung mit Oxaliplatin und Irinotecan,Therapeutic,C0796324,IROX Regimen,[sapbert],0.01942
