# Combinining NER models with xMEN for German Clinical Entity Linking

## Preparation

### Get access to GGPONC Models

https://www.leitlinienprogramm-onkologie.de/projekte/ggponc-english/

and put the spaCy model from the v2.0 release (`models` folder) into a location of your choice.

### Prepare dicts and index

`xmen dict conf/ggponc.yaml`

`xmen index conf/ggponc.yaml --all --overwrite`

In [2]:
# Location of spaCy model
GGPONC_MODEL_PATH = '../temp/ggponc/spacy'

In [None]:
!pip install spacy-transformers

In [None]:
!git clone https://github.com/hpi-dhc/ggponc_annotation ../temp/ggponc/ggponc_annotation

In [3]:
from pathlib import Path
GGPONC_PROJECT_PATH = Path("../temp/ggponc/ggponc_annotation")

In [4]:
import sys
sys.path.append(str(GGPONC_PROJECT_PATH / 'spacy'))

In [5]:
import spacy
import snomed_spans # Import custom span suggester and scorer for spaCy spancat 

nlp = spacy.load(GGPONC_MODEL_PATH)



In [6]:
sentences = [
    "Cetuximab ist ein monoklonaler Antikörper, der gegen den epidermalen Wachstumsfaktorrezeptor (EGFR) gerichtet ist und" \
       "dient zur Therapie des fortgeschrittenen kolorektalen Karzinoms zusammen mit Irinotecan oder in Kombination mit FOLFOX bzw. " \
       "allein nach Versagen einer Behandlung mit Oxaliplatin und Irinotecan.",
    "Als Alternative empfiehlt die ASCCP bei zytologischem Verdacht auf CIN 1/2 die sofortige Kolposkopie."
]
sentences

['Cetuximab ist ein monoklonaler Antikörper, der gegen den epidermalen Wachstumsfaktorrezeptor (EGFR) gerichtet ist unddient zur Therapie des fortgeschrittenen kolorektalen Karzinoms zusammen mit Irinotecan oder in Kombination mit FOLFOX bzw. allein nach Versagen einer Behandlung mit Oxaliplatin und Irinotecan.',
 'Als Alternative empfiehlt die ASCCP bei zytologischem Verdacht auf CIN 1/2 die sofortige Kolposkopie.']

In [7]:
docs = list(nlp.pipe(sentences))

In [8]:
for d in docs:
    for span in sorted(d.spans['snomed'], key=lambda s: s.start):
        print(span, '---', span.label_)

Cetuximab --- Clinical_Drug
monoklonaler Antikörper --- Clinical_Drug
epidermalen Wachstumsfaktorrezeptor --- Nutrient_or_Body_Substance
EGFR --- Nutrient_or_Body_Substance
Therapie des fortgeschrittenen kolorektalen Karzinoms --- Therapeutic
fortgeschrittenen kolorektalen Karzinoms --- Diagnosis_or_Pathology
Irinotecan --- Clinical_Drug
FOLFOX --- Therapeutic
Versagen --- Diagnosis_or_Pathology
Behandlung mit Oxaliplatin und Irinotecan --- Therapeutic
Oxaliplatin --- Clinical_Drug
Irinotecan --- Clinical_Drug
zytologischem Verdacht auf CIN 1/2 --- Other_Finding
sofortige Kolposkopie --- Diagnostic


# Run Entity Linker

In [9]:
from xmen.data import from_spacy
from xmen.linkers import SapBERTLinker, TFIDFNGramLinker, EnsembleLinker
from xmen.confhelper import load_config

In [10]:
dataset = from_spacy(docs, span_key='snomed')

In [11]:
dataset[1]

{'id': 1,
 'document_id': 1,
 'passages': [{'id': 0,
   'offsets': [[0, 101]],
   'text': ['Als Alternative empfiehlt die ASCCP bei zytologischem Verdacht auf CIN 1/2 die sofortige Kolposkopie.'],
   'type': 'sentence'}],
 'entities': [{'id': 0,
   'normalized': [],
   'offsets': [[40, 74]],
   'text': ['zytologischem Verdacht auf CIN 1/2'],
   'type': 'Other_Finding'},
  {'id': 1,
   'normalized': [],
   'offsets': [[79, 100]],
   'text': ['sofortige Kolposkopie'],
   'type': 'Diagnostic'}]}

In [12]:
conf = load_config('../conf/ggponc.yaml')

In [13]:
ngram_linker = TFIDFNGramLinker(**conf.linker.candidate_generation.ngram)

In [14]:
prediction = ngram_linker.predict_batch(dataset)

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [None]:
SapBERTLinker.clear()
sap_linker = SapBERTLinker(**conf.linker.candidate_generation.sapbert)

## Semantic Type Filtering

We filter the generated output to make sure the semantic type of the predicted concepts actually matches the semantic class of the named entity.

As the GGPONC entity classes are based on SNOMED CT top level concepts, while we link against UMLS CUIS, we provide a mapping of GGPONC enitity types to UMLS TUIs in `ggponc2tui.tsv`

In [15]:
from xmen.kb import load_kb
from xmen.data import SemanticTypeFilter
import pandas as pd

In [16]:
kb = load_kb(Path(conf.cache_dir) / 'ggponc' / 'ggponc.jsonl')

In [17]:
tui_df = pd.read_csv('ggponc2tui.csv')
type2tui = {}
for c in ['Diagnosis_or_Pathology',
       'Other_Finding', 'Clinical_Drug', 'Nutrient_or_Body_Substance',
       'External_Substance', 'Therapeutic', 'Diagnostic']:
    type2tui[c] = list(tui_df.TUI[tui_df[c] == 'x'].values)

In [18]:
type_filter = SemanticTypeFilter(type2tui, kb)

In [19]:
filtered_prediction = type_filter.transform_batch(prediction)

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

AttributeError: 'function' object has no attribute 'type_id_to_node'