# Prediction of overlapping spans with spaCy's SpanCategorizer

**Motivation**:

Annotations in GGPONC are often overlapping or nested.

For instance, `Versagen einer Behandlung mit Oxaliplatin und Irinotecan`
- is a *Finding*
- which contains a *Therapeutic Procedure*: `Behandlung mit Oxaliplatin und Irinotecan`:
    - which in turn contains two *Clinical Drug* names: (`Oxaliplatin` and `Irinotecan`).

Standard IOB-encoded labels, and most NER implementations, can only model one label per token, so by default we consider the longest surrounding mention span only in the IOB-based / HuggingFace implementation (in this case, the *Finding*).

**Solution**:

Instead of token-level labels, we use spaCy's new [SpanCategorizer](https://spacy.io/api/spancategorizer/) implementation to predict overlapping mention spans as a SpanGroup in a spaCy document.

## Training

See the `spacy` folder in the root directory of the project. The model configuration can be found at `configs` and training can be run through a spaCy project (see `spacy/run_training.sh`). 

*Note:* We have currently not optimized the many hyperparameters related to span suggestion and model training. However, performance is close to the HuggingFace models evaluated on non-nested mention spans.

## Inference

In [43]:
import sys
sys.path.append('../spacy')

In [49]:
import spacy
import snomed_spans #TODO: import needed to enable custom spaCy components, is there another way?

In [50]:
nlp = spacy.load('../data/models/spacy')

In [166]:
doc = nlp("""Versagen einer Behandlung mit Oxaliplatin und Irinotecan""")

In [167]:
for s in sorted(list(doc.spans['snomed']), key=lambda s: s.start):
    print(s, s.label_)

Versagen einer Behandlung Diagnosis_or_Pathology
Behandlung mit Oxaliplatin und Irinotecan Therapeutic
Oxaliplatin Clinical_Drug
Irinotecan Clinical_Drug
