# Prediction of overlapping spans with spaCy's SpanCategorizer

**Motivation**:

Annotations in GGPONC are often overlapping or nested.

For instance, `Versagen einer Behandlung mit Oxaliplatin und Irinotecan`
- is a *Finding*
- which contains a *Therapeutic Procedure*: `Behandlung mit Oxaliplatin und Irinotecan`:
    - which in turn contains two *Clinical Drug* names: (`Oxaliplatin` and `Irinotecan`).

Standard IOB-encoded labels, and most NER implementations, can only model one label per token, so by default we consider the longest surrounding mention span only in the IOB-based / HuggingFace implementation (in this case, the *Finding*).

**Solution**:

Instead of token-level labels, we use spaCy's new [SpanCategorizer](https://spacy.io/api/spancategorizer/) implementation to predict overlapping mention spans as a SpanGroup in a spaCy document.

## Training

See the `spacy` folder in the root directory of the project. The model configuration can be found at `configs` and training can be run through a spaCy project (see `spacy/run_training.sh`). 

*Note:* We have currently not optimized the many hyperparameters related to span suggestion and model training. However, performance is close to the HuggingFace models evaluated on non-nested mention spans.

## Inference

In [3]:
import sys
sys.path.append('../spacy')

In [4]:
import spacy
from spacy.tokens import Doc, Span
import snomed_spans #TODO: import needed to enable custom spaCy components, is there another way?

In [5]:
nlp = spacy.load('../data/models/spacy')



In [24]:
doc = nlp("""Versagen einer Behandlung mit Oxaliplatin und Irinotecan""")

### Grascco Samples

In [21]:
doc = nlp("""6.04.2029: Nachdebridement am Kopf, VAG-Wechsel linke Hand""")


In [20]:
doc = nlp("6.04.2029: Nachdebridement am Kopf, VAG-Wechsel linke Hand""Röntgen : Rippstein I : Gute Hüftkopfepiphysenkonturgebung , minimale Lateralisation , li. etwas stärker als re. , noch übergreifende Pfannendächer , Shenton-Menard-Linie nicht wesentlich unterbrochen , Pfannendachwinkel Ii. 30° , re. ebenfalls knapp 30° .""")


In [2]:
from pathlib import Path
import json
from sklearn.metrics import f1_score
from spacy.training import docs_to_json
from sklearn.metrics import precision_recall_fscore_support
from spacy.tokens import DocBin
from spacy.training import Corpus
from spacy.training import Example
from spacy.scorer import Scorer
from spacy.vocab import Vocab






folder_raw = "/Users/leon.sarodnik/Documents/GitHub/ggponc_annotation/GraSSco/source"
manual_annotated_file = "/Users/leon.sarodnik/Documents/GitHub/ggponc_annotation/GraSSco/grassco_anno_2023-01-05_0021/spacy/test.spacy"
p = Path(r'/Users/leon.sarodnik/Documents/GitHub/ggponc_annotation/GraSSco/source').glob('*.txt')
files = [x for x in p if x.is_file()]



print("loading .spacy file ...")
#gold_annotation = nlp.from_disk(manual_annotated_file)
#doc_bin = DocBin().from_disk(manual_annotated_file)




print("working ...")
text = ""
#text = ""
for file in files:
    text += file.read_text(encoding="utf-8")
docs = nlp(text)

loading .spacy file ...
working ...


NameError: name 'nlp' is not defined

In [8]:

doc_bin = DocBin().from_disk(manual_annotated_file)


gold_docs  = list(doc_bin.get_docs(nlp.vocab))

#scorer = Scorer()
#scores = scorer.score(examples)

#gold_annotation = spacy.Corpus.v1(manual_annotated_file, gold_preproc=True)
docs = []
for file in files:
    docs.append(file.read_text(encoding="utf-8"))
print("Files merged...")


# Loop over the gold standard data
#for gold_doc in gold_docs:
#    ents1 = [(gold_doc.text, gold_doc.label_) for ent in gold_doc.ents]
    # Process the text with the model
# Compare the model's predicted annotations with the gold standard
#for doc in docs:
#    ents2 = [(doc.text, doc.label_) for ent in doc.ents]

print("NLP Pipe...")
docs_all = nlp.pipe(gold_docs, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"],batch_size = 10)


print(gold_docs[1])
print("Building Examples...")
examples = []
for  i, doc in enumerate(docs_all):
    examples.append(Example((doc), gold_docs[i]))
    


scorer = Scorer(nlp)

print("eval...")
scores = scorer.score(examples)


#example_object = iter(docs)


#examples = Example(Doc.from_docs(docs), Doc.from_docs(gold_docs))

scorer.score()


#print("Entities F-Score:", scorer.scores["ents_f"])
#print("Entities Precision:", scorer.scores["ents_p"])
#print("Entities Recall:", scorer.scores["ents_r"])


Files merged...
NLP Pipe...
Department Orthopädie und Traumatologie Friedrichstraße 55 , 10117 Berlin 
Building Examples...
eval...


TypeError: score() missing 1 required positional argument: 'examples'

In [9]:
scorer = Scorer()

scores = scorer.score(examples)

print("Entities F-Score:", scorer.scores["ents_f"])
print("Entities Precision:", scorer.scores["ents_p"])
print("Entities Recall:", scorer.scores["ents_r"])


AttributeError: 'Scorer' object has no attribute 'scores'

In [10]:
sentence['offsets']

NameError: name 'sentence' is not defined

## Initial Sentence Based Processing

In [39]:
import json

nlp = spacy.load('../data/models/spacy')

def find_annotated_entities(sentence, document):
    s_offsets = sentence['offsets']
    entity_list = []
    for entity in document['entities']:
        if entity is None:
            break
        elif(entity['offsets'][0][0] > s_offsets[0][1]): 
            break
        elif (entity['offsets'][0][0] >=  s_offsets[0][0] and entity['offsets'][0][1] <= s_offsets[0][1]):
            entity_list.append(entity)
    return entity_list

tp = 0
e_count = 0
t_count = 0

def compare_findings(predi_entities, truth_entities, sentence):
    sentence_delta = sentence['offsets'][0][0]
    global tp, e_count, t_count
    for e in list(predi_entities.spans['snomed']):
        e_count+=1
        for t in truth_entities:
            if e.label_ == t['type'] and e.start_char == t['offsets'][0][0]-sentence_delta and e.end_char == t['offsets'][0][1]-sentence_delta:
                tp+=1
                break
    t_count += len(truth_entities)


json_path = "/Users/leon.sarodnik/Documents/GitHub/ggponc_annotation/GraSSco/grassco_anno_2023-01-05_0021/json/fine/long/test.json"
json_path_new = "/Users/leon.sarodnik/Documents/GitHub/ggponc_annotation/GraSSco/grascco_hpi_anno_2023_02_08/annotations/json/fine/long/all_short.json"

with open(json_path_new) as json_file:
    data = json.load(json_file)


for document in data:
    print("Current document_id: "+document['document_id'])
    for sentence in document['passages']:
        nlp_findings = nlp(sentence['text'], disable=["tok2vec", "tagger", "attribute_ruler","lemmatizer"]) #disable_components (standard NER tagger etc.) - "parser" is helpful (10 matches less when not used)
        manual_findings = find_annotated_entities(sentence, document)
        compare_findings(nlp_findings, manual_findings, sentence)




Current document_id: Albers.tsv


## Sentence Based Processing (Optimized)

In [47]:
import json

def find_annotated_entities(sentence, document):
    s_offsets = sentence['offsets']
    return [entity for entity in document['entities']
            if entity and
            (entity['offsets'][0][0] >= s_offsets[0][0] and
            entity['offsets'][0][1] <= s_offsets[0][1])]
            
def compare_findings(predi_entities, truth_entities, sentence):
    sentence_delta = sentence['offsets'][0][0]
    tp = sum(1 for e in predi_entities.spans['snomed'] if any(e.label_ == t['type'] and e.start_char == t['offsets'][0][0] - sentence_delta and e.end_char == t['offsets'][0][1] - sentence_delta for t in truth_entities))
    return tp, len(predi_entities.spans['snomed']), len(truth_entities)

tp = e_count = t_count = 0

json_path = "/Users/leon.sarodnik/Documents/GitHub/ggponc_annotation/GraSSco/grassco_anno_2023-01-05_0021/json/fine/long/test.json"
json_path_new = "/Users/leon.sarodnik/Documents/GitHub/ggponc_annotation/GraSSco/grascco_hpi_anno_2023_02_08/annotations/json/fine/long/all_short.json"

with open(json_path) as json_file:
    data = json.load(json_file)


for document in data:
    print("Current document_id: "+document['document_id'])
    for i, nlp_findings in enumerate(nlp.pipe([d['text'] for d in document['passages'] if 'text' in d]
, disable=["tok2vec", "tagger", "attribute_ruler","lemmatizer"])): # use pipe, disable_components in pipe (standard NER tagger etc.) - "parser" is helpful (10 matches less when not used)
        manual_findings = find_annotated_entities(document['passages'][i], document)
        tp_i, e_count_i, t_count_i = compare_findings(nlp_findings, manual_findings, document['passages'][i])
        tp += tp_i
        e_count += e_count_i
        t_count += t_count_i



Current document_id: Weil.tsv
Current document_id: Recklinghausen.tsv


KeyboardInterrupt: 

In [42]:
fn = t_count-tp
fp = e_count-tp

print(f'Actual Entities: {t_count}')
print(f'Predicted Entities: {e_count}')

print(f'True Positives: {tp}')
print(f'False Positives: {fp}')
print(f'False Negatives: {fn}')

precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1_score = 2 * (precision * recall) / (precision + recall)

print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {f1_score:.2f}')


#result for fine/long

#Actual Entities: 6194
#Predicted Entities: 5171
#True Positives: 2584
#False Positives: 2587
#False Negatives: 3610
#Precision: 0.50
#Recall: 0.42
#F1 Score: 0.45

#result for fine/short
#Actual Entities: 7201
#Predicted Entities: 5171
#True Positives: 1737
#False Positives: 3434
#False Negatives: 5464
#Precision: 0.34
#Recall: 0.24
#F1 Score: 0.28

Actual Entities: 6234
Predicted Entities: 5101
True Positives: 2381
False Positives: 2720
False Negatives: 3853
Precision: 0.47
Recall: 0.38
F1 Score: 0.42
tp: 2381 e_count: 5101 t_count: 6234
precision: 0.46677122132915116 recall: 0.38193776066730833 f1_score: 0.42011468901632115


In [25]:
for s in sorted(list(doc.spans['snomed']), key=lambda s: s.start):
    print(s, s.label_)

Versagen einer Behandlung Diagnosis_or_Pathology
Behandlung mit Oxaliplatin und Irinotecan Therapeutic
Oxaliplatin Clinical_Drug
Irinotecan Clinical_Drug


## Document Based Processing

In [31]:
import json

#nlp = spacy.load('../data/models/spacy')

def get_string_sentences(docs):
    return ['\n'.join([sentence['text'] for sentence in doc['passages']]) for doc in docs]

def find_annotated_entities(annotated_doc):
    return [entity for entity in annotated_doc['entities'] if entity]

def compare_findings(predicted_findings, truth_entities):
    predicted_findings = sorted(predicted_findings, key=lambda x: x.start_char)
    truth_set = set(t['offsets'][0][0] for t in truth_entities)
    tp, e_count, t_count = 0, len(predicted_findings), len(truth_entities)
    for e in predicted_findings:
        if e.start_char in truth_set:
            t = next(t for t in truth_entities if t['offsets'][0][0] == e.start_char)
            if e.label_ == t['type'] and e.start_char == t['offsets'][0][0] and e.end_char == t['offsets'][0][1]:
                tp += 1
    return tp, e_count, t_count

tp, e_count, t_count = 0, 0, 0

json_path = "/Users/leon.sarodnik/Documents/GitHub/ggponc_annotation/GraSSco/grassco_anno_2023-01-05_0021/json/fine/long/test.json"
json_path_new = "/Users/leon.sarodnik/Documents/GitHub/ggponc_annotation/GraSSco/grascco_hpi_anno_2023_02_08/annotations/json/fine/long/all_short.json"

with open(json_path_new) as json_file:
    data = json.load(json_file)

print("Processing all documents... this takes up to multiple minutes hang tight!")

# use pipe, disable_components in pipe (standard NER tagger etc.) - "parser" is helpful (10 matches less when not used)
for i, doc in enumerate(nlp.pipe(get_string_sentences(data), disable=["tok2vec", "attribute_ruler", "lemmatizer"])):
    predicted_findings = [(ent) for ent in list(doc.spans['snomed'])]
    manual_findings = find_annotated_entities(data[i])
    tp_i, e_count_i, t_count_i = compare_findings(predicted_findings, manual_findings)
    tp += tp_i
    e_count += e_count_i
    t_count += t_count_i


Processing all documents... this takes up to multiple minutes hang tight!


In [32]:
fn = t_count-tp
fp = e_count-tp

print(f'Actual Entities: {t_count}')
print(f'Predicted Entities: {e_count}')

print(f'True Positives: {tp}')
print(f'False Positives: {fp}')
print(f'False Negatives: {fn}')

precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1_score = 2 * (precision * recall) / (precision + recall)

print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {f1_score:.2f}')


#sentence based result for all-short

#Actual Entities: 195
#Predicted Entities: 187
#True Positives: 87
#False Positives: 100
#False Negatives: 108
#Precision: 0.47
#Recall: 0.45
#F1 Score: 0.46

#document based result for all-short

#Actual Entities: 195
#Predicted Entities: 177
#True Positives: 81
#False Positives: 96
#False Negatives: 114
#Precision: 0.46
#Recall: 0.42
#F1 Score: 0.44

Actual Entities: 195
Predicted Entities: 177
True Positives: 81
False Positives: 96
False Negatives: 114
Precision: 0.46
Recall: 0.42
F1 Score: 0.44
