# Tutorial For Annotation import and agreement (WIP)

## Import functionality

The AnnotationAggregator class is designed to help with import of different file formats into spacy and dataframe objects, which are frequently used in NLP/datascience communities.

In [1]:
import spacy
import medspacy #Note AnnotationAggregator still has most non-import functionalities functionality without medspacy
from spacy.tokens import Span, Token
import pandas as pd
import os
import sys
sys.path.insert(1, '../AnnotationAggregator/')
import AnnotationAggregator as AA

### Spacy import

We'll start by creating some example spacy annotations. In the cell below, we will read from 3 manually-fabricated documents containing symptom information about patients. We'll manually add some symptom annotations for demonstration purposes.

In [2]:
Span.set_extension("Temporality", default='Present')
Span.set_extension("Negation", default='Affirmed')

##Pull from corpus files
folder_path = "../example_input/Annotator_1/corpus/"
texts = []
for filename in os.listdir(folder_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(folder_path,filename)
        with open(file_path,"r") as f:
            texts.append(f.read())

## Process text files with default spaCy pipeline
nlp = spacy.load("en_core_web_sm")
docs = [nlp(text) for text in texts]

##Manually create entities
#####Doc 0
spans = []
span = Span(docs[0],3,4,label="Anatomical_Location")
spans.append(span)
span = Span(docs[0],4,5,label="Pain")
span._.Temporality = "Past"
span._.Negation = "Affirmed"
spans.append(span)
docs[0].ents = spans

#####Doc 1
spans = []
span = Span(docs[1],11,12,label="Gastrointestinal_and_Genitourinary")
span._.Temporality = "Past"
span._.Negation = "Affirmed"
spans.append(span)
span = Span(docs[1],20,21,label="Gastrointestinal_and_Genitourinary")
span._.Temporality = "Present"
span._.Negation = "Affirmed"
spans.append(span)
span = Span(docs[1],24,26,label="Pain")
span._.Temporality = "Present"
span._.Negation = "Negated"
spans.append(span)
docs[1].ents = spans

#####Doc 2
spans = []
span = Span(docs[2],8,9,label="Pain")
span._.Temporality = "Past"
span._.Negation = "Affirmed"
spans.append(span)
span = Span(docs[2],21,22,label="Pain")
span._.Temporality = "Present"
span._.Negation = "Affirmed"
spans.append(span)
span = Span(docs[2],28,29,label="Gastrointestinal_and_Genitourinary")
span._.Temporality = "Present"
span._.Negation = "Negated"
spans.append(span)
span = Span(docs[2],30,31,label="Gastrointestinal_and_Genitourinary")
span._.Temporality = "Present"
span._.Negation = "Negated"
spans.append(span)
docs[2].ents = spans

In [3]:
for i,doc in enumerate(docs):
    print("\n".join(["Document Text for document "+str(i+1),"-------------------",doc.text,"",""]))

Document Text for document 1
-------------------
Pt is experiencing back pain. They also report having a constant headache in mornings.


Document Text for document 2
-------------------
Pt has painful breathing. Stops eating after lunch to avoid reflux.

HPI:
[x] GERD
[] back pain





Document Text for document 3
-------------------
Review of systems:

Eyes: no pain, discharge, or dryness
Back pain: Has low grade pain w/ walking
Gastrointestinal: no nausea, vomiting, blood




Annotation Aggregator has 'add' methods to upload spacy documents, ehost annotations, and dataframes. Lets initialize the AnnotationAggregator and upload our spacy annotations.

In [4]:
Agg = AA.AnnotationAggregator()
Agg.add_spacyDocs(docs,attributes=["Temporality","Negation"],id_list=["Example_1","Example_2","Example_3"])

Now let's see what is stored in the 'Agg' object. Note that getter information is always stored as dictionaries, with the key generally specifying a group of documents.

In [5]:
Agg.get_spacy_docs().keys(), Agg.get_raw_df().keys(), Agg.get_text().keys()

(dict_keys(['spacyDoc_set1']),
 dict_keys(['spacyDoc_set1']),
 dict_keys(['Example_1', 'Example_2', 'Example_3']))

In [6]:
Agg.get_spacy_docs()['spacyDoc_set1']

{'Example_1': Pt is experiencing back pain. They also report having a constant headache in mornings.,
 'Example_2': Pt has painful breathing. Stops eating after lunch to avoid reflux.
 
 HPI:
 [x] GERD
 [] back pain
 
 ,
 'Example_3': Review of systems:
 
 Eyes: no pain, discharge, or dryness
 Back pain: Has low grade pain w/ walking
 Gastrointestinal: no nausea, vomiting, blood}

In [7]:
Agg.get_raw_df()['spacyDoc_set1']

Unnamed: 0,DocID,annotatedSpan,spanStartChar,spanEndChar,spanLabel,Temporality,Negation
0,Example_1,back,19,23,Anatomical_Location,Present,Affirmed
1,Example_1,pain,24,28,Pain,Past,Affirmed
2,Example_2,reflux,60,66,Gastrointestinal_and_Genitourinary,Past,Affirmed
3,Example_2,GERD,78,82,Gastrointestinal_and_Genitourinary,Present,Affirmed
4,Example_2,back pain,86,95,Pain,Present,Negated
5,Example_3,pain,29,33,Pain,Past,Affirmed
6,Example_3,pain,82,86,Pain,Present,Affirmed
7,Example_3,nausea,119,125,Gastrointestinal_and_Genitourinary,Present,Negated
8,Example_3,vomiting,127,135,Gastrointestinal_and_Genitourinary,Present,Negated


In [8]:
Agg.get_text()['Example_1']

'Pt is experiencing back pain. They also report having a constant headache in mornings.'

### Ehost import

Now lets look at importing ehost annotations. For this, input a list of ehost filepaths in this method. Each path will be set aside as it's own group of documents. First the code will import the files as spacy documents, then dataframes. You can access both using the above functions.

Note that Ehost functions here rely on the ehost-io package. You can download using the below cell. For more extensive use cases and documentation of this code, see https://github.com/medspacy/medspacy_io . **Under 'DataFrame import' we will import the pkl file that was derived using this method.**

In [9]:
##Need to have example ehost file to import

In [10]:
#!pip install medspacy-io

In [11]:
#For this to work you must have ehost-io installed and able to be imported
#Agg.add_ehost_files(annot_dirs="../example_input/Annotator_1/")

In [12]:
##Agg.get_raw_df()

### Dataframe import

Same as above, lets use the add method to add a dataframe. Note that AA uses defaults for column names used in the comparison. These can be changed when you initialize the class, however the spacy and ehost import functions will still stick to the standard column names upon import.

In [13]:
ehost_import_results = pd.read_pickle('../example_input/example_raw_df.pkl')
ehost_import_results

relLabel,DocID,annotatedSpan,spanStartChar,spanEndChar,spanLabel,spanID,Temporality,Negation,Anatomy_to_Pain
0,Example_1,back,19,23,Anatomical_Location,EHOST_Instance_2,,,EHOST_Instance_1
1,Example_1,headache,65,73,Pain,EHOST_Instance_3,Present,Affirmed,
2,Example_1,pain,24,28,Pain,EHOST_Instance_1,Present,Affirmed,
3,Example_2,GERD,78,82,Gastrointestinal_and_Genitourinary,EHOST_Instance_11,Present,Affirmed,
4,Example_2,back,86,90,Anatomical_Location,EHOST_Instance_13,,,EHOST_Instance_12
5,Example_2,pain,91,95,Pain,EHOST_Instance_12,Hypothetical,Negated,
6,Example_2,painful breathing,7,24,Pain,EHOST_Instance_9,Present,Affirmed,
7,Example_2,reflux,60,66,Gastrointestinal_and_Genitourinary,EHOST_Instance_10,Hypothetical,Negated,
8,Example_3,Back,57,61,Anatomical_Location,EHOST_Instance_26,,,
9,Example_3,Eyes,20,24,Anatomical_Location,EHOST_Instance_24,,,EHOST_Instance_23


In [14]:
Agg.add_dataframe({"ehost_import_df":ehost_import_results})

In [15]:
Agg.get_spacy_docs().keys(), Agg.get_raw_df().keys(), Agg.get_text().keys()

(dict_keys(['spacyDoc_set1']),
 dict_keys(['spacyDoc_set1', 'ehost_import_df']),
 dict_keys(['Example_1', 'Example_2', 'Example_3']))

## Agreement (Performance)

Now that data formats are imported, lets see all pairwise agreements/performance between each set of documents. Note that the formal metrics and agreement dataframe are stored by the name of each set of documents, separated by a hyphen. 'get_agreement_metrics()' has an inner dictionary for each pair, for different metrics.

In [16]:
Agg.get_agreement_dict().keys() , Agg.get_agreement_metrics().keys()

(dict_keys(['spacyDoc_set1-ehost_import_df']),
 dict_keys(['spacyDoc_set1-ehost_import_df']))

In [17]:
Agg.get_agreement_dict()['spacyDoc_set1-ehost_import_df']

Unnamed: 0,doc_name,Annotation_1,Annotation_2,Annot_1_label,Annot_1_char,Annot_2_label,Annot_2_char,Overall_start_char,Exact_Match?,Duplicate_Matches?,Overlap?,Matching_label?,A1_Negation,A2_Negation,A1_Temporality,A2_Temporality,context,Negation_Match?,Temporality_Match?
0,Example_2,,painful breathing,,,Pain,7-24,7,False,False,False,False,,Affirmed,,Present,...Pt has painful breathing. Stops eating afte...,False,False
1,Example_2,reflux,reflux,Gastrointestinal_and_Genitourinary,60-66,Gastrointestinal_and_Genitourinary,60-66,60,True,False,True,True,Affirmed,Negated,Past,Hypothetical,...nful breathing. Stops eating after lunch to...,False,False
2,Example_2,GERD,GERD,Gastrointestinal_and_Genitourinary,78-82,Gastrointestinal_and_Genitourinary,78-82,78,True,False,True,True,Affirmed,Affirmed,Present,Present,...ops eating after lunch to avoid reflux.\n\n...,True,True
3,Example_2,back pain,back || pain,Pain,86-95,Anatomical_Location || Pain,86-90 || 91-95,86,False,True,True,False,Negated,NA || Negated,Present,NA || Hypothetical,...ng after lunch to avoid reflux.\n\nHPI:\n[x...,False,False
4,Example_3,,Eyes,,,Anatomical_Location,20-24,20,False,False,False,False,,,,,"...Review of systems:\n\nEyes: no pain, discha...",False,False
5,Example_3,pain,pain,Pain,29-33,Pain,29-33,29,True,False,True,True,Affirmed,Negated,Past,Present,"...Review of systems:\n\nEyes: no pain, discha...",False,False
6,Example_3,,Back,,,Anatomical_Location,57-61,57,False,False,False,False,,,,,"...of systems:\n\nEyes: no pain, discharge, or...",False,False
7,Example_3,,pain,,,Pain,62-66,62,False,False,False,False,,Affirmed,,Hypothetical,"...stems:\n\nEyes: no pain, discharge, or dryn...",False,False
8,Example_3,pain,pain,Pain,82-86,Pain,82-86,82,True,False,True,True,Affirmed,Affirmed,Present,Past,"...n, discharge, or dryness\nBack pain: Has lo...",True,False
9,Example_3,,Gastrointestinal,,,Gastrointestinal_and_Genitourinary,98-114,98,False,False,False,False,,Affirmed,,Hypothetical,... dryness\nBack pain: Has low grade pain w/ ...,False,False


In [19]:
Agg.get_agreement_metrics()['spacyDoc_set1-ehost_import_df'].keys()

dict_keys(['span_metrics', 'token_level_metrics', 'label_metrics', 'overall_label_metrics', 'attr_metrics', 'overall_attr_metrics', 'attr_metrics_hier', 'overall_attr_metrics_hier', 'rel_metrics', 'overall_rel_metrics'])

In [20]:
Agg.get_agreement_metrics()['spacyDoc_set1-ehost_import_df']['label_metrics']

Unnamed: 0,TP,FP,FN,Recall,Precision,F1
Gastrointestinal_and_Genitourinary,3,1,1,0.75,0.75,0.75
Anatomical_Location,1,2,0,1.0,0.333333,0.5
Pain,4,3,0,1.0,0.571429,0.727273


In [21]:
Agg.get_agreement_metrics()['spacyDoc_set1-ehost_import_df']['overall_label_metrics']

Unnamed: 0,TP,FP,FN,Recall,Precision,F1 (Micro),F1 (Macro)
Overall Label Metrics,8,6,1,0.888889,0.571429,0.695652,0.659091


You can also print an html and csv containing this information (and a couple visualizations) by using the 'generate_report()' method.

In [22]:
Agg.generate_report()