# Tutorial For Annotation import and agreement (WIP)

This is the skeleton of a tutorial. Currently planning to finish tutorial by the end of April.

## Import functionality

The AnnotationAggregator class is designed to help with import of different file formats into spacy and dataframe objects, which are frequently used in NLP/datascience communities.

In [1]:
import spacy
import medspacy
import pandas
import sys
sys.path.insert(1, '../AnnotationAggregator/')
import AnnotationAggregator as AA

### Spacy import

We'll start by creating some example spacy annotations and printing the extracted entities.

In [2]:
#!python -m spacy download en_core_web_sm
#!python -m spacy download en_core_web_md

nlp1 = spacy.load("en_core_web_sm")
nlp2 = spacy.load("en_core_web_md")

#Note for development: Get better examples or make my own entities
doc1 = nlp1("Last year, I traveled to Dustin, Dallas, and Atlanta.")
doc2 = nlp2("Last year, I traveled to Dustin, Dallas, and Atlanta.")

print('doc1.ents: ',doc1.ents)
print('doc2.ents: ',doc2.ents)

doc1.ents:  (Last year, Dustin, Dallas, Atlanta)
doc2.ents:  (Last year, Dustin, Dallas, Atlanta)


Annotation Aggregator class objects can be initialized with no parameters. (Note that currently if you are using dataframe object you must include dataframes during instantiation of the object -- this will be changed in an update very soon to work similarly to other inputs)

In [3]:
Agg = AA.AnnotationAggregator()

To add spacy documents with entities or spans with spangroups, use the 'add_spacyDocs' method.

In [4]:
Agg.add_spacyDocs([doc1])
Agg.add_spacyDocs([doc2])

Now let's see what is stored in the 'Agg' object. Note that getter information is always stored as dictionaries, with the key generally specifying a group of documents.

In [5]:
Agg.get_spacy_docs().keys(), Agg.get_raw_df().keys(), Agg.get_text().keys()

(dict_keys(['spacyDoc_set1', 'spacyDoc_set2']),
 dict_keys(['spacyDoc_set1', 'spacyDoc_set2']),
 dict_keys(['0']))

In [6]:
Agg.get_spacy_docs()['spacyDoc_set1']

{'0': Last year, I traveled to Dustin, Dallas, and Atlanta.}

In [7]:
Agg.get_raw_df()['spacyDoc_set1']

Unnamed: 0,DocID,annotatedSpan,spanStartChar,spanEndChar,spanLabel
0,0,Last year,0,9,DATE
1,0,Dustin,25,31,GPE
2,0,Dallas,33,39,GPE
3,0,Atlanta,45,52,GPE


In [8]:
Agg.get_text()['0']

'Last year, I traveled to Dustin, Dallas, and Atlanta.'

### Ehost import

Put list of ehost filepaths in this method. Each path will be set aside as it's own group of documents.

First the code will import the files as spacy documents, then dataframes. You can access both using the above functions.

Note that Ehost functions here rely on the ehost-io package. You can download using the below cell. For more extensive use cases and documentation of this code, see https://github.com/medspacy/medspacy_io

In [10]:
##Need to have example ehost file to import

In [None]:
#!pip medspacy-io

In [15]:
#Agg.add_ehost_files(annot_dirs=***PUT list of filepaths here***)

In [None]:
##Agg.get_raw_df()

## Agreement (Performance)

Now that data formats are imported, lets see all pairwise agreements/performance between each set of documents. Note that the formal metrics and agreement dataframe are stored by the name of each set of documents, separated by a hyphen. 'get_agreement_metrics()' has an inner dictionary for each pair, for different metrics.

In [14]:
Agg.get_agreement_dict().keys() , Agg.get_agreement_metrics().keys()

(dict_keys(['spacyDoc_set1-spacyDoc_set2']),
 dict_keys(['spacyDoc_set1-spacyDoc_set2']))

In [15]:
Agg.get_agreement_dict()['spacyDoc_set1-spacyDoc_set2']

Unnamed: 0,doc_name,Annotation_1,Annotation_2,Annot_1_label,Annot_1_char,Annot_2_label,Annot_2_char,Overall_start_char,Exact_Match?,Duplicate_Matches?,Overlap?,Matching_label?,context
0,0,Last year,Last year,DATE,0-9,DATE,0-9,0,True,False,True,True,"...Last year, I traveled to Dustin, Dallas, an..."
1,0,Dustin,Dustin,GPE,25-31,GPE,25-31,25,True,False,True,True,"...Last year, I traveled to Dustin, Dallas, an..."
2,0,Dallas,Dallas,GPE,33-39,GPE,33-39,33,True,False,True,True,"...Last year, I traveled to Dustin, Dallas, an..."
3,0,Atlanta,Atlanta,GPE,45-52,GPE,45-52,45,True,False,True,True,"...Last year, I traveled to Dustin, Dallas, an..."


In [17]:
Agg.get_agreement_metrics()['spacyDoc_set1-spacyDoc_set2'].keys()

dict_keys(['span_metrics', 'token_level_metrics', 'label_metrics', 'overall_label_metrics', 'attr_metrics', 'overall_attr_metrics', 'rel_metrics', 'overall_rel_metrics'])

In [21]:
Agg.get_agreement_metrics()['spacyDoc_set1-spacyDoc_set2']['label_metrics']

Unnamed: 0,TP,FP,FN,Recall,Precision,F1
DATE,1,0,0,1.0,1.0,1.0
GPE,3,0,0,1.0,1.0,1.0


In [24]:
Agg.get_agreement_metrics()['spacyDoc_set1-spacyDoc_set2']['overall_label_metrics']

Unnamed: 0,TP,FP,FN,Recall,Precision,F1 (Micro),F1 (Macro)
Overall Label Metrics,4,0,0,1.0,1.0,1.0,1.0


You can also print an html and csv containing this information (and a couple visualizations) by using the 'generate_report()' method.

In [None]:
#Agg.generate_report()