# Tutorial

In this tutorial I will go over the basic, front-end usage of calculating IAA between 2 annotations.

## Corpus agreement between two lists of spacy documents

In [1]:
import spacy
import medspacy
import pandas
import sys
sys.path.insert(1, './Integrated_code/')
import IAA_ as IAA

In [2]:
nlp1 = spacy.load("en_core_web_sm")
nlp2 = spacy.load("en_core_web_md")
#!python -m spacy download en_core_web_sm
#!python -m spacy download en_core_web_md

#Note for John: Get better examples or make my own entities
doc1 = nlp1("this is a test document made in utah or mississippi, or salt lake city.")
doc2 = nlp2("this is a test document made in utah or mississippi, or salt lake city.")

print('doc1.ents: ',doc1.ents)
print('doc2.ents: ',doc2.ents)

doc1.ents:  (utah, mississippi)
doc2.ents:  (utah, mississippi, lake city)


Above we made two documents using spacy's NER packages. Document 2 added more entities than document 1. Let's calculate the IAA between these documents!

In [3]:
IAA.corpus_agreement([doc1],[doc2])[0]

Unnamed: 0,IAA,Recall,Precision,True Positives,False Positives,False Negative
0,0.8,0.666667,1.0,2,1,0


'corpus_agreement' calculates the agreement between two lists of documents, two lists containing inner lists/tuples of entities/spans, or 2 dataframes. Note the brackets around 'doc1' and 'doc2', so they are passed in as lists. This is because corpus_agreement expects lists of documents.

Also note that we are selecting the first element of the returned array. This is because the code actually returns a list of 2 elements. We will look at the second element later in the tutorial under "The returned dataframe with mappings".

Lets look at other arguments for IAA.

## Corpus Agreement between two dataframes

*corpus_agreement* can accept two dataframes (one for each annotation set) provided they are structured correctly. Below are the default column names that the code looks for:

__'start loc'__ : column containing starting positions of ents

__'end loc'__ : column containing ending positions of ents

__'Concept Label'__ : column containing label of ent. Only applicable if labels=1.

__'doc name'__ : column containing document name/file name. The code will calculate tp,fp,fn for each document name (ie. the dataframes will be segmented based on document names, then each pair of resultant dataframes are passed along iteratively)

__input attribute column names__ : This includes any column names, input through the "attributes" argument list (we will talk more about this below)

You must use the correct dataframe column names for the code to read your dataframes. Alternatively, you can edit the default column name strings the code searches for at the top of the IAA_ code.

Note that these are the same default columns created by the medspacy ereader code.

__Using dataframes as input is the preferred way to use *corpus_agreement*.__

In [None]:
#IAA.corpus_agreement(df1,df2)[0]

## Corpus Agreement between two lists of lists/tuples of entities/spans

You can also manually make lists of all the spans/entities in a spacy document and pass that to corpus agreement.

In [5]:
IAA.corpus_agreement([doc1.ents],[doc2.ents])[0]

Unnamed: 0,IAA,Recall,Precision,True Positives,False Positives,False Negative
0,0.8,0.666667,1.0,2,1,0


Note that the corpus_agreement internally converts spacy documents into lists of the spans/entities in a document before computing the overlaps. If you do this yourself, you can directly pass the lists to the overlaps function, which will return mapping dictionaries of the entities. We'll talk about this more under 'Other functionality'

## Arguments for corpus_agreement

'corpus_agreement' can also take other arguments to be more flexible with other IAA methods. Below are the arguments:

### corpus_agreement(docs1, docs2, loose=1, labels=1,ent_or_span='ent',attributes=[ ])

__docs1__: Either a list of spacy documents, list containing inner tuples/lists of entities/spans, list of spangroups, or dataframe with proper column names. Considered the golden/reference annotation for tp,fp,fn.
    
__docs2__: Expects the same types of inputs as docs1. Either a list of spacy documents, list of tuples/lists of entities/spans, list of spangroups, or a dataframe.

__loose__: Boolean. 1 indicates to consider any overlap between entities. 0 indicates to only consider exact matches.

__labels__: Boolean. 1 indicates to consider entity labels as matching criteria.

__ent_or_span__: String of either 'ent' or 'span'. 'ent' indicates to compare doc.ents between documents. 'span' indicates to 
    compare doc1's only spangroup (note that doc1 must have only 1 spangroup) with doc2's equivalently named spangroup. This
    argument is only relevant if passing in a list of spacy documents (ie. can be ignored if passing in a list of tuple/list 
    of ents/spans/spangroups or dataframe)
    
__attributes__: List containing column names. These columns will be compared between the two annotations as additional matching criteria. Only applies to dataframes.

In [10]:
IAA.corpus_agreement([doc1],[doc2],loose=0,labels=0)

Unnamed: 0,IAA,Recall,Precision,True Positives,False Positives,False Negative
0,0.8,0.666667,1.0,2,1,0


## The returned dataframe with mappings

The second element that is returned by *corpus_agreement* is a dataframe containing relevant information on all entities between the two documents, including information on matches and matching criteria. Note that (as it stands) this functionality only works when inputing dataframes into *corpus_agreement* (this is one reason dataframes are preferred -- the other being that the attributes argument only works with dataframes).

Here is an example of what this dataframe looks like:

In [None]:
#IAA.corpus_agreement(df1,df2,loose=1,labels=1)[1]

The dataframe returns a row for every entity included in the input dataframes. If entities match between both annotations, the relevant entities will be included in the same row. If an entity is included in one document, but not the other, the columns for the other annotation will be left blank.

Here are the descriptions of the columns within this dataframe:

__doc name__: Name of document from which entity(s) came from. Derived from the doc name of the input dataframe.

__Annotation_1__: Entity text from annotation 1.

__Annotation_2__: Entity text from annotation 2.

__Annot_1_label__: The label of the entity from annotation 1.

__Annot_1_char__: Character positions of annotation 1.

__Annot_2_label__: The label of the entity from annotation 2.

__Annot_2_char__: Character positions of annotation 2.

__Overall_start_char__: The earliest starting character position between the two entities. Used to sort the entities within each document.

__Exact Match?__: Boolean indicating if there is an exact match between entities starting and ending characters for matched entities within the row. (Note that this does not include exact matches of labels and attributes, unlike the case when label=0)

__Duplicate Matches?__: Boolean indicate if an entity included in the row matches multiple entities of the other annotation.

__Overlap?__: Boolean indicating if there is any overlap (ie. match) between entities in this row. In other words, this will be a 1 if there is at least one entity in Annotation_1 and at least one entity in Annotation_2.

## Other functionality and specifics for calculation of tp,fp,fn

Internally, corpus_agreement is figuring out what input you gave it (dataframes, lists of documents, or lists of lists/tuples of spans/entities). For lists of documents, it uses a helper function to convert the documents into lists of lists/tuples of the document's spans/entities. 

corpus_agreement then calls an overlap function, which returns 2 dictionaries with mappings of all matched entities/spans. These dictionaries are used to calculate the true positives, false positives, and false negatives using the 'conf_matrix' function.

corpus_agreement then uses tp,fp,fn to calculate precision, recall, and f1 (using 'pairwise_f1' function).

*corpus_agreement* will then call *create_agreement_df*, which uses the dictionary mappings and initial dataframes to construct the returned dataframe (containing the entity information and matches). This step only applies when dataframes were input as arguments for *corpus_agreement*.

Lets try using these other functions individually.

### *overlaps* for ents and span lists and *df_overlaps* for dataframes

In [11]:
mapping_dictionaries = IAA.overlaps(doc1.ents,doc2.ents)
mapping_dictionaries

({0: [0], 1: [1]}, {0: [0], 1: [1]})

### *conf_matrix* 
This uses mapping dictionaries and the amount of entities to calculate tp,fp,fn. The exact calculations used are described in a below section

In [15]:
tp,fp,fn = IAA.conf_matrix(mapping_dictionaries[0],mapping_dictionaries[1],len(doc1.ents),len(doc2.ents))
print(tp,fp,fn)

2 1 0


### *pairwise_f1*
Finally, tp,fp,fn can be used to calculate pairwise f1.

In [16]:
IAA.pairwise_f1(tp,fp,fn)

0.8

### Calculations for true positives (tp), false positives (fp), and false negatives (fn) based on mappings

Below is a description of the calculations that go into tp, fp, and fn. The high level description is given first, followed by the technical description and an example.

Note that annotation 1 is considered the golden/reference standard.

__True Positives (tp):__

<u>High level:</u> 

tp describes the amount of annotations in the reference standard (annotation_1) that have a match in annotation 2. This is the equivalent of the amount of keys in annotation_1's mapping dictionary, since keys are only placed in the mapping dictionary if there is a match in the other dictionary.

*However*, an exception to this rule is the case where two or more entities in annotation_1 match the same entity in annotation_2. In these cases, we do not double count the "duplicate" matches as multiple true positives. To counteract this, we look at annotation_2's mapping dictionary for cases where an annotation in annotation_2 maps to multiple annotations in annotation_1. We use this information to subtract out the duplicated matches. In other words, if 2 or more annotations from annotation_1 maps to the same entity in annotation_2, this will only count as 1 tp, 0 fp, and 0 fn.

If one entity from annotation_1 maps to several annotations in annotation_2, this also counts as 1 tp, 0 fp, and 0 fn.

<u>Technical description:</u>

tp = doc1_match_num - duplicate_matches

where doc1_match_num is the amount ("len()") of keys in the mapping dictionary for annotation_1, and duplicate_matches is the sum of the lengths of all lists in annotation_2's mapping dictionary's values minus the length of keys in annotation 2's mapping dictionary. 

<u>Example:</u>

For example, if annotation_1's mapping dictionary is "{1: [4],2:[4],3, [9,10]}" and is called annot1_mapping, and annot2_mapping is "{4:[1,2],9:[3],10:[3]}", then the length of keys in annotation_1 would be len(annot1_mapping.keys()) = 3, and annot2_mapping's values minus the length of keys would be 2 (for "len([1,2])") + 1 (for "len([3])") + 1 (for "len([3])") - 3 (one for each key -- "4", "9", and "10"), for a total of 1. Altogether, this is 3-1=2 for tp.

__False Positives (fp):__

<u>High level:</u>

fp describes the amount of annotations in annotation_2 that did not match an annotation in annotation_1. This can be described as the total amount of annotations made in annotation_2, minus the number of annotations in annotation_2 that matched.

<u>Technical Description:</u>

fp = doc2_ent_num - doc2_match_num

where doc2_match_num is the amount of total annotations (entities) in annotation_2, and doc2_match_num is the amount ("len()") of keys in the mapping dictionary for annotation_2.

<u>Example:</u>

If there were 10 annotations made in annotation_2, and the mapping dictionary is "{4:[1,2],9:[3],10:[3]}", then fp would be 10 - 3 ("len({4:[1,2],9:[3],10:[3]})") = 7.

__False Negatives (fn):__

<u>High level:</u>

fn describes the amount of annotations in annotation_1 that did not match an annotation in annotation_2. This can be described as the total amount of annotations made in annotation_1, minus the amount of annotations in annotation_1 that matched.

<u>Technical Description:</u>

fn = doc1_ent_num - doc1_match_num

where doc1_match_num is the amount of total annotations (entities) in annotation_1, and doc1_match_num is the amount ("len()") of keys in the mapping dictionary for annotation_1.

<u>Example:</u>

If there were 10 annotations made in annotation_1, and the mapping dictionary is "{4:[1,2],9:[3],10:[3]}", then fn would be 10 - 3 ("len({4:[1,2],9:[3],10:[3]})") = 7.