# Tutorial

In this tutorial I will go over the basic, front-end usage of calculating IAA between 2 annotators.

## Corpus agreement between two lists of spacy documents

In [2]:
import spacy
import medspacy
import pandas
import sys
sys.path.insert(1, './Integrated_code/')
import IAA_ as IAA

In [3]:
nlp1 = spacy.load("en_core_web_sm")
nlp2 = spacy.load("en_core_web_md")
#!python -m spacy download en_core_web_sm
#!python -m spacy download en_core_web_md

#Note for John: Get better examples or make my own entities
doc1 = nlp1("this is a test document made in utah or mississippi, or salt lake city.")
doc2 = nlp2("this is a test document made in utah or mississippi, or salt lake city.")

print('doc1.ents: ',doc1.ents)
print('doc2.ents: ',doc2.ents)

doc1.ents:  (utah, mississippi)
doc2.ents:  (utah, mississippi, lake city)


Above we made two documents using spacy's NER packages. Document 2 added more entities than document 1. Let's calculate the IAA between these documents!

In [4]:
IAA.corpus_agreement([doc1],[doc2])

Unnamed: 0,IAA,Recall,Precision,True Positives,False Positives,False Negative
0,0.8,0.666667,1.0,2,1,0


'corpus_agreement' calculates the agreement between two lists of documents, two lists containing inner lists/tuples of entities/spans, or 2 dataframes. Note the brackets around 'doc1' and 'doc2', so they are passed in as lists. This is because corpus_agreement assumes you are passing a lists of documents.

Lets look at other inputs for IAA.

## Corpus Agreement between two dataframes

corpus_agreement can accept two dataframes (one for each annotation set) provided they are structured correctly. Below are the default column names that the code looks for:

__'start loc'__ : column containing starting positions of ents

__'end loc'__ : column containing ending positions of ents

__'Concept Label'__ : column containing label of ent. Only applicable if labels=1.

__'doc name'__ : column containing document name/file name. The code will calculate tp,fp,fn for each document name (ie. the dataframes will be segmented based on document names, then each pair of resultant dataframes are passed along iteratively)

In order for the code to read your dataframe, you need to ensure you have the right dataframe structure, including the correct dataframe column names. Alternatively, you can edit the default column name strings the code searches for at the top of the IAA_ code.

In [None]:
#IAA.corpus_agreement(df1,df2)

## Corpus Agreement between two lists of lists/tuples of entities/spans

You can also manually make lists of all the spans/entities in a spacy document and pass that to corpus agreement.

In [7]:
IAA.corpus_agreement([doc1.ents],[doc2.ents])

({0: [0], 1: [1]}, {0: [0], 1: [1]})

Note that the corpus_agreement internally converts spacy documents into lists of the spans/entities in a document before computing the overlaps. If you do this yourself, you can directly pass the lists to the overlaps function, which will return mapping dictionaries of the entities. We'll talk about this more under 'Other functionality'

## Arguments for corpus_agreement

'corpus_agreement' can also take options to be more flexible with other IAA methods. Below are the arguments:

### corpus_agreement(docs1, docs2, loose=1, labels=1,ent_or_span='ent')

__docs1__: Either a list of spacy documents, list containing inner tuples/lists of entities/spans, list of spangroups, dataframe.
    Considered the golden/correct annotation for fp,fn.
    
__docs2__: Either a list of spacy documents, list of tuples/lists of entities/spans, list of spangroups, or a dataframe.

__loose__: Boolean. 1 indicates to consider any overlap. 0 indicates to only consider exact matches.

__labels__: Boolean. 1 indicates to consider labels as matching criteria.

__ent_or_span__: String of either 'ent' or 'span'. 'ent' indicates to compare doc.ents between documents. 'span' indicates to 
    compare doc1's only spangroup (note that doc1 must have only 1 spangroup) with doc2's equivalently named spangroup. This
    argument is only relevant if passing in a list of spacy documents (ie. can be ignored if passing in a list of tuple/list 
    of ents/spans/spangroups or dataframe)

In [10]:
IAA.corpus_agreement([doc1],[doc2],loose=0,labels=0)

Unnamed: 0,IAA,Recall,Precision,True Positives,False Positives,False Negative
0,0.8,0.666667,1.0,2,1,0


## Other functionality

Internally, corpus_agreement is figuring out what input you gave it (dataframes, lists of documents, or lists of lists/tuples of spans/entities). For lists of documents, it uses a helper function to convert the documents into lists of lists/tuples of the document's spans/entities. 

corpus_agreement then calls an overlap function, which returns a dictionaries with mappings of all matched entities/spans. These dictionaries are used to calculate the true positives, false positives, and false negatives using the 'conf_matrix' function.

corpus_agreement then uses tp,fp,fn to calculate precision, recall, and f1 (using 'pairwise_f1').

Lets try using these other functions individually.

### overlaps() and df_overlaps()

In [11]:
mapping_dictionaries = IAA.overlaps(doc1.ents,doc2.ents)
mapping_dictionaries

({0: [0], 1: [1]}, {0: [0], 1: [1]})

In [15]:
tp,fp,fn = IAA.conf_matrix(mapping_dictionaries[0],mapping_dictionaries[1],len(doc1.ents),len(doc2.ents))
print(tp,fp,fn)

2 1 0


In [16]:
IAA.pairwise_f1(tp,fp,fn)

0.8