# Spacy/Medspacy IAA

## Resources

Prodigy forum answer about IAA for spans https://support.prodi.gy/t/proper-way-to-calculate-inter-annotator-agreement-for-spans-ner/5760

Spacy scorer object https://spacy.io/api/scorer

## End Goal

### Functionality

Provide a collection of methods to evaluate IAA between _n_ arbitrary spacy `doc` objects. Provide methods that aid in error analysis such as providing lists of differences.

Priorities:
* Pairwise F1
    * configurable strict/loose matching
    * configurable inclusion of labels/attributes (calculate just span vs span+class agreement)

* Imported python files
    * reasonable docstrings on methods/classes
    
* Unit tests
    * add CI to repo for automated testing later

Extra features:
* List of differences between docs
* 

Expected challenges
* Spacy scorer functions are useful, but _only_ do strict span matching
* Fewer resources (obviously?) available for comparisons between 3+ docs


In [47]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [48]:
doc1 = nlp("Mr. Spacy is a test document made in utah.")
for ent in doc1.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Spacy 4 9 PERSON
utah 37 41 GPE


In [49]:
doc2 = nlp("Mr. Spacy is a test document made in utah.")

Now we want to modify the name entity: 'Mr. Spacy'-> [PERSON]

In [50]:
from spacy.tokens import Span

# Create a span for the new entity
fb_ent = Span(doc2, 0, 2, label="PERSON")
orig_ents = list(doc2.ents)

# Modify the provided entity spans, leaving the rest unmodified
doc2.set_ents([fb_ent], default="unmodified")

for ent in doc2.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Mr. Spacy 0 9 PERSON
utah 37 41 GPE


# Inter Annotator Agreement

In [55]:
def overlaps(ent1, ent2):#compare the span 
    '''calculate whether two ents overlap
    returns bool
    '''
    if ent1.start_char <= ent2.start_char and ent1.end_char > ent2.start_char:
        return True;
    elif ent1.start_char >= ent2.start_char and ent1.start_char<ent2.end_char:
        return True;
    else:
        return False;

In [56]:
def exact_match(ent1, ent2):#compare the span 
    '''calculate whether two ents have exact overlap 
    returns bool
    '''
    if ent1.start_char == ent2.start_char and ent1.end_char == ent2.end_char:
        return True;
    else:
        return False;

In [58]:
def agreement(doc1, doc2, fuzzy): #pair wise
    '''calculate the agreement betweent two docs
       returns confusion matrix
    '''
    ents1 = doc1.ents; 
    ents2 = doc2.ents;
    
    
    tp = 0
    fp = 0
    fn = 0
    tn = 0 #we do not have this
    
    if fuzzy: # span overlap
        for ent1 in ents1:
            findAnnot = False;
            for ent2 in ents2:#treat ents2 as golden
                if overlaps(ent1, ent2) and ent1.label_==ent2.label_:
                    tp=tp+1;
                    findAnnot = True;
                elif overlaps(ent1, ent2) and not (ent1.label_==ent2.label_):
                    fn=fn+1;
                    findAnnot = True;
            if FindAnnot==False:
                fp=fp+1;
    else: #exact
        for ent1 in ents1:
            findAnnot = False;
            for ent2 in ents2:#treat ents2 as golden
                if exact_match(ent1, ent2) and ent1.label_==ent2.label_:
                    tp=tp+1;
                    findAnnot = True;
                elif exact_match(ent1, ent2) and not (ent1.label_==ent2.label_):
                    fn=fn+1;
                    findAnnot = True;
            if FindAnnot==False:
                fp=fp+1;
        
    return (tp, fp, fn)

In [60]:
def pairwise_f1(tp, fp, fn):
    '''calculate f1 with given true positive, false positive, and false negative values'''
    f1 = 2*tp/float(2*tp+fp+fn)
    
    return f1

In [14]:
def corpus_agreement(docs1, docs2, fuzzy):
    '''calculate f1 over an entire corpus of documents'''

    corpus_confusion_matrix = ()
    
    for i, doc1 in enumerate(docs1):
        agreement(doc1, docs2[i])
    
    return corpus_confusion_matrix

SyntaxError: invalid syntax (654967868.py, line 6)