# Spacy/Medspacy IAA

## Resources

Prodigy forum answer about IAA for spans https://support.prodi.gy/t/proper-way-to-calculate-inter-annotator-agreement-for-spans-ner/5760

Spacy scorer object https://spacy.io/api/scorer

## End Goal

### Functionality

Provide a collection of methods to evaluate IAA between _n_ arbitrary spacy `doc` objects. Provide methods that aid in error analysis such as providing lists of differences.

Priorities:
* Pairwise F1
    * configurable strict/loose matching
    * configurable inclusion of labels/attributes (calculate just span vs span+class agreement)

* Imported python files
    * reasonable docstrings on methods/classes
    
* Unit tests
    * add CI to repo for automated testing later

Extra features:
* List of differences between docs
* 

Expected challenges
* Spacy scorer functions are useful, but _only_ do strict span matching
* Fewer resources (obviously?) available for comparisons between 3+ docs


# Testing Section

In [None]:
import spacy

In [None]:
#!python -m spacy download en_core_web_md

In [None]:
nlp = spacy.load("en_core_web_sm")
nlp2 = spacy.load("en_core_web_md")

In [None]:
doc = nlp("this is a test document made in utah or mississippi, or salt lake city.")

In [None]:
doc.ents

In [None]:
doc2 = nlp2("this is a test document made in utah or mississippi, or salt lake city.")

In [None]:
doc2.ents

In [None]:
from spacy.tokens import Span
spand = list()
spand += [Span(doc, 2, 4, label="PERSON"),Span(doc,7,8,label="GPE"),Span(doc,9,10,label="PERSON"),Span(doc,13,14,label="GPE"),Span(doc,14,15,label="GPE")]

# Add the span to the doc's entities
doc.ents = spand

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

In [None]:
#Run below cells before this
tp,fp,fn = agreement(doc,doc2,1,1)
print(tp,fp,fn)
print(pairwise_f1(tp,fp,fn))

In [2]:
import spacy

In [3]:
nlp = spacy.load("en_core_web_sm")

test_str = "This is a test document. For testing lol."
test_str_2 = "This is a test document. For testing lol."

doc = nlp(test_str)
doc2 = nlp(test_str)

from spacy.tokens import SpanGroup

spans = [doc[0:1], doc[1:3]]
group = SpanGroup(doc, name="errors", spans=spans, attrs={"annotator": "matt"})
doc.spans["errors"] = group
group = SpanGroup(doc, name="entity1", spans=spans, attrs={"annotator": "John"})
#doc.spans["entity"] = group


spans = [doc2[0:1], doc2[1:5],doc2[4:5]]
group = SpanGroup(doc2, name="errors", spans=spans, attrs={"annotator": "matt"})
doc2.spans["errors"] = group
group = SpanGroup(doc2, name="entity1", spans=spans, attrs={"annotator": "John"})
#doc2.spans["entity"] = group

In [8]:
IAA.corpus_agreement([doc.spans['errors']],doc.ents,labels=0)

Input Error: Input must be iterable of spacy documents, or dataframe.


In [5]:
from spacy.tokens import Span
import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("this is a test document made in utah or mississippi, or salt lake city.")

spand = list()
spand += [Span(doc, 2, 4, label="PERSON"),Span(doc,7,8,label="GPE"),Span(doc,9,10,label="PERSON"),Span(doc,13,14,label="GPE"),Span(doc,14,15,label="GPE")]

# Add the span to the doc's entities
doc.ents = spand

print([(ent.text, ent.label_) for ent in doc.ents])

from spacy.tokens import SpanGroup

spans = [doc[7:8], doc[8:10]]
group = SpanGroup(doc, name="errors", spans=spans, attrs={"annotator": "matt"})
doc.spans["errors"] = group
print(doc.spans['errors'])
print(doc.spans['errors'][0].label_)

[('a test', 'PERSON'), ('utah', 'GPE'), ('mississippi', 'PERSON'), ('lake', 'GPE'), ('city', 'GPE')]
[utah, or mississippi]



In [None]:
print(doc.spans['errors'])
for span in doc.spans['errors']:
    print('what')

In [None]:
agreement(doc.ents,doc.spans['errors'],labels=0)

In [None]:
agreement(doc,doc2,ent_or_span='ent')

In [None]:
type(doc.ents[0]) is spacy.tokens.span.Span
type(doc.spans['errors']) is spacy.tokens.span_group.SpanGroup

type(doc)

# Code

In [None]:
from quicksectx import IntervalNode, IntervalTree, Interval #note that you need the quicksectx library

#In order to make the code a little more adaptable for situations of multiple overlapping entities, as well as for 
#transparency and testing the code, I wrote the overlaps code to output a mapping of which entities are being matched. 
#Then agreement can parse this output for how many valid overlaps exist.

#This makes the code a little more complicated to understand, but I think it makes everything more transparent and adaptable.

def overlaps(doc1_ents, doc2_ents,labels=1):
    '''Calculates overlapping entities between two spacy documents. Also checks for matching labels if label=1.
    
    Return:
        Dictionaries with the mapping of matching entity indices:
            keys: entity index from one annotation
            value: matched entity index from other annotation
        
        Ex: "{1 : [2] , 3 : [4,5]}" means that entity 1 from doc1 matches entity 1 in doc2, and entity 3 in doc1 matches 
        entity 4 and 5 from doc2.
    '''
    
    doc1_matches = dict()
    doc2_matches = dict()
    
    tree = IntervalTree()
    for index2,ent2 in enumerate(doc2_ents):
        tree.add(ent2.start_char,ent2.end_char,index2)
    
    for index1,ent1 in enumerate(doc1_ents):
        matches = tree.search(ent1.start_char,ent1.end_char)
        for match in matches:
            index2 = match.data #match.data is the index of doc2_ents
            if ((labels == 0) | (doc2_ents[index2].label_ == ent1.label_)):
                if index1 not in doc1_matches.keys():
                    doc1_matches[index1] = [index2]
                else:
                    doc1_matches[index1].append(index2)
                if index2 not in doc2_matches.keys():
                    doc2_matches[index2] = [index1]
                else:
                    doc2_matches[index2].append(index1)
                
    return doc1_matches, doc2_matches

In [None]:
### This is the old, less efficient code. The newer code uses a tree search instead of the nested for-loop.

def old_overlaps(doc1_ents, doc2_ents,labels):
    '''Old code for calculating overlapping entities between two spacy documents. Also checks for matching labels if label=1.
    
    Return:
        Dictionaries with the mapping of matching entity indices:
            keys: entity index from one annotation
            value: matched entity index from other annotation
        
        Ex: "{1 : [2] , 3 : [4,5]}" means that entity 1 from doc1 matches entity 1 in doc2, and entity 3 in doc1 matches 
        entity 4 and 5 from doc2.
    '''
    
    doc1_matches = dict()
    doc2_matches = dict()

    for index1,ent1 in enumerate(doc1_ents):
        for index2,ent2 in enumerate(doc2_ents):
            if (ent1.end_char >= ent2.start_char) & (ent1.start_char <= ent2.end_char) & ((labels==0) | (ent1.label_ == ent2.label_)):
                if index1 not in doc1_matches.keys():
                    doc1_matches[index1] = [index2]
                else:
                    doc1_matches[index1].append(index2)
                if index2 not in doc2_matches.keys():
                    doc2_matches[index2] = [index1]
                else:
                    doc2_matches[index2].append(index1)
                
    return doc1_matches, doc2_matches
    

In [None]:
def exact_match(doc1_ents, doc2_ents, labels):
    '''calculate whether two ents have exact overlap
    returns bool
    '''
    
    doc1_matches = dict()
    doc2_matches = dict()

    doc1_ent_dict = dict()
    doc2_ent_dict = dict()
    
    for index1,ent1 in enumerate(doc1_ents):
        if labels == 1: #If checking for labels, then include this in the tuple's to-be-compared elements
            doc1_ent_dict[(ent1.start_char,ent1.end_char,ent1.label_)] = index1
        else:
            doc1_ent_dict[(ent1.start_char,ent1.end_char)] = index1
            
    for index2,ent2 in enumerate(doc2_ents):
        if labels == 1:    
            doc2_ent_dict[(ent2.start_char,ent2.end_char,ent2.label_)] = index2
        else:
            doc2_ent_dict[(ent2.start_char,ent2.end_char)] = index2
        
    doc1_ent_set = set(doc1_ent_dict.keys())
    doc2_ent_set = set(doc2_ent_dict.keys())
    
    matched_ents = doc1_ent_set.intersection(doc2_ent_set)
    
    for match in matched_ents:
        index1 = doc1_ent_dict[match]
        index2 = doc2_ent_dict[match]
        doc1_matches[index1] = [index2]
        doc2_matches[index2] = [index1]
        
    return doc1_matches, doc2_matches
    

In [None]:
def agreement(doc1, doc2, loose=1, labels=1, ent_or_span = 'ent'):
    '''Calculates confusion matrix for agreement between two documents.
    
       returns true positive, false positive, and false negative
    '''
    if (type(doc1) is tuple) or (type(doc1) is spacy.tokens.span_group.SpanGroup) and \
    (type(doc2) is tuple) or (type(doc2) is spacy.tokens.span_group.SpanGroup):
        doc1_ents = doc1
        doc2_ents = doc2
    elif (type(doc1) is spacy.tokens.doc.Doc) and (type(doc2) is spacy.tokens.doc.Doc):
        if ent_or_span == 'ent':
            doc1_ents = doc1.ents
            doc2_ents = doc2.ents
        elif ent_or_span == 'span':
            if len(doc1.spans) > 1:
                #raise error
                print("Error: cannot distinquish which span group to use from doc1.")
                return
            else:
                span_group = list(doc1.spans.keys())[0]
                doc1_ents = doc1.spans[span_group]
                doc2_ents = doc2.spans[span_group]
        else:
            #raise error
            print("Error: Must select 'span' or 'ent' for ent_or_span option.")
            return
    else:
        #raise error
        print("Error: Input must be of type 'tuples', 'spacy.tokens.span_group.SpanGroup', or 'spacy.tokens.doc.Doc'")
        return
        
    if loose:
        doc1_matches, doc2_matches = overlaps(doc1_ents, doc2_ents, labels)
    else:
        doc1_matches, doc2_matches = exact_match(doc1_ents, doc2_ents, labels)
    
    return conf_matrix(doc1_matches,doc2_matches,len(doc1_ents),len(doc2_ents))


In [None]:
def conf_matrix(doc1_matches,doc2_matches,doc1_ent_num,doc2_ent_num):

    doc1_match_num = len(doc1_matches.keys())
    doc2_match_num = len(doc2_matches.keys())
    
    duplicate_matches = 0
    for value in doc2_matches.values():
        duplicate_matches += len(value) - 1
    
    tp = doc1_match_num - duplicate_matches #How many entity indices from doc1 matched, minus duplicated matches
    fp = doc2_ent_num - doc2_match_num #How many entities from doc2 that didn't match
    fn = doc1_ent_num - doc1_match_num #How many entities from doc1 that didn't match
    
    return (tp,fp,fn)

In [None]:
def pairwise_f1(tp,fp,fn):
    '''calculate f1 given true positive, false positive, and false negative values'''
    
    return (2*tp)/float(2*tp+fp+fn)

In [None]:
def corpus_agreement(docs1, docs2, loose=1, labels=1,ent_or_span='ent'):
    '''calculate f1 over an entire corpus of documents'''
    corpus_tp, corpus_fp, corpus_fn = (0,0,0)
    
    if type(docs1[0]) is spacy.tokens.doc.Doc:
        for i, doc1 in enumerate(docs1):
            tp,fp,fn = agreement(doc1, docs2[i],loose,labels,ent_or_span)
            corpus_tp += tp
            corpus_fp += fp
            corpus_fn += fn
    elif type(docs1) is pandas.core.frame.DataFrame:
        for doc_name in docs1['doc name'].unique():
            docs1_df = docs1[docs1['doc name'] == doc_name]
            docs2_df = docs2[docs2['doc name'] == doc_name]
            doc1_matches,doc2_matches = df_overlaps(docs1_df,docs2_df)
            tp,fp,fn = conf_matrix(doc1_matches,doc2_matches,docs1_df.shape[0],docs2_df.shape[0])
            corpus_tp += tp
            corpus_fp += fp
            corpus_fn += fn
    else:
        #raise error
        print('Input Error: Input must be iterable of spacy documents, or dataframe.')
        return
    
    data = {'IAA' : [pairwise_f1(corpus_tp,corpus_fp,corpus_fn)], 'Recall' : [tp/float(tp+fp)], 'Precision' : [tp/float(tp+fn)],\
           'True Positives' : [tp] , 'False Positives' : [fp], 'False Negative' : [fn]}
    
    return pd.DataFrame(data)

# Tutorial

In this tutorial I will go over the basic, front-end usage of calculating IAA between 2 annotators.

In [None]:
import spacy
import medspacy
nlp1 = spacy.load("en_core_web_sm")
nlp2 = spacy.load("en_core_web_md")
#!python -m spacy download en_core_web_sm
#!python -m spacy download en_core_web_md

#Note for John: Get better examples or make my own entities
doc1 = nlp1("this is a test document made in utah or mississippi, or salt lake city.")
doc2 = nlp2("this is a test document made in utah or mississippi, or salt lake city.")

print('doc1.ents: ',doc1.ents)
print('doc2.ents: ',doc2.ents)

Above we made two documents using spacy's NER packages. Document 2 added more entities than document 1. Let's calculate the IAA between these documents!

In [None]:
corpus_agreement([doc1],[doc2])

'corpus_agreement' calculates the agreement between two lists of documents. Note the brackets around 'doc1' and 'doc2' so they are passed in as lists of size 1 each.

'corpus_agreement' can also take options to be more flexible with other IAA methods. Below are the arguments:

### corpus_agreement(docs1, docs2, loose=1, labels=1,ent_or_span='ent')

docs1: list of spacy documents

docs2: list of spacy documents with same order as docs1

loose: Boolean for allowing loose matching. '1' indicates that any overlap between two spans/entities is counted towards IAA. '0' indicates that only exact matches will be allowed.

labels: Boolean to include labels when matching. Uses the .label_ attribute of entities/spans to access labels.

ent_or_span: string of whether spans or entities are being compared. If set to 'ent', code will iterate through doc1.ents and doc2.ents. If set to 'span', code will iterate through the spans within doc1 and doc2's first span group. 'span' only works if doc1 has one span group. This option may be extended to include doc._ .concepts option, or to allow multiple span groups to be compared.

Internally, corpus_agreement is calling the 'agreement' and 'pairwise_f1' functions on each pair of document in the lists. We can instead choose to call these functions separate

# Code in Development

In [None]:
import pandas as pd
john_df = pd.read_pickle('./df_John.pkl')
#john_df = john_df[john_df['Concept Label'] == 'Symptom']
mengke_df = pd.read_pickle('./df_Mengke.pkl')

In [None]:
from quicksectx import IntervalNode, IntervalTree, Interval #note that you need the quicksectx library

#In order to make the code a little more adaptable for situations of multiple overlapping entities, as well as for 
#transparency and testing the code, I wrote the overlaps code to output a mapping of which entities are being matched. 
#Then agreement can parse this output for how many valid overlaps exist.

#This makes the code a little more complicated to understand, but I think it makes everything more transparent and adaptable.

def df_overlaps(docs1_df, docs2_df,labels=1):
    '''Calculates overlapping entities between two spacy documents. Also checks for matching labels if label=1.
    
    Return:
        Dictionaries with the mapping of matching entity indices:
            keys: entity index from one annotation
            value: matched entity index from other annotation
        
        Ex: "{1 : [2] , 3 : [4,5]}" means that entity 1 from doc1 matches entity 1 in doc2, and entity 3 in doc1 matches 
        entity 4 and 5 from doc2.
    '''
    
    doc1_matches = dict()
    doc2_matches = dict()
    
    tree = IntervalTree()
    for index2,row2 in docs2_df.iterrows():
        tree.add(row2['start loc'],row2['end loc'],index2)
    
    for index1,row1 in docs1_df.iterrows():
        matches = tree.search(row1['start loc'],row1['end loc'])
        for match in matches:
            index2 = match.data #match.data is the index of doc2_ents
            if ((labels == 0) | (docs2_df.loc[index2,'Concept Label'] == row1['Concept Label'])):
                if index1 not in doc1_matches.keys():
                    doc1_matches[index1] = [index2]
                else:
                    doc1_matches[index1].append(index2)
                if index2 not in doc2_matches.keys():
                    doc2_matches[index2] = [index1]
                else:
                    doc2_matches[index2].append(index1)
                
    return doc1_matches, doc2_matches

In [None]:
def df_corpus_agreement(docs1, docs2, loose=1, labels=1,ent_or_span='ent'):
    '''calculate f1 over an entire corpus of documents'''
    corpus_tp, corpus_fp, corpus_fn = (0,0,0)
    
    for doc_name in docs1['doc name'].unique():
        docs1_df = docs1[docs1['doc name'] == doc_name]
        docs2_df = docs2[docs2['doc name'] == doc_name]
        doc1_matches,doc2_matches = df_overlaps(docs1_df,docs2_df)
        tp,fp,fn = conf_matrix(doc1_matches,doc2_matches,docs1_df.shape[0],docs2_df.shape[0])
        corpus_tp += tp
        corpus_fp += fp
        corpus_fn += fn
    
    print('corpus tp: ',corpus_tp,'\ncorpus fp: ',corpus_fp,'\ncorpus fn: ',corpus_fn)
    
    print(tp)
    print(corpus_tp)
    
    #print("Not in doc2 annotations:\n")
    for index,row in docs1.iterrows():
        if (index not in doc2_matches.keys()):
            #print(row['Span Text'])
            break
    
    return pairwise_f1(corpus_tp,corpus_fp,corpus_fn)

In [None]:
df_corpus_agreement(john_df,mengke_df,1)

In [None]:
john_df_symptoms = john_df[john_df['Concept Label'] == 'Symptom']
mengke_df_symptoms = mengke_df[mengke_df['Concept Label'] == 'Symptom']

df_corpus_agreement(john_df_symptoms,mengke_df_symptoms)

In [None]:
corpus_agreement(john_df_symptoms,mengke_df_symptoms,1)

In [None]:
type(john_df)

In [1]:
import sys
# caution: path[0] is reserved for script path (or '' in REPL)
sys.path.insert(1, './Integrated_code/')
import IAA_ as IAA

In [2]:
import pandas as pd
john_df = pd.read_pickle('./df_John.pkl')
#john_df = john_df[john_df['Concept Label'] == 'Symptom']
mengke_df = pd.read_pickle('./df_Mengke.pkl')

In [3]:
#IAA.corpus_agreement(john_df,mengke_df,labels=1)[1].to_csv('C:/Users/johna/Desktop/mimic_annot.csv')

In [3]:
john_df['doc name'].unique()

array(['485939.txt', '366026.txt', '5585.txt', '669731.txt', '33200.txt',
       '36609.txt', '335643.txt', '52374.txt', '38580.txt', '601884.txt',
       '38757.txt', '416614.txt', '45725.txt', '56773.txt', '19070.txt',
       '628045.txt', '45509.txt', '527735.txt', '439912.txt',
       '457012.txt', '595179.txt', '575514.txt', '28056.txt',
       '655523.txt', '537108.txt', '33137.txt', '330814.txt', '18826.txt',
       '700062.txt', '26272.txt', '321062.txt', '584041.txt',
       '369557.txt', '679566.txt', '593885.txt', '379.txt', '489228.txt',
       '24881.txt', '377554.txt', '700799.txt', '26388.txt', '330335.txt',
       '18305.txt', '375911.txt', '345622.txt', '346854.txt',
       '536291.txt', '456788.txt', '5749.txt', '3271.txt'], dtype=object)

In [4]:
john_df_doc1 = john_df[john_df['doc name'] == '18826.txt']
mengke_df_doc1 = mengke_df[mengke_df['doc name'] == '18826.txt']
mengke_df_doc1

Unnamed: 0,Span Text,Concept Label,start loc,end loc,doc name
328,ILLNESS:,SectionHeader_HasSymptom,183,191,18826.txt
329,of,SectionHeader_HasSymptom,538,540,18826.txt
330,of,SectionHeader_HasSymptom,1397,1399,18826.txt
331,of,SectionHeader_HasSymptom,1455,1457,18826.txt
332,of,SectionHeader_HasSymptom,4407,4409,18826.txt
333,migraine headaches,Symptom,250,268,18826.txt
334,cool and\nnumb hands,Symptom,363,382,18826.txt
335,fatigue,Symptom,403,410,18826.txt
336,upper extremity\nweakness,Symptom,415,439,18826.txt
337,photophobia,Symptom,541,552,18826.txt


In [5]:
john_df_doc1

Unnamed: 0,Span Text,Concept Label,start loc,end loc,doc name
772,HISTORY OF PRESENT ILLNESS: The patient is a ...,Symptom_Section,164,1310,18826.txt
773,PAST MEDICAL HISTORY: Other past medical hist...,Symptom_Section,1312,1487,18826.txt
774,PHYSICAL EXAMINATION ON PRESENTATION: On phys...,Symptom_Section,1831,2436,18826.txt
775,CONCISE SUMMARY OF HOSPITAL COURSE: The impre...,Symptom_Section,3126,4614,18826.txt
776,"DISCHARGE DIAGNOSIS:\n1. Delirium, not otherw...",Symptom_Section,4744,4852,18826.txt
777,HISTORY OF PRESENT ILLNESS:,SectionHeader_HasSymptom,164,191,18826.txt
778,PAST MEDICAL HISTORY:,SectionHeader_HasSymptom,1312,1333,18826.txt
779,PHYSICAL EXAMINATION ON PRESENTATION:,SectionHeader_HasSymptom,1831,1868,18826.txt
780,CONCISE SUMMARY OF HOSPITAL COURSE:,SectionHeader_HasSymptom,3126,3161,18826.txt
781,DISCHARGE DIAGNOSIS:,SectionHeader_HasSymptom,4744,4764,18826.txt


In [22]:
IAA.corpus_agreement(john_df_doc1,mengke_df_doc1,labels=1,attributes=['attribute1','attribute2'])[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  docs1['Span Text'] = docs1['Span Text'].str.replace('\n',' ')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  docs2['Span Text'] = docs2['Span Text'].str.replace('\n',' ')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  docs1['Span Text'] = docs1['Span Text'].str.replace('\t',' ')
A value is trying t

Unnamed: 0,IAA,Recall,Precision,True Positives,False Positives,False Negative
0,0.323529,0.366667,0.289474,11,19,27


In [17]:
IAA.corpus_agreement(john_df_doc1,mengke_df_doc1,labels=1)[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  docs1['Span Text'] = docs1['Span Text'].str.replace('\n',' ')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  docs2['Span Text'] = docs2['Span Text'].str.replace('\n',' ')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  docs1['Span Text'] = docs1['Span Text'].str.replace('\t',' ')
A value is trying t

Unnamed: 0,IAA,Recall,Precision,True Positives,False Positives,False Negative
0,0.615385,0.740741,0.526316,20,7,18


In [21]:
newArray = []
for i in range(john_df_doc1.shape[0]):
    newArray.append("cardio")
john_df_doc1['attribute1'] = newArray

newArray = []
for i in range(john_df_doc1.shape[0]):
    newArray.append("brain")
john_df_doc1['attribute2'] = newArray

newArray = []
for i in range(mengke_df_doc1.shape[0]):
    if i%2 == 0:
        newArray.append("not cardio")
    else:
        newArray.append("cardio")
mengke_df_doc1['attribute1'] = newArray
        
newArray = []
for i in range(mengke_df_doc1.shape[0]):
    newArray.append("brain")
mengke_df_doc1['attribute2'] = newArray

mengke_df_doc1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  john_df_doc1['attribute1'] = newArray
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  john_df_doc1['attribute2'] = newArray
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mengke_df_doc1['attribute1'] = newArray
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_ind

Unnamed: 0,Span Text,Concept Label,start loc,end loc,doc name,attribute1,attribute2
328,ILLNESS:,SectionHeader_HasSymptom,183,191,18826.txt,not cardio,brain
329,of,SectionHeader_HasSymptom,538,540,18826.txt,cardio,brain
330,of,SectionHeader_HasSymptom,1397,1399,18826.txt,not cardio,brain
331,of,SectionHeader_HasSymptom,1455,1457,18826.txt,cardio,brain
332,of,SectionHeader_HasSymptom,4407,4409,18826.txt,not cardio,brain
333,migraine headaches,Symptom,250,268,18826.txt,cardio,brain
334,cool and numb hands,Symptom,363,382,18826.txt,not cardio,brain
335,fatigue,Symptom,403,410,18826.txt,cardio,brain
336,upper extremity weakness,Symptom,415,439,18826.txt,not cardio,brain
337,photophobia,Symptom,541,552,18826.txt,cardio,brain


In [7]:
dicts = IAA.df_overlaps(john_df_doc1,mengke_df_doc1,labels=1)
dicts

({772: [353, 352],
  773: [354, 355],
  775: [356],
  776: [357],
  777: [328],
  782: [333],
  783: [334],
  784: [335],
  785: [336],
  786: [337],
  787: [338],
  788: [339],
  790: [340],
  791: [341],
  792: [342],
  794: [343],
  795: [344],
  807: [348],
  808: [349],
  809: [351, 350]},
 {353: [772],
  352: [772],
  354: [773],
  355: [773],
  356: [775],
  357: [776],
  328: [777],
  333: [782],
  334: [783],
  335: [784],
  336: [785],
  337: [786],
  338: [787],
  339: [788],
  340: [790],
  341: [791],
  342: [792],
  343: [794],
  344: [795],
  348: [807],
  349: [808],
  351: [809],
  350: [809]})

In [8]:
test = IAA.create_agreement_df(dicts[0],dicts[1],john_df_doc1,mengke_df_doc1) #index is not finding a duplicate_match that exists
#because duplicate match is already added to a previous duplicate match

In [11]:
test.loc[0,'Annotation_2']

'developed fatigue and upper extremity\nweakness || photophobia,\nlightheadedness, and a headache'

In [8]:
test_dict = {"Test1" : [4,5,6]}
test_dict["Test1"][0] += 2
test_dict

{'Test1': [6, 5, 6]}

In [None]:
##This was a very bad idea

def match_indices_recursive(doc1_matches,doc2_matches,index,doc1_doc2,doc1_indices,doc2_indices):  
    if (doc1_doc2==1):
        doc1_indices.add(index)
        for index2 in doc1_matches[index]:
            if index2 not in doc2_indices():
                doc2_indices.add(index2)
                doc2_indices.add(match_indices_recursive(doc1_matches,doc2_matches,index2,2,doc1_indices,doc2_indices))
    elif doc1_doc2==2:
        doc2_indices.add(index)
        for index1 in doc2_matches[index]:
            return (match_indices_recursive(doc1_matches,doc2_matches,index2,1,doc1_indices,doc2_indices)[0].add(index1))
    else:
        return [set(),set()]

In [34]:
test = pd.DataFrame(columns=['test','test2'])
test2 = pd.DataFrame({'test': [2,3], 'test2' : [3,4]})
pd.concat([test,test2])

Unnamed: 0,test,test2
0,2,3
1,3,4


In [5]:
def create_agreement_df(doc1_matches,doc2_matches,doc1_ents,doc2_ents):
    result_dict = {"Index" : [],"Annotation_1" : [],"Annotation_2" : [], "Exact Match?" : [], "Duplicate Matches?" : [], "Overlap?" : []} 
    
    #Add doc name,labels,start_char(1&2),end_char(1&2), fix index, get rid of index column
    
    for index1 in range(doc1_ents.shape[0]): #iterate through all ents inset one
        if index1 in doc1_matches.keys(): #if ent is in
            #if another index1 is in doc2_matches.values(), then add it to this row
            first_index2 = sorted(doc1_matches[index1])[0]
            first_index1 = sorted(doc2_matches[first_index2])[0]
            if first_index1 < index1:
                #Add to index: sorted(doc2_matches[first_index2])[0]
                duplicate_match_index = result_dict["Index"].index(first_index1)
                result_dict["Index"][duplicate_match_index] += " || " + doc1_ents.loc[index1,'Span Text']
                result_dict["Duplicate Matches?"][duplicate_match_index] = 1
            else:
                result_dict["Index"].append(index1)
                result_dict["Annotation_1"].append(doc1_ents.loc[index1,'Span Text'])
                annot_2 = ""
                first_time=1
                for index2 in sorted(doc1_matches[index1]):
                    if first_time ==1:
                        annot_2 += doc2_ents.loc[index2,'Span Text']
                        first_time=0
                    else:
                        annot_2 += " || " + doc2_ents.loc[index2,'Span Text']
                result_dict["Annotation_2"].append(annot_2)
                result_dict["Exact Match?"].append("")
                if len(doc1_matches[index1]) > 1:
                    result_dict["Duplicate Matches?"].append(1)
                else:
                    result_dict["Duplicate Matches?"].append(0)
                result_dict["Overlap?"].append(1)
        else:
            result_dict["Index"].append(index1)
            result_dict["Annotation_1"].append(doc1_ents.loc[index1,'Span Text'])
            result_dict["Annotation_2"].append("")
            result_dict["Exact Match?"].append(0)
            result_dict["Duplicate Matches?"].append(0)
            result_dict["Overlap?"].append(0)
    for index2 in range(doc2_ents.shape[0]):
        if index2 not in doc2_matches.keys():
            result_dict["Index"].append(index2)
            result_dict["Annotation_1"].append("")
            result_dict["Annotation_2"].append(doc1_ents.loc[index2,'Span Text'])
            result_dict["Exact Match?"].append(0)
            result_dict["Duplicate Matches?"].append(0)
            result_dict["Overlap?"].append(0)
        #Add annotation2
    return pd.DataFrame.from_dict(result_dict)

Join Tables

In [11]:
mimic_IAA = pd.read_csv('C:/Users/johna/Desktop/mimic_annot.csv',index_col=[0])
mimic_quickumls = pd.read_csv('C:/Users/johna/Desktop/work forms filled/projects/PascLex/mimic_annotation_quickumls.csv')

In [25]:
mimic_quickumls.dtypes

Text           object
start_char      int64
end_char        int64
label          object
Row_ID          int64
similarity    float64
dtype: object

In [6]:
mimic_IAA

Unnamed: 0,doc name,Annotation_1,Annotation_2,Annot_1_label,Annot_1_char,Annot_2_label,Annot_2_char,Overall_start_char,Exact Match?,Duplicate Matches?,Overlap?
0,485939.txt,Chief Complaint:,,SectionHeader_HasSymptom,0-16,,,0,0,0,0
1,485939.txt,Transfer from [**Hospital Unit Name 1**] for X...,,Symptom_Section,17-1307,,,17,0,0,0
2,485939.txt,respiratory failure,respiratory failure,Symptom,103-122,Symptom,103-122,103,1,0,1
3,485939.txt,afib,,Symptom,259-263,,,259,0,0,0
4,485939.txt,right apical pneumothorax,,Symptom,729-754,,,729,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
1389,3271.txt,left sided hemiparesis,,Symptom,6030-6052,,,6030,0,0,0
1390,3271.txt,dysarthria,dysarthria,Symptom,6075-6085,Symptom,6075-6085,6075,1,0,1
1391,3271.txt,,swallowing,,,Symptom,6090-6100,6090,0,0,0
1392,3271.txt,decreased arousability,,Symptom,6202-6224,,,6202,0,0,0


In [28]:
mimic_IAA['start_char'] = mimic_IAA['Overall_start_char']
mimic_IAA['Row_ID'] = mimic_IAA['doc name'].str.replace('.txt','',regex=True).astype(int)
mimic_IAA.dtypes

doc name              object
Annotation_1          object
Annotation_2          object
Annot_1_label         object
Annot_1_char          object
Annot_2_label         object
Annot_2_char          object
Overall_start_char     int64
Exact Match?           int64
Duplicate Matches?     int64
Overlap?               int64
start_char             int64
Row_ID                 int32
dtype: object

Index(['doc name', 'Annotation_1', 'Annotation_2', 'Annot_1_label',
       'Annot_1_char', 'Annot_2_label', 'Annot_2_char', 'Overall_start_char',
       'Exact Match?', 'Duplicate Matches?', 'Overlap?', 'start_char',
       'Row_ID'],
      dtype='object')

In [30]:
merged_annotation = pd.merge(mimic_IAA,mimic_quickumls,on=['Row_ID','start_char'],how='inner')
merged_annotation.to_csv('C:/Users/johna/Desktop/merged_IAA_quick.csv')