In [39]:
import spacy
nlp = spacy.load('en_core_web_sm')
from conll import evaluate
from conll import read_corpus_conll
import pandas as pd

corpus = read_corpus_conll('test.txt')
sentence = "Apple's Steve Jobs died in 2011 in Palo Alto, California."

##Evaluate spaCy NER on CoNLL 2003 data, token-level performance (per class and total)
- accuracy of correctly recognizing all tokens that belong to named entities

First thing is to correctly read the sentences inside the CoNLL dataset, which is done by using the function `read_corpus_conll` from the conll.py script. But since its output are the sentences collected in a `list of list of list`, I implemented a function (`getSentences`) which takes as input the result of `read_corpus_conll`, and will output `list` where the full sentences from the dataset are stored. After collecting all of the sentences that will be then fed to `spacy`, using the function `getSentencesTag` I can create a `list of list of tuple` where the ground truth for the evaluation is stored. This was done basically in the same way as for the `getSentences` function, but istead of simply appending each token to the other, I create a tuple of `token` and `ent_tag` which will be used in the comparison of spacy's NER. 
Speaking of which, with the function `getSpacyTag` I take as input the `list` of sentences, and for each one of them will create a `list of tuples` that will contain `token` and `ent_tag`, but since `spacy` has a different way of tokenizing certain words some manipulation was done (for example dates like `19-04-2020` would be splitted in three for each number, while on the CoNLL dataset it is trated as a single token). So with the use of the `whitespace` property I was able to detect those istances and with the help of a function I created `getWord`, which takes a `doc` object and the index numeber of the processed token and will output the full "span" that constitute as the whole token in the CoNLL dataset and another index number of the immidiate next token that will come after the processed one (index that will be used in the `getSpacyTag` function to essentially skip those "extra" tokens that are recognized by `spacy` but are already given out by the `getWord` function.

So now that I all have all that's necessary to make an evaluation I create two variables to count all of the tokens and the ones that are correctly recognized, I also create two `dict` to store all the counts and the correct ones that will have as keys the four types of entities recognized in the CoNLL dataset times two because the tokens can be inside the entity `I-ENT` or its beginning `B-ENT`, and also a key for the non-entity type `O`. 

I should also point out that since CoNLL "only" uses four entity types while spacy recognizes 18 of them, a part of the `getSpacyTag` is dedicated to correctly associating `spacy` tags to the CoNLL ones, basically if a spacy tag was `GPE` it was reported as `LOC`, and the `PERSON` one was simply converted to `PER`, the `LOC` and `ORG` remained untouched while all of the others were assigned to the `MISC` label. 

Another thing I fell the need to point out is the reason why I computed the accuracy "by hand" is because the request was to compute the token-level preformance total and per class, and to do the latter I found out that using the function the professor suggested on piazza it was not possible. 


In [40]:
def getSentences(corpus):
    sentence=''
    sentences=[]
    space= ' '
    for s in corpus:
        for i in s:
            words=i[0].split()
            word=words[0]
            if(word=='-DOCSTART-'):
                continue
            else:
                sentence=sentence+space+word
        sent = sentence.strip()
        if (len(sent)>0):
            sentences.append(sent)
        sentence=''
    return sentences   
        
def getSentencesTag(corpus):
    sentences=[]
    for s in corpus:
        sentence=[]
        for i in s:
            words=i[0].split()
            tup=[words[0], words[3]]
            if(words[0]=='-DOCSTART-'):
                continue
            else:
                sentence.append(tup)
        if (len(sentence)>0):
            sentences.append(sentence)
    return sentences

def getWord(doc, i):
    token=doc[i]
    word=token.text
    nex=doc[i+1]
    next_word=nex.text
    for skip in range(i+1,len(doc)):
        token=doc[skip]
        if(token.whitespace_==''):
            word+=doc[skip].text
        else:
            word+=doc[skip].text
            break
    skip+=1
    return word, skip

def getSpacyTag(sentences):
    spacyTag=[]
    for i,sent in enumerate(sentences):
        doc=nlp(sent)
        sentTag=[]
        numberSkip=0
        for j, t in enumerate(doc):
            if(j>=numberSkip):
                skip=False
            if(t.whitespace_=='' and not(len(doc)==j+1)and not(skip)):
                text, numberSkip=getWord(doc, j)
                skip=True
                sentTag.append([text, tag])
            if (t.ent_iob_=='O' and not(skip)): 
                tag=t.ent_iob_ 
                text=t.text
            elif(not(skip)):
                if(t.ent_type_=='GPE'):
                    tag=t.ent_iob_+'-LOC'
                elif(t.ent_type_=='PERSON'):
                    tag=t.ent_iob_+'-PER'
                elif(t.ent_type_=='LOC' or t.ent_type_=='ORG'):
                    tag=t.ent_iob_+'-'+t.ent_type_
                else:       
                    tag=t.ent_iob_+'-MISC'
                text=t.text
                
            if(not(skip)):
                sentTag.append([text, tag])
        spacyTag.append(sentTag)
    return spacyTag   

sentences = getSentences(corpus)
sentencesTag= getSentencesTag(corpus)
spacyTag=getSpacyTag(sentences)
        
count=0
correct=0
correctDict={"O":0, "B-PER":0, "B-LOC":0, "B-ORG":0, "I-PER":0, "I-LOC":0, "I-ORG":0, "B-MISC":0, "I-MISC":0}
countDict={"O":0, "B-PER":0, "B-LOC":0, "B-ORG":0, "I-PER":0, "I-LOC":0, "I-ORG":0, "B-MISC":0, "I-MISC":0}
for i,sent in enumerate(spacyTag):
    gt=sentencesTag[i]
    for j,token in enumerate(sent):
        count+=1
        countDict[gt[j][1]]+=1
        if(token==gt[j]):
            correct+=1
            correctDict[token[1]]+=1

print('O Accuracy:', correctDict.get("O")/countDict.get("O"))
print('PER Accuracy:', (correctDict.get("B-PER")+correctDict.get("I-PER"))/(countDict.get("B-PER")+countDict.get("I-PER")))
print('ORG Accuracy:', (correctDict.get("B-ORG")+correctDict.get("I-ORG"))/(countDict.get("B-ORG")+countDict.get("I-ORG")))
print('LOC Accuracy:', (correctDict.get("B-LOC")+correctDict.get("I-LOC"))/(countDict.get("B-LOC")+countDict.get("I-LOC")))
print('MISC Accuracy:', (correctDict.get("B-MISC")+correctDict.get("I-MISC"))/(countDict.get("B-MISC")+countDict.get("I-MISC")))
print ('Total Accuracy:', correct/count)    

O Accuracy: 0.8645200010437596
PER Accuracy: 0.6891453299675442
ORG Accuracy: 0.37219551282051283
LOC Accuracy: 0.6545454545454545
MISC Accuracy: 0.5185185185185185
Total Accuracy: 0.812038333153871


##Evaluate spaCy NER on CoNLL 2003 data, chunk-level performance (per class and total)
- precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total 

Since the previously generated `list of list of tuple` was already the correct input for the function `evaluate` inside the `conll.py` script, I simply generated the results and printed them using the `panda` library 

In [41]:
results=evaluate(sentencesTag,spacyTag)
pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
PER,0.747,0.594,0.662,1617
ORG,0.45,0.267,0.335,1661
LOC,0.727,0.661,0.693,1668
MISC,0.107,0.54,0.178,702
total,0.393,0.511,0.445,5648


##Grouping of Entities. Write a function to group recognized named entities using `noun_chunks` method of spaCy. 

The function `groupEnteties` takes as input a sentence and will output a list of all the entities recognized by `spacy` but grouped by noun chunks. Initially I create lists for the entites and the noun chunks, then I cycle through all of the entities and using a function `findChunk` I check if an entity is inside one the the chunks, if it is then it will return a `bool` value `True` and the text of the chunk it is part of, if it is not part of any chunk the output will be `False` and an empty string. So by making use of a `temp` list, I monitor if the entity I'm on is still part of the same chunk as the one of the previous entity, if I am I simply append the tag of the entity to the one of the previous one, if I'm still in a chunk then probably I'm in a new chunk, so I can add to the output list `temp` if it's not empty, and then work on this new chunk. If I'm not in a chunk then again I can add to the output list `temp` if it's not empty, and then add to the output list this entity outside every chunk that still needs to be added since it is an entity.

In [42]:
def findChunk(substring,chunks):
    for chunk in chunks:
        if substring in chunk.text:
            return True, chunk.text
    return False, ''

def groupEnteties(sentence):
    doc = nlp(sentence)
    labels=[] #list of all the entities
    for i in doc.ents:
        labels.append([i.text,i.label_])
    chunks=[]
    for i in doc.noun_chunks:
        chunks.append(i)
    out=[]
    temp = []
    prev_chunk = ''
    for i, ents in enumerate(labels):
        ret, chunk = findChunk(ents[0], chunks) 
        if i==0:
            prev_chunk = chunk
        if (ret and chunk == prev_chunk):
            temp.append(ents[1])
        elif(ret):
            if(len(temp)>0):
                out.append(temp) 
            prev_chunk=chunk
            temp=[]
            temp.append(ents[1])
        else:
            if(len(temp)>0):
                out.append(temp)
                temp=[]
            out.append([ents[1]])
    if(len(temp)>0):
        out.append(temp)
    return out

print(groupEnteties(sentence))

[['ORG', 'PERSON'], ['DATE'], ['GPE'], ['GPE']]


##Analyze the groups in terms of most frequent combinations

With the function `getFrequencies` that takes as input a list of sentences and as output a `dict` that as keys has the combinations of possible entities. For each sentence the function `groupEnteties` is called to compute the entities grouped by noun chunks, I then iterate over it and if it has more than one element (basically at least two entities grouped together) then a count in the dictionary will be added. To do so I check if the dictionary alredy has a key which is the cobination of those entities, if it does I add one to the count, if it's the first time it has come up (so it doesn't exist yet a key with that name) I create this new entry key and initiaze it to one.
Since the request was to report the most frequent combination I use the function `nbest` to output the top 5 combiations. 

In [43]:
def getFrequencies(sentences):
    d = dict()
    for sent in sentences:
        temp = groupEnteties(sent)
        for t in temp:
            if (len(t)>1):
                key=t[0]
                space = ' '
                for i, lab in enumerate(t):
                    if(i>0):
                        key = key + space + t[i]
                if (key in d.keys()):
                    d[key]+=1
                else:
                    d[key]=1
    return d

def nbest(d, n=1):
    return dict(sorted(d.items(), key=lambda item: item[1], reverse=True)[:n])

sentences = getSentences(corpus)
diz = getFrequencies(sentences)

d = nbest(diz, 5)
for c in d:
    print(c, '=', d.get(c))

CARDINAL PERSON = 49
NORP PERSON = 42
GPE PERSON = 33
GPE GPE = 28
ORG PERSON = 20


##Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

For this task I created a function `fixEntity` that takes as input a sentence and returns `list of tuples` where the first element is made by the `token.text` and the second is the entity iob tag and type where if a token has the dependecy `conpound` and itself is not part of any entity while it's head is, then the tag of this `conpound` token is changes to be considered as part of the entity, and the head too has the iob tag changed if the `conpound` token actually comes before it (so the tag is changed to `I` because its inside the entity and possibly not in the beginning). To do so a `list` change is used so that before the function gives the output it updates all of the changed heads for the conpound tokens. 


In [44]:
def fixEntity(sentence):
    doc = nlp(sentence)
    ret =[]
    change = []
    for i, token in enumerate(doc):
        if (token.dep_ == 'compound' and token.ent_type_ == '' and token.head.ent_type_ != ''):
            iToken=token.i
            iHead=token.head.i
            if(iToken-iHead<0): 
                tag = 'B-'+token.head.ent_type_
                change.append([token.head.i-1, 'I-'+token.head.ent_type_])
                temp = [token.text, tag]
                ret.append(temp)
                continue
            elif (token.head.ent_iob_ == 'I'):
                tag = 'B-'+token.head.ent_type_
                temp = [token.text, tag]
                ret.append(temp)
                continue
        if token.ent_type_ == '':
            tag = token.ent_iob_
        else:
            tag = token.ent_iob_+'-'+token.ent_type_
        temp = [token.text, tag]
        ret.append(temp)
    if (len(change)>0):
        for c in change:
            ret[c[0]]=[ret[c[0]][0],c[1]]
    return ret

test = fixEntity(sentence)
for i in test:
    print(i)

['Apple', 'B-ORG']
["'s", 'O']
['Steve', 'B-PERSON']
['Jobs', 'I-PERSON']
['died', 'O']
['in', 'O']
['2011', 'B-DATE']
['in', 'O']
['Palo', 'B-GPE']
['Alto', 'I-GPE']
[',', 'O']
['California', 'B-GPE']
['.', 'O']
