## Assignment 2

Assigment is in the intersection of Named Entity Recognition and Dependency Parsing.

1. Evaluate spaCy NER on CoNLL 2003 data (provided)
    - report token-level performance (per class and total)
        - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
    - report CoNLL chunk-level performance (per class and total);
        - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total  

2. Grouping of Entities.
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together). 

3. One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

In [42]:
import pandas as pd
import conll
import spacy
from spacy.tokens import Doc, Span
from tqdm import tqdm

DATASET_PATH = 'data/test.txt'

# Read dataset
corpus = conll.read_corpus_conll(DATASET_PATH, fs=' ')
corpus = list(filter(lambda sent: sent[0][0] != '-DOCSTART-', corpus)) # Remove -DOCSTART- sentences
words = [word[0] for sent in corpus for word in sent]

# Parse with spaCy
nlp = spacy.load('en_core_web_sm')
doc = Doc(nlp.vocab, words)

# Custom sentence division
i = 0
for sent in corpus:
    for j, word in enumerate(sent):
        doc[i].is_sent_start = (j == 0) # Set to true only if first word in sentence
        i += 1

# NER with spaCy
for name, proc in nlp.pipeline:
    doc = proc(doc)

### 1. Evaluate spaCy NER on CoNLL 2003 data (provided)
- report token-level performance (per class and total)
    - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
- report CoNLL chunk-level performance (per class and total);
    - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total  

In [43]:
# Map spacy tags to conll
def to_conll(iob, ent_type):
    conll_type = {
        'PERSON': 'PER',
        'LOC': 'LOC',
        'GPE': 'LOC',
        'ORG': 'ORG',
        '': ''
    }.get(ent_type)
    if conll_type is None: 
        conll_type = 'MISC'
    return f'{iob}-{conll_type}'.strip('-')

# Compute token-level accuracy
def accuracy(refs, hyps):
    acc, tot = {}, {}
    if len(refs) != len(hyps): raise ValueError(f'Size mismatch: ref: {len(refs)}, hyp: {len(hyps)}')
    for ref_chunk, hyp_chunk in zip(refs, hyps):
        if len(ref_chunk) != len(hyp_chunk): raise ValueError(f'Size mismatch: ref: {len(ref_chunk)}, hyp: {len(hyp_chunk)}')
        for ref_token, hyp_token in zip(ref_chunk, hyp_chunk):
            # The two compared token mus be equal
            if ref_token[0] != hyp_token[0]: raise ValueError(f'Alignment mismatch: ref: {ref_token} & hyp: {hyp_token}')
            # Create missing keys
            if ref_token[1] not in acc:
                acc[ref_token[1]], tot[ref_token[1]] = 0, 0
            # Increase counts for accuracy and total
            if ref_token[1] == hyp_token[1]:
                acc[ref_token[1]] += 1
            tot[ref_token[1]] += 1
    # Compute total metrics
    acc['total'], tot['total'] = sum(acc.values()), sum(tot.values())
    # Compute accuracy by category
    for key in acc:
        acc[key] = {'accuracy': acc[key] / tot[key], 's': tot[key] }
    return acc

def evaluate(doc):
    # Results pre-processing
    refs = [[(word[0], word[3]) for word in sent] for sent in corpus]
    hyps = [[(word.text, to_conll(word.ent_iob_, word.ent_type_)) for word in sent] for sent in doc.sents]

    # Token-level
    token_lvl = accuracy(refs, hyps)
    token_lvl = pd.DataFrame().from_dict(token_lvl, orient='index').sort_index()

    # Chunk-level
    chunk_lvl = conll.evaluate(refs, hyps)
    chunk_lvl = pd.DataFrame().from_dict(chunk_lvl, orient='index').sort_index()

    return token_lvl, chunk_lvl

# Compute performances
token_lvl, chunk_lvl = evaluate(doc)
display(token_lvl.style.set_precision(3).set_caption("Token-level performances"))
display(chunk_lvl.style.set_precision(3).set_caption("Chunk-level performances"))


Unnamed: 0,accuracy,s
B-LOC,0.685,1668
B-MISC,0.554,702
B-ORG,0.289,1661
B-PER,0.607,1617
I-LOC,0.595,257
I-MISC,0.398,216
I-ORG,0.499,835
I-PER,0.735,1156
O,0.868,38323
total,0.813,46435


Unnamed: 0,p,r,f,s
LOC,0.777,0.676,0.723,1668
MISC,0.104,0.543,0.175,702
ORG,0.455,0.256,0.328,1661
PER,0.757,0.59,0.663,1617
total,0.395,0.511,0.446,5648


### 2. Grouping of Entities.

Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together). 

In [7]:
def group_ents(doc):
    result = []
    i = 0
    doc_ents = list(doc.ents)
    for chunk in tqdm(list(doc.noun_chunks)):
        if len(chunk.ents) > 0:
            result.append([]) # New chunk in result
            for chunk_ent in chunk.ents:
                if chunk_ent.text != doc_ents[i].text and len(result[-1]) > 0: # Misaligned noun_chunks and ents, create new chunk
                    result.append([])
                while i < len(doc_ents) and chunk_ent != doc_ents[i]: # Search for an alignment
                    result[-1].append(doc_ents[i].label_) # Add missing ents as standalone chunks
                    i+=1
                    result.append([])
                result[-1].append(doc_ents[i].label_)
                i+=1
    return result


def evaluate_groups(groups):
    result = {}
    for group in groups:
        key = "_".join(group)
        if key not in result:
            result[key] = {'freq': 0}
        result[key]['freq'] += 1
    return result

# Compute metrics
ents_groups = group_ents(doc)
groups_freq = evaluate_groups(ents_groups)
groups_freq = pd.DataFrame().from_dict(groups_freq, orient='index').sort_values(by='freq', ascending=False)
display(groups_freq.style.set_caption("Groups frequencies"))

100%|██████████| 11157/11157 [04:10<00:00, 44.51it/s]


Unnamed: 0,freq
CARDINAL,1500
GPE,1246
PERSON,1021
DATE,859
ORG,831
NORP,292
MONEY,146
ORDINAL,119
CARDINAL_PERSON,92
TIME,80


### 3. Fix segmentation errors.
One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

In [65]:
new_doc = doc.copy()
new_ents = []
for ent in new_doc.ents:
    new_ents.append(ent)
    for token in ent:
        for child in token.children:
            if child.dep_ == 'compound' and child.ent_type_ == '' and (child.i == new_ents[-1].start-1 or child.i == new_ents[-1].end+1):
                new_start, new_end = min(new_ents[-1].start, child.i), max(new_ents[-1].end, child.i)
                new_ents[-1] = Span(new_doc, new_start, new_end, new_ents[-1].label_) 
                # N.B.: spaCy will add the correct ent_type_ and ent_iob_ attributes to the new generated entity
                #print(f'ADDED [{child.i}, {child}, {child.ent_type_}]', 'TO', new_ents[-1])
new_doc.set_ents(new_ents)

# Compute performances
new_token_lvl, new_chunk_lvl = evaluate(new_doc)
new_token_lvl = pd.concat([token_lvl, new_token_lvl], axis=1, keys=['previous', 'new'])
new_chunk_lvl = pd.concat([chunk_lvl, new_chunk_lvl], axis=1, keys=['previous', 'new'])
display(new_token_lvl.style.set_precision(3)
        .applymap(lambda s: 'background_color: red', subset=pd.IndexSlice[:, ['new']])
        .set_caption("New token-level performances"))
display(new_chunk_lvl.style.set_precision(3).set_caption("New chunk-level performances"))

Unnamed: 0_level_0,previous,previous,new,new
Unnamed: 0_level_1,accuracy,s,accuracy,s
B-LOC,0.685,1668,0.672,1668
B-MISC,0.554,702,0.554,702
B-ORG,0.289,1661,0.284,1661
B-PER,0.607,1617,0.502,1617
I-LOC,0.595,257,0.595,257
I-MISC,0.398,216,0.403,216
I-ORG,0.499,835,0.503,835
I-PER,0.735,1156,0.743,1156
O,0.868,38323,0.859,38323
total,0.813,46435,0.802,46435


Unnamed: 0_level_0,previous,previous,previous,previous,new,new,new,new
Unnamed: 0_level_1,p,r,f,s,p,r,f,s
LOC,0.777,0.676,0.723,1668,0.764,0.664,0.71,1668
MISC,0.104,0.543,0.175,702,0.104,0.541,0.174,702
ORG,0.455,0.256,0.328,1661,0.446,0.251,0.321,1661
PER,0.757,0.59,0.663,1617,0.624,0.486,0.546,1617
total,0.395,0.511,0.446,5648,0.368,0.476,0.415,5648
