## Assignment 2

Assigment is in the intersection of Named Entity Recognition and Dependency Parsing.

0. Evaluate spaCy NER on CoNLL 2003 data (provided)
    - report token-level performance (per class and total)
        - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
    - report CoNLL chunk-level performance (per class and total);
        - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total  

1. Grouping of Entities.
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together). 

2. One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

In [45]:
import pandas as pd
import conll
import spacy
from spacy.tokens import Doc


DATASET_PATH = 'data/test.txt'

# Read dataset
corpus = conll.read_corpus_conll(DATASET_PATH, fs=' ')
corpus = list(filter(lambda sent: sent[0][0] != '-DOCSTART-', corpus)) # Remove -DOCSTART- sentences
words = [word[0] for sent in corpus for word in sent]

# Parse with spaCy
nlp = spacy.load('en')
doc = nlp.tokenizer.tokens_from_list(words)

# Custom sentence division
i = 0
for sent in corpus:
    for j, word in enumerate(sent):
        doc[i].is_sent_start = (j == 0) # Set to true only if first word in sentence
        i += 1

# NER with spaCy
for name, proc in nlp.pipeline:
    doc = proc(doc)

In [56]:
# Map spacy tags to conll
def to_conll(iob, ent_type):
    conll_type = {
        'PERSON': 'PER',
        'LOC': 'LOC',
        'ORG': 'ORG',
        '': ''
    }.get(ent_type)
    if conll_type is None: 
        conll_type = 'MISC'
    return f'{iob}-{conll_type}'.strip('-')

# Token-level performance
refs = [[(word[0], word[3]) for word in sent] for sent in corpus]
hyps = [[(word.text, to_conll(word.ent_iob_, word.ent_type_)) for word in sent] for sent in doc.sents]

token_lvl = conll.evaluate(refs, hyps)
token_lvl = pd.DataFrame().from_dict(token_lvl, orient='index').round(decimals=3).style.set_caption("Token-level performances")
display(token_lvl)


Unnamed: 0,p,r,f,s
MISC,0.073,0.581,0.13,702
PER,0.762,0.584,0.662,1617
LOC,0.403,0.016,0.031,1668
ORG,0.425,0.291,0.345,1661
total,0.232,0.33,0.272,5648
