# Assignment 2

Assigment is in the intersection of Named Entity Recognition and Dependency Parsing.

1. Evaluate spaCy NER on CoNLL 2003 data (provided)
    - report token-level performance (per class and total)
        - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
    - report CoNLL chunk-level performance (per class and total);
        - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total  

2. Grouping of Entities.
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together). 

3. One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

## Preprocessing
Parse corpus with spaCy and align results to CoNLL tokenization and sentence segmentation.

In [20]:
import pandas as pd
import conll
import spacy
from spacy.tokens import Doc, Span
from sklearn.metrics import classification_report
from tqdm import tqdm

DATASET_PATH = 'data/test.txt'

# Read dataset
corpus = conll.read_corpus_conll(DATASET_PATH, fs=' ')
corpus = list(filter(lambda sent: sent[0][0] != '-DOCSTART-', corpus)) # Remove -DOCSTART- sentences
words = [word[0] for sent in corpus for word in sent]

# Parse with spaCy
nlp = spacy.load('en_core_web_sm')
doc = Doc(nlp.vocab, words)

# Custom sentence split, to allineate the spaCy results to CoNLL
i = 0
for sent in corpus:
    for j, word in enumerate(sent):
        doc[i].is_sent_start = (j == 0) # Set to true only when first word in sentence
        i += 1

# NER with spaCy
for name, proc in nlp.pipeline:
    doc = proc(doc)

## 1. Evaluate spaCy NER on CoNLL 2003 data (provided)
- report token-level performance (per class and total)
    - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
- report CoNLL chunk-level performance (per class and total);
    - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total  

In [21]:
# Convert spacy tags to conll
def to_conll(iob, ent_type):
    conll_type = {
        'PERSON': 'PER',
        'GPE': 'LOC',
        'FAC': 'LOC',
        'LOC': 'LOC',
        'ORG': 'ORG',
        '': '',
    }.get(ent_type)
    if conll_type is None: 
        conll_type = 'MISC'
    return f'{iob}-{conll_type}'.strip('-')

def evaluate(doc):
    # Results pre-processing
    refs = [[(word[0], word[3]) for word in sent] for sent in corpus]
    hyps = [[(word.text, to_conll(word.ent_iob_, word.ent_type_)) for word in sent] for sent in doc.sents]

    # Token-level performance
    token_lvl = classification_report([w[1] for s in refs for w in s], [w[1] for s in hyps for w in s], digits=3)

    # Chunk-level performance
    chunk_lvl = conll.evaluate(refs, hyps)
    chunk_lvl = pd.DataFrame.from_dict(chunk_lvl, orient='index').sort_index().round(3).to_string()

    return token_lvl, chunk_lvl

# Compute performances
token_lvl, chunk_lvl = evaluate(doc)
print('Token-level perfomance\n', token_lvl)
print('\n')
print('Chunk-level perfomance\n', chunk_lvl)

Token-level perfomance
               precision    recall  f1-score   support

       B-LOC      0.780     0.692     0.733      1668
      B-MISC      0.107     0.553     0.179       702
       B-ORG      0.513     0.289     0.370      1661
       B-PER      0.779     0.607     0.683      1617
       I-LOC      0.566     0.630     0.597       257
      I-MISC      0.055     0.389     0.096       216
       I-ORG      0.459     0.499     0.478       835
       I-PER      0.829     0.735     0.779      1156
           O      0.940     0.868     0.902     38323

    accuracy                          0.813     46435
   macro avg      0.559     0.585     0.535     46435
weighted avg      0.883     0.813     0.843     46435



Chunk-level perfomance
            p      r      f     s
LOC    0.769  0.682  0.723  1668
MISC   0.105  0.541  0.175   702
ORG    0.455  0.256  0.328  1661
PER    0.757  0.590  0.663  1617
total  0.397  0.513  0.447  5648


## 2. Grouping of Entities

Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together). 

In [22]:
# Group entities based on noun_chunks method
def group_ents(doc):
    result = []
    i = 0
    doc_ents = list(doc.ents)
    for chunk in tqdm(list(doc.noun_chunks)):
        if len(chunk.ents) > 0:
            result.append([]) # Add new chunk in result
            for chunk_ent in chunk.ents:
                # Misaligned noun_chunks and ents found, create a new chunk
                if chunk_ent.text != doc_ents[i].text and len(result[-1]) > 0:
                    result.append([])
                while i < len(doc_ents) and chunk_ent.text != doc_ents[i].text: # Search for an alignment
                    result[-1].append(doc_ents[i].label_) # Add missing ents as standalone chunks
                    i+=1
                    result.append([])
                result[-1].append(doc_ents[i].label_)
                i+=1
    return result

# Compute frequency for each group
def evaluate_groups(groups):
    result = {}
    for group in groups:
        key = "_".join(group)
        if key not in result:
            result[key] = {'freq': 0}
        result[key]['freq'] += 1
    return result

# Compute metrics
ents_groups = group_ents(doc)
groups_freq = evaluate_groups(ents_groups)
groups_freq = pd.DataFrame.from_dict(groups_freq, orient='index').sort_values(by='freq', ascending=False).to_string()
print('\n')
print('Groups frequencies\n', groups_freq)

100%|██████████| 11157/11157 [04:08<00:00, 44.88it/s]

Groups frequencies
                              freq
CARDINAL                     1500
GPE                          1246
PERSON                       1021
DATE                          860
ORG                           831
NORP                          292
MONEY                         146
ORDINAL                       119
CARDINAL_PERSON                92
TIME                           80
PERCENT                        70
EVENT                          57
QUANTITY                       51
LOC                            49
NORP_PERSON                    41
GPE_PERSON                     32
ORG_PERSON                     24
PRODUCT                        23
FAC                            23
CARDINAL_ORG                   19
CARDINAL_GPE                   18
CARDINAL_NORP                  18
GPE_GPE                        17
LAW                            13
WORK_OF_ART                    13
GPE_ORG                        12
ORG_ORG

## 3. Fix segmentation errors
One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

In [23]:
new_doc = doc.copy()
new_ents = []
x = set()
for ent in new_doc.ents:
    new_ents.append(ent)
    ent_tokens = list(ent)
    for token in ent_tokens:
        # children + parent as candidates to be added to the NE
        candidates = [(c, c.dep_) for c in token.children] + [(token.head, token.dep_)]
        # Check for each candidate if it is a valid, i.e. if it has not already been added to another NE
        for candidate, dep in candidates:
            if(dep == 'compound' and candidate.ent_type_ == '' and 
            (candidate.i == new_ents[-1].start-1 or candidate.i + 1 == new_ents[-1].end + 1)):
                # Mark the newly added token with the ne entity label, to avoid readding it to another NE
                candidate.ent_type_ = new_ents[-1].label_ 
                # Create a new Span for the new entity
                # N.B.: spaCy will add the correct ent_type_ and ent_iob_ attributes to the new generated entity span
                new_start, new_end = min(new_ents[-1].start, candidate.i), max(new_ents[-1].end, candidate.i + 1) 
                new_ents[-1] = Span(new_doc, new_start, new_end, new_ents[-1].label_)
                # Add the new candidate token to the list of tokens to be processed
                ent_tokens.append(candidate)
new_doc.set_ents(new_ents)

# Compute performances
new_token_lvl, new_chunk_lvl = evaluate(new_doc)
print('Token-level perfomance\n', new_token_lvl)
print('\n')
print('Chunk-level perfomance\n', new_chunk_lvl)

Token-level perfomance
               precision    recall  f1-score   support

       B-LOC      0.765     0.679     0.719      1668
      B-MISC      0.107     0.551     0.179       702
       B-ORG      0.503     0.284     0.363      1661
       B-PER      0.675     0.526     0.592      1617
       I-LOC      0.373     0.634     0.470       257
      I-MISC      0.048     0.394     0.086       216
       I-ORG      0.398     0.511     0.448       835
       I-PER      0.670     0.747     0.707      1156
           O      0.941     0.849     0.893     38323

    accuracy                          0.795     46435
   macro avg      0.498     0.575     0.495     46435
weighted avg      0.874     0.795     0.828     46435



Chunk-level perfomance
            p      r      f     s
LOC    0.707  0.627  0.665  1668
MISC   0.095  0.493  0.160   702
ORG    0.371  0.209  0.267  1661
PER    0.647  0.504  0.567  1617
total  0.350  0.452  0.394  5648
