# Introduction

The goal of this assignment is to create a basic program that provides an overview of basic evaluation metrics (in particular, precision, recall, f-score and a confusion matrix) from documents provided in the conll format. You will need to implement the calculations for precision, recall and f-score yourself (i.e. do not use an existing module that spits them out). Make sure that your code can handle the situation where there are no true positives for a specific class.

This notebook provides functions for reading in conll structures with pandas and proposes a structure for calculating your evaluation metrics and producing the confusion matrix. Feel free to adjust the proposed structure if you see fit.

In [27]:
# libraries

import sys
import numpy as np
import pandas as pd
# see tips & tricks on using defaultdict (remove when you do not use it)
from collections import defaultdict, Counter
# module for verifying output
from nose.tools import assert_equal

In [2]:
def extract_annotations(inputfile, annotationcolumn, delimiter='\t'):
    '''
    This function extracts annotations represented in the conll format from a file
    
    :param inputfile: the path to the conll file
    :param annotationcolumn: the name of the column in which the target annotation is provided
    :param delimiter: optional parameter to overwrite the default delimiter (tab)
    :type inputfile: string
    :type annotationcolumn: string
    :type delimiter: string
    :returns: the annotations as a list
    '''
    #https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    conll_input = pd.read_csv(inputfile, sep=delimiter, on_bad_lines='skip')
    annotations = conll_input[annotationcolumn].tolist()
    return annotations

In [3]:
extract_annotations("datas/minigold.csv","token")

['The',
 'Computational',
 'Lexicology',
 'and',
 'Terminology',
 'Lab',
 'headed',
 'by',
 'Piek',
 'Vossen',
 'offers',
 'mutliple',
 'courses',
 'in',
 'NLP',
 '.']

In [4]:
goldannotations = extract_annotations("datas/minigold.csv","gold")
machineannotations = extract_annotations("datas/miniout1.csv","NER")

In [5]:
goldannotations

['O',
 'B-ORG',
 'I-ORG',
 'I-ORG',
 'B-ORG',
 'I-ORG',
 'O',
 'O',
 'B-PER',
 'I-PER',
 'O',
 'O',
 'O',
 'O',
 'B-MISC',
 'O']

In [6]:
machineannotations

['O',
 'B-ORG',
 'I-ORG',
 'O',
 'B-ORG',
 'I-ORG',
 'O',
 'O',
 'B-PER',
 'I-PER',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O']

In [7]:
results = {}

for i in machineannotations:
    results[i] = machineannotations.count(i)
    
print(results)

{'O': 10, 'B-ORG': 2, 'I-ORG': 2, 'B-PER': 1, 'I-PER': 1}


In [8]:
results = {}

for i in goldannotations:
    results[i] = goldannotations.count(i)
    
print(results)

{'O': 8, 'B-ORG': 2, 'I-ORG': 3, 'B-PER': 1, 'I-PER': 1, 'B-MISC': 1}


In [9]:
def obtain_counts(goldannotations, machineannotations):
    '''
    This function compares the gold annotations to machine output
    
    :param goldannotations: the gold annotations
    :param machineannotations: the output annotations of the system in question
    :type goldannotations: the type of the object created in extract_annotations
    :type machineannotations: the type of the object created in extract_annotations
    
    :returns: a countainer providing the counts for each predicted and gold class pair
    '''
    
    # TIP on how to get the counts for each class
    # https://stackoverflow.com/questions/49393683/how-to-count-items-in-a-nested-dictionary, last accessed 22.10.2020
    evaluation_counts = defaultdict(Counter)
        
    for i, j in zip(goldannotations, machineannotations):
        evaluation_counts[i][j] += 1

    
    return evaluation_counts

In [10]:
evaluation_counts = obtain_counts(goldannotations,machineannotations)
evaluation_counts

defaultdict(collections.Counter,
            {'O': Counter({'O': 8}),
             'B-ORG': Counter({'B-ORG': 2}),
             'I-ORG': Counter({'I-ORG': 2, 'O': 1}),
             'B-PER': Counter({'B-PER': 1}),
             'I-PER': Counter({'I-PER': 1}),
             'B-MISC': Counter({'O': 1})})

In [11]:
evaluation_counts['I-ORG']

Counter({'I-ORG': 2, 'O': 1})

In [12]:
TP = 0
FP = 0
FN = 0

for k,v in evaluation_counts.items():
    for i,j in v.items():
        #print("i:",i)
        if i == k:
            TP += int(j) 
            print("TP: ",TP)
        elif i != k:
            FP += int(j)
            print("FP: ",FP)
            #print("not equal:",i[0])
            if len(v) >= 2 and i != k:
                FN += int(j)
                print("FN: ",FN)
                #print("i,j:",i,j)

TP:  8
TP:  10
TP:  12
FP:  1
FN:  1
TP:  13
TP:  14
FP:  2


In [17]:
def calculate_true_false(evaluation_counts):
    
    '''
    Calculates true positives, false positives and false negatives for each class and return them in a tuple
    
    :param evaluation_counts: a container from which you can obtain the true positives, false positives and false negatives for each class
    :type evaluation_counts: type of object returned by obtain_counts
    
    :returns true positives, false positives and false negatives of each class in a tuple
    '''
    
    TP = 0
    TN = 0
    FP = 0
    FN = 0
    
    for k,v in evaluation_counts.items(): # i = ('O', 8), v = Counter({'O': 8}), k = O 
        for i,j in v.items():
            if i == k:
                TP += int(j) 
            elif i != k:
                FP += int(j)
                if len(v) >= 2 and i != k:
                    FN += int(j)
                    
    return TP,TN,FP,FN

In [18]:
TP,TN,FP,FN = calculate_true_false(evaluation_counts)
print(TP,TN,FP,FN)

14 0 2 1


In [19]:
def calculate_precision_recall_fscore(evaluation_counts):
    '''
    Calculate precision recall and fscore for each class and return them in a dictionary
    
    :param calculate_true_false: a tuple from which you can obtain the true positives, false positives and false negatives for each class
    :type calculate_true_false: type of object returned by obtain_counts
    
    :returns the precision, recall and f-score of each class in a container
    '''
    
    # TIP: you may want to write a separate function that provides an overview of true positives, false positives and false negatives
    #      for each class based on the outcome of obtain counts
    
    TP,TN,FP,FN = calculate_true_false(evaluation_counts)
    
    recall = TP / (TP+FN)
    precision = TP / (TP+FP)
    f1_score = (2*precision*recall) / (precision+recall) 
    
    return precision,recall,f1_score

In [23]:
precision,recall,f1_score = calculate_precision_recall_fscore(evaluation_counts)
print("precision: ",precision)
print("recall : ",recall)
print("f1_score : ",f1_score)

precision:  0.875
recall :  0.9333333333333333
f1_score :  0.9032258064516129


In [30]:
def provide_confusion_matrix(evaluation_counts):
    '''
    Read in the evaluation counts and provide a confusion matrix for each class
    
    :param evaluation_counts: a container from which you can obtain the true positives, false positives and false negatives for each class
    :type evaluation_counts: type of object returned by obtain_counts
    
    :prints out a confusion matrix
    '''
    
    # TIP: provide_output_tables does something similar, but those tables are assuming one additional nested layer
    # your solution can thus be a simpler version of the one provided in provide_output_tables below
    
    TP,TN,FP,FN = calculate_true_false(evaluation_counts)
    confusion_matrix =  np.array([[TN,FN],[FP,TP]])
    
    return confusion_matrix

In [31]:
confusion_matrix = provide_confusion_matrix(evaluation_counts)
print("Confusion matrix of a given dataset is:")
print(confusion_matrix)

Confusion matrix of a given dataset is:
[[ 0  1]
 [ 2 14]]


In [33]:
def carry_out_evaluation(gold_annotations, systemfile, systemcolumn, delimiter='\t'):
    '''
    Carries out the evaluation process (from input file to calculating relevant scores)
    
    :param gold_annotations: list of gold annotations
    :param systemfile: path to file with system output
    :param systemcolumn: indication of column with relevant information
    :param delimiter: specification of formatting of file (default delimiter set to '\t')
    
    returns evaluation information for this specific system
    '''
    system_annotations = extract_annotations(systemfile, systemcolumn, delimiter)
    evaluation_counts = obtain_counts(gold_annotations, system_annotations)
    provide_confusion_matrix(evaluation_counts)
    evaluation_outcome = calculate_precision_recall_fscore(evaluation_counts)
    
    return evaluation_outcome

In [48]:
systemfile = "datas/miniout1.csv"
evaluation_outcome = carry_out_evaluation(gold_annotations=goldannotations,systemfile=systemfile,systemcolumn="NER")
print(evaluation_outcome)

(0.875, 0.9333333333333333, 0.9032258064516129)


In [49]:
systemfile2 = "datas/miniout2.csv"
evaluation_outcome = carry_out_evaluation(gold_annotations=goldannotations,systemfile=systemfile2,systemcolumn="NER")
print(evaluation_outcome)

(0.6875, 0.9166666666666666, 0.7857142857142857)
