# Introduction

The goal of this assignment is to create a basic program that provides an overview of basic evaluation metrics (in particular, precision, recall, f-score and a confusion matrix) from documents provided in the conll format. 
You will need to implement the calculations for precision, recall and f-score yourself (i.e. do not use an existing module that spits them out). Make sure that your code can handle the situation where there are no true positives for a specific class.

This notebook provides functions for reading in conll structures with pandas and proposes a structure for calculating your evaluation metrics and producing the confusion matrix. Feel free to adjust the proposed structure if you see fit.

In [None]:
import sys
import pandas as pd
# see tips & tricks on using defaultdict (remove when you do not use it)
from collections import defaultdict, Counter
# module for verifying output
from nose.tools import assert_equal

# A note Pandas

Pandas is a module that provides data structures and is widely used for dealing with data representations in machine learning. It is a bit more advanced than the csv module we saw in the preprocessing notebook.
Working with pandas data structures can be tricky, but it will generally work well if you follow online tutorials and examples closely. If your code is slow before you even started training your models, it is likely to be a problem with the way you are using Pandas (it will still work in most cases, you will just have to wait a bit longer). Once you are more used to working with modules and complex objects, it will also become easier to work with Pandas.

In the examples below, we assume that the 

In [None]:
def extract_annotations(inputfile, annotationcolumn, delimiter='\t'):
    '''
    This function extracts annotations represented in the conll format from a file
    
    :param inputfile: the path to the conll file
    :param annotationcolumn: the name of the column in which the target annotation is provided
    :param delimiter: optional parameter to overwrite the default delimiter (tab)
    :type inputfile: string
    :type annotationcolumn: string
    :type delimiter: string
    :returns: the annotations as a list
    '''
    #https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    conll_input = pd.read_csv(inputfile, sep=delimiter)
    annotations = conll_input[annotationcolumn].tolist()
    return annotations

In [None]:
def obtain_counts(goldannotations, machineannotations):
    '''
    This function compares the gold annotations to machine output
    
    :param goldannotations: the gold annotations
    :param machineannotations: the output annotations of the system in question
    :type goldannotations: the type of the object created in extract_annotations
    :type machineannotations: the type of the object created in extract_annotations
    
    :returns: a countainer providing the counts for each predicted and gold class pair
    '''
    
    # TIP on how to get the counts for each class
    # https://stackoverflow.com/questions/49393683/how-to-count-items-in-a-nested-dictionary, last accessed 22.10.2020
    evaluation_counts = defaultdict(Counter)
    assert len(goldannotations) == len(machineannotations)
    for i in range(len(goldannotations)):
        evaluation_counts[goldannotations[i]][machineannotations[i]] += 1
    return evaluation_counts
    
def calculate_precision_recall_fscore(evaluation_counts):
    '''
    Calculate precision recall and fscore for each class and return them in a dictionary
    
    :param evaluation_counts: a container from which you can obtain the true positives, false positives and false negatives for each class
    :type evaluation_counts: type of object returned by obtain_counts
    
    :returns the precision, recall and f-score of each class in a container
    '''
    
    # TIP: you may want to write a separate function that provides an overview of true positives, false positives and false negatives
    #      for each class based on the outcome of obtain counts
    # YOUR CODE HERE (and remove statement below)
    value_dict = defaultdict(Counter)

    
    for label in evaluation_counts.keys():
        # Calculation of TP, FP & FN
        TP = evaluation_counts[label][label]
        FP = 0
        FN = 0
        for label_ in evaluation_counts.keys():
            if label_ != label:
                FP += evaluation_counts[label_][label]
                FN += evaluation_counts[label][label_]

        # Precision, Recall & F-Score for class 'label'
        try:
            precis = TP / (TP + FP)
        except ZeroDivisionError:
            precis = 0
        try:
            recall = TP / (TP + FN)
        except ZeroDivisionError:
            recall = 0
        try:
            FScore = (2 * precis * recall) / (precis + recall)
        except ZeroDivisionError:
            FScore = 0
        
        # Assign Precision, Recall, & F-Score according to set structure
        value_dict[label]['precision'] = precis
        value_dict[label]['recall'] = recall
        value_dict[label]['f-score'] = FScore

    return value_dict
            

def provide_confusion_matrix(evaluation_counts):
    '''
    Read in the evaluation counts and provide a confusion matrix for each class
    
    :param evaluation_counts: a container from which you can obtain the true positives, false positives and false negatives for each class
    :type evaluation_counts: type of object returned by obtain_counts
    
    :prints out a confusion matrix
    '''
    
    # TIP: provide_output_tables does something similar, but those tables are assuming one additional nested layer
    #      your solution can thus be a simpler version of the one provided in provide_output_tables below
    
    # YOUR CODE HERE (and remove statement below)
    res = pd.DataFrame.from_dict(evaluation_counts).fillna(0)

    print(res)
    print(res.to_latex())

In [None]:
def carry_out_evaluation(gold_annotations, systemfile, systemcolumn, delimiter='\t'):
    '''
    Carries out the evaluation process (from input file to calculating relevant scores)
    
    :param gold_annotations: list of gold annotations
    :param systemfile: path to file with system output
    :param systemcolumn: indication of column with relevant information
    :param delimiter: specification of formatting of file (default delimiter set to '\t')
    
    returns evaluation information for this specific system
    '''
    system_annotations = extract_annotations(systemfile, systemcolumn, delimiter)
    evaluation_counts = obtain_counts(gold_annotations, system_annotations)
    provide_confusion_matrix(evaluation_counts)
    evaluation_outcome = calculate_precision_recall_fscore(evaluation_counts)
    
    return evaluation_outcome

In [None]:
def provide_output_tables(evaluations):
    '''
    Create tables based on the evaluation of various systems
    
    :param evaluations: the outcome of evaluating one or more systems
    '''
    #https:stackoverflow.com/questions/13575090/construct-pandas-dataframe-from-items-in-nested-dictionary
    evaluations_pddf = pd.DataFrame.from_dict({(i,j): evaluations[i][j]
                                              for i in evaluations.keys()
                                              for j in evaluations[i].keys()},
                                             orient='index')
    print(evaluations_pddf)
    print(evaluations_pddf.to_latex())

In [None]:
def run_evaluations(goldfile, goldcolumn, systems):
    '''
    Carry out standard evaluation for one or more system outputs
    
    :param goldfile: path to file with goldstandard
    :param goldcolumn: indicator of column in gold file where gold labels can be found
    :param systems: required information to find and process system output
    :type goldfile: string
    :type goldcolumn: integer
    :type systems: list (providing file name, information on tab with system output and system name for each element)
    
    :returns the evaluations for all systems
    '''
    evaluations = {}
    #not specifying delimiters here, since it corresponds to the default ('\t')
    gold_annotations = extract_annotations(goldfile, goldcolumn)
    for system in systems:
        sys_evaluation = carry_out_evaluation(gold_annotations, system[0], system[1])
        evaluations[system[2]] = sys_evaluation
    return evaluations

# Checking the overall set-up

The functions below illustrate how to run the setup as outlined above using a main function and, later, commandline arguments. This setup will facilitate the transformation to an experimental setup that no longer makes use of notebooks, that you will submit later on. There are also some functions that can be used to test your implementation You can carry out a few small tests yourself with the data provided in the data/ folder.

In [None]:
def identify_evaluation_value(system, class_label, value_name, evaluations):
    '''
    Return the outcome of a specific value of the evaluation
    
    :param system: the name of the system
    :param class_label: the name of the class for which the value should be returned
    :param value_name: the name of the score that is returned
    :param evaluations: the overview of evaluations
    
    :returns the requested value
    '''
    return evaluations[system][class_label][value_name]

In [None]:
def create_system_information(system_information):
    '''
    Takes system information in the form that it is passed on through sys.argv or via a settingsfile
    and returns a list of elements specifying all the needed information on each system output file to carry out the evaluation.
    
    :param system_information is the input as from a commandline or an input file
    '''
    # https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
    systems_list = [system_information[i:i + 3] for i in range(0, len(system_information), 3)]
    return systems_list

In [None]:

def main(my_args=None):
    
    if my_args is None:
        my_args = sys.argv
    
    system_info = create_system_information(my_args[2:])
    evaluations = run_evaluations(my_args[0], my_args[1], system_info)
    provide_output_tables(evaluations)
    check_eval = identify_evaluation_value('system1', 'O', 'f-score', evaluations)
    #if it does not work, this assert stateme
    assert_equal("%.3f" % check_eval,"0.889")
    

# these can come from the commandline using sys.argv for instance
my_args = ['../../data/minigold.csv','gold','../../data/miniout1.csv','NER','system1']
main(my_args)

In [None]:
#some additional tests

test_args = ['../../data/minigold.csv','gold','../../data/miniout2.csv','NER','system2']
system_info = create_system_information(test_args[2:])
evaluations = run_evaluations(test_args[0], test_args[1], system_info)
test_eval = identify_evaluation_value('system2', 'I-ORG', 'f-score', evaluations)
assert_equal("%.3f" % test_eval,"0.571")
test_eval2 = identify_evaluation_value('system2', 'I-PER', 'precision', evaluations)
assert_equal("%.3f" % test_eval2,"0.500")
test_eval3 = identify_evaluation_value('system2', 'I-ORG', 'recall', evaluations)
assert_equal("%.3f" % test_eval3,"0.667")