# Evaluation 

In this file the evaluation scores of the systems that I trained can be found. The code computes the (macro average) precision, recall and F1 scores by comparing the outputs of the system to the gold data. All documents must be in conll format. 
The main function has an `extended` setting, which also the confusion matrix and the evaluation measures for each class. 

In [1]:
import sys
import pandas as pd

# see tips & tricks on using defaultdict (remove when you do not use it)
from collections import defaultdict, Counter

In [2]:
def extract_annotations(inputfile: str, annotationcolumn: str, delimiter: str ='\t'):
    '''
    This function extracts annotations represented in the conll format from a file
    
    :param inputfile: the path to the conll file
    :param annotationcolumn: the name/index of the column in which the target annotation is provided
    :param delimiter: optional parameter to overwrite the default delimiter (tab)

    :returns: the annotations as a list
    '''
    annotations = []
    first_line = True
    with open(inputfile, 'r', encoding='utf8') as infile:
        for line in infile:
            #skip the first line in the data, this contains the column names (which equal the indices)
            if first_line:
                first_line = False
                continue
            components = line.rstrip('\n').split(delimiter)
            #skip empty lines
            if len(components) > 1:
                annotations.append(components[int(annotationcolumn)])
    return annotations

In [3]:
def obtain_counts(gold_annotations, machine_annotations):
    '''
    This function compares the gold annotations to machine output
    
    :param goldannotations: the gold annotations
    :param machineannotations: the output annotations of the system in question
    :type goldannotations: the type of the object created in extract_annotations
    :type machineannotations: the type of the object created in extract_annotations
    
    :returns: a countainer providing the counts for each predicted and gold class pair
    '''
    evaluation_counts = defaultdict(Counter)
    for i, annotation in enumerate(gold_annotations):
        evaluation_counts[annotation][machine_annotations[i]] += 1
    return evaluation_counts
        
def safe_divide(numerator, denominator):
    #This function divides the numerator by the denominator. If the denominator is zero it returns zero
    try:
        return numerator / denominator
    except ZeroDivisionError:
        return 0
    
def calculate_precision_recall_fscore(evaluation_counts):
    '''
    Calculate precision recall and fscore for each class and return them in a dictionary
    
    :param evaluation_counts: a container from which you can obtain the true positives, false positives and false negatives for each class
    :type evaluation_counts: type of object returned by obtain_counts
    
    :returns: the precision, recall and f-score of each class in a container
    '''
    scores = {}
    for classlabel, counts in evaluation_counts.items():
        #True positives are the number of times we correctly classify the label
        TP = counts[classlabel]
        #false negatives are the number of times we should have selected the current label but selected another
        FN = sum([count for label, count in counts.items() if label != classlabel])
        #false positives are the number of times we should have selected another label but selected this label
        FP = sum([label_counts[classlabel] for label, label_counts in evaluation_counts.items() if label != classlabel])
        
        #calculate metrics, make sure to safe divide in case the denominator is zero
        precision = round(safe_divide(TP, (TP + FP)),3)
        recall = round(safe_divide(TP, (TP + FN)),3)
        F1 = round(safe_divide((2* precision * recall), (precision + recall)),3)
        #save scores
        scores[classlabel] = {'precision' : precision, 'recall': recall, 'f-score': F1}
        
    #get marco averages    
    macro_precision = sum([score['precision'] for score in scores.values()]) / len(scores.keys())  
    macro_recall = sum([score['recall'] for score in scores.values()]) / len(scores.keys())  
    macro_F1 = safe_divide((2* macro_precision * macro_recall), (macro_precision + macro_recall))
    
    #print macro averages
    print(f"Macro precision score : {round(macro_precision * 100,2)}")
    print(f"Macro recall score : {round(macro_recall * 100,2)}")
    print(f"Macro F1 score : {round(macro_F1 *100,2)}")
    return scores

def provide_confusion_matrix(evaluation_counts):
    '''
    Read in the evaluation counts and provide a confusion matrix for each class
    
    :param evaluation_counts: a container from which you can obtain the true positives, false positives and false negatives for each class
    :type evaluation_counts: type of object returned by obtain_counts
    
    :prints out a confusion matrix
    '''
    #make sure all values are in the dict, and that the same order is maintained for all labels, so that we get a clean table
    for i in evaluation_counts.keys():   
        evaluation_counts[i] = {j:evaluation_counts[i][j] for j in evaluation_counts.keys()}
        
    # create matrix
    confusions_pddf = pd.DataFrame.from_dict({i: evaluation_counts[i]
                                              for i in evaluation_counts.keys()},
                                             orient='index', columns=evaluation_counts.keys(),
                                             )
    #print matrix and latex version of matrix
    print(confusions_pddf)
    print(confusions_pddf.to_latex())

In [4]:

def carry_out_evaluation(gold_annotations, systemfile, systemcolumn, extended, delimiter='\t'):
    '''
    Carries out the evaluation process (from input file to calculating relevant scores)
    
    :param gold_annotations: list of gold annotations
    :param systemfile: path to file with system output
    :param systemcolumn: indication of column with relevant information
    :param delimiter: specification of formatting of file (default delimiter set to '\t')
    
    returns evaluation information for this specific system
    '''
    #retrieve annotations of the system
    system_annotations = extract_annotations(systemfile, systemcolumn, delimiter)
    #evaluate
    evaluation_counts = obtain_counts(gold_annotations, system_annotations)
    
    #print confusion matrix in extended evaluation setting
    if extended:
        provide_confusion_matrix(evaluation_counts)
    #get evaluation metrics
    evaluation_outcome = calculate_precision_recall_fscore(evaluation_counts)
    return evaluation_outcome

In [5]:
def provide_output_tables(evaluations):
    '''
    Create tables based on the evaluation of various systems
    
    :param evaluations: the outcome of evaluating one or more systems
    '''
    evaluations_pddf = pd.DataFrame.from_dict({(i,j): evaluations[i][j]
                                              for i in evaluations.keys()
                                              for j in evaluations[i].keys()},
                                             orient='index')

    print(evaluations_pddf)
    print(evaluations_pddf.to_latex())

In [6]:
def run_evaluations(goldfile, goldcolumn, systems, extended):
    '''
    Carry out standard evaluation for one or more system outputs
    
    :param goldfile: path to file with goldstandard
    :param goldcolumn: indicator of column in gold file where gold labels can be found
    :param systems: required information to find and process system output
    :type goldfile: string
    :type goldcolumn: integer
    :type systems: list (providing file name, information on tab with system output and system name for each element)
    
    :returns the evaluations for all systems
    '''
    evaluations = {}
    #extract gold annotations
    gold_annotations = extract_annotations(goldfile, goldcolumn)
    #evalutate
    for system in systems:
        sys_evaluation = carry_out_evaluation(gold_annotations, system[0], system[1], extended)
        evaluations[system[2]] = sys_evaluation
    return evaluations

# Checking the overall set-up

The functions below illustrate how to run the setup as outlined above using a main function and, later, commandline arguments. This setup will facilitate the transformation to an experimental setup that no longer makes use of notebooks, that you will submit later on. There are also some functions that can be used to test your implementation You can carry out a few small tests yourself with the data provided in the data/ folder.

In [7]:
def identify_evaluation_value(system, class_label, value_name, evaluations):
    '''
    Return the outcome of a specific value of the evaluation
    
    :param system: the name of the system
    :param class_label: the name of the class for which the value should be returned
    :param value_name: the name of the score that is returned
    :param evaluations: the overview of evaluations
    
    :returns the requested value
    '''
    return evaluations[system][class_label][value_name]

In [8]:
def create_system_information(system_information):
    '''
    Takes system information in the form that it is passed on through sys.argv or via a settingsfile
    and returns a list of elements specifying all the needed information on each system output file to carry out the evaluation.
    
    :param system_information is the input as from a commandline or an input file
    '''
    systems_list = [system_information[i:i + 3] for i in range(0, len(system_information), 3)]
    return systems_list

In [9]:
def main(my_args=None, extended=False):
    '''
    A main function. this makes sure to carry out the evaluation for the given gold and machine annotation file
    
    sys.argv is a very lightweight way of passing arguments from the commandline to a script.
    
    :param my_arg : a list containing the following parameters:
                    args[0] : the path (str) to the goldfile
                    args[1] : the index of the column in the gold file in which the gold labels can be found
                    args[2] : the path (str) to the file containing machine annotations to be evaluated
                    args[3] : the index of the column in the machine annotations file in which the annotations can be found
                    
    :param extendend: if this is true, the output is not only the performance measures but also the confusion matrix and 
                      the scores per label. Default is false 
    '''
    if my_args is None:
        my_args = sys.argv
        
    
    system_info = create_system_information(my_args[2:])
    evaluations = run_evaluations(my_args[0], my_args[1], system_info, extended)
    #in the extended setting, also print confusion and output tables
    if extended:
        provide_output_tables(evaluations)

## Evaluations
Below, the results for all models I evaluated can be found

### pre-trained models
Spacy

In [10]:
my_args = ['../data/conll2003.dev-preprocessed.conll','3','../data/spacy_out.dev-preprocessed.conll','2','system1']
main(my_args)

Macro precision score : 53.86
Macro recall score : 62.77
Macro F1 score : 57.97


Stanford CoreNLP

In [11]:
my_args = ['../data/conll2003.dev-preprocessed.conll','3','../data/stanford_out.dev-preprocessed.conll','3','system2']
main(my_args)

Macro precision score : 34.33
Macro recall score : 49.16
Macro F1 score : 40.43


### Feature engineering models
#### Logistic Regression
basic setting (token only)

In [12]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/logreg_basic.conll','9','system2']
main(my_args)

Macro precision score : 81.13
Macro recall score : 54.76
Macro F1 score : 65.38


extended setting (all features included, token as one-hot)

In [13]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/logreg_extended.conll','9','system2']
main(my_args)

Macro precision score : 85.97
Macro recall score : 78.98
Macro F1 score : 82.32


embeddings setting (all features included, token as word embedding)

In [14]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/logreg_embeddings.conll','9','system2']
main(my_args)

Macro precision score : 81.21
Macro recall score : 77.6
Macro F1 score : 79.36


#### SVM
basic setting (token only)

In [15]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/SVM_basic.conll','9','system2']
main(my_args)

Macro precision score : 80.57
Macro recall score : 64.99
Macro F1 score : 71.94


extended setting (all features included, token as one-hot)

In [16]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/SVM_extended.conll','9','system2']
main(my_args, extended=True)

            O  B-ORG  B-LOC  B-MISC  I-MISC  B-PER  I-PER  I-LOC  I-ORG
O       42606     26     12      21      14     38      9      2     31
B-ORG      49   1041    100      24       1    100      4      1     21
B-LOC      59     87   1604      23       0     46     10      4      4
B-MISC     54     55     40     736       5     25      2      0      5
I-MISC     41     13      4      13     225      5     26      8     11
B-PER      58     28     41      12       0   1665     32      0      6
I-PER      25      2      3       2       1     15   1251      3      5
I-LOC      12      2      5       0       2      1     17    200     18
I-ORG      79      6     23       7      15      7     63     38    513
\begin{tabular}{lrrrrrrrrr}
\toprule
{} &      O &  B-ORG &  B-LOC &  B-MISC &  I-MISC &  B-PER &  I-PER &  I-LOC &  I-ORG \\
\midrule
O      &  42606 &     26 &     12 &      21 &      14 &     38 &      9 &      2 &     31 \\
B-ORG  &     49 &   1041 &    100 &      24 &       

embeddings setting (all features included, token as word embedding)

In [17]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/SVM_embeddings.conll','9','system2']
main(my_args)

Macro precision score : 81.13
Macro recall score : 75.67
Macro F1 score : 78.3


### Naive Bayes
basic setting (token only)

In [18]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/NB_basic.conll','9','system2']
main(my_args)

Macro precision score : 78.71
Macro recall score : 64.73
Macro F1 score : 71.04


extended setting (all features included, token as one-hot)

In [19]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/NB_extended.conll','9','system2']
main(my_args)

Macro precision score : 76.66
Macro recall score : 75.93
Macro F1 score : 76.29


## CRF

In [20]:
my_args = ['../data/conll2003.dev.conll','3','../data/CRF.conll','2','system2']
main(my_args)

Macro precision score : 85.52
Macro recall score : 80.7
Macro F1 score : 83.04


## Results for ablation study (SVM - extended setting)
including ALL features:

No Token:

In [21]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/SVM_extended_noToken.conll','9','system2']
main(my_args)

Macro precision score : 70.49
Macro recall score : 65.69
Macro F1 score : 68.0


No POS:


In [22]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/SVM_extended_noPos.conll','9','system2']
main(my_args)

Macro precision score : 86.16
Macro recall score : 82.39
Macro F1 score : 84.23


No Cap:

In [23]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/SVM_extended_noCap.conll','9','system2']
main(my_args)

Macro precision score : 86.61
Macro recall score : 81.43
Macro F1 score : 83.94


No Number:

In [24]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/SVM_extended_noNumber.conll','9','system2']
main(my_args)

Macro precision score : 86.64
Macro recall score : 82.34
Macro F1 score : 84.44


No Punctuation:

In [25]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/SVM_extended_noPunct.conll','9','system2']
main(my_args)

Macro precision score : 86.67
Macro recall score : 82.34
Macro F1 score : 84.45


No Previous token:

In [26]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/SVM_extended_noPrevToken.conll','9','system2']
main(my_args)

Macro precision score : 84.47
Macro recall score : 77.47
Macro F1 score : 80.82


No Previous POS:

In [27]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/SVM_extended_noPrevPOS.conll','9','system2']
main(my_args)

Macro precision score : 85.73
Macro recall score : 81.53
Macro F1 score : 83.58


No Token and no PrevToken:

In [28]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/SVM_extended_noTokenPrevToken.conll','9','system2']
main(my_args)

Macro precision score : 43.78
Macro recall score : 44.26
Macro F1 score : 44.02


No Token and no PrevToken, and no POS:

In [29]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/SVM_extended_noTokenPrevToken_Pos.conll','9','system2']
main(my_args)

Macro precision score : 43.13
Macro recall score : 36.77
Macro F1 score : 39.7


No Token and no PrevToken, and no Capitalization:

In [30]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/SVM_extended_noTokenPrevToken_Cap.conll','9','system2']
main(my_args)

Macro precision score : 41.1
Macro recall score : 38.41
Macro F1 score : 39.71


No Token and no PrevToken, and no number:

In [31]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/SVM_extended_noTokenPrevToken_Number.conll','9','system2']
main(my_args)

Macro precision score : 44.19
Macro recall score : 43.61
Macro F1 score : 43.9


No Token and no PrevToken, and no punctuation:

In [32]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/SVM_extended_noTokenPrevToken_Punct.conll','9','system2']
main(my_args)

Macro precision score : 44.0
Macro recall score : 43.6
Macro F1 score : 43.8


No Token and no PrevToken, and no previous POS:

In [33]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/SVM_extended_noTokenPrevToken_PrevPos.conll','9','system2']
main(my_args)

Macro precision score : 38.1
Macro recall score : 27.13
Macro F1 score : 31.69


Insights:
* The token is the most important, previous token follows
* if we exclude these both, we see that the previous POS becomes the most important, while this feature barely had an effect when the token and the previous token were still included
* if we exclude these both, POS and capitalization also have an effect
* if we exclude these both, Number and Punctuation still do not contribute

Are the same features important for logistic regression? Let's see..

No token embedding:

In [34]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/logreg_extended_noToken.conll','9','system2']
main(my_args)

Macro precision score : 71.57
Macro recall score : 63.93
Macro F1 score : 67.53


No previous token:

In [35]:
my_args = ['../data/conll2003.dev-preprocessed-features.conll','3','../data/logreg_extended_noPrevToken.conll','9','system2']
main(my_args)

Macro precision score : 82.16
Macro recall score : 74.47
Macro F1 score : 78.12
