# Notebook to Calculate the Inter-Annotator Agreement

### Import the libraries

In [1]:
import sys
import pandas as pd
import collections 
import os
import numpy as np
import json
from itertools import chain
from itertools import combinations
sys.path.insert(0, '..')
from src.experiment_utils.helper_classes import token, span, repository
from src.d02_corpus_statistics.corpus import Corpus
from src.d03_inter_annotator_agreement.inter_annotator_agremment import Inter_Annotator_Agreement
from definitions import df_annotation_marker
from src.d03_inter_annotator_agreement.inter_annotator_agremment import row_to_span_list, keep_valid_anotations
from src.d03_inter_annotator_agreement.scoring_functions import create_scoring_matrix

from definitions import ROOT_DIR


## Introduction

Most of the functions used in this tutorial are based on the "Corpus" class. Please follow the Turorial before.

In [2]:
dataframe_dir = os.path.join(ROOT_DIR,'data/02_processed_to_dataframe', 'preprocessed_dataframe.pkl')
stat_df = pd.read_pickle(dataframe_dir)
stat_df.head()

Unnamed: 0,Policy,Text,Tokens,Article_State,Finished_Annotators,Curation,A,C,F,B,E,G,D
EU_32018R1999_Title_0_Chapter_7_Section_3_Article_43,,article 43\r\nexercise of the delegation\r\n1....,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, C]",[span id:CUR0 annotator:Curation layer:Instrum...,[span id:A1 annotator:A layer:Instrumenttypes ...,[span id:C1 annotator:C layer:Policydesignchar...,,,,,
EU_32019R0631_Title_0_Chapter_0_Section_0_Article_12,,article 12\r\nreal-world co2 emissions and fue...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[F, B]",[span id:CUR36 annotator:Curation layer:Instru...,,,[span id:F1 annotator:F layer:Instrumenttypes ...,[span id:B1 annotator:B layer:Policydesignchar...,,,
EU_32018L2001_Title_0_Chapter_0_Section_0_Article_11,,article 11\r\njoint projects between member st...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[C, F]",[span id:CUR116 annotator:Curation layer:Instr...,,[span id:C28 annotator:C layer:Instrumenttypes...,[span id:F58 annotator:F layer:Instrumenttypes...,,,,
EU_32018R1999_Title_0_Chapter_7_Section_3_Article_56,,article 56\r\namendments to directive (eu) 201...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, C]",[span id:CUR202 annotator:Curation layer:Polic...,[span id:A38 annotator:A layer:Policydesigncha...,[span id:C129 annotator:C layer:Policydesignch...,,,,,
EU_32018L2001_Title_0_Chapter_0_Section_0_Article_03,,article 3\r\nbinding overall union target for ...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[C, F, B]",[span id:CUR211 annotator:Curation layer:Instr...,,[span id:C138 annotator:C layer:Instrumenttype...,[span id:F165 annotator:F layer:Instrumenttype...,[span id:B27 annotator:B layer:Instrumenttypes...,,,


First create a object of class Inter_Annotator_Agreement. The constructor takes a stat_df as input, has a optional argument DEBUG where only the first 10 articles are taken to test different functions. Per default, the preamble ("Front" and "Whereas" articles) is excluded.



In [3]:
test_evaluator = Inter_Annotator_Agreement(stat_df, front_and_whereas = False)
test_evaluator_debug = Inter_Annotator_Agreement(stat_df, DEBUG = True)

In [4]:
test_evaluator.df.shape

(412, 13)

In [5]:
test_evaluator_debug.df.shape

(10, 13)

Inter_Annotator_Agreement is a child class of the Corpus class, so all methods of the Corpus class are available:

In [6]:
test_dir = repository(policy = 'EU_32008R1099')
test_evaluator.get_span_list(conditional_rep = test_dir, annotators = 'annotators', item = 'tag', value =  'Tech_LowCarbon')

[span id:F3882 annotator:F layer:Technologyandapplicationspecificity feature:TechnologySpecificity tag:Tech_LowCarbon start:18 stop:25 text:nuclear,
 span id:F3883 annotator:F layer:Technologyandapplicationspecificity feature:TechnologySpecificity tag:Tech_LowCarbon start:95 stop:109 text:nuclear energy,
 span id:F3884 annotator:F layer:Technologyandapplicationspecificity feature:TechnologySpecificity tag:Tech_LowCarbon start:151 stop:158 text:nuclear,
 span id:F5161 annotator:F layer:Technologyandapplicationspecificity feature:TechnologySpecificity tag:Tech_LowCarbon start:125 stop:141 text:renewable energy,
 span id:F5162 annotator:F layer:Technologyandapplicationspecificity feature:TechnologySpecificity tag:Tech_LowCarbon start:393 stop:409 text:renewable energy,
 span id:F5163 annotator:F layer:Technologyandapplicationspecificity feature:TechnologySpecificity tag:Tech_LowCarbon start:499 stop:515 text:renewable energy,
 span id:F5164 annotator:F layer:Technologyandapplicationspecif

To calculate the inter-annonator agreement, there are two options: 1) appending the score to dataframe and 2) getting the total score based on a spanlist. We will now walk through both of these options.


## Append the score to dataframe

This method appends the inter-annotator agreement for each article based on a set of inter-annotator agreement measures. As the scores can be calculated in parallel, this is the recommended method for computationally intensive scores.

First, we only consider the articles where the curation is finished and the text was annotated by at least two annotators:

In [7]:
test_evaluator.keep_only_finished_articles()

Define a list of scoring metrics:

In [8]:
scoring_metrics = ['f1_exact', 'f1_tokenwise', 'f1_heuristic']

We will now make use of the function
**append_total_score_per_article(scoring_metrics, parallel = False, ** optional_tuple_properties)**.

This function calculates the individual score for each article and for each metric defined in scoring_metrics (can be a list of metrics or a single metric). For each metric, a new column is appended to the dataframe, therefore the scores can be stored and don't have to be recalculated. To speed up the computation, the scores can be calculated in parallel using the pandarell library. The kwargs "optional_tuple_properties" are reserved for pygamma properties such as a dissimilarity matrix. 



In [9]:
test_evaluator.append_total_score_per_article(scoring_metrics)

100%|██████████| 412/412 [00:00<00:00, 705.93it/s]
100%|██████████| 412/412 [00:01<00:00, 252.93it/s]
100%|██████████| 412/412 [00:00<00:00, 704.21it/s]


In [10]:
test_evaluator.get_total_score_df(weight_by = 'no_weighting')

{'f1_exact_score': 0.40028724799007476,
 'f1_heuristic_score': 0.5275951402554365,
 'f1_tokenwise_score': 0.5197949109942909}

In [11]:
test_evaluator.df.head()

Unnamed: 0,Policy,Text,Tokens,Article_State,Finished_Annotators,Curation,A,C,F,B,E,G,D,f1_exact_score,f1_tokenwise_score,f1_heuristic_score
EU_32018R1999_Title_0_Chapter_7_Section_3_Article_43,,article 43\r\nexercise of the delegation\r\n1....,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, C]",[span id:CUR0 annotator:Curation layer:Instrum...,[span id:A1 annotator:A layer:Instrumenttypes ...,[span id:C1 annotator:C layer:Policydesignchar...,,,,,,0.21875,0.494915,0.28125
EU_32019R0631_Title_0_Chapter_0_Section_0_Article_12,,article 12\r\nreal-world co2 emissions and fue...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[F, B]",[span id:CUR36 annotator:Curation layer:Instru...,,,[span id:F1 annotator:F layer:Instrumenttypes ...,[span id:B1 annotator:B layer:Policydesignchar...,,,,0.289157,0.385827,0.421687
EU_32018L2001_Title_0_Chapter_0_Section_0_Article_11,,article 11\r\njoint projects between member st...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[C, F]",[span id:CUR116 annotator:Curation layer:Instr...,,[span id:C28 annotator:C layer:Instrumenttypes...,[span id:F58 annotator:F layer:Instrumenttypes...,,,,,0.567308,0.559615,0.653846
EU_32018R1999_Title_0_Chapter_7_Section_3_Article_56,,article 56\r\namendments to directive (eu) 201...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, C]",[span id:CUR202 annotator:Curation layer:Polic...,[span id:A38 annotator:A layer:Policydesigncha...,[span id:C129 annotator:C layer:Policydesignch...,,,,,,0.736842,0.875,0.736842
EU_32018L2001_Title_0_Chapter_0_Section_0_Article_03,,article 3\r\nbinding overall union target for ...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[C, F, B]",[span id:CUR211 annotator:Curation layer:Instr...,,[span id:C138 annotator:C layer:Instrumenttype...,[span id:F165 annotator:F layer:Instrumenttype...,[span id:B27 annotator:B layer:Instrumenttypes...,,,,0.420198,0.511835,0.544511


There is also a normal implementation which uses parallel

In [20]:
test_evaluator.append_total_score_per_article(scoring_metrics, parallel = True)

INFO: Pandarallel will run on 48 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


### Gamma-score

The gamma-score is a special case that is calculated using the pygamma-agreement package. To calculate the gamma-score, we have the option to feed in a dissimilarity matrix and a list of all the possible annotated spans. The dissimilarity matrix can be calculated via the create_scoring_matrix function. Each mismatch between tags of the same feature or layer is penalized less than a mismatch between different features or layers. For more details, refer to https://pygamma-agreement.readthedocs.io/en/latest/. 

This dissimilarity matrix can be fed into the function as kwargs. When soft_tagset_dissimilarity = False and soft_layer_dissimilarity = False, all the mismatches are penalized equally.

In [13]:
# create custom scoring matrix
# by setting soft tagset dissimilarity equal true, mismatches in the same tagset are penalized less
# if soft layer dissimilarity would be set true, mismatches in the same layer would be penalized less

category_list, cat_dissimilarity_matrix = create_scoring_matrix(os.path.join(ROOT_DIR,'src/experiment_utils/tag_set.json'),  soft_dissimilarity_penality = 0.5, soft_tagset_dissimilarity = True, soft_layer_dissimilarity = False)

In [12]:
# To penalize all the missmatches equally, e.g no soft penalty
category_list, cat_dissimilarity_matrix = create_scoring_matrix(os.path.join(ROOT_DIR,'Coding_Scheme.json'), soft_tagset_dissimilarity = False, soft_layer_dissimilarity = False)

In [13]:
test_evaluator.append_total_score_per_article(scoring_metrics = 'pygamma', category_list = category_list, cat_dissimilarity_matrix = cat_dissimilarity_matrix, soft = False, parallel=True)



INFO: Pandarallel will run on 48 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=9), Label(value='0 / 9'))), HBox(c…

Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will 

Checking out the dataframe now that shows the inter-annotator agreement scores for each article:

In [15]:
test_evaluator.df.head()

Unnamed: 0,Policy,Text,Tokens,Article_State,Finished_Annotators,Curation,A,C,F,B,E,G,D,f1_exact_score,f1_tokenwise_score,f1_heuristic_score,pygamma_score
EU_32018R1999_Title_0_Chapter_7_Section_3_Article_43,,article 43\r\nexercise of the delegation\r\n1....,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, C]",[span id:CUR0 annotator:Curation layer:Instrum...,[span id:A1 annotator:A layer:Instrumenttypes ...,[span id:C1 annotator:C layer:Policydesignchar...,,,,,,0.21875,0.494915,0.28125,0.569688
EU_32019R0631_Title_0_Chapter_0_Section_0_Article_12,,article 12\r\nreal-world co2 emissions and fue...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[F, B]",[span id:CUR36 annotator:Curation layer:Instru...,,,[span id:F1 annotator:F layer:Instrumenttypes ...,[span id:B1 annotator:B layer:Policydesignchar...,,,,0.289157,0.385827,0.421687,0.462261
EU_32018L2001_Title_0_Chapter_0_Section_0_Article_11,,article 11\r\njoint projects between member st...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[C, F]",[span id:CUR116 annotator:Curation layer:Instr...,,[span id:C28 annotator:C layer:Instrumenttypes...,[span id:F58 annotator:F layer:Instrumenttypes...,,,,,0.567308,0.559615,0.653846,0.681931
EU_32018R1999_Title_0_Chapter_7_Section_3_Article_56,,article 56\r\namendments to directive (eu) 201...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, C]",[span id:CUR202 annotator:Curation layer:Polic...,[span id:A38 annotator:A layer:Policydesigncha...,[span id:C129 annotator:C layer:Policydesignch...,,,,,,0.736842,0.875,0.736842,0.705969
EU_32018L2001_Title_0_Chapter_0_Section_0_Article_03,,article 3\r\nbinding overall union target for ...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[C, F, B]",[span id:CUR211 annotator:Curation layer:Instr...,,[span id:C138 annotator:C layer:Instrumenttype...,[span id:F165 annotator:F layer:Instrumenttype...,[span id:B27 annotator:B layer:Instrumenttypes...,,,,0.420198,0.511835,0.544511,0.604655


### Get total score

In [22]:
test_evaluator.df['Tokens'][1][27].get_token_spans()

[span id:CUR75 annotator:Curation layer:Technologyandapplicationspecificity feature:ApplicationSpecificity tag:App_Other start:173 stop:199 text:fuel or energy consumption,
 span id:CUR76 annotator:Curation layer:Technologyandapplicationspecificity feature:EnergySpecificity tag:Energy_Other start:173 stop:177 text:fuel]

The function **test_evaluator.get_total_score_df(scoring_metrics = 'all', annotator = 'all', weight_by = 'Tokens')** calculates the total scores specified in scoring metrics of the dataframe, either for a specific annotator or for all the annotators. The scores can be weighted by {'no_weighting, 'Tokens', 'Spans'}. Note that this works only for scoring metrics that are already calculated. The default argument 'all' retrieves the score for all the scores that have been appended to the dataframe in the previous step.


In [14]:
# Get the total score of the corpus calculated as a mean of all the individual article scores
test_evaluator.get_total_score_df(weight_by = 'no_weighting')

{'f1_exact_score': 0.40028724799007476,
 'f1_heuristic_score': 0.5275951402554365,
 'f1_tokenwise_score': 0.5197949109942909,
 'pygamma_score': 0.5325009800483032}

In [15]:
# Get the total score of the corpus calculated as a mean of all the individual article scores weighted by the 
# total number of tokens per article
test_evaluator.get_total_score_df(weight_by = 'Tokens')

{'f1_exact_score': 0.37492081552891304,
 'f1_heuristic_score': 0.4870123246432716,
 'f1_tokenwise_score': 0.4641653577192596,
 'pygamma_score': 0.5034930161274749}

In [16]:
# Get the total score of the corpus calculated as a mean of all the individual article scores weighted by the 
# total number of spans per article
test_evaluator.get_total_score_df(weight_by = 'Spans')

{'f1_exact_score': 0.3874058515882508,
 'f1_heuristic_score': 0.5030915691250408,
 'f1_tokenwise_score': 0.48154350057838985,
 'pygamma_score': 0.5115099014940808}

If only specific scores are required, those can be displayed separately:

In [26]:
test_evaluator.get_total_score_df(scoring_metrics = 'f1_exact', weight_by = 'no_weighting')

{'f1_exact_score': 0.40028724799007476}

or for a list of scores:

In [27]:
test_evaluator.get_total_score_df(scoring_metrics =['f1_exact', 'f1_tokenwise'], weight_by = 'Spans')

{'f1_exact_score': 0.3874058515882508,
 'f1_tokenwise_score': 0.48154350057838985}

### Get total score per annotator

The same function can be used to retrieve the score of individual annotators. That is for example the weighted average of scores for all the articles the annotator has participated:

In [28]:
test_evaluator.get_total_score_df(scoring_metrics = 'all', annotator = 'A', weight_by = 'no_weighting')

{'f1_exact_score': 0.3270815592714389,
 'f1_heuristic_score': 0.45133424018050017,
 'f1_tokenwise_score': 0.44226065268711207,
 'pygamma_score': 0.5323319370748079}

To compare all the annotators (weighted by the spans):

In [29]:

for ann in test_evaluator.finished_annotators:
    print('annotator: ', ann)
    print(test_evaluator.get_total_score_df(annotator = ann, weight_by = 'Spans'))
    print('')

annotator:  F
{'f1_exact_score': 0.4381654829355101, 'f1_heuristic_score': 0.5468911489380965, 'f1_tokenwise_score': 0.5276050490415679, 'pygamma_score': 0.5959071455769209}

annotator:  C
{'f1_exact_score': 0.4427467808588547, 'f1_heuristic_score': 0.5771630720105967, 'f1_tokenwise_score': 0.5592613347512423, 'pygamma_score': 0.5984438500781765}

annotator:  A
{'f1_exact_score': 0.32323742866301736, 'f1_heuristic_score': 0.4441804412622658, 'f1_tokenwise_score': 0.41706341003087777, 'pygamma_score': 0.5153066290050988}

annotator:  B
{'f1_exact_score': 0.3071820274746298, 'f1_heuristic_score': 0.39638095203759655, 'f1_tokenwise_score': 0.3690580489778917, 'pygamma_score': 0.48478772099497336}



Or for specific scores:

In [30]:
test_evaluator.get_total_score_df(annotator ='A', scoring_metrics = ['f1_exact', 'f1_tokenwise'], weight_by = 'Spans')


{'f1_exact_score': 0.32323742866301736,
 'f1_tokenwise_score': 0.41706341003087777}

### Rank articles by score

In [None]:
test_evaluator.df.sort_values(by=['f1_heuristic_score'])

Unnamed: 0,Policy,Text,Tokens,Article_State,Finished_Annotators,Curation,A,B,D,C,E,F,G,f1_exact_score,f1_tokenwise_score,f1_heuristic_score,pygamma_score
EU_32019L0944_Title_0_Chapter_2_Section_0_Article_04,,article 4\r\nfree choice of supplier\r\nmember...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, C]",[span id:CUR20198 annotator:Curation layer:Pol...,[span id:A8184 annotator:A layer:Policydesignc...,,[],[span id:C4906 annotator:C layer:Policydesignc...,,,,0.000000,0.222222,0.0,0.065325
EU_32008R1099_Title_0_Chapter_0_Section_0_Article_05,,article 5\r\ntransmission and dissemination\r\...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, B]",[span id:CUR15252 annotator:Curation layer:Ins...,[span id:A6149 annotator:A layer:Instrumenttyp...,[span id:B9617 annotator:B layer:Instrumenttyp...,[span id:D6620 annotator:D layer:Policydesignc...,,,[],,0.000000,0.000000,0.0,0.284623
EU_32019R0631_Title_0_Chapter_0_Section_0_Article_18,,article 18\r\nrepeal\r\nregulations (ec) no 44...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[D, C]",[span id:CUR950 annotator:Curation layer:Polic...,,,[span id:D272 annotator:D layer:Policydesignch...,[span id:C289 annotator:C layer:Policydesignch...,,,,0.000000,0.368421,0.0,0.310785
EU_32006L0066_Title_0_Chapter_0_Section_0_Article_29,,article 29\r\nentry into force\r\nthis directi...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[B, D]",[span id:CUR5178 annotator:Curation layer:Poli...,,[span id:B2926 annotator:B layer:Policydesignc...,[span id:D2083 annotator:D layer:Policydesignc...,,,[],,0.000000,0.000000,0.0,0.292699
EU_32008R1099_Title_0_Chapter_0_Section_0_Article_08,,article 8\r\nannual nuclear statistics\r\nthe ...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, B]",[span id:CUR2501 annotator:Curation layer:Inst...,[span id:A1749 annotator:A layer:Instrumenttyp...,[span id:B856 annotator:B layer:Policydesignch...,[span id:D681 annotator:D layer:Policydesignch...,,,,,0.000000,0.000000,0.0,0.183265
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
EU_32018R1999_Title_0_Chapter_7_Section_3_Article_55,,article 55\r\namendment to directive 2013/30/e...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, B]",[span id:CUR17821 annotator:Curation layer:Pol...,[span id:A7015 annotator:A layer:Policydesignc...,[span id:B10353 annotator:B layer:Policydesign...,,,,,,0.909091,0.983051,1.0,0.995044
EU_32018R1999_Title_0_Chapter_7_Section_3_Article_54,,article 54\r\namendments to directive 2012/27/...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, B]",[span id:CUR17511 annotator:Curation layer:Pol...,[span id:A6952 annotator:A layer:Policydesignc...,[span id:B10310 annotator:B layer:Policydesign...,,,,,,1.000000,1.000000,1.0,1.000000
EU_32008R1099_Title_0_Chapter_0_Section_0_Article_12,,article 12\r\nentry into force\r\nthis regulat...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, B]",[span id:CUR17444 annotator:Curation layer:Pol...,[span id:A6907 annotator:A layer:Policydesignc...,[span id:B10274 annotator:B layer:Policydesign...,[span id:D7929 annotator:D layer:Policydesignc...,,,,,0.500000,0.777778,1.0,0.976971
EU_32018R1999_Title_0_Chapter_7_Section_3_Article_46,,article 46\r\namendments to directive 94/22/ec...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, B]",[span id:CUR19975 annotator:Curation layer:Pol...,[span id:A8091 annotator:A layer:Policydesignc...,[span id:B11052 annotator:B layer:Policydesign...,,,,,,1.000000,1.000000,1.0,1.000000


## Get total score based on a spanlist

The inter-annotator agreement score can be also calculated from a spanlist. For all the spans present, it calculates the inter-agreement scores for all the articls with at least two valid annoations. This can also be used to calculate the similarity to the curation.

**get_score_spanlist(conditional_rep, annotators , scoring_metric, item = None, value = None, weight_by = 'Spans', ** optional_tuple_properties)**

The function is designed in a similar way to the get_span_list function. A set of spans is selected by providing a conditional repository and an optional item and value. For all the spans present, it calculates the inter-agreement scores for all the articles where two finished annotators are found. The function returns the resulting spanlist and the score. This can be used to calculate simmilarity to curation or scores in different categories.

The following example shows two annotators in agreement:

In [49]:
test_dir = repository.from_repository_name('EU_32008R1099_Title_0_Chapter_0_Section_0_Article_12')
span_list = test_evaluator.get_span_list(test_dir, ['A', 'B'])

In [50]:
span_list

[span id:A6907 annotator:A layer:Policydesigncharacteristics feature:Time tag:Time_InEffect start:76 stop:110 text:20th day following its publication,
 span id:A6908 annotator:A layer:Policydesigncharacteristics feature:Actor tag:Addressee_default start:239 stop:252 text:member states,
 span id:B10274 annotator:B layer:Policydesigncharacteristics feature:Time tag:Time_InEffect start:76 stop:134 text:20th day following its publication in the official journal,
 span id:B10275 annotator:B layer:Policydesigncharacteristics feature:Actor tag:Addressee_default start:239 stop:252 text:member states]

Retrieving the inter-annotator agreement score of this spanlist:

In [62]:
span_list, score = test_evaluator.get_score_spanlist(conditional_rep = test_dir, annotators = ['A', 'B'] , 
                                                     scoring_metric = 'f1_heuristic', weight_by = 'Spans')
print(span_list)
print(f"\nscore: {score}")

[span id:A6907 annotator:A layer:Policydesigncharacteristics feature:Time tag:Time_InEffect start:76 stop:110 text:20th day following its publication, span id:A6908 annotator:A layer:Policydesigncharacteristics feature:Actor tag:Addressee_default start:239 stop:252 text:member states, span id:B10274 annotator:B layer:Policydesigncharacteristics feature:Time tag:Time_InEffect start:76 stop:134 text:20th day following its publication in the official journal, span id:B10275 annotator:B layer:Policydesigncharacteristics feature:Actor tag:Addressee_default start:239 stop:252 text:member states]

score: 1.0


This example shows how to use this function to get scores in specific categories:

In [99]:
test_dir = repository.from_repository_name('EU_32018L2001_Title_0_Chapter_0_Section_0_Article_11')

span_list_score, score = test_evaluator.get_score_spanlist(test_dir, annotators = ['B', 'D'], 
                                                           item = 'layer', value = 'Instrumenttypes', 
                                                           scoring_metric = 'f1_heuristic', weight_by = 'Spans')
print(*span_list_score, sep='\n')
print(f"\nscore: {score}")


span id:B9435 annotator:B layer:Instrumenttypes feature:InstrumentType tag:VoluntaryAgrmt start:12 stop:26 text:joint projects
span id:B9436 annotator:B layer:Instrumenttypes feature:InstrumentType tag:VoluntaryAgrmt start:164 stop:178 text:joint projects
span id:B9437 annotator:B layer:Instrumenttypes feature:InstrumentType tag:Unspecified start:1662 stop:1676 text:support scheme
span id:B9438 annotator:B layer:Instrumenttypes feature:InstrumentType tag:PublicInvt start:1707 stop:1721 text:investment aid
span id:B9439 annotator:B layer:Instrumenttypes feature:InstrumentType tag:VoluntaryAgrmt start:1879 stop:1967 text:council of europe convention for the protection of human rights and fundamental freedoms
span id:B9440 annotator:B layer:Instrumenttypes feature:InstrumentType tag:VoluntaryAgrmt start:1978 stop:2031 text:international conventions or treaties on human rights
span id:B9441 annotator:B layer:Instrumenttypes feature:InstrumentType tag:VoluntaryAgrmt start:2835 stop:2848 tex

## Check closeness to curation

In the same spirit as calculating the inter-annotator agreement scores, we can check the closeness to the curation.

**append_score_to_curation(self, scoring_metrics, parallel = False, ** optional_tuple_properties)**

This method works very similar to get_total_score_df, but calculates the closeness to the curation for all the scoring_metrics defined in scoring_metrics. Again, the scores can be calculted in parallel. The method appends all the scores for all annotators that contributed in tuples, where each element corresponds to a scoring metric. Again, the optional tuple properties are reserved for the pygamma score.

In [17]:
scoring_metrics = ['f1_exact', 'f1_tokenwise', 'f1_heuristic']

In [18]:
test_evaluator.append_score_to_curation(scoring_metrics, parallel = False)

100%|██████████| 412/412 [00:01<00:00, 245.45it/s]
100%|██████████| 412/412 [00:00<00:00, 452.95it/s]
100%|██████████| 412/412 [00:01<00:00, 235.52it/s]
100%|██████████| 412/412 [00:01<00:00, 348.10it/s]


As before, we can specify a dissimilarity matrix for pygamma:

In [19]:
#category_list, cat_dissimilarity_matrix = create_scoring_matrix(os.path.join(ROOT_DIR,'src/experiment_utils/tag_set.json'),  soft_tagset_dissimilarity = True, soft_layer_dissimilarity = False)
#category_list, cat_dissimilarity_matrix = create_scoring_matrix(os.path.join(ROOT_DIR,'Coding_Scheme.json'),  soft_tagset_dissimilarity = True, soft_layer_dissimilarity = False)
category_list, cat_dissimilarity_matrix = create_scoring_matrix(os.path.join(ROOT_DIR,'Coding_Scheme.json'), soft_tagset_dissimilarity = False, soft_layer_dissimilarity = False)
test_evaluator.append_score_to_curation(scoring_metrics = 'pygamma', category_list = category_list, cat_dissimilarity_matrix = cat_dissimilarity_matrix, parallel=True)

INFO: Pandarallel will run on 24 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=18), Label(value='0 / 18'))), HBox…

Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
GLPK Integer Optimizer, v4.65
23 rows, 15 columns, 38 non-zeros
15 integer variables, all of which a

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=18), Label(value='0 / 18'))), HBox…

Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
GLPK Integer Optimizer, v4.6

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=18), Label(value='0 / 18'))), HBox…

Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will 

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=18), Label(value='0 / 18'))), HBox…

Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will be used
Long-step dual simplex will 

Checking out the dataframe:

In [69]:
test_evaluator.df.head()

Unnamed: 0,Policy,Text,Tokens,Article_State,Finished_Annotators,Curation,A,B,D,C,...,F,G,f1_exact_score,f1_tokenwise_score,f1_heuristic_score,pygamma_score,B_to_curation,C_to_curation,A_to_curation,D_to_curation
EU_32018R1999_Title_0_Chapter_3_Section_0_Article_15,,article 15\r\nlong-term strategies\r\n1. by ...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, B]",[span id:CUR0 annotator:Curation layer:Instrum...,[span id:A1 annotator:A layer:Instrumenttypes ...,[span id:B1 annotator:B layer:Instrumenttypes ...,,,...,,,0.3,0.524775,0.538889,0.494809,"[0.7234042553191489, 0.8297872340425532, 0.837...",,"[0.4117647058823529, 0.6411764705882352, 0.568...",
EU_32009L0028_Title_0_Chapter_0_Section_0_Article_19,,article 19\r\ncalculation of the greenhouse ga...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, C]",[span id:CUR89 annotator:Curation layer:Instru...,[span id:A82 annotator:A layer:Instrumenttypes...,,[],[span id:C1 annotator:C layer:Instrumenttypes ...,...,,,0.453608,0.508287,0.474227,0.720554,,"[0.5576923076923076, 0.5673076923076923, 0.635...","[0.7339449541284404, 0.7431192660550459, 0.787...",
EU_32009L0028_Title_0_Chapter_0_Section_0_Article_25,,article 25\r\ncommittees\r\n1. except in the...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, C]",[span id:CUR147 annotator:Curation layer:Polic...,[span id:A133 annotator:A layer:Policydesignch...,,[],[span id:C47 annotator:C layer:Policydesigncha...,...,,,0.363636,0.472973,0.363636,0.564383,,"[0.6, 0.6, 0.953125, 0.6900424476513854]","[0.7692307692307692, 0.7692307692307692, 0.549...",
EU_32019L0944_Title_0_Chapter_2_Section_0_Article_09,,article 9\r\npublic service obligations\r\n1. ...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, C]",[span id:CUR159 annotator:Curation layer:Instr...,[span id:A147 annotator:A layer:Instrumenttype...,,[],[span id:C55 annotator:C layer:Instrumenttypes...,...,,,0.254545,0.407407,0.4,0.475311,,"[0.3373493975903614, 0.37349397590361444, 0.60...","[0.5111111111111112, 0.5777777777777778, 0.417...",
EU_32019L0944_Title_0_Chapter_2_Section_0_Article_08,,article 8\r\nauthorisation procedure for new c...,"[start:0 stop:7 text:article tag_count:0, star...",CURATION_FINISHED,"[A, C]",[span id:CUR218 annotator:Curation layer:Instr...,[span id:A178 annotator:A layer:Instrumenttype...,,[],[span id:C79 annotator:C layer:Instrumenttypes...,...,,,0.253968,0.401163,0.285714,0.363413,,"[0.4415584415584415, 0.49350649350649345, 0.59...","[0.5, 0.5238095238095238, 0.6626984126984127, ...",


### Get individual score

To retrieve the scores, im simmilar fashion to get_total_score_df, we use

**get_to_curation_score(self, weight_by = 'Tokens')**

This retrieves the closeness to the curation for all the annotators that participated, weighted by one of the following methods: {'no_weighting, 'Tokens', 'Spans'}


In [20]:
test_evaluator.get_to_curation_score(weight_by = 'no_weighting')

{'A': {'f1_exact': 0.5759851342165275,
  'f1_heuristic': 0.6626746821425515,
  'f1_tokenwise': 0.6238340038467249,
  'pygamma': 0.6447144861852127},
 'B': {'f1_exact': 0.43231750058721435,
  'f1_heuristic': 0.509642244445365,
  'f1_tokenwise': 0.4868209888810718,
  'pygamma': 0.4955379835017275},
 'C': {'f1_exact': 0.6832770638730249,
  'f1_heuristic': 0.752194329001551,
  'f1_tokenwise': 0.8209086648141731,
  'pygamma': 0.7301928529796907},
 'F': {'f1_exact': 0.6741487861941954,
  'f1_heuristic': 0.7422988818458282,
  'f1_tokenwise': 0.7300526054593619,
  'pygamma': 0.7087585120211776}}

In [21]:
test_evaluator.get_to_curation_score(weight_by = 'Tokens')

{'A': {'f1_exact': 0.5478083927236341,
  'f1_heuristic': 0.6311571670202513,
  'f1_tokenwise': 0.587297552665045,
  'pygamma': 0.6205081324796429},
 'B': {'f1_exact': 0.3884944280696229,
  'f1_heuristic': 0.4462154109859259,
  'f1_tokenwise': 0.42693279125784217,
  'pygamma': 0.4474737143219891},
 'C': {'f1_exact': 0.6814903098765762,
  'f1_heuristic': 0.7407736247878691,
  'f1_tokenwise': 0.8049646012591863,
  'pygamma': 0.7267490976389174},
 'F': {'f1_exact': 0.6507138521164361,
  'f1_heuristic': 0.715843449602768,
  'f1_tokenwise': 0.6752261129067305,
  'pygamma': 0.6963766913385405}}

In [22]:
test_evaluator.get_to_curation_score(weight_by = 'Spans')

{'A': {'f1_exact': 0.5467249839982933,
  'f1_heuristic': 0.6319074034563689,
  'f1_tokenwise': 0.5875171172675752,
  'pygamma': 0.6171804727147149},
 'B': {'f1_exact': 0.3859744866634834,
  'f1_heuristic': 0.4416399481547173,
  'f1_tokenwise': 0.4267859242472768,
  'pygamma': 0.4415601915354286},
 'C': {'f1_exact': 0.6837066473988437,
  'f1_heuristic': 0.7435422687861271,
  'f1_tokenwise': 0.8084465246563635,
  'pygamma': 0.7281822832755371},
 'F': {'f1_exact': 0.6498717755317543,
  'f1_heuristic': 0.7154422487051841,
  'f1_tokenwise': 0.6765125170545355,
  'pygamma': 0.6944983690860258}}

### Get Total score

To retireve the total closeness to the curation score, we use

**get_to_curation_score_total(self, weight_by = 'Tokens')**

For each article, we take an average of the annotator-curation scores (For all the finsihed annotators). To get the total average, all the individual article averages are weighted by one of the following methods: {'no_weighting, 'Tokens', 'Spans'}


In [23]:
test_evaluator.get_to_curation_score_total(weight_by = 'no_weighting')

{'f1_exact': 0.5994190569879829,
 'f1_heuristic': 0.6742289960750816,
 'f1_tokenwise': 0.6765053575584232,
 'pygamma': 0.6524052393256519}

In [24]:
test_evaluator.get_to_curation_score_total(weight_by = 'Tokens')

{'f1_exact': 0.5725567750165721,
 'f1_heuristic': 0.6388021548917368,
 'f1_tokenwise': 0.6305550505802415,
 'pygamma': 0.6280549895861997}

In [25]:
test_evaluator.get_to_curation_score_total(weight_by = 'Spans')

{'f1_exact': 0.5746747169968692,
 'f1_heuristic': 0.6419866749325036,
 'f1_tokenwise': 0.6357219405018641,
 'pygamma': 0.6282550674026643}

# Check scores in different categories

In [41]:
layers = ['Technologyandapplicationspecificity', 'Policydesigncharacteristics', 'Instrumenttypes' ]
repo = repository()

for l in layers:
    span_list, score = test_evaluator.get_score_spanlist(conditional_rep = repo, annotators = 'annotators' , 
                                                         item = 'layer', value = l,scoring_metric = 'f1_heuristic', 
                                                         weight_by = 'Spans')
    print(f"layer: {l}, len of spanlist: {len(span_list)}, score: {score}")
    

layer: Technologyandapplicationspecificity, len of spanlist: 9799, score: 0.49129379850132515
layer: Policydesigncharacteristics, len of spanlist: 18660, score: 0.5154005542648732
layer: Instrumenttypes, len of spanlist: 5548, score: 0.47970571741677825


By feature:

In [98]:
with open(os.path.join(ROOT_DIR,'src/experiment_utils/Coding_Scheme.json'),'r') as f: # Change to tag_set.json
    coding_scheme = json.loads(f.read())
    
layerindex = range(0,3)

for l in layerindex:
    features = [x['tagset'][:-7] for x in coding_scheme['layers'][l]['tagsets']]

    repo = repository()

    for f in features:
        span_list, score = test_evaluator.get_score_spanlist(conditional_rep = repo, annotators = 'annotators' , 
                                                             item = 'feature', value = f, 
                                                             scoring_metric = 'f1_heuristic', 
                                                             weight_by = 'Spans')
        print(f"feature: {f}, len of spanlist: {len(span_list)}, score: {score}")
        

feature: Objective, len of spanlist: 1355, score: 0.49909799097990976
feature: Reference, len of spanlist: 1830, score: 0.797260845680898
feature: Actor, len of spanlist: 10366, score: 0.4962007742097252
feature: Resource, len of spanlist: 1130, score: 0.2810619469026549
feature: Time, len of spanlist: 1240, score: 0.5084005376344086
feature: Compliance, len of spanlist: 2603, score: 0.534661663763712
feature: Reversibility, len of spanlist: 30, score: 0.06666666666666667
feature: EnergySpecificity, len of spanlist: 3833, score: 0.6086307325594438
feature: ApplicationSpecificity, len of spanlist: 2158, score: 0.31044382658840486
feature: TechnologySpecificity, len of spanlist: 3776, score: 0.4818443973634652
feature: InstrumentType, len of spanlist: 5236, score: 0.4821637143092047
