# Evaluating the trained models

**This notebook produces Figures 3 and 7 of the JCLS paper.**

We comparatively evaluate three trained models using Precision, Recall and $F_1$, which is an equal weighting of Precision and Recall (because we have no immediate reason to favor one over the other). 

The three models are:

- mBERT (MaChAmp default, multilingual)
- XLM-RoBERTa (multilingual)
- robBERT-2023 (Dutch-only)

We created ground truth training and test data using two different splits:
- $0.1$: 80/10/10
- $0.2$: 60/20/20

In the paper we use the latter, because there are some categories with very few positive examples. The more balanced split ensures that those categories have enough examples in both the validation and test sets to do reliable evaluation.

## Preparation

Mapping from category names in the data to more human-readable labels.

In [1]:
import glob
import json
import re

import pandas as pd


def map_cat_short(cat):
    short = cat[:3].lower()
    if '--' in cat:
        short = f"{short}_{cat.split('--')[-1][:3].lower()}"
    return short


def map_cat_pretty(cat):
    cat = cat.replace('_', ' ')
    if '--' in cat:
        cat = '~~~~' + cat.split('--')[-1]
    return cat


In [2]:
main_cats = [
    'Author',
    'Classification',
    'Content',
    'Other_works',
    'Reader_response',
    'Recommendations',
    'Style',
]

sub_cats = [
    'Content--Narrative',
    'Content--Other',
    'Content--Quote',
    'Content--Theme',
    'Reader_response--Evaluation_of_quality',
    'Reader_response--Feelings',
    'Reader_response--Identification_and_immersion',
    'Reader_response--Reading_Context',
    'Reader_response--Reception',
    'Reader_response--Reflection',
    'Style--Context',
    'Style--Structure',
    'Style--Stylistic_features'
]

all_cats = main_cats + sub_cats

In [3]:
{map_cat_pretty(cat): cat for cat in all_cats}

short_cat_map = {map_cat_short(cat): cat for cat in all_cats}
short_cat_map

{'aut': 'Author',
 'cla': 'Classification',
 'con': 'Content',
 'oth': 'Other_works',
 'rea': 'Reader_response',
 'rec': 'Recommendations',
 'sty': 'Style',
 'con_nar': 'Content--Narrative',
 'con_oth': 'Content--Other',
 'con_quo': 'Content--Quote',
 'con_the': 'Content--Theme',
 'rea_eva': 'Reader_response--Evaluation_of_quality',
 'rea_fee': 'Reader_response--Feelings',
 'rea_ide': 'Reader_response--Identification_and_immersion',
 'rea_rea': 'Reader_response--Reading_Context',
 'rea_rec': 'Reader_response--Reception',
 'rea_ref': 'Reader_response--Reflection',
 'sty_con': 'Style--Context',
 'sty_str': 'Style--Structure',
 'sty_sty': 'Style--Stylistic_features'}

## Read test output file

We ran MaChAmp on the sentences in the ground truth test data. The output contains category assignments per sentence and per category. We compare that against the ground truth to compute performance scores.

In [15]:
eval_files = glob.glob('../data/predictions/test/prediction-test-*.tsv.eval')
eval_files = [f for f in eval_files if 'split' in f]
for eval_file in eval_files:
    print(eval_file)


../data/predictions/test/prediction-test-split_0.1-model_roberta.tsv.eval
../data/predictions/test/prediction-test-split_0.2-model_roberta.tsv.eval
../data/predictions/test/prediction-test-split_0.2-model_robbert.tsv.eval
../data/predictions/test/prediction-test-split_0.2-model_mbert.tsv.eval
../data/predictions/test/prediction-test-split_0.1-model_mbert.tsv.eval


In [5]:

gt_file = '../data/ground_truth/ground_truth_test-cats_all-task_type_multi.tsv'
gt_files = {
    '0.1': '../data/ground_truth/ground_truth_test-cats_all-task_type_multi-split_0.1.tsv',
    '0.2': '../data/ground_truth/ground_truth_test-cats_all-task_type_multi-split_0.2.tsv'
}

pred_roberta_file = '../data/predictions/prediction-test-roberta.tsv.eval'
pred_mbert_file = '../data/predictions/prediction-test-mbert.tsv.eval'

split_dfs = []
for split in gt_files:
    split_df = pd.read_csv(gt_files[split], sep='\t', header=None, names=['sent_text'] + all_cats)
    split_df['split'] = split
    split_dfs.append(split_df)
gt_df = pd.concat(split_dfs)
gt_df.head(2)

Unnamed: 0,sent_text,Author,Classification,Content,Other_works,Reader_response,Recommendations,Style,Content--Narrative,Content--Other,...,Reader_response--Evaluation_of_quality,Reader_response--Feelings,Reader_response--Identification_and_immersion,Reader_response--Reading_Context,Reader_response--Reception,Reader_response--Reflection,Style--Context,Style--Structure,Style--Stylistic_features,split
0,De hoofdpersoon Charlotte leeft met een diep g...,,,con,,,,,con_nar,,...,,,,,,,,,,0.1
1,U heeft naar mijn mening een hier een illegale...,,,,,,,,,,...,,,,,,,,,,0.1


In [16]:
def make_split_support_column(gt_file):
    """Generate a column with the support (number of positive examples) per category."""
    temp = gt_df.melt(id_vars=['split', 'sent_text'], value_vars=all_cats, 
                              var_name='category', value_name='value')
    return (temp[temp.value.notna()]
            .groupby('split')
            .category
            .value_counts()
            .sort_index()
            .unstack()
            .fillna(0)
            .T)


def make_eval_dataframe(scores, cats, support):
    """Create a dataframe with categories, performance scores and support."""
    measures = ['precision_macro', 'recall_macro', 'f1_macro']
    rows = []
    
    for split, model in scores:
        cats = [score for score in list(scores[(split, model)].keys()) if score != 'sum']
        for cat in cats:
            model_scores = scores[(split, model)][cat]['f1_macro']
            row = [split, model, cat] + [model_scores[measure] for measure in measures]
            rows.append(row)
    
    
    eval_df = pd.DataFrame(rows, columns=['split', 'model', 'category'] + ['precision', 'recall', 'f1'])
    eval_df = eval_df[eval_df.category != 'Style--Context']
    eval_df = eval_df.sort_values(['category', 'model', 'split'])

    #eval_df['support'] = eval_df.category.apply(lambda x: support[x])
    eval_df['support'] = eval_df.apply(lambda row: support.loc[row['category']][row['split']], axis=1)
    eval_df['category'] = eval_df.category.apply(map_cat_pretty)
    eval_df = eval_df.set_index(['category', 'model', 'split'])
    return eval_df


In [7]:
def make_support_column(gt_file):
    temp = gt_df.melt(id_vars='sent_text', value_vars=all_cats, 
                      var_name='category', value_name='value')
    return temp[temp.value.notna()].category.value_counts().sort_index()



In [18]:
# Check that it works
#support = make_support_column(gt_df)
support = make_split_support_column(gt_df)
support

split,0.1,0.2
category,Unnamed: 1_level_1,Unnamed: 2_level_1
Author,153.0,282.0
Classification,32.0,62.0
Content,607.0,1224.0
Content--Narrative,533.0,1064.0
Content--Other,32.0,61.0
Content--Quote,23.0,62.0
Content--Theme,10.0,25.0
Other_works,53.0,120.0
Reader_response,566.0,1142.0
Reader_response--Evaluation_of_quality,324.0,661.0


### Load scores

Load performance scores per model and per split.

In [19]:

scores = {}
for eval_file in eval_files:
    if m := re.search(r"split_(0\.[12])-model_(mbert|roberta|robbert)\.", eval_file):
        split = m.group(1)
        model = m.group(2)
    else:
        raise ValueError(f"no valid model name in eval_file '{eval_file}'")
    with open(eval_file, 'rt') as fh:
        json_scores = json.load(fh)

        scores[(split, model)] = {}
        for cat in json_scores:
            if cat == 'sum':
                continue
            if cat in short_cat_map:
                clean_cat = short_cat_map[cat]
            else:
                clean_cat = cat
            scores[(split, model)][clean_cat] = json_scores[cat]

In [22]:
#cats = ['aut', 'cla', 'con', 'con_nar', 'con_oth', 'con_quo', 'con_the', 'oth', 'rea', 'rea_eva', 'rea_fee', 'rea_ide', 'rea_rea', 'rea_rec', 'rea_ref', 'rec', 'sty', 'sty_con', 'sty_str', 'sty_sty', 'sum']

cats = short_cat_map.keys()

eval_df = make_eval_dataframe(scores, cats, support)

eval_df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,precision,recall,f1,support
category,model,split,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Author,mbert,0.1,0.900046,0.8956,0.897805,153.0
Author,mbert,0.2,0.892547,0.908029,0.900074,282.0
Author,robbert,0.2,0.912719,0.911457,0.912087,282.0
Author,roberta,0.1,0.925786,0.91868,0.92219,153.0
Author,roberta,0.2,0.904906,0.913534,0.909155,282.0
Classification,mbert,0.1,0.839527,0.808514,0.823225,32.0
Classification,mbert,0.2,0.882169,0.827993,0.852882,62.0
Classification,robbert,0.2,0.8798,0.867873,0.873735,62.0
Classification,roberta,0.1,0.913572,0.888411,0.900571,32.0
Classification,roberta,0.2,0.891725,0.91626,0.903595,62.0


Let's look at mBERT only:

In [26]:
temp_df = eval_df.reset_index()
eval_mbert_df = temp_df[temp_df.model == 'mbert'].drop('model', axis=1).set_index(['category', 'split'])
eval_mbert_df.head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,precision,recall,f1,support
category,split,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Author,0.1,0.900046,0.8956,0.897805,153.0
Author,0.2,0.892547,0.908029,0.900074,282.0


Next, focus on the $0.2$ split:

In [25]:
temp_df = eval_df.reset_index()
eval_02_df = temp_df[temp_df.split == '0.2'].drop('split', axis=1).set_index(['category', 'model'])
eval_02_df

Unnamed: 0_level_0,Unnamed: 1_level_0,precision,recall,f1,support
category,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Author,mbert,0.892547,0.908029,0.900074,282.0
Author,robbert,0.912719,0.911457,0.912087,282.0
Author,roberta,0.904906,0.913534,0.909155,282.0
Classification,mbert,0.882169,0.827993,0.852882,62.0
Classification,robbert,0.8798,0.867873,0.873735,62.0
Classification,roberta,0.891725,0.91626,0.903595,62.0
Content,mbert,0.867023,0.867276,0.867139,1224.0
Content,robbert,0.893936,0.893708,0.893816,1224.0
Content,roberta,0.886086,0.883656,0.884472,1224.0
~~~~Narrative,mbert,0.860221,0.86133,0.860683,1064.0


Finally, we want to make a LaTeX table to copy-and-pasta into the paper:

In [27]:
def prettify_table(table):
    table_lines = table.split('\n')
    pretty_lines = []
    for tl in table_lines:
        if tl.startswith('\\multirow'):
            first, second = tl.split('} &')
            first = f"{first}" + "}"
            second = f" &{second.replace('mbert', 'mbert  ')}"
            pretty_lines.extend([f"\t{first}", f"\t{second}"])
        else:
            pretty_lines.append(f"\t{tl}")
    return '\n'.join(pretty_lines)
    
table = eval_02_df.to_latex(float_format="{:.2f}".format)
pretty_table = prettify_table(table)
print(pretty_table)

	\begin{tabular}{llrrrr}
	\toprule
	 &  & precision & recall & f1 & support \\
	category & model &  &  &  &  \\
	\midrule
	\multirow[t]{3}{*}{Author}
	 & mbert   & 0.89 & 0.91 & 0.90 & 282.00 \\
	 & robbert & 0.91 & 0.91 & 0.91 & 282.00 \\
	 & roberta & 0.90 & 0.91 & 0.91 & 282.00 \\
	\cline{1-6}
	\multirow[t]{3}{*}{Classification}
	 & mbert   & 0.88 & 0.83 & 0.85 & 62.00 \\
	 & robbert & 0.88 & 0.87 & 0.87 & 62.00 \\
	 & roberta & 0.89 & 0.92 & 0.90 & 62.00 \\
	\cline{1-6}
	\multirow[t]{3}{*}{Content}
	 & mbert   & 0.87 & 0.87 & 0.87 & 1224.00 \\
	 & robbert & 0.89 & 0.89 & 0.89 & 1224.00 \\
	 & roberta & 0.89 & 0.88 & 0.88 & 1224.00 \\
	\cline{1-6}
	\multirow[t]{3}{*}{~~~~Narrative}
	 & mbert   & 0.86 & 0.86 & 0.86 & 1064.00 \\
	 & robbert & 0.88 & 0.88 & 0.88 & 1064.00 \\
	 & roberta & 0.88 & 0.88 & 0.88 & 1064.00 \\
	\cline{1-6}
	\multirow[t]{3}{*}{~~~~Other}
	 & mbert   & 0.89 & 0.73 & 0.79 & 61.00 \\
	 & robbert & 0.94 & 0.83 & 0.87 & 61.00 \\
	 & roberta & 0.94 & 0.82 & 0.87 & 6

### Macro and Weighted Average

We only have scores per category, but it's useful to also have weighted averages (where categories contribute to the average based on their support) and macro averages (where categories contribute equally to the average).

In [33]:
temp_df = eval_df.reset_index()
# compute macro average
macro_avg = temp_df.groupby(['split', 'model'])[['precision', 'recall', 'f1']].mean()
macro_avg['category'] = 'macro_avg'
macro_avg['support'] = temp_df.groupby(['split', 'model']).support.sum()
macro_avg

Unnamed: 0_level_0,Unnamed: 1_level_0,precision,recall,f1,category,support
split,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.1,mbert,0.841872,0.783018,0.802988,macro_avg,2867.0
0.1,roberta,0.852849,0.835427,0.838585,macro_avg,2867.0
0.2,mbert,0.830028,0.778318,0.797102,macro_avg,5732.0
0.2,robbert,0.865515,0.82875,0.844414,macro_avg,5732.0
0.2,roberta,0.854704,0.825517,0.835879,macro_avg,5732.0


In [34]:
# compute weighted average
temp_df ['f1_support'] = temp_df.apply(lambda row: row['f1'] * row['support'], axis=1)
temp_df ['precision_support'] = temp_df.apply(lambda row: row['precision'] * row['support'], axis=1)
temp_df ['recall_support'] = temp_df.apply(lambda row: row['recall'] * row['support'], axis=1)

weighted_avg = (temp_df.groupby(['split', 'model'])
                [['precision_support', 'recall_support', 'f1_support']]
                .sum().T
                .div(temp_df.groupby(['split', 'model']).support.sum())
                .T
               )

weighted_avg = weighted_avg.rename(columns={col: col.replace('_support', '') for col in weighted_avg.columns})
#weighted_avg = weighted_avg.reset_index()
weighted_avg['category'] = 'weighted_avg'
weighted_avg['support'] = temp_df.groupby(['split', 'model']).support.sum()
weighted_avg

Unnamed: 0_level_0,Unnamed: 1_level_0,precision,recall,f1,category,support
split,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.1,mbert,0.858551,0.837642,0.845571,weighted_avg,2867.0
0.1,roberta,0.87203,0.870386,0.86926,weighted_avg,2867.0
0.2,mbert,0.849192,0.836729,0.840938,weighted_avg,5732.0
0.2,robbert,0.877521,0.866649,0.87117,weighted_avg,5732.0
0.2,roberta,0.87078,0.864154,0.866234,weighted_avg,5732.0


Combine the two types of averages:

In [35]:
avg = pd.concat([weighted_avg, macro_avg])
avg = avg.loc['0.2'].reset_index().set_index(['category', 'model'])
avg

Unnamed: 0_level_0,Unnamed: 1_level_0,precision,recall,f1,support
category,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
weighted_avg,mbert,0.849192,0.836729,0.840938,5732.0
weighted_avg,robbert,0.877521,0.866649,0.87117,5732.0
weighted_avg,roberta,0.87078,0.864154,0.866234,5732.0
macro_avg,mbert,0.830028,0.778318,0.797102,5732.0
macro_avg,robbert,0.865515,0.82875,0.844414,5732.0
macro_avg,roberta,0.854704,0.825517,0.835879,5732.0


Put it all together:

In [36]:
eval_all = pd.concat([eval_02_df, avg])
eval_all

Unnamed: 0_level_0,Unnamed: 1_level_0,precision,recall,f1,support
category,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Author,mbert,0.892547,0.908029,0.900074,282.0
Author,robbert,0.912719,0.911457,0.912087,282.0
Author,roberta,0.904906,0.913534,0.909155,282.0
Classification,mbert,0.882169,0.827993,0.852882,62.0
Classification,robbert,0.879800,0.867873,0.873735,62.0
...,...,...,...,...,...
weighted_avg,robbert,0.877521,0.866649,0.871170,5732.0
weighted_avg,roberta,0.870780,0.864154,0.866234,5732.0
macro_avg,mbert,0.830028,0.778318,0.797102,5732.0
macro_avg,robbert,0.865515,0.828750,0.844414,5732.0


Now we have a single table with all scores for all models using the $0.2$ split.

In [37]:
table = eval_all.to_latex(float_format="{:.2f}".format)
pretty_table = prettify_table(table)
print(pretty_table)

	\begin{tabular}{llrrrr}
	\toprule
	 &  & precision & recall & f1 & support \\
	category & model &  &  &  &  \\
	\midrule
	\multirow[t]{3}{*}{Author}
	 & mbert   & 0.89 & 0.91 & 0.90 & 282.00 \\
	 & robbert & 0.91 & 0.91 & 0.91 & 282.00 \\
	 & roberta & 0.90 & 0.91 & 0.91 & 282.00 \\
	\cline{1-6}
	\multirow[t]{3}{*}{Classification}
	 & mbert   & 0.88 & 0.83 & 0.85 & 62.00 \\
	 & robbert & 0.88 & 0.87 & 0.87 & 62.00 \\
	 & roberta & 0.89 & 0.92 & 0.90 & 62.00 \\
	\cline{1-6}
	\multirow[t]{3}{*}{Content}
	 & mbert   & 0.87 & 0.87 & 0.87 & 1224.00 \\
	 & robbert & 0.89 & 0.89 & 0.89 & 1224.00 \\
	 & roberta & 0.89 & 0.88 & 0.88 & 1224.00 \\
	\cline{1-6}
	\multirow[t]{3}{*}{~~~~Narrative}
	 & mbert   & 0.86 & 0.86 & 0.86 & 1064.00 \\
	 & robbert & 0.88 & 0.88 & 0.88 & 1064.00 \\
	 & roberta & 0.88 & 0.88 & 0.88 & 1064.00 \\
	\cline{1-6}
	\multirow[t]{3}{*}{~~~~Other}
	 & mbert   & 0.89 & 0.73 & 0.79 & 61.00 \\
	 & robbert & 0.94 & 0.83 & 0.87 & 61.00 \\
	 & roberta & 0.94 & 0.82 & 0.87 & 6

This table corresponds to Tables 3 and 7 in the paper.