# Model Ensembling Notebook

This notebook contains the text and experiments regarding the benefits of combining / ensembling models for Page Stream Segmentation. It contains both an overview of a section of related work, as well as an analysis of various ensembling strategies from the literature.

## Index

1. [Introduction](#intro)
2. [Ensemble on predictions](#average_ensemble)
    - 2.1 [Table with combinations](#table_combi)
    - 2.2 [Interesting combination](#interesting)
    - 2.3 [Problems](#problems)
3. [Earlier Combination](#early)
4. [Combining Multiple Models](#multiple) 
5. [Conclusion](#conclusion)
6. [Extra Error Analysis](#extra)

In [17]:
import os
import seaborn as sns
from tqdm import tqdm
import matplotlib.pyplot as plt
from collections import defaultdict
from IPython.display import display

%run metricutils.py

## Loading in model predictions

In [18]:
import torch.nn.functional as F
import torch
def get_bert_prediction_dict(dataframe):
    output = {}
    for doc_id, stream in dataframe.groupby('name'):
        predictions = np.vstack([np.array(stream['raw_scores_0']), np.array(stream['raw_scores_1'])])
        output[doc_id] = F.softmax(torch.from_numpy(predictions), dim=0)[1, :].tolist()
        output[doc_id][0] = 1
    return output


In [19]:
# for investigation, also get the binary labels instead of the probabilities
def make_bin(probability_dict):
    out = {}
    for key, value in probability_dict.items():
        out[key] = np.array([round(item) for item in value])
        out[key][0] = 1
    return out
        

In [20]:
LONG_dataframe = pd.concat([pd.read_csv('../resources/datasets/LONG/dataframes/train.csv'), pd.read_csv('../resources/datasets/LONG/dataframes/test.csv')])
SHORT_dataframe = pd.concat([pd.read_csv('../resources/datasets/SHORT/dataframes/train.csv'), pd.read_csv('../resources/datasets/SHORT/dataframes/test.csv')])

TEXTCNN_predictions_raw_LONG = read_json('../resources/model_predictions/TEXT-CNN/LONG_LONG/raw_scores.json')
CNN_predictions_raw_LONG = read_json('../resources/model_predictions/CNN/LONG_LONG/raw_scores.json')
EFFICIENTNET_predictions_raw_LONG = read_json('../resources/model_predictions/EFFICIENTNET/LONG_LONG/raw_scores.json')
BERT_predictions_raw_LONG = get_bert_prediction_dict(pd.read_csv('../resources/model_predictions/BERT/LONG_LONG/model_output.csv'))

TEXTCNN_predictions_raw_SHORT = read_json('../resources/model_predictions/TEXT-CNN/SHORT_SHORT/raw_scores.json')
CNN_predictions_raw_SHORT = read_json('../resources/model_predictions/CNN/SHORT_SHORT/raw_scores.json')
EFFICIENTNET_predictions_raw_SHORT = read_json('../resources/model_predictions/EFFICIENTNET/SHORT_SHORT/raw_scores.json')
BERT_predictions_raw_SHORT = get_bert_prediction_dict(pd.read_csv('../resources/model_predictions/BERT/SHORT_SHORT/model_output.csv'))

LONG_gold_standard_train = read_json('../resources/model_predictions/CNN/LONG_train/gold_standard.json')
SHORT_gold_standard_train = read_json('../resources/model_predictions/CNN/SHORT_train/gold_standard.json')

LONG_gold_standard_test = read_json('../resources/model_predictions/CNN/LONG_LONG/gold_standard.json')
SHORT_gold_standard_test = read_json('../resources/model_predictions/CNN/SHORT_SHORT/gold_standard.json')

In [21]:
gold_standard = {**LONG_gold_standard_test, **SHORT_gold_standard_test}

In [22]:
BERT_predictions_raw_LONG_robust = get_bert_prediction_dict(pd.read_csv('../resources/model_predictions/BERT/SHORT_LONG/model_output.csv'))
BERT_predictions_raw_SHORT_robust = get_bert_prediction_dict(pd.read_csv('../resources/model_predictions/BERT/LONG_SHORT/model_output.csv'))


In [23]:
bert_k_standard = read_json('../resources/model_predictions//BERT-K/standard/predictions.json')
bert_k_robust = read_json('../resources/model_predictions//BERT-K/robust/predictions.json')

bert_k_standard_final = {}
bert_k_robust_final = {}

In [24]:
for key, value in bert_k_standard.items():
    prediction = np.zeros_like(np.array(value))
    num_docs = np.sum(gold_standard[key])
    top_k_indices = np.argpartition(value, -num_docs)[-num_docs:]
    prediction[top_k_indices] = 1
    prediction[0] = 1
    bert_k_standard_final[key] = prediction.astype(int).tolist()


In [25]:
for key, value in bert_k_robust.items():
    prediction = np.zeros_like(np.array(value))
    num_docs = np.sum(gold_standard[key])
    top_k_indices = np.argpartition(value, -num_docs)[-num_docs:]
    prediction[top_k_indices] = 1
    prediction[0] = 1
    bert_k_robust_final[key] = prediction.astype(int).tolist()


In [26]:
with open('bertk_robust.json', 'w') as j:
    json.dump(bert_k_robust_final, j)

In [27]:
LONG_results = {'VGG16': evaluation_report(LONG_gold_standard_test, make_bin(CNN_predictions_raw_LONG)),
             'EFFICIENTNET': evaluation_report(LONG_gold_standard_test, make_bin(EFFICIENTNET_predictions_raw_LONG)),
             'BERT': evaluation_report(LONG_gold_standard_test, make_bin(BERT_predictions_raw_LONG)),
             'TEXTCNN': evaluation_report(LONG_gold_standard_test, make_bin(TEXTCNN_predictions_raw_LONG))}

In [28]:
SHORT_results = {'VGG16': evaluation_report(SHORT_gold_standard_test, make_bin(CNN_predictions_raw_SHORT)),
             'EFFICIENTNET': evaluation_report(SHORT_gold_standard_test, make_bin(EFFICIENTNET_predictions_raw_SHORT)),
             'BERT': evaluation_report(SHORT_gold_standard_test, make_bin(BERT_predictions_raw_SHORT)),
             'TEXTCNN': evaluation_report(SHORT_gold_standard_test, make_bin(TEXTCNN_predictions_raw_SHORT))}

In [29]:
# use *args to make the function work with an arbitrary number of input dicts
def combine_predictions(*args):
    final_predictions = {}
    for key in args[0].keys():
        combi_prediction = (np.vstack([model[key] for model in args]).sum(axis=0) / len(args))
        final_predictions[key] = np.round(combi_prediction).astype(int).tolist()
    return final_predictions
        

This function works nicely for the general case, but we might want to experiment specifically with two models, and how we can adjust 
the balance betweem them to adjust for problems for example when one of the models is really bad, such as in the robst case.

In [30]:
def weigh_predictions(model_1_probabilities: dict, model_2_probabilities: dict, alpha: float= 0.5):
    output_dict = {}
    for key in model_1_probabilities.keys():
        combined_predictions = np.vstack([alpha*np.array(model_1_probabilities[key]), (1-alpha)*np.array(model_2_probabilities[key])])
        output_dict[key] = np.round(combined_predictions.sum(axis=0)).astype(int).tolist()
        output_dict[key][0] = 1
    return output_dict
    
    

In [31]:
# Now try the best prediction for the model fusion one in the robust setting
EFFICIENTNET_predictions_raw_LONG_robust = read_json('../resources/model_predictions/EFFICIENTNET/SHORT_LONG/raw_scores.json')
BERT_predictions_raw_LONG_robust = get_bert_prediction_dict(pd.read_csv('../resources/model_predictions/BERT/SHORT_LONG/model_output.csv'))
TEXTCNN_predictions_raw_LONG_robust = read_json('../resources/model_predictions/TEXT-CNN/SHORT_LONG/raw_scores.json')

EFFICIENTNET_predictions_raw_SHORT_robust = read_json('../resources/model_predictions/EFFICIENTNET/LONG_SHORT/raw_scores.json')
BERT_predictions_raw_SHORT_robust = get_bert_prediction_dict(pd.read_csv('../resources/model_predictions/BERT/LONG_SHORT/model_output.csv'))
TEXTCNN_predictions_raw_SHORT_robust = read_json('../resources/model_predictions/TEXT-CNN/LONG_SHORT/raw_scores.json')

In [33]:
combo_robustness_LONG = weigh_predictions(EFFICIENTNET_predictions_raw_LONG_robust, BERT_predictions_raw_LONG_robust, alpha=0.25)
evaluation_report(LONG_gold_standard_test, combo_robustness_LONG).mean().T.round(2)

  evaluation_report(LONG_gold_standard_test, combo_robustness_LONG).mean().T.round(2)


Precision               0.80
Recall                  0.58
F1                      0.60
SQ                      0.77
SQ*                     0.77
Weighted PQ P           0.55
Weighted PQ* P          0.55
Weighted PQ R           0.44
Weighted PQ* R          0.44
Weighted PQ F1          0.47
Weighted PQ* F1         0.47
Unweighted PQ P         0.59
Unweighted PQ* P        0.59
Unweighted PQ R         0.48
Unweighted PQ* R        0.48
Unweighted PQ F1        0.50
Unweighted PQ* F1       0.50
support              6347.00
dtype: float64

In [34]:
import scipy
def get_model_correlation(gold_standard, model_1, model_2):
    model1_arr = []
    model2_arr = []
    for key in gold_standard.keys():
        model1_correct = (np.array(gold_standard[key]) == np.array(model_1[key])).astype(int)
        model2_correct = (np.array(gold_standard[key]) == np.array(model_2[key])).astype(int)
        
        model1_arr.extend(model1_correct.tolist())
        model2_arr.extend(model2_correct.tolist())
        
    corr = scipy.stats.pearsonr(np.array(model1_arr), np.array(model2_arr))[0]
    return corr
    

<a id="intro" />

## Model Ensembles

As with many fields in Artifical Intelligence, the task of Page Stream Segmentation can benefit from the combination of the predictions of multiple models, where the combination of information form multiple modalities/models can lead to improved performance of the combined system.
Methods that combine the information from multiple modalities can be broadly classified into two categories, with either a  combination of the models at the decision level (i.e. the probabilities of the classes), or combination at the feature level.
In the case of combining at the feature level, the concatenated feature vectors are then fed into a single model for classification \cite{attr_mult10}. Which approach is most succesful depends on a number of factors, such as how closely the modalities are to each other \cite{wu_mult99}.

In the case of ensembling at the decision level, \cite{gune_affe05} have done a study on the most effective way of combining model predictions, in the task of classifying emotions based on facial expressions and body gestures. They concluded that summing the the raw probabilities of both modalities provided the best performance, but that the best method can depend on the specific task and model outputs.

In this paper, besides from combining predictions from both the text and image modalities, we also experiment with combining predictions from different classification architecture for the same modality.



## Theoretical benefits of ensembling

As mentioned in \cite{poli_ense06}, the most benefit from combining different classifiers can be obtained when the classifiers make mistakes on different instances, after which we can select the 'right' classifier with a smart combination strategy. These different classifiers can be obtained in different ways, such as using different initilizations, training on different subsets of the data, or by using different model achitecture. Intuitively this makes sense: If we have the models that make exactly the same mistakes, then there is no 'extra' information we can use, both models always agree.

We follow their work, and measure diversity by calculating the number of times both models were correct. We then use the following formula to measure the correlation between te predictions based on contingency table of the model predictions of both models, with a low score indicating diversity. Here, the different numbers indicate the different coordinates in the contingency table.

$corr = \frac{((00*11)-(10*01))}{\sqrt{(00+10)(01+11)(11+01)(10+11)}}$.

This formula looks a bit complicated, but it is equivalent to calculating the Pearson correlation between the vectors of the correctness of model predictions.

It is possible to calculate the maximum possible performance of a combined system, given the predictions of the individual models. As done in \cite{kunc_theo02}, we include an <i>oracle</i> model, which is correct when either of the two models is correct. In this case, False Positives are when both models wrongly predict 1, and False Negatives are when both models incorrectly predict 0.

Please note that, for the calculation of the maximum possible performance, we followed the scoring scheme used for the models, meaning that we calculate the maximum possible score per stream, and average this for all the metrics to obtain a final maximum possible score.


<a id="average_ensemble" />

## Comparing different ensembles

In this section, we will compare various ensemble strategies, both with combining models from different modalities, as well as combining different models from the same modality. We performed a grid search on weigthing schemes and found that average the output probabilities of both models worked best, which we will refer to as <i>average prediction ensemble<i/>.
   
We have created a table in which we both report the similarity / correlation between model mistakes, as well as to reporting the obtained Panoptic Quality score of best model in the combination, compared to their maximum possible achievable score.
    

In [35]:
def oracle_score(gold_dict, *args):
    maximum_score_dict = {}
    for key in gold_dict.keys():
        combined_prediction = []
        gold = np.array(gold_dict[key]).reshape(1, -1)
        models = np.stack([model[key] for model in args])
        equal = (models == gold).any(axis=0)
        maximum_score_dict[key] = np.where(equal, gold, 1-gold).flatten().tolist()
        
    return maximum_score_dict

In [36]:
LONG_all_predictions = {'TEXTCNN': TEXTCNN_predictions_raw_LONG, 'VGG16': CNN_predictions_raw_LONG,
                      'BERT': BERT_predictions_raw_LONG,
                     'EFFICIENTNET': EFFICIENTNET_predictions_raw_LONG}

SHORT_all_predictions = {'TEXTCNN': TEXTCNN_predictions_raw_SHORT, 'VGG16': CNN_predictions_raw_SHORT,
                      'BERT': BERT_predictions_raw_SHORT,
                     'EFFICIENTNET': EFFICIENTNET_predictions_raw_SHORT}

In [37]:
# Here we construct the very large table with all the models and their diversity score + orcale PQ - highest single model PQ
def create_confusion_matrix_all_models(gold_standard, all_models_prediction_dict, score: str="PQ",
                                      save_path: str = ""):
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
    similarity_matrix = {}
    maximum_score_diff = {}

    for first_name, first_model in tqdm(all_models_prediction_dict.items()):
        similarity_row = {}
        score_row = {}
        for second_name, second_model in all_models_prediction_dict.items():
            # We can use the predictions here to loop over the scores and average them 
            max_score_predictions = oracle_score(gold_standard, make_bin(first_model), make_bin(second_model))
            
            if first_name != second_name:
                best_score = max(evaluation_report(gold_standard, make_bin(first_model)).mean().T['Weighted PQ F1'],
                                 evaluation_report(gold_standard, make_bin(second_model)).mean().T['Weighted PQ F1'])
                max_score = evaluation_report(gold_standard, max_score_predictions).mean().T['Weighted PQ F1']
                score_diff = max_score - best_score
                similarity_row[second_name] = get_model_correlation(gold_standard, make_bin(first_model), make_bin(second_model))
                score_row[second_name] = score_diff
            else:
                similarity_row[second_name] = 1
                score_row[second_name] = 0
        maximum_score_diff[first_name] = score_row
        similarity_matrix[first_name] = similarity_row
    
    def clean_up_matrix(matrix):
        matrix.iloc[0, 0] = 0
        matrix.iloc[1, 1] = 0
        matrix.iloc[2, 2] = 0
        matrix.iloc[3, 3] = 0
        matrix.iloc[3, [0, 1, 2]] = 0
        matrix.iloc[2, [0, 1]] = 0
        matrix.iloc[1, 0] = 0
        return matrix
    
    correlation_dataframe = clean_up_matrix(pd.DataFrame(similarity_matrix).round(2))
    score_dataframe = clean_up_matrix(pd.DataFrame(maximum_score_diff).round(2))
    sns.heatmap(correlation_dataframe, annot=True, ax=axes[0])
    axes[0].set_title("Pearson correlation between various model combinations")
    axes[1].set_title("Difference between the maximum achievable PQ F1 score \n and the best performing single model of the combination")
    sns.heatmap(score_dataframe, annot=True, ax=axes[1])
    plt.tight_layout()
    plt.savefig(save_path)
    plt.show()

In [38]:
for first_name, first_model in tqdm(LONG_all_predictions.items()):
    for second_name, second_model in LONG_all_predictions.items():
        model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)
        print("%s + %s" % (first_name, second_name))
        display(model_scores)

  0%|          | 0/4 [00:00<?, ?it/s]

TEXTCNN + TEXTCNN


  model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)


Unnamed: 0,Precision,Recall,F1,SQ,SQ*,Weighted PQ P,Weighted PQ* P,Weighted PQ R,Weighted PQ* R,Weighted PQ F1,Weighted PQ* F1,Unweighted PQ P,Unweighted PQ* P,Unweighted PQ R,Unweighted PQ* R,Unweighted PQ F1,Unweighted PQ* F1,support
0,0.81,0.88,0.81,0.91,0.92,0.75,0.75,0.77,0.78,0.75,0.76,0.78,0.78,0.8,0.81,0.78,0.78,6347.0


TEXTCNN + VGG16


  model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)


Unnamed: 0,Precision,Recall,F1,SQ,SQ*,Weighted PQ P,Weighted PQ* P,Weighted PQ R,Weighted PQ* R,Weighted PQ F1,Weighted PQ* F1,Unweighted PQ P,Unweighted PQ* P,Unweighted PQ R,Unweighted PQ* R,Unweighted PQ F1,Unweighted PQ* F1,support
0,0.87,0.8,0.79,0.93,0.93,0.78,0.78,0.69,0.69,0.71,0.71,0.8,0.8,0.71,0.71,0.74,0.74,6347.0


TEXTCNN + BERT


  model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)


Unnamed: 0,Precision,Recall,F1,SQ,SQ*,Weighted PQ P,Weighted PQ* P,Weighted PQ R,Weighted PQ* R,Weighted PQ F1,Weighted PQ* F1,Unweighted PQ P,Unweighted PQ* P,Unweighted PQ R,Unweighted PQ* R,Unweighted PQ F1,Unweighted PQ* F1,support
0,0.85,0.88,0.84,0.9,0.9,0.79,0.79,0.79,0.79,0.78,0.78,0.82,0.82,0.82,0.82,0.81,0.81,6347.0


TEXTCNN + EFFICIENTNET


  model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)


Unnamed: 0,Precision,Recall,F1,SQ,SQ*,Weighted PQ P,Weighted PQ* P,Weighted PQ R,Weighted PQ* R,Weighted PQ F1,Weighted PQ* F1,Unweighted PQ P,Unweighted PQ* P,Unweighted PQ R,Unweighted PQ* R,Unweighted PQ F1,Unweighted PQ* F1,support
0,0.85,0.88,0.83,0.93,0.93,0.79,0.79,0.79,0.79,0.77,0.77,0.82,0.82,0.82,0.82,0.8,0.8,6347.0


 25%|██▌       | 1/4 [00:09<00:29,  9.93s/it]

VGG16 + TEXTCNN


  model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)


Unnamed: 0,Precision,Recall,F1,SQ,SQ*,Weighted PQ P,Weighted PQ* P,Weighted PQ R,Weighted PQ* R,Weighted PQ F1,Weighted PQ* F1,Unweighted PQ P,Unweighted PQ* P,Unweighted PQ R,Unweighted PQ* R,Unweighted PQ F1,Unweighted PQ* F1,support
0,0.87,0.8,0.79,0.93,0.93,0.78,0.78,0.69,0.69,0.71,0.71,0.8,0.8,0.71,0.71,0.74,0.74,6347.0


VGG16 + VGG16


  model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)


Unnamed: 0,Precision,Recall,F1,SQ,SQ*,Weighted PQ P,Weighted PQ* P,Weighted PQ R,Weighted PQ* R,Weighted PQ F1,Weighted PQ* F1,Unweighted PQ P,Unweighted PQ* P,Unweighted PQ R,Unweighted PQ* R,Unweighted PQ F1,Unweighted PQ* F1,support
0,0.86,0.7,0.72,0.92,0.92,0.71,0.72,0.57,0.58,0.61,0.62,0.75,0.76,0.6,0.61,0.64,0.65,6347.0


VGG16 + BERT


  model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)


Unnamed: 0,Precision,Recall,F1,SQ,SQ*,Weighted PQ P,Weighted PQ* P,Weighted PQ R,Weighted PQ* R,Weighted PQ F1,Weighted PQ* F1,Unweighted PQ P,Unweighted PQ* P,Unweighted PQ R,Unweighted PQ* R,Unweighted PQ F1,Unweighted PQ* F1,support
0,0.87,0.83,0.82,0.9,0.9,0.8,0.8,0.75,0.75,0.76,0.76,0.83,0.83,0.77,0.77,0.79,0.79,6347.0


VGG16 + EFFICIENTNET


  model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)


Unnamed: 0,Precision,Recall,F1,SQ,SQ*,Weighted PQ P,Weighted PQ* P,Weighted PQ R,Weighted PQ* R,Weighted PQ F1,Weighted PQ* F1,Unweighted PQ P,Unweighted PQ* P,Unweighted PQ R,Unweighted PQ* R,Unweighted PQ F1,Unweighted PQ* F1,support
0,0.85,0.84,0.81,0.93,0.92,0.78,0.78,0.75,0.75,0.74,0.74,0.82,0.82,0.77,0.78,0.77,0.77,6347.0


 50%|█████     | 2/4 [00:19<00:19,  9.79s/it]

BERT + TEXTCNN


  model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)


Unnamed: 0,Precision,Recall,F1,SQ,SQ*,Weighted PQ P,Weighted PQ* P,Weighted PQ R,Weighted PQ* R,Weighted PQ F1,Weighted PQ* F1,Unweighted PQ P,Unweighted PQ* P,Unweighted PQ R,Unweighted PQ* R,Unweighted PQ F1,Unweighted PQ* F1,support
0,0.85,0.88,0.84,0.9,0.9,0.79,0.79,0.79,0.79,0.78,0.78,0.82,0.82,0.82,0.82,0.81,0.81,6347.0


BERT + VGG16


  model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)


Unnamed: 0,Precision,Recall,F1,SQ,SQ*,Weighted PQ P,Weighted PQ* P,Weighted PQ R,Weighted PQ* R,Weighted PQ F1,Weighted PQ* F1,Unweighted PQ P,Unweighted PQ* P,Unweighted PQ R,Unweighted PQ* R,Unweighted PQ F1,Unweighted PQ* F1,support
0,0.87,0.83,0.82,0.9,0.9,0.8,0.8,0.75,0.75,0.76,0.76,0.83,0.83,0.77,0.77,0.79,0.79,6347.0


BERT + BERT


  model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)


Unnamed: 0,Precision,Recall,F1,SQ,SQ*,Weighted PQ P,Weighted PQ* P,Weighted PQ R,Weighted PQ* R,Weighted PQ F1,Weighted PQ* F1,Unweighted PQ P,Unweighted PQ* P,Unweighted PQ R,Unweighted PQ* R,Unweighted PQ F1,Unweighted PQ* F1,support
0,0.84,0.88,0.83,0.91,0.91,0.78,0.78,0.78,0.78,0.77,0.77,0.81,0.81,0.8,0.8,0.79,0.79,6347.0


BERT + EFFICIENTNET


  model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)


Unnamed: 0,Precision,Recall,F1,SQ,SQ*,Weighted PQ P,Weighted PQ* P,Weighted PQ R,Weighted PQ* R,Weighted PQ F1,Weighted PQ* F1,Unweighted PQ P,Unweighted PQ* P,Unweighted PQ R,Unweighted PQ* R,Unweighted PQ F1,Unweighted PQ* F1,support
0,0.85,0.87,0.83,0.94,0.94,0.8,0.8,0.78,0.78,0.78,0.78,0.82,0.82,0.81,0.81,0.8,0.8,6347.0


 75%|███████▌  | 3/4 [00:29<00:09,  9.98s/it]

EFFICIENTNET + TEXTCNN


  model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)


Unnamed: 0,Precision,Recall,F1,SQ,SQ*,Weighted PQ P,Weighted PQ* P,Weighted PQ R,Weighted PQ* R,Weighted PQ F1,Weighted PQ* F1,Unweighted PQ P,Unweighted PQ* P,Unweighted PQ R,Unweighted PQ* R,Unweighted PQ F1,Unweighted PQ* F1,support
0,0.85,0.88,0.83,0.93,0.93,0.79,0.79,0.79,0.79,0.77,0.77,0.82,0.82,0.82,0.82,0.8,0.8,6347.0


EFFICIENTNET + VGG16


  model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)


Unnamed: 0,Precision,Recall,F1,SQ,SQ*,Weighted PQ P,Weighted PQ* P,Weighted PQ R,Weighted PQ* R,Weighted PQ F1,Weighted PQ* F1,Unweighted PQ P,Unweighted PQ* P,Unweighted PQ R,Unweighted PQ* R,Unweighted PQ F1,Unweighted PQ* F1,support
0,0.85,0.84,0.81,0.93,0.92,0.78,0.78,0.75,0.75,0.74,0.74,0.82,0.82,0.77,0.78,0.77,0.77,6347.0


EFFICIENTNET + BERT


  model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)


Unnamed: 0,Precision,Recall,F1,SQ,SQ*,Weighted PQ P,Weighted PQ* P,Weighted PQ R,Weighted PQ* R,Weighted PQ F1,Weighted PQ* F1,Unweighted PQ P,Unweighted PQ* P,Unweighted PQ R,Unweighted PQ* R,Unweighted PQ F1,Unweighted PQ* F1,support
0,0.85,0.87,0.83,0.94,0.94,0.8,0.8,0.78,0.78,0.78,0.78,0.82,0.82,0.81,0.81,0.8,0.8,6347.0


EFFICIENTNET + EFFICIENTNET


  model_scores = pd.DataFrame(evaluation_report(LONG_gold_standard_test, combine_predictions(first_model, second_model)).mean()).T.round(2)


Unnamed: 0,Precision,Recall,F1,SQ,SQ*,Weighted PQ P,Weighted PQ* P,Weighted PQ R,Weighted PQ* R,Weighted PQ F1,Weighted PQ* F1,Unweighted PQ P,Unweighted PQ* P,Unweighted PQ R,Unweighted PQ* R,Unweighted PQ F1,Unweighted PQ* F1,support
0,0.82,0.86,0.8,0.92,0.92,0.76,0.76,0.75,0.75,0.73,0.74,0.78,0.79,0.78,0.78,0.76,0.76,6347.0


100%|██████████| 4/4 [00:39<00:00,  9.93s/it]


<a id="table_combi"/>

In [39]:
# replace image cnn with vgg16 after I fix this weird error
create_confusion_matrix_all_models(LONG_gold_standard_test, LONG_all_predictions, save_path="model_fusion_images/contingency_LONG.png")

  best_score = max(evaluation_report(gold_standard, make_bin(first_model)).mean().T['Weighted PQ F1'],
  evaluation_report(gold_standard, make_bin(second_model)).mean().T['Weighted PQ F1'])
  max_score = evaluation_report(gold_standard, max_score_predictions).mean().T['Weighted PQ F1']
  best_score = max(evaluation_report(gold_standard, make_bin(first_model)).mean().T['Weighted PQ F1'],
  evaluation_report(gold_standard, make_bin(second_model)).mean().T['Weighted PQ F1'])
  max_score = evaluation_report(gold_standard, max_score_predictions).mean().T['Weighted PQ F1']
  best_score = max(evaluation_report(gold_standard, make_bin(first_model)).mean().T['Weighted PQ F1'],
  evaluation_report(gold_standard, make_bin(second_model)).mean().T['Weighted PQ F1'])
  max_score = evaluation_report(gold_standard, max_score_predictions).mean().T['Weighted PQ F1']
  best_score = max(evaluation_report(gold_standard, make_bin(first_model)).mean().T['Weighted PQ F1'],
  evaluation_report(gold_standard, m

  best_score = max(evaluation_report(gold_standard, make_bin(first_model)).mean().T['Weighted PQ F1'],
  evaluation_report(gold_standard, make_bin(second_model)).mean().T['Weighted PQ F1'])
  max_score = evaluation_report(gold_standard, max_score_predictions).mean().T['Weighted PQ F1']
  best_score = max(evaluation_report(gold_standard, make_bin(first_model)).mean().T['Weighted PQ F1'],
  evaluation_report(gold_standard, make_bin(second_model)).mean().T['Weighted PQ F1'])
  max_score = evaluation_report(gold_standard, max_score_predictions).mean().T['Weighted PQ F1']
  best_score = max(evaluation_report(gold_standard, make_bin(first_model)).mean().T['Weighted PQ F1'],
  evaluation_report(gold_standard, make_bin(second_model)).mean().T['Weighted PQ F1'])
  max_score = evaluation_report(gold_standard, max_score_predictions).mean().T['Weighted PQ F1']
  best_score = max(evaluation_report(gold_standard, make_bin(first_model)).mean().T['Weighted PQ F1'],
  evaluation_report(gold_standard, m

FileNotFoundError: [Errno 2] No such file or directory: 'model_fusion_images/contingency_LONG.png'

In [None]:
create_confusion_matrix_all_models(SHORT_gold_standard_test, SHORT_all_predictions, save_path="model_fusion_images/contingency_SHORT.png")
# sometimes this assertion error pops up in the evaluation, not sure why its only between bert and efficientnet, which is extra weird.

In the above two heatmaps we can see the diversity score and the possible performance gain of all model model combinations, respectively. What is interesting to note is that in the left table we can see that in general, the models that predict the same modality (EFFICIENTNET, CNN), (BERT, TEXTCNN) have lower diversity scores than when compared to models that concern a different modality. This already confirms that hypothesis that these models from different modalities capture different aspects of the data. Apart from this, we can see that all the models have a very high diversity score when compared the LSTM model. This is mostly due to the fact that the LSTM model has lower accuracy and therefore makes more mistakes, increasing the diversity with the models that are more accurate.

In [None]:
# Here we construct the very large table with all the models and their diversity score + orcale PQ - highest single model PQ
def create_new_plot_all_models(gold_standard, all_models_prediction_dict, score: str="PQ",
                                      save_path: str = ""):
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
    similarity_matrix = {}
    maximum_score_diff = {}

    for first_name, first_model in tqdm(all_models_prediction_dict.items()):
        similarity_row = {}
        score_row = {}
        for second_name, second_model in all_models_prediction_dict.items():
            # We can use the predictions here to loop over the scores and average them 
            max_score_predictions = oracle_score(gold_standard, make_bin(first_model), make_bin(second_model))
            
            if first_name != second_name:
                best_score = max(evaluation_report(gold_standard, make_bin(first_model)).mean().T['Weighted PQ F1'],
                                 evaluation_report(gold_standard, make_bin(second_model)).mean().T['Weighted PQ F1'])
                max_score = evaluation_report(gold_standard, max_score_predictions).mean().T['Weighted PQ F1']
                score_diff = max_score - best_score
                similarity_row[second_name] = get_model_correlation(gold_standard, make_bin(first_model), make_bin(second_model))
                score_row[second_name] = score_diff
            else:
                similarity_row[second_name] = 1
                score_row[second_name] = 0
        maximum_score_diff[first_name] = score_row
        similarity_matrix[first_name] = similarity_row
    
    def clean_up_matrix(matrix):
        matrix.iloc[0, 0] = 0
        matrix.iloc[1, 1] = 0
        matrix.iloc[2, 2] = 0
        matrix.iloc[3, 3] = 0
        matrix.iloc[3, [0, 1, 2]] = 0
        matrix.iloc[2, [0, 1]] = 0
        matrix.iloc[1, 0] = 0
        return matrix
    
    def get_dict_combinations(d):
        output_dict = {'TEXT-CNN & VGG16': d['TEXTCNN']['VGG16'],
        'TEXT-CNN & BERT': d['TEXTCNN']['BERT'],
        'TEXT-CNN & EFFICIENTNET': d['TEXTCNN']['EFFICIENTNET'],
        'VGG16 & BERT': d['VGG16']['BERT'],
        'VGG16 & EFFICIENTNET': d['VGG16']['EFFICIENTNET'],
        'EFFICIENTNET & BERT': d['EFFICIENTNET']['BERT']}
        return output_dict
    
    combination_barplot = pd.DataFrame({'correlation': get_dict_combinations(similarity_matrix),
                          'oracle': get_dict_combinations(maximum_score_diff)})
    return combination_barplot

In [None]:
# replace image cnn with vgg16 after I fix this weird error
df = create_new_plot_all_models(SHORT_gold_standard_test, SHORT_all_predictions, save_path="../model_fusion_images/contingency_SHORT.png")

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8, 4))
df = df.sort_values(by='correlation', ascending=False)
df.index = ['(I + T) ', '(T + I) ', '(I + T) ', '(T + I) ', '(I + I) ', '(T + T) '] + df.index
df.plot(kind='barh', ax=ax)
ax.legend(loc='upper right')
plt.tight_layout()
ax.plot(df['correlation'], range(len(df)), color='black', marker='o')
ax.plot(df['oracle'], range(len(df)), color='black', marker='o')
plt.savefig('images/modelf_fusion_plot.eps', format='eps')
plt.show()

<a id="interesting" />

## Interesting combination of models

From the above observations of both datasets, we found that although the LSTM combinations showed the most improvement, this was mostly due to the fact that the predictions were less accurate. Because there is also quite a large possible improvement possible in the BERT-EFFICIENTNET combination, we took this as our example pair, and looked into it in more depth.

**BERT+EFF LONG**
| Model         | Page P    | Page R | Page F1 | Doc. PQ | Doc. SQ |Doc.  RQ F1|
|         :--:  |     ---:  |   ---: | ---:| ---:| ---: | ---: |
| EFFICIENTNET  |  0.82     |  0.85  | 0.80 | 0.76 | 0.95 | 0.76 |
| BERT          |  0.84     |  0.88  | 0.83 | 0.78 | 0.97 | 0.79 | 
| AVG. ENSEMBLE |  0.85     |  0.87  | 0.83 | 0.80 | 0.96 | 0.80 |
| ORACLE        |  0.92     |  0.94  | 0.91 | 0.89 | 0.97 | 0.89 |


**BERT+EFF SHORT**
| Model         | Page P    | Page R | Page F1 | Doc. PQ | Doc. SQ |Doc.  RQ F1 |
|        :--:   |       ---: |    ---: | ---:| ---:|---: |---: |
| EFFICIENTNET  |  0.83     |  0.75  |0.75|0.74|0.92|0.71|
| BERT          |  0.80     |  0.76  |0.73|0.67|0.92|0.66|
| AVG. ENSEMBLE |  0.85     |  0.75  |0.75|0.71|0.92|0.69|
| ORACLE        |  0.91     |  0.85  |0.85|0.83|0.96|0.80|

From the above results, we can see that the performance of the combined model is better than the individual model, but we are not able to actually reach the maximum potential of the combination. The most likely explanation for this is because of the fact that during training, the models are incentivised to push their predictions close to 0 or 1, due to the cross entropy loss. This partly explains why it is hard to optimize, as both models are generally really confident in their predictions, even when they are wrong. Thus if both models are really confident and they disagree, it is hard to know which one is correct, and as in late fusion you only have the probabilities to go by, the ability to learn this is also very limited.

We will try to quantatively show the problem with the raw binary predictions, we will do this by showing their average scores, both when the models are correct, and when the models are incorrect. 

<a id="problems" />

### Problems with binary predictions

In [None]:
def analyse_raw_model_scores(gold_standard, model_predictions, model: str):
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
    binary_predictions = make_bin(model_predictions)
    
    gold_standard_label = []
    binary_label = []
    model_score = []
    
    for key in gold_standard.keys():        
        gold_standard_label.extend(gold_standard[key])
        binary_label.extend(binary_predictions[key])
        model_score.extend(model_predictions[key])
        
    scores_df = pd.DataFrame({'gold': gold_standard_label, 'prediction': binary_label, 'prob': model_score})
    sns.kdeplot(data=scores_df, x='prob', clip=[0, 1], ax=axes[0])
    axes[0].set_title("Distribution of the raw probability scores\n for the %s model" % model)
    # Apart from analysing the model only by plotting the scores in general, we also plot the scores
    # for the model when it is actually incorrect, we do this by taking a subsample of the data
    incorrect_predictions = scores_df[scores_df.gold != scores_df.prediction]
    sns.kdeplot(data=incorrect_predictions, x='prob', clip=[0,1], ax=axes[1], hue='prediction')
    axes[1].set_title("Distribution of the probability scores for the %s model\n when the prediction was incorrect" % model)
    plt.savefig("images/model_fusion_images/binary_bert.png")
    plt.tight_layout()
    plt.show()
        

In [None]:
analyse_raw_model_scores(LONG_gold_standard_test, BERT_predictions_raw_LONG, model="BERT")


In [None]:
def plot_binary_prediction(gold_standard, model_predictions, model: str):
    binary_predictions = make_bin(model_predictions)
    
    gold_standard_label = []
    binary_label = []
    model_score = []
    
    for key in gold_standard.keys():        
        gold_standard_label.extend(gold_standard[key])
        binary_label.extend(binary_predictions[key])
        model_score.extend(model_predictions[key])
        
    scores_df = pd.DataFrame({'gold': gold_standard_label, 'prediction': binary_label, 'prob': model_score})
    sns.kdeplot(data=scores_df, x='prob', clip=[0, 1])
    #plt.title("Distribution of the raw probability scores\n for the %s model" % model)
    plt.xlabel("Probability")
    plt.savefig('images/bert_probabilities.eps', format='eps')
    plt.tight_layout()
    plt.show()

In [None]:
plot_binary_prediction(LONG_gold_standard_test, BERT_predictions_raw_LONG, model="BERT")

In [None]:
analyse_raw_model_scores(LONG_gold_standard_test, CNN_predictions_raw_LONG, model="TEXTCNN")

Although it is not the case in the same extreme for all models, there are definitely a few models, such as the efficientnet model shown above, for which the predictions are very 'binary', even when the model is wrong, as we can see in the plot on the right. We also see the same kind of behaviour for the BERT, VGG16 model and the LSTM model, but less so for the TEXTCNN model, where the predictions are a bit closer to 0.5 when the model is incorrect, while still being very confident in correct predictions.

### Plot from the two different classifiers when they disagree, seperated for 1 and 0

In [None]:
def analyse_model_differences(model1_predictions, model2_predictions, model1: str, model2: str):
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
    model1_binary = make_bin(model1_predictions)
    model2_binary = make_bin(model2_predictions)
    
    model1_label = []
    model2_label = []
    
    model1_score = []
    model2_score = []
    
    for key in model1_binary.keys():        
        model1_label.extend(model1_binary[key])
        model2_label.extend(model2_binary[key])
        
        model1_score.extend(model1_predictions[key])
        model2_score.extend(model2_predictions[key])
        
    
    scores_df = pd.DataFrame({'model1_pred': model1_label, 'model2_pred': model2_label, 'prob_model1': model1_score, 'prob_model2': model2_score})
    # plot two different ones, the first one where model1 predicts 0, and the other one where it predicts 1
    model_differences = scores_df[scores_df.model1_pred != scores_df.model2_pred]

    sns.kdeplot(data=model_differences, x='prob_model2', hue='model1_pred', clip=[0, 1], ax=axes[0])
    sns.kdeplot(data=model_differences, x='prob_model1', hue='model2_pred', clip=[0, 1], ax=axes[1])
    axes[0].set_xlabel(model2)
    axes[1].set_xlabel(model1)
    plt.show()
        

In [None]:
def mistake_types(gold_standard, model1_predictions, model2_predictions):
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
    model1_binary = make_bin(model1_predictions)
    model2_binary = make_bin(model2_predictions)
    
    model1_label = []
    model2_label = []
    gold_standard_label = []
    
    model1_score = []
    model2_score = []
    
    for key in model1_binary.keys():        
        model1_label.extend(model1_binary[key])
        model2_label.extend(model2_binary[key])
        gold_standard_label.extend(gold_standard[key])
        
        model1_score.extend(model1_predictions[key])
        model2_score.extend(model2_predictions[key])
        
    
    scores_df = pd.DataFrame({'gold': gold_standard_label, 'model1_pred': model1_label, 'model2_pred': model2_label, 'prob_model1': model1_score, 'prob_model2': model2_score})
    # plot two different ones, the first one where model1 predicts 0, and the other one where it predicts 1
    model_differences = scores_df[scores_df.model1_pred != scores_df.model2_pred]
    print(model_differences['gold'].value_counts(normalize=True))
    
        

In [None]:
analyse_model_differences(EFFICIENTNET_predictions_raw_LONG, BERT_predictions_raw_LONG, model1="EFFICIENTNET", model2="BERT")

The above plot shows what happens when the BERT and EFFICIENTNET models disagree. In the case of the BERT model, if the EFFICENTNET model predicted something different the score are very binary. For the EFFICIENTNET bart, the scores are also quite binary, but a bit less so than for efficientnet

We also briefly looked into for which class the models disagreed the most, and it turns at that this is quite balanced with 54\% of the disagreement being for the 0 class, and 46\% for the 1 class. Given that the distribution of classes is quite skewed, this means the 1 class is a bit overrepresented in the disagreement.

In [None]:
mistake_types(LONG_gold_standard_test, EFFICIENTNET_predictions_raw_LONG, BERT_predictions_raw_LONG)

### Investigating the PQ score

Apart from the interesting observations about the binary preditions, we can also see that the model combination has a large possible improvement in the PQ scores, while the SQ score is almost perfect, but the RQ score is not. To investigate this further, we will split up this RQ score into Precision Recall and F1 to see where the biggest improvement is possible.

In [None]:
# get the score of our combined model
EFF_BERT_ensemble = combine_predictions(EFFICIENTNET_predictions_raw_LONG, BERT_predictions_raw_LONG)
oracle_predictions = oracle_score(LONG_gold_standard_test, make_bin(BERT_predictions_raw_LONG), make_bin(EFFICIENTNET_predictions_raw_LONG))

EFF_BERT_ensemble_scores = evaluation_report(LONG_gold_standard_test, EFF_BERT_ensemble)
oracle_scores = evaluation_report(LONG_gold_standard_test, oracle_predictions)

In [None]:
display(pd.DataFrame(EFF_BERT_ensemble_scores.loc['RQ']).T.iloc[:, :3])

In [None]:
display(pd.DataFrame(oracle_scores.loc['RQ']).T.iloc[:, :3])

This tells us that the performance potential missed by the models is not attributed to a drop in precision or recall only, but rather both although recall has a bit of a larger gap than precision. We will dig a bit deeper to find the issue here. The segmentation quality measures how well True Positives are lined up, summing IoU values over all TP values. This means that the model is able to match true positives very well. For the regocnition quality, we can investigate this by simply retrieving the false positives and false negatives that the model produces, and investigate these.

In [None]:
# use the 'align' function for this
combined_model_mistakes = {}
oracle_model_mistakes = {}
for key in EFF_BERT_ensemble.keys():
    _, _,FP, FN = align(bin_to_length_list(LONG_gold_standard_test[key]), bin_to_length_list(EFF_BERT_ensemble[key]), kind="and")
    combined_model_mistakes[key] = {'FP': FP, 'FN': FN}
    
oracle_mistakes = {}
for key in oracle_predictions.keys():
    _, _,FP, FN = align(bin_to_length_list(LONG_gold_standard_test[key]), bin_to_length_list(oracle_predictions[key]), kind="and")
    oracle_model_mistakes[key] = {'FP': FP, 'FN': FN}
    

Now that we have the mistakes for both the combination and the oracle, we can try to compare these on a stream basis.

In [None]:
# TODO: Even kijken hoe ik hier goed fouten kan vergelijken

In [None]:
for key in combined_model_mistakes.keys():
    combined_FP_mistakes = combined_model_mistakes[key]['FP'] - oracle_model_mistakes[key]['FP']
    combined_FN_mistakes = combined_model_mistakes[key]['FN'] - oracle_model_mistakes[key]['FN']
    #print(combined_FP_mistakes)


<a id="early" />

# Combining earlier layers of models

As we can see from the results above, by the time we combine the predictions on the decision level, we are actually already 'too late', i.e. the models have become really confident in their predictions, and combining them is problematic because of this. This poses a good reason for trying to combine the information from the models at an earlier stage, to make use of more rich information, and to avoid the problems of the binary predictions that we saw above. This technique is referred to as joint / hybrid fusion or ensebmling. To do this, we are now going to use the last linear layer for all the models, and use a simple logistic regression on the concatenated features, and then report the scores of the models, as well as to rerun the analysis on the output probabilities of the Logistic Regression to see whether the problem has been alleviated. We also normalize the concatenated embeddings to ensure they are of the same magnitude.

In [None]:
# First we load in the logistic regression.
# Note that we have to train this, we will train this on the train sets of the models and then we will test it on the test vectors of the models.

vector_base_path = '../resources/model_predictions'

TEXTCNN_train_vectors_LONG = np.load(os.path.join(vector_base_path, 'TEXT-CNN', 'LONG_train', 'raw_vecs.npy'), allow_pickle=True)[()]
TEXTCNN_test_vectors_LONG = np.load(os.path.join(vector_base_path, 'TEXT-CNN', 'LONG_LONG', 'raw_vecs.npy'), allow_pickle=True)[()]
TEXTCNN_train_vectors_SHORT = np.load(os.path.join(vector_base_path, 'TEXT-CNN', 'SHORT_train', 'raw_vecs.npy'), allow_pickle=True)[()]
TEXTCNN_test_vectors_SHORT = np.load(os.path.join(vector_base_path, 'TEXT-CNN', 'SHORT_SHORT', 'raw_vecs.npy'), allow_pickle=True)[()]

VGG16_train_vectors_LONG = np.load(os.path.join(vector_base_path, 'CNN', 'LONG_train', 'raw_vecs.npy'), allow_pickle=True)[()]
VGG16_test_vectors_LONG = np.load(os.path.join(vector_base_path, 'CNN', 'LONG_LONG', 'raw_vecs.npy'), allow_pickle=True)[()]
VGG16_train_vectors_SHORT = np.load(os.path.join(vector_base_path, 'CNN', 'SHORT_train', 'raw_vecs.npy'), allow_pickle=True)[()]
VGG16_test_vectors_SHORT = np.load(os.path.join(vector_base_path, 'CNN', 'SHORT_SHORT', 'raw_vecs.npy'), allow_pickle=True)[()]

EFFICIENTNET_train_vectors_LONG = np.load(os.path.join(vector_base_path, 'EFFICIENTNET', 'LONG_train', 'raw_vecs.npy'), allow_pickle=True)[()]
EFFICIENTNET_test_vectors_LONG = np.load(os.path.join(vector_base_path, 'EFFICIENTNET', 'LONG_LONG', 'raw_vecs.npy'), allow_pickle=True)[()]
EFFICIENTNET_train_vectors_SHORT = np.load(os.path.join(vector_base_path, 'EFFICIENTNET', 'SHORT_train', 'raw_vecs.npy'), allow_pickle=True)[()]
EFFICIENTNET_test_vectors_SHORT = np.load(os.path.join(vector_base_path, 'EFFICIENTNET', 'SHORT_SHORT', 'raw_vecs.npy'), allow_pickle=True)[()]

BERT_train_vectors_LONG = np.load(os.path.join(vector_base_path, 'BERT', 'LONG_train', 'raw_vecs.npy'), allow_pickle=True)[()]
BERT_test_vectors_LONG = np.load(os.path.join(vector_base_path,'BERT', 'LONG_LONG', 'raw_vecs.npy'), allow_pickle=True)[()]
BERT_train_vectors_SHORT = np.load(os.path.join(vector_base_path, 'BERT', 'SHORT_train', 'raw_vecs.npy'), allow_pickle=True)[()]
BERT_test_vectors_SHORT = np.load(os.path.join(vector_base_path, 'BERT', 'SHORT_SHORT', 'raw_vecs.npy'), allow_pickle=True)[()]

In [None]:
LONG_mean = []

In [None]:
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression

def norm(vectors):
    return preprocessing.normalize(vectors, norm='l2')

def train_combined_vector_regression(train_gold_standard, test_gold_standard, model1_train_vectors, model1_test_vectors, model2_train_vectors, model2_test_vectors):
    logistic_regressor = LogisticRegression(max_iter=500)
    # First we create the training data
    train_X = []
    train_y = []
    
    for key in train_gold_standard.keys():
        train_y.extend(train_gold_standard[key])
        train_X.append(np.hstack([model1_train_vectors[key], model2_train_vectors[key]]))
    
    train_X = norm(np.vstack(train_X))
    logistic_regressor.fit(train_X, train_y)
    
    predictions = {}
    
    # Now that we have trained the model, we will test it.
    for key in test_gold_standard.keys():
        test_X = np.hstack([model1_test_vectors[key], model2_test_vectors[key]])
        predictions[key] = logistic_regressor.predict_proba(norm(test_X))[:, 1]
        predictions[key][0] = 1
        
    return predictions
    
    
    
        

In [None]:
model_vectors_LONG = {'BERT': [BERT_train_vectors_LONG, BERT_test_vectors_LONG],
                   'VGG16': [VGG16_train_vectors_LONG, VGG16_test_vectors_LONG],
                   'TEXTCNN': [TEXTCNN_train_vectors_LONG, TEXTCNN_test_vectors_LONG],
                   'EFFICIENTNET': [EFFICIENTNET_train_vectors_LONG, EFFICIENTNET_test_vectors_LONG]}
model_vectors_SHORT = {'BERT': [BERT_train_vectors_SHORT, BERT_test_vectors_SHORT],
                   'VGG16': [VGG16_train_vectors_SHORT, VGG16_test_vectors_SHORT],
                   'TEXTCNN': [TEXTCNN_train_vectors_SHORT, TEXTCNN_test_vectors_SHORT],
                   'EFFICIENTNET': [EFFICIENTNET_train_vectors_SHORT, EFFICIENTNET_test_vectors_SHORT]}

<a id="multiple" />

Now that we have written the code to use the classifier, we will train the models of all combinations and show its results when compared to predictions level combinations.

In [None]:
from tqdm import tqdm
early_combos_LONG = {}
combinations = ['BERT-VGG16', 'BERT-TEXTCNN', 'BERT-EFFICIENTNET', 'VGG16-TEXTCNN', "VGG16-EFFICIENTNET",
               'TEXTCNN-EFFICIENTNET']
for model1_name in tqdm(model_vectors_LONG.keys()):
    for model2_name in model_vectors_LONG.keys():
        if "%s-%s" % (model1_name, model2_name) in combinations:
            print("Combination of %s and %s" % (model1_name, model2_name))
            logistic_reg_predictions = train_combined_vector_regression(LONG_gold_standard_train, LONG_gold_standard_test,
                                                                        *model_vectors_LONG[model1_name], *model_vectors_LONG[model2_name])
            model_combination = combine_predictions(LONG_all_predictions[model1_name], LONG_all_predictions[model2_name])
            early_scores = evaluation_report(LONG_gold_standard_test, make_bin(logistic_reg_predictions)).loc[["Boundary", "PQ"], :]
            late_scores = evaluation_report(LONG_gold_standard_test, make_bin(model_combination)).loc[["Boundary", "PQ"], :]
            early_combos_LONG["%s-%s" % (model1_name, model2_name)] = {'early': early_scores.loc['PQ', 'recall'], 'late': late_scores.loc['PQ', 'recall']}

Surprisingly, we can see from the above results that the outputs of training the logistic regression are not on the same level as the model combination on the prediction level. However, it does seem to work well for the CNN and TEXTCNN combination, where performance is increased.

In [None]:
from tqdm import tqdm
early_combos_SHORT = {}
combinations = ['BERT-VGG16', 'BERT-TEXTCNN', 'BERT-EFFICIENTNET', 'VGG16-TEXTCNN', "VGG16-EFFICIENTNET",
               'TEXTCNN-EFFICIENTNET']
for model1_name in tqdm(model_vectors_SHORT.keys()):
    for model2_name in model_vectors_SHORT.keys():
        if "%s-%s" % (model1_name, model2_name) in combinations:
            print("Combination of %s and %s" % (model1_name, model2_name))
            logistic_reg_predictions = train_combined_vector_regression(SHORT_gold_standard_train, SHORT_gold_standard_test,
                                                                        *model_vectors_SHORT[model1_name], *model_vectors_SHORT[model2_name])
            model_combination = combine_predictions(SHORT_all_predictions[model1_name], SHORT_all_predictions[model2_name])
            early_scores = evaluation_report(SHORT_gold_standard_test, make_bin(logistic_reg_predictions)).loc[["Boundary", "PQ"], :]
            late_scores = evaluation_report(SHORT_gold_standard_test, make_bin(model_combination)).loc[["Boundary", "PQ"], :]
            early_combos_SHORT["%s-%s" % (model1_name, model2_name)] = {'early': early_scores.loc['PQ', 'recall'], 'late': late_scores.loc['PQ', 'recall']}

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2)
pd.DataFrame(early_combos_LONG).T.plot(kind='bar', ax=axes[0])
pd.DataFrame(early_combos_SHORT).T.plot(kind='bar', ax=axes[1])
axes[0].legend(loc='upper center', bbox_to_anchor=(0.5, 1.05),
          ncol=3, fancybox=True)
axes[1].legend(loc='upper center', bbox_to_anchor=(0.5, 1.05),
          ncol=3, fancybox=True)
plt.savefig('images/early_ensemble_recall.png', bbox_inches='tight')
plt.show()

## Combining information from more than two classifiers

As a final experiment, we will try to combine the information from four classifiers, where we will again take an average decision level average, and see what the outcome is.

In [None]:
multiple_models_LONG = evaluation_report(LONG_gold_standard_test, combine_predictions(BERT_predictions_raw_LONG, TEXTCNN_predictions_raw_LONG, CNN_predictions_raw_LONG, EFFICIENTNET_predictions_raw_LONG))
oracle_scores_multi_LONG = evaluation_report(LONG_gold_standard_test, oracle_score(LONG_gold_standard_test, combine_predictions(BERT_predictions_raw_LONG, TEXTCNN_predictions_raw_LONG, CNN_predictions_raw_LONG, EFFICIENTNET_predictions_raw_LONG)))

In [None]:
multiple_models_SHORT = evaluation_report(SHORT_gold_standard_test, combine_predictions(BERT_predictions_raw_SHORT, TEXTCNN_predictions_raw_SHORT, CNN_predictions_raw_SHORT, EFFICIENTNET_predictions_raw_SHORT))
oracle_scores_multi_SHORT = evaluation_report(SHORT_gold_standard_test, oracle_score(SHORT_gold_standard_test, combine_predictions(BERT_predictions_raw_SHORT, TEXTCNN_predictions_raw_SHORT, CNN_predictions_raw_SHORT, EFFICIENTNET_predictions_raw_SHORT)))

To make this a bit more insightful, I will make a group barplot in which we can show some interesting results.

In [None]:
def collect_barplot_scores(model_dataframe):
    return [*model_dataframe.loc['Boundary', :].iloc[:3].values.tolist(), model_dataframe.loc['PQ']['F1'], model_dataframe.loc['SQ']['F1'], model_dataframe.loc['RQ']['F1']]
   
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 10))
metrics = ["Page P", "Page R", "Page F1", "Doc F1", "SQ", "RQ"]

single_models_LONG = [pd.DataFrame({model: collect_barplot_scores(scores)}) for model, scores in LONG_results.items()]
textcnn_bert_scores_LONG = evaluation_report(LONG_gold_standard_test, combine_predictions(TEXTCNN_predictions_raw_LONG, BERT_predictions_raw_LONG))
efficientnet_bert_scores_LONG = evaluation_report(LONG_gold_standard_test, combine_predictions(EFFICIENTNET_predictions_raw_LONG, BERT_predictions_raw_LONG))
combined_models_LONG = [pd.DataFrame({'BERT+TEXTCNN': collect_barplot_scores(textcnn_bert_scores_LONG)}),
                                pd.DataFrame({'BERT+EFFICIENTNET':collect_barplot_scores(efficientnet_bert_scores_LONG)}), pd.DataFrame({'ALL_MODELS': collect_barplot_scores(multiple_models_LONG)})]

single_models_SHORT = [pd.DataFrame({model: collect_barplot_scores(scores)}) for model, scores in SHORT_results.items()]
textcnn_bert_scores_SHORT = evaluation_report(SHORT_gold_standard_test, combine_predictions(TEXTCNN_predictions_raw_SHORT, BERT_predictions_raw_SHORT))
efficientnet_bert_scores_SHORT = evaluation_report(SHORT_gold_standard_test, combine_predictions(EFFICIENTNET_predictions_raw_SHORT, BERT_predictions_raw_SHORT))
combined_models_SHORT = [pd.DataFrame({'BERT+TEXTCNN': collect_barplot_scores(textcnn_bert_scores_SHORT)}),
                                pd.DataFrame({'BERT+EFFICIENTNET':collect_barplot_scores(efficientnet_bert_scores_SHORT)}), pd.DataFrame({'ALL_MODELS': collect_barplot_scores(multiple_models_SHORT)})]

pd.concat([*single_models_LONG, *combined_models_LONG],
    axis=1).plot.bar(ax=axes[0])
axes[0].set_xticklabels(metrics)
axes[0].set_title("Scores of the single model models, two combinations of models and the combination\n of all models on LONG")

pd.concat([*single_models_SHORT, *combined_models_SHORT],
    axis=1).plot.bar(ax=axes[1])
axes[1].set_xticklabels(metrics)
axes[1].set_title("Scores of the single model models, two combinations of models and the combination\n of all models on SHORT")
plt.tight_layout()
axes[0].legend(loc="lower center")
axes[1].legend(loc="lower center")
plt.savefig("images/all_models_barplots.png")
plt.show()

Unsurprisingly, the model that combines all models  as the best, slightly outperforming the best 2model combination. We also tried incorporating the LSTM model, but this decreased the overall performane.

<a id="extra" />

## Extra: Hard Examples

As another small extra we looked at the following: What pages did all the models predict wrong? This might help us identify some particularly hard phenomena or maybe some labelling mistakes. What is interesting to see is that there are quite a few examples where the models all wrongly predicted samples of consequtive pages, we will show some examples.

In [None]:
def all_models_wrong(gold_standard, *args):
    mistakes = {}
    
    for key in gold_standard.keys():
        model_predictions = np.vstack([np.array(model[key]).round() for model in args]).T
        # make a column for each model otherwise numpy gives errors
        tiled_gold = np.tile(gold_standard[key], reps=len(args)).reshape(-1, len(args))
        model_correctness = (tiled_gold == model_predictions).astype(int)
        
        # now find rows with only zeros, where all models were incorrect
        all_wrong = np.where(~model_correctness.any(axis=1))[0]
        mistakes[key] = {'pages': (all_wrong+1).tolist(), 'correct': np.array(gold_standard[key])[all_wrong]} 
    return mistakes
        

In [None]:
LONG_all_wrong = all_models_wrong(LONG_gold_standard_test, BERT_predictions_raw_LONG, TEXTCNN_predictions_raw_LONG, CNN_predictions_raw_LONG, EFFICIENTNET_predictions_raw_LONG)
SHORT_all_wrong = all_models_wrong(SHORT_gold_standard_test, BERT_predictions_raw_SHORT, TEXTCNN_predictions_raw_SHORT, CNN_predictions_raw_SHORT, EFFICIENTNET_predictions_raw_SHORT)

In [None]:
list(SHORT_all_wrong.keys())[20:30]

I have manually selected some images which we show here to see why all models get these wrong

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
image_dict = '/Users/rvanheusden/Downloads/png/'
def plot_model_mistakes(mistakes_dict, specific_stream: str = None, specific_range = None):
    if specific_stream:
        info = mistakes_dict[specific_stream]
        pages = info['pages']
        ground_truth = info['correct']
        image_paths = [os.path.join(image_dict, specific_stream+'-%d.png' % num) for num in pages]
        for i, path in enumerate(image_paths):
            img = mpimg.imread(path)
            imgplot = plt.imshow(img)
            plt.xlabel("True label % d, models predicted %d" % (ground_truth[i], 1-ground_truth[i]))
            plt.show()
        

Although it is hard to see exactly, it seems that some of these issues with the consecutive pages come from emails. This is acutally not surprising as it seems emails are part of the same document sometimes and then concatenated, or they are part of a different document, but in either case there is one email per page, and the models really don't have anything to go by.

We kunnen dit hard maken door een kleine email classifier te maken op de tekst en dan te zien waar de fouten vandaan komen (zowel voor LONG als SHORT).

In [None]:
def get_text_from_mistakes(dataframe, dict_of_mistakes):
    page_text = []
    for key, value in dict_of_mistakes.items():
        for page in value['pages']:
            page_text.append(dataframe[(dataframe['name'] == key) & (dataframe['page'] == page)]['text'].tolist()[0])
    return page_text

In [None]:
#LONG_mistakes_text = get_text_from_mistakes(LONG_dataframe, LONG_all_wrong)
#SHORT_mistakes_text = get_text_from_mistakes(SHORT_dataframe, SHORT_all_wrong)

numbers_wrong_LONG = len([item for key in LONG_all_wrong.keys() for item in LONG_all_wrong[key]['pages']])
numbers_wrong_SHORT = len([item for key in SHORT_all_wrong.keys() for item in SHORT_all_wrong[key]['pages']])

In [None]:
print(numbers_wrong_LONG)
print(numbers_wrong_SHORT)

In [None]:
def email_classifier(page):
    if "van" in page.lower() and "aan" in page.lower() and "onderwerp" in page.lower() and "verzonden" in page.lower() and "cc" in page.lower():
        return True
    elif "from" in page.lower() and "to" in page.lower() and "subject" in page.lower() and "sent" in page.lower() and "cc" in page.lower():
        return True
    return False

In [29]:
for key, value in SHORT_all_wrong.items():
    for i, page in enumerate(value['pages']):
        text = SHORT_dataframe[(SHORT_dataframe['name'] == key) & (SHORT_dataframe['page'] == page)]['text'].tolist()[0]
        if email_classifier(text):
            pass
            # print(key, page, value['correct'][i])

NameError: name 'D2_all_wrong' is not defined

<a id="conclusion" />

## Conclusion

In this notebook we investigated the usefulness of ensembling different PSS classifiers. We found that although the usage of combining the models through decision level ensembling yielded some performance gain, the usefulness was limited due to the models outputting very binary predictions. To this extent, we investigated the usage of hybrid ensembling to circumvent this problem, but found that it had a negative effect on the overall performance of most models. Finally we tried combining all models to form a prediction with decision level ensembling, and found that this yielded a performance gain, at the case of having to train all models.