# Experiments Notebook

This notebook contains the experiments conducted in the "Using Deep Learned Vector Representations for Page Stream Segmentation by Agglomerative Clustering" paper submitted to the Short papers track of SIGIR 2023.

## Index

1. [Data Loading](#dataloading)
2. [Binary Classification](#bin_classification)
    - 2.1 [VGG16 without knowing K](#bin_class_no_k)
    - 2.2 [VGG16 with knowing K](#bin_class_k)
3. [Clustering](#agglo_clustering)
    - 3.1.1 [Pretrained No Switch](#pretrained_no_switch)
    - 3.1.2 [Pretrained Switch](#pretrained_switch)
    - 3.2.1 [Finetuned No Switch](#finetuned_no_switch)
    - 3.2.2 [Finetuned Switch](#finetuned_switch)
4. [Error Analaysis](#result_analysis)
    - [Classification Mistakes](#classification_mistakes)
    - [Vector Distances](#vector_dist)



<a id="dataloading" />

## Loading in the data and utilities

To run the experiments, please first download the data and model from Zenodo using the following link: INSERT LINK. Alternatively, you can run 'TODO.py' to automatically download the data and have it in the right place in the repository.

In [1]:
# Imports
import os
import json
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm
from collections import defaultdict

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Local imports
%run ../utils/utils.py
%run ../utils/metricutils.py

In [2]:
# Loading the data vectors, because we are loading from numpy, we should allow pickle loading.
pretrained_vectors= np.load('../data/pretrained_vectors.npy', allow_pickle=True)[()]
finetuned_vectors = np.load('../data/finetuned_vectors.npy', allow_pickle=True)[()]

gold_standard = read_json('../data/gold_standard.json')

In [3]:
pretrained_vectors['82493e06e956a0262e67e32167c32ff4_documenten_2'][0].shape

(4096,)

### Note on input formats

In the experiments, we use the following format: each stream is a key-value pair in a dictionary, where the key is the stream name, and the value is the label for each individual page.
We use the binary classification format to define the gold standard, with a stream being represented by a binary vector and a binary label for each page, with a 0 meaning a page is part of the current document, and a 1 indicating that the page is a boundary page. By definition, we let the first page of a stream always be a boundary page.

Alternatively, some of the methods use a different format, where streams are resprented as a list of documents lengths. We use this format only sporidically in the internal workings of some functionns because the logic is easier.

<a id="baseline" />

## Mean Document Baseline

The first baseline that we will use in the paper is the mean document baseline, where we just predict each document in a stream to have the median length of a document in the complete gold standard.

In [4]:
def mean_document_length_baseline(gold_standard: dict):
    predictions = {}
    for key, value in gold_standard.items():
        mean_prediction = np.zeros_like(value)
        mean_document_length = int(np.array(bin_to_length_list(value)).mean())
        mean_prediction[np.arange(len(value)) % mean_document_length == 0] = 1
        predictions[key] = mean_prediction
    return predictions

In [4]:
mean_length_predictions = mean_document_length_baseline(gold_standard)
evaluation_report(gold_standard, mean_length_predictions)

Unnamed: 0,precision,recall,F1,support,CI Precision,CI Recall,CI F1
Page,0.38,0.48,0.42,6347,0.37-0.39,0.47-0.49,0.41-0.43
Doc,0.31,0.27,0.25,6347,0.3-0.32,0.26-0.28,0.24-0.26


<a id="bin_classification" />

## Binary Classification Results

For the comparison of the agglomerative clustering model with the binary classification model, we need to obtain the predictions of the VGG16 model trained on the dataset. For this we will load the saved model, to save time in training the model. As we also have the vectors obtained after training, we can simply use only that part of the model to obtain the results for the VGG16 model on the testset.

In [5]:
from tensorflow.keras.models import load_model, Model
trained_VGG16_model = load_model('../data/trained_VGG16_model')

#select only top end of model as we already have the vectors precomputed
layer_name = 'dense_1'
VGG16_top = Model(inputs=trained_VGG16_model.get_layer(layer_name).output,
                         outputs=trained_VGG16_model.output)
# Print the architecture of the model after the first dense layer
VGG16_top.summary()

2023-02-15 14:19:40.299675: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.




2023-02-15 14:19:45.037559: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 256)]             0         
                                                                 
 leaky_re_lu_1 (LeakyReLU)   (None, 256)               0         
                                                                 
 dense_2 (Dense)             (None, 1)                 257       
                                                                 
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________


Now that we have loaded the model, the next step is to obtain the predictions for all the streams using the finetuned vectors that we have saved. For the case in which we don't know k, we simple round all the predictions, as this causes outputs larger than 0.5 to become 1, and inputs lower than 0.5 to become 0.

In [6]:
def binary_classification(use_k: bool=False):
    VGG16_predictions = {}

    for stream_id, stream_vectors in finetuned_vectors.items():
        number_of_boundaries = sum(gold_standard[stream_id])
        VGG16_prediction = VGG16_top.predict(stream_vectors)
        # If we are using K then we will need to get the total number of boundaries
        # from the gold standard.
        if use_k:
            VGG16_prediction = select_topk(VGG16_prediction.flatten(), number_of_boundaries).flatten()
        else:
            # set first element to 1, this is always true in our definition
            VGG16_prediction = VGG16_prediction.round()
        VGG16_prediction[0] = 1
        VGG16_predictions[stream_id] = VGG16_prediction.flatten()
    return VGG16_predictions
    


In [7]:
print("Running predictions for VGGG16 model where we do not know K")
VGG16_predictions_no_k = binary_classification(use_k=False)
print("Running predictions for VGGG16 model where we do know K")
VGG16_predictions_k = binary_classification(use_k=True)

Running predictions for VGGG16 model where we do not know K
Running predictions for VGGG16 model where we do know K


We can now use these predictions to obtain scores for the model. First we simply round the predictions, which will give as scores for the normal scenario, and then we will only select the top-k, to be in line with the experiments where k is known.

<a id ="bin_class_no_k" />

### Binary Classification Results without knowing k

In [8]:
# Scores are not quite equal, ask about the other model and vectors that were used.
evaluation_report(gold_standard, VGG16_predictions_no_k)

Unnamed: 0,precision,recall,F1,support,CI Precision,CI Recall,CI F1
Page,0.83,0.82,0.78,6347,0.82-0.84,0.81-0.83,0.77-0.79
Doc,0.7,0.71,0.75,6347,0.69-0.71,0.7-0.72,0.74-0.76


<a id ="bin_class_k" />

### Binary Classification Results with knowing k

In [9]:
evaluation_report(gold_standard, VGG16_predictions_k)

Unnamed: 0,precision,recall,F1,support,CI Precision,CI Recall,CI F1
Page,0.8,0.83,0.81,6347,0.79-0.81,0.82-0.84,0.8-0.82
Doc,0.75,0.74,0.73,6347,0.74-0.76,0.73-0.75,0.72-0.74


<a id="agglo_clustering" />

## Agglomerative Clustering

Here we set up the code to run the agglomerative clustering algorithm, both in the cases where k is known, whether we use pretrained or finetuned embeddings, and whether or not we use the 'switch' technique.

In [10]:
def agglomerative_clustering(mode:str="finetuned", use_switch: bool = False):
    assert mode in ["finetuned", "pretrained"]
    
    clustering_predictions = {}
    dict_list = {}
    
    for stream_id in gold_standard.keys():
        
        if mode == "finetuned":
            image_vectors = finetuned_vectors[stream_id]
        elif mode == "pretrained":
            image_vectors = pretrained_vectors[stream_id]

        labels = gold_standard[stream_id]
        n_pages = len(labels)
        n_docs = sum(labels)

        if use_switch:
            _, switch_predictions = cluster_with_switch(labels, image_vectors, switch_first=True)
            clustering_predictions[stream_id] = switch_predictions

        if not use_switch:
            ## Not using switch
            distance_list = np.array([distance.cosine(image_vectors[i], image_vectors[i+1]) for i in range(len(image_vectors)-1)])
            connectivity_matrix = create_connectivity_matrix(n_pages)
            if len(distance_list) > 1:
                dist_list_norm = (distance_list - np.min(distance_list)) / (np.max(distance_list) - np.min(distance_list))
                nth_highest = np.sort(dist_list_norm)[-n_docs]
            else:
                dist_list_norm = dist_list

            if n_pages > 1:
                cluster = AgglomerativeClustering(n_clusters=n_docs, affinity='cosine',
                                                  linkage='average',compute_distances = True,
                                                  connectivity = connectivity_matrix)  
                image_predictions = cluster.fit_predict(image_vectors) 

            clustering_predictions[stream_id] = length_list_to_bin(groups_to_lengths(image_predictions))
        
    # return the predictions
    return clustering_predictions

Using this code, we can obtain the results for all the possible settings with agglomerative clustering. We will first show the results of both using switching and not using switching with pretrained vectors, followed by the results of both strategies on finetuned vectors.

<a id="pretrained_no_switch" />

### Clustering with pretrained vectors no switch

In [11]:
clustering_pretrained_no_switch = agglomerative_clustering(mode="pretrained", use_switch = False)
evaluation_report(gold_standard, clustering_pretrained_no_switch)

Unnamed: 0,precision,recall,F1,support,CI Precision,CI Recall,CI F1
Page,0.48,0.48,0.48,6347,0.47-0.49,0.47-0.49,0.47-0.49
Doc,0.27,0.27,0.27,6347,0.26-0.28,0.26-0.28,0.26-0.28


<a id="pretrained_switch" />

### Clustering with pretrained vectors with switch

In [12]:
clustering_pretrained_switch = agglomerative_clustering(mode="pretrained", use_switch = True)
evaluation_report(gold_standard, clustering_pretrained_switch)

Unnamed: 0,precision,recall,F1,support,CI Precision,CI Recall,CI F1
Page,0.41,0.41,0.41,6347,0.4-0.42,0.4-0.42,0.4-0.42
Doc,0.28,0.28,0.28,6347,0.27-0.29,0.27-0.29,0.27-0.29


<a id="finetuned_no_switch" />

### Clustering with finetuned vectors no switch

In [13]:
clustering_finetuned_no_switch = agglomerative_clustering(mode="finetuned", use_switch = False)
evaluation_report(gold_standard, clustering_finetuned_no_switch)

Unnamed: 0,precision,recall,F1,support,CI Precision,CI Recall,CI F1
Page,0.53,0.53,0.53,6347,0.52-0.54,0.52-0.54,0.52-0.54
Doc,0.28,0.28,0.28,6347,0.27-0.29,0.27-0.29,0.27-0.29


<a id="finetuned_switch" />

### Clustering with finetuned vectors with switch

In [14]:
clustering_finetuned_switch = agglomerative_clustering(mode="finetuned", use_switch = True)
evaluation_report(gold_standard, clustering_finetuned_switch)

Unnamed: 0,precision,recall,F1,support,CI Precision,CI Recall,CI F1
Page,0.62,0.62,0.62,6347,0.61-0.63,0.61-0.63,0.61-0.63
Doc,0.52,0.52,0.52,6347,0.51-0.53,0.51-0.53,0.51-0.53


<a id="result_analysis" />

## Result Analysis

In this section we will analyse the results, such as investigating the differences in the pretrained and finetuned vectors, as well as looking into the types of mistakes made by the clustering model.

<a id="classification_mistakes" />

### False Positives and False Negatives

In [15]:
def get_classification_statistics(gold_standard: dict, predictions: dict):
    
    statistics_dict = defaultdict(int)
    
    for key in gold_standard.keys():
        stream_statistics = get_base_metrics(np.array(gold_standard[key]),
                                                np.array(predictions[key]))

        for key in stream_statistics.keys():
            statistics_dict[key] += stream_statistics[key]
    
    return statistics_dict


In [16]:
VGG16_no_k_stats = get_classification_statistics(gold_standard, VGG16_predictions_no_k)
VGG16_k_stats = get_classification_statistics(gold_standard, VGG16_predictions_k)

In [17]:
print("With K not known we have %d False Positives, and %d False Negatives" % (VGG16_no_k_stats['FP'], VGG16_no_k_stats['FN']))
print("With K known we have %d False Positives, and %d False Negatives" % (VGG16_k_stats['FP'], VGG16_k_stats['FN']))

With K not known we have 517 False Positives, and 734 False Negatives
With K known we have 547 False Positives, and 539 False Negatives


<a id="vector_dist" />

### Distances between vectors

In this section we will investigate the cosine similarity between the finetuned vectors obtained from the VGG16 model, and differentiate between starting pages and non-starting pages. We will also do this for the petrained versions of the vectors.

In [133]:
def custom_cosine(A, B):
    return (np.dot(A, B.T)) / (scipy.linalg.norm(A)*scipy.linalg.norm(B))

In [134]:
# Here we calculate the average distance between vectors of different types.
from scipy.spatial.distance import pdist, squareform
def distance_against_eachother_2(vector_list, gold_standard):
    document_vector_boundaries = np.array(gold_standard).nonzero()[0]
    document_vectors = np.split(vector_list, document_vector_boundaries)[1:]
    
    print(custom_cosine(document_vectors[0], document_vectors[0]))
    
    #same_document_similarity = [(custom_cosine(document, document).sum()-vector_list.shape[0]) / 2 for document in document_vectors]

In [135]:
def get_vector_distances(gold_standard: dict, predictions: dict):
    
    statistics_dict = defaultdict(int)
    
    for key in predictions.keys():
        vector_distances = distance_against_eachother_2(predictions[key], gold_standard[key])
        
    
    return statistics_dict

In [136]:
get_vector_distances(gold_standard, finetuned_vectors)

[[0.99999994]]
[[0.10977144 0.08487014 0.08364015 0.09373466 0.11134786]
 [0.08487014 0.22891751 0.15016362 0.09818999 0.23040353]
 [0.08364015 0.15016362 0.18969935 0.10831296 0.15938331]
 [0.09373466 0.09818999 0.10831296 0.14744616 0.12317356]
 [0.11134786 0.23040353 0.15938331 0.12317356 0.3241654 ]]
[[1.]]
[[ 0.9259466  -0.15513776]
 [-0.15513776  0.07405321]]
[[ 0.35969076 -0.12393659]
 [-0.12393659  0.64030915]]
[[1.0000001]]
[[ 0.32848597 -0.29789403]
 [-0.29789403  0.67151433]]
[[ 0.28505707 -0.01337749 -0.02258576 -0.02477903 -0.0467885 ]
 [-0.01337749  0.14644262  0.12656662  0.08848834  0.10869379]
 [-0.02258576  0.12656662  0.24715778  0.13081188  0.1640793 ]
 [-0.02477903  0.08848834  0.13081188  0.13204272  0.13426793]
 [-0.0467885   0.10869379  0.1640793   0.13426793  0.18930006]]
[[0.12264156 0.08722322 0.05287336 0.08066577 0.08800559 0.14034215
  0.12810043 0.08414943 0.05482504]
 [0.08722322 0.08990871 0.05094488 0.07972496 0.0885293  0.13078667
  0.12370169 0.08028

defaultdict(int, {})