# Simple Supervised baseline models

In this notebook we will implement several simple baselines from the scikit-learn package, that are not fixed are capable of learning, and that serve as a good baseline for the more complex neural-based approaches that we use in the paper.

We will run K-nearest Neighbours and XGBoost. We will do both the classic and robust experiments, and we will finish both sections with a table containing all the approaches we used and the scores they achieved, where we report Page P, R and F1, and Document SQ, RQ and PQ.

We will also try the robustness experiments on the XGBoost and the KNN models and see if combining both modalities leads to better scores than just using the individual modalities.

## Index
1. [Dataloading and Setting Up](#dataloading)
2. [XGBoost](#xgboost)
3. [K-Nearest Neighbours](#knearestneighbours)

<a id="dataloading" />

## Loading in the data and setting up

As the first step we will load in the data that we will use for the experiments. This will consist of the pretrained image-and text vectors for both datasets, and the gold standard data on the document boundaries.


In [23]:
import numpy as np
import pandas as pd

In [24]:
def pandas_to_json(dataframe):
    output_dict = {}
    for doc_id, doc_data in dataframe.groupby('name'):
        output_dict[doc_id] = doc_data['label'].tolist()
    
    return output_dict

In [25]:
# load gold standard
# Load train and test for LONG
LONG_train = pandas_to_json(pd.read_csv('../resources/datasets/LONG/dataframes/train.csv'))
LONG_test = pandas_to_json(pd.read_csv('../resources/datasets/LONG/dataframes/test.csv'))
                         
# Load train and test for SHORT
SHORT_train = pandas_to_json(pd.read_csv('../resources/datasets/SHORT/dataframes/train.csv'))
SHORT_test = pandas_to_json(pd.read_csv('../resources/datasets/SHORT/dataframes/test.csv'))

Next up is loading all the pre-trained vectors for both datasets and both modalities, which are all saved in json dictionaries.

In [27]:
# Load the image vectors for both datasets
LONG_image_train_vectors = np.load('../resources/page_vectors/image_vectors/LONG/train_vectors.npy', allow_pickle=True)[()]
LONG_image_test_vectors = np.load('../resources/page_vectors/image_vectors/LONG/test_vectors.npy', allow_pickle=True)[()]

SHORT_image_train_vectors = np.load('../resources/page_vectors/image_vectors/SHORT/train_vectors.npy', allow_pickle=True)[()]
SHORT_image_test_vectors = np.load('../resources/page_vectors/image_vectors/SHORT/test_vectors.npy', allow_pickle=True)[()]


# Load the text vectors for both datasets
LONG_text_train_vectors = np.load('../resources/page_vectors/bert_vectors/LONG/train_vectors.npy', allow_pickle=True)[()]
LONG_text_test_vectors = np.load('../resources/page_vectors/bert_vectors/LONG/test_vectors.npy', allow_pickle=True)[()]

SHORT_text_train_vectors = np.load('../resources/page_vectors/bert_vectors/SHORT/train_vectors.npy', allow_pickle=True)[()]
SHORT_text_test_vectors = np.load('../resources/page_vectors/bert_vectors/SHORT/test_vectors.npy', allow_pickle=True)[()]

In [28]:
# Create the combination vectors for the multimodal experiments
LONG_combined_train_vectors = {key: np.concatenate([LONG_text_train_vectors[key], LONG_image_train_vectors[key]], axis=1) for key in LONG_train.keys()}
LONG_combined_test_vectors = {key: np.concatenate([LONG_text_test_vectors[key], LONG_image_test_vectors[key]], axis=1) for key in LONG_test.keys()}

SHORT_combined_train_vectors = {key: np.concatenate([SHORT_text_train_vectors[key], SHORT_image_train_vectors[key]], axis=1) for key in SHORT_train.keys()}
SHORT_combined_test_vectors = {key: np.concatenate([SHORT_text_test_vectors[key], SHORT_image_test_vectors[key]], axis=1) for key in SHORT_test.keys()}

Now that we have all the data we need we can almost start doing the experiments, we just have to convert this format with json into numpy arrays so that we can immediately use them with scikit-learn. we will write a quick helper function to do this. We won't have to do this for our test data, as we will use a custom function for prediction that works with the format our evaluation metrics expect.

In [29]:
def json_to_sklearn(x_json: dict, y_json: dict):
    
    X_data = []
    y_data = []
    for key in x_json.keys():
        X_data.append(x_json[key])
        y_data.extend(y_json[key])
        
    return np.concatenate(X_data), np.array(y_data)

In [30]:
# let's make the training data

# For the images
LONG_image_X_train, LONG_image_y_train = json_to_sklearn(LONG_image_train_vectors, LONG_train)

SHORT_image_X_train, SHORT_image_y_train = json_to_sklearn(SHORT_image_train_vectors, SHORT_train)

# For the text
LONG_text_X_train, LONG_text_y_train = json_to_sklearn(LONG_text_train_vectors, LONG_train)

SHORT_text_X_train, SHORT_text_y_train = json_to_sklearn(SHORT_text_train_vectors, SHORT_train)

# For the combined/ multimodal vectors
LONG_combo_X_train, LONG_combo_y_train = json_to_sklearn(LONG_combined_train_vectors, LONG_train)

SHORT_combo_X_train, SHORT_combo_y_train = json_to_sklearn(SHORT_combined_train_vectors, SHORT_train)

Wit the dataloading done, the only thing left to do is to import the evaluation functions. We will also define our own predict wrapper for the functions, as we want the predictions in a dictionary on a stream level to integrate nicely with our metrics code, which scikit-learn does not provide natively.

In [31]:
# import metricutils file
%run metricutils.py

## Simple Baselines

Here we calculate the two degenerate baselines first

In [32]:
def model_predict(sklearn_model, test_data_dict: dict, gold_standard, known_k:bool = True):
    output_predictions = {}
    for stream_id, vectors in tqdm(test_data_dict.items()):
        if known_k:
            num_breaks = sum(gold_standard[stream_id])
            preds = sklearn_model.predict_proba(vectors)[:, 1].squeeze()
            most_confident_breaks = np.argpartition(preds, -num_breaks)[-num_breaks:]
            preds[most_confident_breaks] = 1
            preds[preds < 1] = 0
        else:
            preds = sklearn_model.predict(vectors).squeeze()
        output_predictions[stream_id] = preds.tolist()
        output_predictions[stream_id][0] = 1
    return output_predictions

In [33]:
# write a function that does an experiment and returns the results in the format that we want
def experiment(sklearn_model, train_x: np.ndarray, train_y: np.ndarray, test_x_dict: dict, test_y_dict: dict,
              known_k: bool=False):
    # train the model
    sklearn_model.fit(train_x, train_y)
    
    # make the predictions
    predictions = model_predict(sklearn_model, test_x_dict, test_y_dict, known_k=known_k)
    
    # evaluate
    return predictions
    
    

<a id="xgboost" />

## Experiment 1: XGBoost

Next up is trying As XGBoost to see if we can improve over the performance of the logistic regression model. As XGBoost is not natively available in scikit-learn, we have to install it as a separate package. Luckily it does have very good integration with scikit-learn making it possible to pretty much use it like any other classifier. As with logistic regression we tried some different hyperparameters, and show the best performing model below.

In [35]:
import xgboost as xgb

In [36]:
LONG_text_results_xgboost = experiment(xgb.XGBClassifier(objective='binary:logistic', booster='gbtree', verbosity=0), LONG_text_X_train, LONG_text_y_train,
                             LONG_text_test_vectors, LONG_test, known_k=False)

SHORT_text_results_xgboost = experiment(xgb.XGBClassifier(objective='binary:logistic', booster='gbtree', verbosity=0), SHORT_text_X_train, SHORT_text_y_train,
                             SHORT_text_test_vectors, SHORT_test, known_k=False)


KeyboardInterrupt: 

In [37]:
LONG_image_results_xgboost = experiment(xgb.XGBClassifier(objective='binary:logistic', booster='gbtree', verbosity=0), LONG_image_X_train, LONG_image_y_train,
                             LONG_image_test_vectors, LONG_test, known_k=False)

SHORT_image_results_xgboost = experiment(xgb.XGBClassifier(objective='binary:logistic', booster='gbtree', verbosity=0), SHORT_image_X_train, SHORT_image_y_train,
                             SHORT_image_test_vectors, SHORT_test, known_k=False)


KeyboardInterrupt: 

Apart from these two unimodal models we also run a model based on both modalities, where we simply concatenate the features of both input modalities.
Here we do use normalization to get the features into the right range.

In [None]:
from sklearn.preprocessing import StandardScaler

LONG_scaler = StandardScaler()
LONG_train_vecs = LONG_scaler.fit_transform(LONG_combo_X_train)

LONG_combined_results_xgboost = experiment(xgb.XGBClassifier(objective='binary:logistic', booster='gbtree', verbosity=0), LONG_train_vecs, LONG_combo_y_train,
                             {key: LONG_scaler.transform(value) for key , value in LONG_combined_test_vectors.items()}, LONG_test)

SHORT_scaler = StandardScaler()
SHORT_train_vecs = SHORT_scaler.fit_transform(SHORT_combo_X_train)

SHORT_combined_results_xgboost = experiment(xgb.XGBClassifier(objective='binary:logistic', booster='gbtree', verbosity=0), SHORT_train_vecs, SHORT_combo_y_train,
                             {key: SHORT_scaler.transform(value) for key , value in SHORT_combined_test_vectors.items()}, SHORT_test)

In [None]:
xgboost_results_LONG = pd.DataFrame({'XGBOOST-TEXT': LONG_text_results_xgboost[1], 'XGBOOST-IMAGE': LONG_image_results_xgboost[1],
                                  'XGBOOST-MULTI': LONG_combined_results_xgboost[1]}).T
xgboost_results_SHORT = pd.DataFrame({'XGBOOST-TEXT': SHORT_text_results_xgboost[1], 'XGBOOST-IMAGE': SHORT_image_results_xgboost[1],
                                  'XGBOOST-MULTI': SHORT_combined_results_xgboost[1]}).T
xgboost_results = pd.concat([xgboost_results_LONG, xgboost_results_SHORT], axis=1, keys=['LONG', 'SHORT'])

In [None]:
xgboost_results

<a id="knearestneighbours" />

## Experiment 2: K-nearest neighbours

As a final experiment in this classic setup, we will also using K-Nearest neighbours to do the classification, following the same experimental setup as with the previous two approaches. Naturally, we do some testing on the optimal value of K by using a grid search over the possible values of K.

In [50]:
from tqdm import tqdm
from sklearn.neighbors import KNeighborsClassifier

In [55]:
LONG_text_results_knn = experiment(KNeighborsClassifier(n_neighbors=5, weights='distance'), LONG_text_X_train, LONG_text_y_train,
                             LONG_text_test_vectors, LONG_test, known_k=True)

SHORT_text_results_knn = experiment(KNeighborsClassifier(n_neighbors=25, weights='distance'), SHORT_text_X_train, SHORT_text_y_train,
                             SHORT_text_test_vectors, SHORT_test, known_k=True)

100%|██████████| 34/34 [00:53<00:00,  1.57s/it]
100%|██████████| 108/108 [00:23<00:00,  4.62it/s]


In [56]:
# and now do the image
LONG_image_results_knn = experiment(KNeighborsClassifier(n_neighbors=25, weights='distance'), LONG_image_X_train, LONG_image_y_train,
                             LONG_image_test_vectors, LONG_test, known_k=True)

SHORT_image_results_knn = experiment(KNeighborsClassifier(n_neighbors=25, weights='distance'), SHORT_image_X_train, SHORT_image_y_train,
                             SHORT_image_test_vectors, SHORT_test, known_k=True)

100%|██████████| 34/34 [02:31<00:00,  4.47s/it]
100%|██████████| 108/108 [01:35<00:00,  1.14it/s]


Now we do the same thing for the image classifier, finding optimal values of K for both datasets and run the multimodel version.

In [57]:
# we save the results so that we can put them in the large dataframe later
json_dump({**LONG_image_results_knn, **SHORT_image_results_knn}, '../../experiment_notebooks/experiment_results/KNN-IMAGE-K/predictions.json')

In [58]:
# we save the results so that we can put them in the large dataframe later
json_dump({**LONG_text_results_knn, **SHORT_text_results_knn}, '../../experiment_notebooks/experiment_results/KNN-TEXT-K/predictions.json')

In [None]:
LONG_scaler = StandardScaler()
LONG_train_vecs = LONG_scaler.fit_transform(LONG_combo_X_train)

LONG_combined_results_knn = experiment(KNeighborsClassifier(n_neighbors=5, weights='distance'), LONG_train_vecs, LONG_combo_y_train,
                             {key: LONG_scaler.transform(value) for key , value in LONG_combined_test_vectors.items()}, LONG_test)

SHORT_scaler = StandardScaler()
SHORT_train_vecs = SHORT_scaler.fit_transform(SHORT_combo_X_train)

SHORT_combined_results_knn = experiment(KNeighborsClassifier(n_neighbors=25, weights='distance'), SHORT_train_vecs, SHORT_combo_y_train,
                             {key: SHORT_scaler.transform(value) for key , value in SHORT_combined_test_vectors.items()}, SHORT_test)

In [None]:
knn_results_LONG = pd.DataFrame({'KNN-TEXT': LONG_text_results_knn[1], 'KNN-IMAGE': LONG_image_results_knn[1], 'KNN-MULTI': LONG_combined_results_knn[1]}).T
knn_results_SHORT = pd.DataFrame({'KNN-TEXT': SHORT_text_results_knn[1], 'KNN-IMAGE': SHORT_image_results_knn[1], 'KNN-MULTI': SHORT_combined_results_knn[1]}).T
knn_results = pd.concat([knn_results_LONG, knn_results_SHORT], axis=1, keys=['LONG', 'SHORT'])

In [None]:
knn_results

In [46]:
def json_dump(dictionary: dict, filepath: str):
    with open(filepath, 'w') as json_file:
        json.dump(dictionary, json_file)

In [None]:
# we save the results so that we can put them in the large dataframe later
json_dump(SHORT_combined_results_knn[0], '../experiment_notebooks/experiment_results/KNN-BOTH/SHORT_SHORT/predictions.json')

## Final Ranking

Now that we  have run our experiments for all three baselines we can simply combine these dataframes and print the final leaderboard, shere we sort and Document Weighted F1.

In [None]:
pd.concat([xgboost_results, knn_results])

## Robustnesss

Apart from running the baselines on the standard task as we have done above, we can also run our models on the robust task with just a few changes to the code, and see what kind of scores we get in that scenario. We will save the results in json dictionaries, so that we can include the models in the plots that we will make of the other models and get a sense of their robustness.

### KNN Classifier

In [14]:
# first we will try the knn method
LONG_text_results_knn_robust = experiment(KNeighborsClassifier(n_neighbors=5, weights='distance'), SHORT_text_X_train, SHORT_text_y_train,
                             LONG_text_test_vectors, LONG_test)

SHORT_text_results_knn_robust = experiment(KNeighborsClassifier(n_neighbors=5, weights='distance'), LONG_text_X_train, LONG_text_y_train,
                             SHORT_text_test_vectors, SHORT_test)

NameError: name 'KNeighborsClassifier' is not defined

In [15]:
LONG_image_results_knn_robust = experiment(KNeighborsClassifier(n_neighbors=25, weights='distance'), SHORT_image_X_train, SHORT_image_y_train,
                             LONG_image_test_vectors, LONG_test)

SHORT_image_results_knn_robust = experiment(KNeighborsClassifier(n_neighbors=5, weights='distance'), LONG_image_X_train, LONG_image_y_train,
                             SHORT_image_test_vectors, SHORT_test)

NameError: name 'KNeighborsClassifier' is not defined

In [179]:
from sklearn.preprocessing import StandardScaler

LONG_scaler = StandardScaler()
LONG_train_vecs = LONG_scaler.fit_transform(LONG_combo_X_train)

SHORT_scaler = StandardScaler()
SHORT_train_vecs = SHORT_scaler.fit_transform(SHORT_combo_X_train)


LONG_combined_results_knn_robust = experiment(KNeighborsClassifier(n_neighbors=25, weights='distance'), SHORT_train_vecs, SHORT_combo_y_train,
                             {key: LONG_scaler.transform(value) for key , value in LONG_combined_test_vectors.items()}, LONG_test)

SHORT_combined_results_knn_robust = experiment(KNeighborsClassifier(n_neighbors=25, weights='distance'), LONG_train_vecs, LONG_combo_y_train,
                             {key: SHORT_scaler.transform(value) for key , value in SHORT_combined_test_vectors.items()}, SHORT_test)

In [101]:
knn_results_LONG_robust = pd.DataFrame({'KNN-TEXT': LONG_text_results_knn_robust[1], 'KNN-IMAGE': LONG_image_results_knn_robust[1], 'KNN-MULTI': LONG_combined_results_knn_robust[1]}).T
knn_results_SHORT_robust = pd.DataFrame({'KNN-TEXT': SHORT_text_results_knn_robust[1], 'KNN-IMAGE': SHORT_image_results_knn_robust[1], 'KNN-MULTI': SHORT_combined_results_knn_robust[1]}).T
knn_results_robust = pd.concat([knn_results_LONG_robust, knn_results_SHORT_robust], axis=1, keys=['LONG', 'SHORT'])

In [232]:
print(knn_results_robust.to_latex())

\begin{tabular}{lrrrrrrrrrrrr}
\toprule
{} & \multicolumn{6}{l}{D1} & \multicolumn{6}{l}{D2} \\
{} & Page P & Page R & Page F1 & Doc. SQ & Doc. F1 & Doc W F1 & Page P & Page R & Page F1 & Doc. SQ & Doc. F1 & Doc W F1 \\
\midrule
KNN-TEXT  &   0.54 &   0.40 &    0.40 &    0.80 &    0.29 &     0.30 &   0.52 &   0.58 &    0.47 &    0.78 &    0.38 &     0.32 \\
KNN-IMAGE &   0.69 &   0.37 &    0.36 &    0.84 &    0.23 &     0.30 &   0.57 &   0.53 &    0.49 &    0.81 &    0.41 &     0.38 \\
KNN-MULTI &   0.67 &   0.48 &    0.47 &    0.86 &    0.34 &     0.38 &   0.52 &   0.67 &    0.52 &    0.83 &    0.42 &     0.34 \\
\bottomrule
\end{tabular}



  print(knn_results_robust.to_latex())


## XGBoost Robustness

In [180]:
# Now we will run XGBOOST with robustness
LONG_text_results_xgboost_robust = experiment(xgb.XGBClassifier(objective='binary:logistic', booster='gbtree', verbosity=0), SHORT_text_X_train, SHORT_text_y_train,
                             LONG_text_test_vectors, LONG_test)

SHORT_text_results_xgboost_robust = experiment(xgb.XGBClassifier(objective='binary:logistic', booster='gbtree', verbosity=0), LONG_text_X_train, LONG_text_y_train,
                             SHORT_text_test_vectors, SHORT_test)

In [181]:
LONG_image_results_xgboost_robust = experiment(xgb.XGBClassifier(objective='binary:logistic', booster='gbtree', verbosity=0), SHORT_image_X_train, SHORT_image_y_train,
                             LONG_image_test_vectors, LONG_test)

SHORT_image_results_xgboost_robust = experiment(xgb.XGBClassifier(objective='binary:logistic', booster='gbtree', verbosity=0), LONG_image_X_train, LONG_image_y_train,
                             SHORT_image_test_vectors, SHORT_test)

In [182]:
from sklearn.preprocessing import StandardScaler

LONG_scaler = StandardScaler()
LONG_train_vecs = LONG_scaler.fit_transform(LONG_combo_X_train)

SHORT_scaler = StandardScaler()
SHORT_train_vecs = SHORT_scaler.fit_transform(SHORT_combo_X_train)


LONG_combined_results_xgboost_robust = experiment(xgb.XGBClassifier(objective='binary:logistic', booster='gbtree', verbosity=0), SHORT_train_vecs, SHORT_combo_y_train,
                             {key: LONG_scaler.transform(value) for key , value in LONG_combined_test_vectors.items()}, LONG_test)

SHORT_combined_results_xgboost_robust = experiment(xgb.XGBClassifier(objective='binary:logistic', booster='gbtree', verbosity=0), LONG_train_vecs, LONG_combo_y_train,
                             {key: SHORT_scaler.transform(value) for key , value in SHORT_combined_test_vectors.items()}, SHORT_test)

In [194]:
# we save the results so that we can put them in the large dataframe later
json_dump(SHORT_combined_results_xgboost_robust[0], '../experiment_notebooks/experiment_results/XGBOOST-BOTH/LONG_SHORT/predictions.json')

In [204]:
## boilerplate code to combine predictions of the models on the separate datasets and put them into one folder.
import os
def combine_predictions(input_root_folder, f1, f2):
    first_set = read_json(os.path.join(input_root_folder, f1, 'predictions.json'))
    second_set = read_json(os.path.join(input_root_folder, f2, 'predictions.json'))
    out = 'standard' if f1 == 'LONG_LONG' else 'robust'
    combined_set = {**first_set, **second_set}
    json_dump(combined_set, os.path.join(input_root_folder, out, 'predictions.json'))
    

In [231]:
combine_predictions('../experiment_notebooks/experiment_results/XGBOOST-TEXT/', 'SHORT_LONG', 'LONG_SHORT')

In [235]:
xgboost_results_LONG_robust = pd.DataFrame({'XGBOOST-TEXT': LONG_text_results_xgboost_robust[1], 'XGBOOST-IMAGE': LONG_image_results_xgboost_robust[1], 'XGBOOST-MULTI': LONG_combined_results_xgboost_robust[1]}).T
xgboost_results_SHORT_robust = pd.DataFrame({'XGBOOST-TEXT': SHORT_text_results_xgboost_robust[1], 'XGBOOST-IMAGE': SHORT_image_results_xgboost_robust[1], 'XGBOOST-MULTI': SHORT_combined_results_xgboost_robust[1]}).T
xgboost_results_robust = pd.concat([xgboost_results_LONG_robust, xgboost_results_SHORT_robust], axis=1, keys=['LONG', 'SHORT'])

In [236]:
print(xgboost_results_robust.to_latex())

\begin{tabular}{lrrrrrrrrrrrr}
\toprule
{} & \multicolumn{6}{l}{D1} & \multicolumn{6}{l}{D2} \\
{} & Page P & Page R & Page F1 & Doc. SQ & Doc. F1 & Doc W F1 & Page P & Page R & Page F1 & Doc. SQ & Doc. F1 & Doc W F1 \\
\midrule
XGBOOST-TEXT  &   0.65 &   0.31 &    0.33 &    0.83 &    0.19 &     0.23 &   0.69 &   0.44 &    0.48 &    0.77 &    0.38 &     0.40 \\
XGBOOST-IMAGE &   0.67 &   0.35 &    0.36 &    0.83 &    0.23 &     0.30 &   0.44 &   0.60 &    0.45 &    0.81 &    0.34 &     0.27 \\
XGBOOST-MULTI &   0.66 &   0.36 &    0.33 &    0.85 &    0.22 &     0.27 &   0.67 &   0.50 &    0.50 &    0.79 &    0.41 &     0.41 \\
\bottomrule
\end{tabular}



  print(xgboost_results_robust.to_latex())
