# Training a fusion model for correct first page prediction

This notebook covers one of the approaches to training a model for predicting whether a page of the document is the first one or not -- a feature that would allow correct splitting for PDFs that consist of more than one actual document (we assume that the pages are already sorted). The approach used is fusion between VGG16 (CV-based neural network) and a plain LegalBERT instance.

Before you start, make sure you have **installed** and **initialized** the konfuzio_sdk package as shown in the readme of the [repository](https://github.com/konfuzio-ai/Python-SDK).

In [None]:
!pip install konfuzio-sdk

In [None]:
!konfuzio_sdk init

In [None]:
!pip install transformers datasets

Importing necessary libraries and packages:

In [9]:
import cv2 
import keras
import os
import torch

import numpy as np
import tensorflow as tf

from datasets import load_dataset, load_metric
from nltk import word_tokenize
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.layers import Dense, Conv2D, MaxPool2D , Flatten
from keras.models import Sequential, load_model
from keras.preprocessing import image
from keras.preprocessing.image import ImageDataGenerator
from konfuzio_sdk.data import Project, Document
from PIL import UnidentifiedImageError
from tensorflow.keras.optimizers import Adam
from tqdm import tqdm
from transformers import BertTokenizer, AutoModelForSequenceClassification, AutoConfig

Setting seed for reproducibility purposes:

In [7]:
seed_value = 42
os.environ['PYTHONHASHSEED'] = str(seed_value)

## Gathering and processing the data

In [10]:
my_project = Project(id_=1644, update=True)

2022-10-05 23:36:27,337 [konfuzio_sdk.api    ] [MainThread] [INFO    ] [get_meta_of_files   ][0364] Iterate on paginated https://app.konfuzio.com/api/projects/1644/docs/?limit=1000.


In [11]:
train_docs = my_project.documents
test_docs = my_project.test_documents

In [13]:
for doc in train_docs:
    doc.get_images()

In [14]:
for doc in test_docs:
    doc.get_images()

In [20]:
doc_paths = ['data_{}/documents/{}'.format(my_project.id_, x.id_) for x in train_docs]

Preparing data for image processing:

In [25]:
page_paths = []
labels = []

for doc in train_docs:
    for page in doc.pages():
        page_paths.append(page.image_path)
        if page.image_path.split('.')[-2] == 'page_1':
            labels.append(1)
        else:
            labels.append(0)

With texts, we only need test documents because we don't run training with BERT:

In [28]:
texts = []
labels = []

for doc in test_docs:
    for page in doc.pages():
        texts.append(page.text)
        if page.number == 1:
            labels.append(1)
        else:
            labels.append(0)

Saving text versions of the pages for using with BERT later:

In [8]:
df = pd.DataFrame({'text': texts, 'label': labels})

In [10]:
df.to_csv('vgg.csv')

Gathering paths for test dataset's images:

In [29]:
test_paths = []
test_labels = []

for doc in test_docs:
    for page in doc.pages():
        test_paths.append(page.image_path)
        if page.image_path.split('.')[-2] == 'page_1':
            labels.append(1)
        else:
            labels.append(0)

## VGG16 

Processing images from training and test sets with Otsu binarization and resizing. Note: before running, you need to create directory with the following structure:

- otsu (or any other name of your choice)
    - train
        - not_first_page
        - first_page
    - test
        - not_first_page
        - first_page

In [9]:
for img, label in tqdm(zip(page_paths, labels)):
    image = cv2.imread(img)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    ret, thresh1 = cv2.threshold(image, 120, 255, cv2.THRESH_BINARY + 
                                                cv2.THRESH_OTSU)
    image = cv2.resize(thresh1, (224,224), interpolation=cv2.INTER_AREA)
    if label == 0:
        cv2.imwrite('otsu/train/not_first_page/{}'.format(img.split('/')[-2] + '_' +
                                            img.split('/')[-1]), image)
    else:
        cv2.imwrite('otsu/train/first_page/{}'.format(img.split('/')[-2] + '_' +
                                            img.split('/')[-1]), image)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2641/2641 [02:09<00:00, 20.33it/s]


In [None]:
for img, label in tqdm(zip(test_paths, test_labels)):
    image = cv2.imread(img)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    ret, thresh1 = cv2.threshold(image, 120, 255, cv2.THRESH_BINARY + 
                                                cv2.THRESH_OTSU)
    image = cv2.resize(thresh1, (224,224), interpolation=cv2.INTER_AREA)
    if label == 0:
        cv2.imwrite('otsu/test/not_first_page/{}'.format(img.split('/')[-2] + '_' +
                                        img.split('/')[-1]), image)
    else:
        cv2.imwrite('otsu/test/first_page/{}'.format(img.split('/')[-2] + '_' +
                                        img.split('/')[-1]), image)

Loading processed and sorted images:

In [2]:
trdata = ImageDataGenerator()
traindata = trdata.flow_from_directory(directory="otsu/train",target_size=(224,224))
tsdata = ImageDataGenerator()
testdata = tsdata.flow_from_directory(directory="otsu/test", target_size=(224,224))

Found 881 images belonging to 2 classes.
Found 235 images belonging to 2 classes.


Building VGG16 architecture:

In [None]:
model = Sequential()
model.add(Conv2D(input_shape=(224,224,3),filters=64,kernel_size=(3,3),padding="same", activation="relu"))
model.add(Conv2D(filters=64,kernel_size=(3,3),padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Flatten())
model.add(Dense(units=4096,activation="relu"))
model.add(Dense(units=4096,activation="relu"))
model.add(Dense(units=2, activation="softmax"))

In [None]:
opt = Adam(lr=0.001)
model.compile(optimizer=opt, loss=keras.losses.categorical_crossentropy, metrics=['accuracy'])

In [7]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 224, 224, 64)      1792      
                                                                 
 conv2d_1 (Conv2D)           (None, 224, 224, 64)      36928     
                                                                 
 max_pooling2d (MaxPooling2D  (None, 112, 112, 64)     0         
 )                                                               
                                                                 
 conv2d_2 (Conv2D)           (None, 112, 112, 128)     73856     
                                                                 
 conv2d_3 (Conv2D)           (None, 112, 112, 128)     147584    
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 56, 56, 128)      0         
 2D)                                                    

Training the model with 100 epochs:

In [9]:
checkpoint = ModelCheckpoint("vgg16_1.h5", monitor='val_accuracy', verbose=1, 
                             save_best_only=True, save_weights_only=False, mode='auto', period=1)
early = EarlyStopping(monitor='val_accuracy', min_delta=0, patience=20, verbose=1, mode='auto')
hist = model.fit_generator(steps_per_epoch=100,generator=traindata, validation_data=testdata, 
                           validation_steps=10,epochs=100,callbacks=[checkpoint,early])



Epoch 1/100


  """







Epoch 1: val_accuracy improved from -inf to 0.73191, saving model to vgg16_1.h5


In [None]:
model.save('vgg16.h5')

### Metrics & prediction

In [None]:
saved_model = load_model("vgg16.h5")

In [37]:
def calculate_metrics(paths, model):
    true_positive = 0
    false_positive = 0
    false_negative = 0
    
    for path in tqdm(paths):
        if 'not_first_page' in path:
            label = 0
        else:
            label = 1

        img = image.load_img(path,target_size=(224,224))
        img = np.asarray(img)
        img = np.expand_dims(img, axis=0)
        output = saved_model.predict(img)

        if output[0][0] > output[0][1]:
            pred = 0
        else:
            pred = 1

        if label == 1 and pred == 1:
            true_positive += 1
        elif label == 1 and pred == 0:
            false_negative += 1
        elif label == 0 and pred == 1:
            false_positive += 1
    
    if true_positive + false_positive != 0:
        precision = true_positive / (true_positive + false_positive)
    else:
        precision = 0
    
    if true_positive + false_negative != 0:
        recall = true_positive / (true_positive + false_negative)
    else:
        recall = 0
    
    if precision + recall != 0:
        f1 = 2 * precision * recall / (precision + recall)
    
    else:
        f1 = 0
    
    return precision, recall, f1

In [24]:
precision, recall, f1 = calculate_metrics(paths, saved_model)

100%|██████████| 235/235 [00:14<00:00, 16.18it/s]


In [25]:
print('\n Precision: {} \n Recall: {} \n F1 score: {}'.format(precision, recall, f1))


 Precision: 0.2680851063829787 
 Recall: 1.0 
 F1 score: 0.4228187919463087


## LegalBERT

Initializing the model and tokenizer:

In [4]:
configuration = AutoConfig.from_pretrained('nlpaueb/legal-bert-base-uncased')
configuration.num_labels = 2

Downloading config.json:   0%|          | 0.00/0.99k [00:00<?, ?B/s]

In [None]:
model = AutoModelForSequenceClassification.from_pretrained('nlpaueb/legal-bert-base-uncased', config=configuration)
tokenizer = BertTokenizer.from_pretrained('nlpaueb/legal-bert-base-uncased', do_lower_case=True, max_length=10000, padding="max_length", truncate=True)

## Processing visual and textual inputs

In [21]:
paths = []

for el in os.listdir('test/first_page'):
    paths.append('test/first_page/' + el)

for el in os.listdir('test/not_first_page'):
    paths.append('test/not_first_page/' + el)


Loading and processing images via the VGG16:

In [None]:
outputs = {}

for path in tqdm(paths):
    try:
        img = image.load_img(path,target_size=(224,224))
        img = np.asarray(img)
        img = np.expand_dims(img, axis=0)
        output = saved_model.predict(img)
        outputs[path] = output
    except UnidentifiedImageError:
        pass

100%|██████████| 437/437 [01:49<00:00,  3.98it/s]


Opening the textual file for further priocessing via BERT:

In [None]:
dataset = load_dataset('csv',
                      data_files={
                                 'test': 'vgg.csv'})



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-4dd2354e4a5420d2/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-4dd2354e4a5420d2/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
outputs_2 = {}

for i, test in tqdm(zip(texts, test_labels)):
    inputs = tokenizer(i, truncation=True, return_tensors="pt")
    with torch.no_grad():
        logits = model(**inputs).logits
    pred = logits.argmax().item()
    outputs_2[i] = logits

435it [10:54,  1.50s/it]


Squashing the predictions together in a single input list:

In [None]:
logits = []

for logit_1, logit_2 in tqdm(outputs, outputs_2):
    logits.append([outputs[logit_1][0], outputs[logit_1][1], outputs_2[logit_2][0], outputs_2[logit_2][1]])

100%|██████████| 435/435 [00:00<00:00, 822597.94it/s]


In [None]:
Xtrain = np.array(logits[:350])

In [None]:
Xtest = np.array(logits[350:])

In [None]:
ytrain = np.array(test_labels[:350])
ytest = np.array(test_labels[350:])

In [None]:
input_shape = Xtest.shape[1]

Running the training of an MLP:

In [None]:
model = Sequential()
model.add(Dense(50, input_shape=(input_shape,), activation='relu'))
model.add(Dense(50, activation='elu'))
model.add(Dense(50, activation='elu'))

model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(Xtrain, ytrain, epochs=100, verbose=2)

Epoch 1/100
11/11 - 3s - loss: 0.6610 - accuracy: 0.6343 - 3s/epoch - 240ms/step
Epoch 2/100
11/11 - 0s - loss: 0.6574 - accuracy: 0.6343 - 40ms/epoch - 4ms/step
Epoch 3/100
11/11 - 0s - loss: 0.6534 - accuracy: 0.6343 - 43ms/epoch - 4ms/step
Epoch 4/100
11/11 - 0s - loss: 0.6526 - accuracy: 0.6343 - 37ms/epoch - 3ms/step
Epoch 5/100
11/11 - 0s - loss: 0.6550 - accuracy: 0.6343 - 42ms/epoch - 4ms/step
Epoch 6/100
11/11 - 0s - loss: 0.6515 - accuracy: 0.6343 - 44ms/epoch - 4ms/step
Epoch 7/100
11/11 - 0s - loss: 0.6506 - accuracy: 0.6343 - 44ms/epoch - 4ms/step
Epoch 8/100
11/11 - 0s - loss: 0.6509 - accuracy: 0.6343 - 44ms/epoch - 4ms/step
Epoch 9/100
11/11 - 0s - loss: 0.6503 - accuracy: 0.6343 - 45ms/epoch - 4ms/step
Epoch 10/100
11/11 - 0s - loss: 0.6485 - accuracy: 0.6371 - 44ms/epoch - 4ms/step
Epoch 11/100
11/11 - 0s - loss: 0.6483 - accuracy: 0.6400 - 41ms/epoch - 4ms/step
Epoch 12/100
11/11 - 0s - loss: 0.6481 - accuracy: 0.6400 - 47ms/epoch - 4ms/step
Epoch 13/100
11/11 - 0s -

<keras.callbacks.History at 0x7f355fcf4050>

Evaluating final model's performance and saving it:

In [None]:
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Accuracy: {}'.format(acc*100))

In [None]:
model.save('fusion.h5')

In [None]:
def predict_label(input, model):
    pred = model.predict(input, verbose=0)
    return round(pred[0,0])

In [None]:
def calculate_metrics(inputs, labels):
    true_positive = 0
    false_positive = 0
    false_negative = 0
    
    for i, test in tqdm(zip(labels, inputs)):
        pred = predict_label(test.reshape((1,4)), model)
        if i == 1 and pred == 1:
            true_positive += 1
        elif i == 1 and pred == 0:
            false_negative += 1
        elif i == 0 and pred == 1:
            false_positive += 1
    
    if true_positive + false_positive != 0:
        precision = true_positive / (true_positive + false_positive)
    else:
        precision = 0
    
    if true_positive + false_negative != 0:
        recall = true_positive / (true_positive + false_negative)
    else:
        recall = 0
    
    if precision + recall != 0:
    
        f1 = 2 * precision * recall / (precision + recall)
    
    else:
        
        f1 = 0
    
    return precision, recall, f1

In [None]:
precision, recall, f1 = calculate_metrics(Xtest, test_labels)

85it [00:05, 16.51it/s]


In [None]:
print('\n Precision: {} \n Recall: {} \n F1-score: {}'.format(precision, recall, f1))


 Precision: 0.7534246575342466 
 Recall: 0.8333333333333334 
 F1-score: 0.7913669064748201
