# Training a model for correct first page prediction

This notebook covers one of the approaches to training a model for predicting whether a page of the document is the first one or not -- a feature that would allow correct splitting for PDFs that consist of more than one actual document (we assume that the pages are already sorted). The approach used is NBOW (Neural Bag-of-words).

Before you start, makee sure you have **installed** and **initialized** the konfuzio_sdk package as shown in the readme of the [repository](https://github.com/konfuzio-ai/Python-SDK).

In [None]:
!pip install konfuzio-sdk

In [None]:
!konfuzio_sdk init

Importing necessary libraries and packages:

In [1]:
import os

import numpy as np
import pandas as pd
import tensorflow as tf

from collections import Counter
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.preprocessing.text import Tokenizer
from konfuzio_sdk.data import Project, Document
from nltk import word_tokenize
from PIL import Image
from tqdm import tqdm

Setting seed for reproducibility purposes:

In [2]:
seed_value = 42
os.environ['PYTHONHASHSEED'] = str(seed_value)

We will use a multilayered perceptron architecture built with Keras library and a vocabulary built by using Counter.

### Gathering the data

Loading our project for training and testing purposes:

In [3]:
my_project = Project(id_=1644)

In [4]:
train_data = my_project.documents
test_data = my_project.test_documents

Preparing data for training and testing datasets:

In [5]:
train_data_texts = []
train_data_labels = []

for doc in tqdm(train_data):
    for page in doc.pages():
        train_data_texts.append(page.text)
        if page.number == 1:
            train_data_labels.append(1)
        elif page.number != 1 and int(page.number):
            train_data_labels.append(0)
        else:
            print(page.number)

100%|██████████████████████████████████████| 1443/1443 [00:02<00:00, 712.48it/s]


In [6]:
test_data_texts = []
test_data_labels = []

for doc in tqdm(test_data):
    for page in doc.pages():
        test_data_texts.append(page.text)
        if page.number == 1:
            test_data_labels.append(1)
        elif page.number != 1 and int(page.number):
            test_data_labels.append(0)
        else:
            print(page.number)

100%|████████████████████████████████████████| 286/286 [00:00<00:00, 644.59it/s]


### NBOW (no preprocessing)

Initializing and building the vocabulary:

In [7]:
vocab = Counter()

In [8]:
for text in tqdm(train_data_texts):
    tokens = word_tokenize(text)
    vocab.update(tokens)

100%|██████████████████████████████████████| 2634/2634 [00:07<00:00, 365.43it/s]


Intializing and fitting the tokenizer for subsequent applying at the training and testing data:

In [9]:
tokenizer = Tokenizer()

In [10]:
tokenizer.fit_on_texts(train_data_texts)

In [11]:
Xtrain = tokenizer.texts_to_matrix(train_data_texts, mode='freq')
print(Xtrain.shape)

(2634, 49404)


In [12]:
Xtest = tokenizer.texts_to_matrix(test_data_texts, mode='freq')
print(Xtest.shape)

(435, 49404)


Processing the labels:

In [13]:
ytrain = np.array(train_data_labels)
ytest = np.array(test_data_labels)

In [14]:
n_words = Xtest.shape[1]

The architecture is Keras's Sequential with two Dense layers. The training runs for 50 epochs; chosen metric is accuracy.

In [16]:
model = Sequential()
model.add(Dense(50, input_shape=(n_words,), activation='relu'))
model.add(Dense(50, activation='elu'))
model.add(Dense(50, activation='elu'))

model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(Xtrain, ytrain, epochs=100, verbose=2)

2022-09-13 14:24:15.461986: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 1/100
83/83 - 2s - loss: 0.5283 - accuracy: 0.7388 - 2s/epoch - 23ms/step
Epoch 2/100
83/83 - 1s - loss: 0.2112 - accuracy: 0.9127 - 898ms/epoch - 11ms/step
Epoch 3/100
83/83 - 1s - loss: 0.0762 - accuracy: 0.9749 - 1s/epoch - 13ms/step
Epoch 4/100
83/83 - 1s - loss: 0.0335 - accuracy: 0.9913 - 917ms/epoch - 11ms/step
Epoch 5/100
83/83 - 1s - loss: 0.0249 - accuracy: 0.9916 - 914ms/epoch - 11ms/step
Epoch 6/100
83/83 - 1s - loss: 0.0204 - accuracy: 0.9939 - 924ms/epoch - 11ms/step
Epoch 7/100
83/83 - 1s - loss: 0.0156 - accuracy: 0.9943 - 1s/epoch - 14ms/step
Epoch 8/100
83/83 - 1s - loss: 0.0157 - accuracy: 0.9932 - 996ms/epoch - 12ms/step
Epoch 9/100
83/83 - 1s - loss: 0.0131 - accuracy: 0.9954 - 935ms/epoch - 11ms/step
Epoch 10/100
83/83 - 1s - loss: 0.0138 - accuracy: 0.9947 - 1s/epoch - 12ms/step
Epoch 11/100
83/83 - 1s - loss: 0.0121 - accuracy: 0.9954 - 1s/epoch - 13ms/step
Epoch 12/100
83/83 - 1s - loss: 0.0105 - accuracy: 0.9951 - 963ms/epoch - 12ms/step
Epoch 13/100
83/

<keras.callbacks.History at 0x7feeeb2a9c40>

Let's save our model:

In [17]:
model.save('NBOW.h5')

In [15]:
model = tf.keras.models.load_model('NBOW.h5')

2022-09-20 10:22:10.999199: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [16]:
loss, acc = model.evaluate(Xtest, ytest, verbose=0)

### Metrics & prediction

Accuracy on the test set:

In [46]:
print('Accuracy: {}'.format(acc*100))

Accuracy: 78.85057330131531


A function for running predictions manually consists of pre-filtering with the usage of previously built vocabulary and the prediction on the remaining tokens.

In [17]:
def predict_label(page_text, vocab, tokenizer, model):
    tokens = word_tokenize(page_text)
    tokens = [t for t in tokens if t in vocab]
    line = ' '.join(tokens)
    encoded = tokenizer.texts_to_matrix([line], mode='freq')
    pred = model.predict(encoded, verbose=0)
    return round(pred[0,0])

We calculate our custom metric via the following function that determines how many ground-truth first pages were actually predicted as first pages. The logic behind this approach suggests that by determining first pages correctly we can consecutively split documents correctly, using each first page as a separator (since it means a start of a new document).

In [None]:
def calculate_metrics(texts, labels):
    true_positive = 0
    false_positive = 0
    false_negative = 0
    
    for i, test in tqdm(zip(labels, texts)):
        pred = predict_label(test, vocab, tokenizer, model)
        if i == 1 and pred == 1:
            true_positive += 1
        elif i == 1 and pred == 0:
            false_negative += 1
        elif i == 0 and pred == 1:
            false_positive += 1
    
    if true_positive + false_positive != 0:
        precision = true_positive / (true_positive + false_positive)
    else:
        precision = 0
    
    if true_positive + false_negative != 0:
        recall = true_positive / (true_positive + false_negative)
    else:
        recall = 0
    
    if precision + recall != 0:
    
        f1 = 2 * precision * recall / (precision + recall)
    
    else:
        
        f1 = 0
    
    return precision, recall, f1

In [None]:
precision, recall, f1 = calculate_metrics(test_data_texts, test_data_labels)

In [48]:
print('\n Precision: {} \n Recall: {} \n F1-score: {}'.format(precision, recall, f1))


 Precision: 0.7642276422764228 
 Recall: 0.986013986013986 
 F1-score: 0.8610687022900764


Manual assessment:

In [19]:
for test in test_data_texts[:10]: 
    print(predict_label(test, vocab, tokenizer, model))

1
1
1
1
1
1
1
1
0
1


In [20]:
test_data_labels[:10]

[1, 1, 1, 1, 1, 1, 1, 1, 0, 1]

The results for the manual assessment prove to be similar to the evaluation given previously.

Let us make some visualizations to ensure the manually-run predictions are correct as well.
First, let's take a look at a single-page document which is the first in the test set:

In [40]:
test_data[0]

Document 32.pdf (334665)

Since it's a single-page document, it only has the first page, and it was predicted as such.

In [21]:
print("original label:", test_data_labels[0] , ", prediction:", predict_label(test_data_texts[0], vocab, tokenizer, model))

original label: 1 , prediction: 1


Next, we'll take a look at a two-page document which is also present in the test set. Its first page should be predicted to be the first (receive label 1) , and the second one should be predicted as not first (receive label 0), and it has been predicted as such.

In [59]:
test_data[7]

Document hit_20200711_002.pdf (334821)

In [50]:
print("original label:", test_data_labels[7] , ", prediction:", predict_label(test_data_texts[7], vocab, tokenizer, model))

original label: 1 , prediction: 1


In [26]:
print("original label:", test_data_labels[8] , ", prediction:", predict_label(test_data_texts[8], vocab, tokenizer, model))

original label: 0 , prediction: 0


Not all the pages get predicted correctly. Let's take a look at the three-page document that got 2 non-first pages predicted as first:

In [51]:
print("original label:", test_data_labels[28] , ", prediction:", predict_label(test_data_texts[28], vocab, tokenizer, model))

original label: 1 , prediction: 1


In [28]:
path_1 = my_project.get_document_by_id(334946).pages()[0].image_path
path_2 = my_project.get_document_by_id(334946).pages()[1].image_path
path_3 = my_project.get_document_by_id(334946).pages()[2].image_path

The following page gets predicted incorrectly:

In [52]:
print("original label:", test_data_labels[29] , ", prediction:", predict_label(test_data_texts[29], vocab, tokenizer, model))

original label: 0 , prediction: 1


The third page also gets an incorrect prediction:

In [53]:
print("original label:", test_data_labels[30] , ", prediction:", predict_label(test_data_texts[30], vocab, tokenizer, model))

original label: 0 , prediction: 1
