# Training a model for correct first page prediction

This notebook covers one of the approaches to training a model for predicting whether a page of the document is the first one or not -- a feature that would allow correct splitting for PDFs that consist of more than one actual document (we assume that the pages are already sorted). The approach used is NBOW (Neural Bag-of-words).

Before you start, makee sure you have **installed** and **initialized** the konfuzio_sdk package as shown in the readme of the [repository](https://github.com/konfuzio-ai/Python-SDK).

In [None]:
!pip install konfuzio-sdk

In [None]:
!konfuzio_sdk init

Importing necessary libraries and packages:

In [2]:
import os

import numpy as np
import pandas as pd

from collections import Counter
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.preprocessing.text import Tokenizer
from konfuzio_sdk.data import Project, Document
from nltk import word_tokenize
from tqdm import tqdm

Setting seed for reproducibility purposes:

In [3]:
seed_value = 77
os.environ['PYTHONHASHSEED'] = str(seed_value)

We will use a multilayered perceptron architecture built with Keras library and a vocabulary built by using Counter.

### Gathering the data

Loading our project:

In [5]:
my_project = Project(id_=24)

In [6]:
train_docs = my_project.documents
test_docs = my_project.test_documents

Preparing data for training and testing datasets:

In [None]:
pages_train_docs = []
pages_labels = []

for doc in tqdm(train_docs):
    for page in doc.pages():
        pages_train_docs.append(page.text)
        if page.number == 1:
            pages_labels.append(1)
        else:
            pages_labels.append(0)

In [None]:
pages_test_docs = []
pages_labels_test = []

for doc in tqdm(test_docs):
    for page in doc.pages():
        pages_test_docs.append(page.text)
        if page.number == 1:
            pages_labels_test.append(1)
        else:
            pages_labels_test.append(0)

### NBOW (no preprocessing)

Initializing and building the vocabulary:

In [None]:
vocab = Counter()

In [None]:
for text in tqdm(pages_train_docs):
    tokens = word_tokenize(text)
    vocab.update(tokens)

Intializing and fitting the tokenizer for subsequent applying at the training and testing data:

In [9]:
tokenizer = Tokenizer()

In [10]:
tokenizer.fit_on_texts(pages_train_docs)

In [None]:
Xtrain = tokenizer.texts_to_matrix(pages_train_docs, mode='freq')
print(Xtrain.shape)

In [None]:
Xtest = tokenizer.texts_to_matrix(pages_test_docs, mode='freq')
print(Xtest.shape)

Processing the labels:

In [13]:
ytrain = np.array(pages_labels)
ytest = np.array(pages_labels_test)

In [14]:
n_words = Xtest.shape[1]

The architecture is Keras's Sequential with two Dense layers. The training runs for 50 epochs; chosen metric is accuracy.

In [None]:
model = Sequential()
model.add(Dense(50, input_shape=(n_words,), activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(Xtrain, ytrain, epochs=50, verbose=2)

In [16]:
loss, acc = model.evaluate(Xtest, ytest, verbose=0)

### Metrics & prediction

Accuracy on the test set is 95%, which is unrealistically high.

In [17]:
print('Accuracy: {}'.format(acc*100))

Accuracy: 95.31915187835693


A function for running predictions manually consists of pre-filtering with the usage of previously built vocabulary and the prediction on the remaining tokens.

In [18]:
def predict_label(page_text, vocab, tokenizer, model):
    tokens = word_tokenize(page_text)
    tokens = [t for t in tokens if t in vocab]
    line = ' '.join(tokens)
    encoded = tokenizer.texts_to_matrix([line], mode='freq')
    pred = model.predict(encoded, verbose=0)
    return round(pred[0,0])

In [50]:
def calculate_metrics(texts, labels):
    true_positive = 0
    false_positive = 0
    false_negative = 0
    
    for i, test in tqdm(zip(labels, texts)):
        pred = predict_label(test, vocab, tokenizer, model)
        if i == 1 and pred == 1:
            true_positive += 1
        elif i == 1 and pred == 0:
            false_negative += 1
        elif i == 0 and pred == 1:
            false_positive += 1
    
    if true_positive + false_positive != 0:
        precision = true_positive / (true_positive + false_positive)
    else:
        precision = 0
    
    if true_positive + false_negative != 0:
        recall = true_positive / (true_positive + false_negative)
    else:
        recall = 0
    
    if precision + recall != 0:
    
        f1 = 2 * precision * recall / (precision + recall)
    
    else:
        
        f1 = 0
    
    return precision, recall, f1

In [26]:
precision, recall, f1 = calculate_metrics(pages_test_docs, pages_labels_test)

In [28]:
print('\n Precision: {} \n Recall: {} \n F1-score: {}'.format(precision, recall, f1))


 Precision: 0.9193548387096774 
 Recall: 0.9047619047619048 
 F1-score: 0.912


Manual assessment:

In [22]:
for test in pages_test_docs[:10]: 
    print(predict_label(test, vocab, tokenizer, model))

1
0
1
0
0
1
0
0
1
0


In [20]:
pages_labels_test[:10]

[1, 0, 1, 0, 0, 1, 0, 0, 1, 0]

The results for the manual assessment prove to be similar to the evaluation given previously.

### Running on different data

Let's check the model's performance on a different dataset.

In [29]:
my_project = Project(id_=91)

In [52]:
all_docs = my_project.test_documents

In [None]:
pages_test_docs = []
pages_labels_test = []

for doc in tqdm(all_docs):
    for page in doc.pages():
        pages_test_docs.append(page.text)
        if page.number == 1:
            pages_labels_test.append(1)
        else:
            pages_labels_test.append(0)

In [54]:
Xtest = tokenizer.texts_to_matrix(pages_test_docs, mode='freq')

In [55]:
ytest = np.array(pages_labels_test)

In [56]:
loss, acc = model.evaluate(Xtest, ytest, verbose=0)

In [57]:
print('Accuracy: {}'.format(acc*100))

Accuracy: 42.85714328289032


This is significantly different from the initial results we have got on the test set. This might be explained by the fact that the current dataset contains cases more complex than the ones in the initial training dataset.

In [58]:
precision, recall, f1 = calculate_metrics(pages_test_docs, pages_labels_test)

7it [00:00, 24.46it/s]


In [59]:
print('\n Precision: {} \n Recall: {} \n F1-score: {}'.format(precision, recall, f1))


 Precision: 0 
 Recall: 0.0 
 F1-score: 0


Let's save a new test set into a .csv for further usage with Transformer-based model:

In [54]:
test_csv = pd.DataFrame({'text': pages_test_docs, 'label': pages_labels_test})

In [55]:
test_csv.to_csv('test_2.csv', index=False)