# Training a model for correct first-page prediction

This notebook covers two approaches to training a model for predicting whether a page of the document is the first one or not -- a feature that would allow correct splitting for PDFs that consist of more than one actual document (we assume that the pages are already sorted). The first approach used is NBOW (Neural Bag-of-words) and the second is fine-tuning Transformers model (BERT) with our document-related dataset.

## Creating project and prerequisites

This is an extract from [this](https://help.konfuzio.com/tutorials/quickstart/index.html) video tutorial. For more detailed information address it.

To create a project, go to the main server page and press "Create a project". Give it a name and save it. 

Before adding any documents, we need to create labels for future annotations. They can include any type of information that is to be extracted from the documents, i.e. "Total", "Description", "Net sum". 

To create a label, go to Home > Labels > + Add. 

Each label has to be named; it also has the tickbox "Multiple" allowing for it occuring in the document more than once. 

Threshold allows setting a minimal level of confidence (model's prediction accuracy – how sure the model is about the label) for putting a label — for example, 0.1 threshold means the label will only be assigned if model has 10 or more percents of confidence about this label. 

Data type allows to choose what type of information will be under the label, i.e. text, percentage.

After filling all the necessary fields, save the label.

Labels are grouped into label sets. We need to create at least one label set because labels have to be assigned one.  

To create a label set, go to Home > Label sets > + Add; you can also use an automatically made set that has a name similar to that of the project. 

After creating or selecting a label set, add all the necessary labels in it.

If you create a new label set, you also need to assign it a category.

After all changes, save the resulting set. 

## Uploading the documents

Ideally, to build a dataset you will need at least a hundred of documents. The training and testing shares can be divided as 80%/20%. 

All the documents must fit a certain set of requirements:

- each document has to contain only one unit (i.e. no multiple documents scanned as a single file);
- if document's length is more than one page, the pages have to be sorted or at least the ground-truth first page should go first;
- the formats supported are PNG and PDF;
- each document has to be ascribed a certain category.

Preferably, a balanced dataset would contain not only single-page documents, but also multiple-paged ones in an equal amount; category-based splitting should also be balanced.

To add documents, go to the Documents page and add all the necessary files. 

Documents need to be categorized; for that, go to Home > Categories and add all the categories you deem necessary. After, it is possible to change the category in the Documents page in a similarly-named column, if it has not already been predicted by AI.

After that it is possible to open any of the documents in the smartview by clicking on the title.

When you have finished processing all the documents, go to the project's page and note the ID: we will have to use it for accessing the data later on.

## Imports and initializations

Before you start, make sure you have **installed** and **initialized** the konfuzio_sdk package as shown in the readme of the [repository](https://github.com/konfuzio-ai/Python-SDK).

In [None]:
!pip install konfuzio-sdk

In [None]:
!konfuzio_sdk init

Also, you will need to install the Transformers-related packages:

In [None]:
!pip install transformers datasets

Importing necessary libraries and packages:

In [1]:
import os
import torch

import numpy as np
import pandas as pd
import tensorflow as tf

from collections import Counter
from datasets import load_dataset, load_metric
from nltk import word_tokenize
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.preprocessing.text import Tokenizer
from konfuzio_sdk.data import Project, Document
from tqdm import tqdm
from transformers import BertTokenizer, AutoModelForSequenceClassification, AutoConfig, \
                        TrainingArguments, DataCollatorWithPadding, Trainer

Setting seed for reproducibility purposes:

In [2]:
seed_value = 42
os.environ['PYTHONHASHSEED'] = str(seed_value)

Initializing the config file, model and the tokenizer:

In [10]:
configuration = AutoConfig.from_pretrained('bert-base-uncased')
configuration.num_labels = 2

In [11]:
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', config=configuration)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, max_length=10000, 
                                          padding="max_length", truncate=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Loading the data

Loading our project for training and testing purposes:

In [3]:
my_project = Project(id_=1644)

In [4]:
train_data = my_project.documents
test_data = my_project.test_documents

Preparing data for training and testing datasets:

In [5]:
train_data_texts = []
train_data_labels = []

for doc in tqdm(train_data):
    for page in doc.pages():
        train_data_texts.append(page.text)
        if page.number == 1:
            train_data_labels.append(1)
        elif page.number != 1 and int(page.number):
            train_data_labels.append(0)
        else:
            print(page.number)

100%|██████████████████████████████████████| 1443/1443 [00:02<00:00, 683.58it/s]


In [6]:
test_data_texts = []
test_data_labels = []

for doc in tqdm(test_data):
    for page in doc.pages():
        test_data_texts.append(page.text)
        if page.number == 1:
            test_data_labels.append(1)
        elif page.number != 1 and int(page.number):
            test_data_labels.append(0)
        else:
            print(page.number)

100%|████████████████████████████████████████| 286/286 [00:00<00:00, 696.91it/s]


## NBOW approach

### No preprocessing

Initializing and building the vocabulary:

In [7]:
vocab = Counter()

In [8]:
for text in tqdm(train_data_texts):
    tokens = word_tokenize(text)
    vocab.update(tokens)

100%|██████████████████████████████████████| 2634/2634 [00:07<00:00, 361.91it/s]


Intializing and fitting the tokenizer for subsequent applying at the training and testing data:

In [9]:
tokenizer = Tokenizer()

In [10]:
tokenizer.fit_on_texts(train_data_texts)

In [11]:
Xtrain = tokenizer.texts_to_matrix(train_data_texts, mode='freq')
print(Xtrain.shape)

(2634, 49404)


In [12]:
Xtest = tokenizer.texts_to_matrix(test_data_texts, mode='freq')
print(Xtest.shape)

(435, 49404)


Processing the labels:

In [13]:
ytrain = np.array(train_data_labels)
ytest = np.array(test_data_labels)

In [14]:
n_words = Xtest.shape[1]

The architecture is Keras's Sequential with two Dense layers. The training runs for 50 epochs; chosen metric is accuracy.

In [16]:
model = Sequential()
model.add(Dense(50, input_shape=(n_words,), activation='relu'))
model.add(Dense(50, activation='elu'))
model.add(Dense(50, activation='elu'))

model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(Xtrain, ytrain, epochs=100, verbose=2)

2022-09-13 14:24:15.461986: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 1/100
83/83 - 2s - loss: 0.5283 - accuracy: 0.7388 - 2s/epoch - 23ms/step
Epoch 2/100
83/83 - 1s - loss: 0.2112 - accuracy: 0.9127 - 898ms/epoch - 11ms/step
Epoch 3/100
83/83 - 1s - loss: 0.0762 - accuracy: 0.9749 - 1s/epoch - 13ms/step
Epoch 4/100
83/83 - 1s - loss: 0.0335 - accuracy: 0.9913 - 917ms/epoch - 11ms/step
Epoch 5/100
83/83 - 1s - loss: 0.0249 - accuracy: 0.9916 - 914ms/epoch - 11ms/step
Epoch 6/100
83/83 - 1s - loss: 0.0204 - accuracy: 0.9939 - 924ms/epoch - 11ms/step
Epoch 7/100
83/83 - 1s - loss: 0.0156 - accuracy: 0.9943 - 1s/epoch - 14ms/step
Epoch 8/100
83/83 - 1s - loss: 0.0157 - accuracy: 0.9932 - 996ms/epoch - 12ms/step
Epoch 9/100
83/83 - 1s - loss: 0.0131 - accuracy: 0.9954 - 935ms/epoch - 11ms/step
Epoch 10/100
83/83 - 1s - loss: 0.0138 - accuracy: 0.9947 - 1s/epoch - 12ms/step
Epoch 11/100
83/83 - 1s - loss: 0.0121 - accuracy: 0.9954 - 1s/epoch - 13ms/step
Epoch 12/100
83/83 - 1s - loss: 0.0105 - accuracy: 0.9951 - 963ms/epoch - 12ms/step
Epoch 13/100
83/

<keras.callbacks.History at 0x7feeeb2a9c40>

Let's save our model:

In [17]:
model.save('NBOW.h5')

In [15]:
model = tf.keras.models.load_model('NBOW.h5')

2022-09-21 13:45:12.881055: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Let's evaluate the model's performance:

In [16]:
loss, acc = model.evaluate(Xtest, ytest, verbose=0)

Accuracy on the test set:

In [46]:
print('Accuracy: {}'.format(acc*100))

Accuracy: 78.85057330131531


A function for running predictions manually consists of pre-filtering with the usage of previously built vocabulary and the prediction on the remaining tokens.

In [16]:
def predict_label(page_text, vocab, tokenizer, model):
    tokens = word_tokenize(page_text)
    tokens = [t for t in tokens if t in vocab]
    line = ' '.join(tokens)
    encoded = tokenizer.texts_to_matrix([line], mode='freq')
    pred = model.predict(encoded, verbose=0)
    return round(pred[0,0])

We calculate our custom metric via the following function that determines how many ground-truth first pages were actually predicted as first pages. The logic behind this approach suggests that by determining first pages correctly we can consecutively split documents correctly, using each first page as a separator (since it means a start of a new document).

In [17]:
def calculate_metrics(texts, labels):
    true_positive = 0
    false_positive = 0
    false_negative = 0
    
    for i, test in tqdm(zip(labels, texts)):
        pred = predict_label(test, vocab, tokenizer, model)
        if i == 1 and pred == 1:
            true_positive += 1
        elif i == 1 and pred == 0:
            false_negative += 1
        elif i == 0 and pred == 1:
            false_positive += 1
    
    if true_positive + false_positive != 0:
        precision = true_positive / (true_positive + false_positive)
    else:
        precision = 0
    
    if true_positive + false_negative != 0:
        recall = true_positive / (true_positive + false_negative)
    else:
        recall = 0
    
    if precision + recall != 0:
    
        f1 = 2 * precision * recall / (precision + recall)
    
    else:
        
        f1 = 0
    
    return precision, recall, f1

In [None]:
precision, recall, f1 = calculate_metrics(test_data_texts, test_data_labels)

In [48]:
print('\n Precision: {} \n Recall: {} \n F1-score: {}'.format(precision, recall, f1))


 Precision: 0.7642276422764228 
 Recall: 0.986013986013986 
 F1-score: 0.8610687022900764


Manual assessment:

In [19]:
for test in test_data_texts[:10]: 
    print(predict_label(test, vocab, tokenizer, model))

1
1
1
1
1
1
1
1
0
1


In [20]:
test_data_labels[:10]

[1, 1, 1, 1, 1, 1, 1, 1, 0, 1]

The results for the manual assessment prove to be similar to the evaluation given previously.

Let us make some visualizations to ensure the manually-run predictions are correct as well.
First, let's take a look at a single-page document which is the first in the test set:

In [40]:
test_data[0]

Document 32.pdf (334665)

Since it's a single-page document, it only has the first page, and it was predicted as such.

In [21]:
print("original label:", test_data_labels[0] , ", prediction:", predict_label(test_data_texts[0], vocab, tokenizer, model))

original label: 1 , prediction: 1


Next, we'll take a look at a two-page document which is also present in the test set. Its first page should be predicted to be the first (receive label 1) , and the second one should be predicted as not first (receive label 0), and it has been predicted as such.

In [50]:
print("original label:", test_data_labels[7] , ", prediction:", predict_label(test_data_texts[7], vocab, tokenizer, model))

original label: 1 , prediction: 1


In [26]:
print("original label:", test_data_labels[8] , ", prediction:", predict_label(test_data_texts[8], vocab, tokenizer, model))

original label: 0 , prediction: 0


Not all the pages get predicted correctly. Let's take a look at the three-page document that got 2 non-first pages predicted as first:

In [51]:
print("original label:", test_data_labels[28] , ", prediction:", predict_label(test_data_texts[28], vocab, tokenizer, model))

original label: 1 , prediction: 1


In [28]:
path_1 = my_project.get_document_by_id(334946).pages()[0].image_path
path_2 = my_project.get_document_by_id(334946).pages()[1].image_path
path_3 = my_project.get_document_by_id(334946).pages()[2].image_path

The following page gets predicted incorrectly:

In [52]:
print("original label:", test_data_labels[29] , ", prediction:", predict_label(test_data_texts[29], vocab, tokenizer, model))

original label: 0 , prediction: 1


The third page also gets an incorrect prediction:

In [53]:
print("original label:", test_data_labels[30] , ", prediction:", predict_label(test_data_texts[30], vocab, tokenizer, model))

original label: 0 , prediction: 1


### Various preprocessing techniques

In order to possibly enhance the results, let's try several preprocessing approaches for the texts. There will be eight of them:

Removal of punctuation:

In [5]:
def preprocess_punctuation(text):
    tokens = word_tokenize(text)
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    tokens = ' '.join(tokens)
    return tokens

Removal of punctuation and non-alphabetical tokens:

In [6]:
def preprocess_punct_alpha(text):
    tokens = word_tokenize(text)
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    tokens = [word for word in tokens if word.isalpha()]
    tokens = ' '.join(tokens)
    return tokens

Removal of punctuation, non-alphabetical tokens, and 1-character-long tokens:

In [7]:
def preprocess_punct_alpha_len(text):
    tokens = word_tokenize(text)
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    tokens = [word for word in tokens if word.isalpha()]
    tokens = [word for word in tokens if len(word) > 1]
    tokens = ' '.join(tokens)
    return tokens

Removal of punctuation, non-alphabetical tokens, 1-character-long tokens, and stopwords:

In [8]:
def preprocess_punct_alpha_len_sw(text):
    tokens = word_tokenize(text)
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    tokens = [word for word in tokens if word.isalpha()]
    tokens = [word for word in tokens if len(word) > 1]
    stop_words = set(stopwords.words('german'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens = ' '.join(tokens)
    return tokens

Removal of punctuation and stopwords:

In [9]:
def preprocess_punct_sw(text):
    tokens = word_tokenize(text)
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    stop_words = set(stopwords.words('german'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens = ' '.join(tokens)
    return tokens

Removal of punctuation, 1-character-long tokens, and stopwords:

In [10]:
def preprocess_punct_len_sw(text):
    tokens = word_tokenize(text)
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    tokens = [word for word in tokens if len(word) > 1]
    stop_words = set(stopwords.words('german'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens = ' '.join(tokens)
    return tokens

Removal of punctuation, 1-character-long tokens, stopwords, and numbers:

In [11]:
def preprocess_punct_len_sw_nums(text):
    tokens = word_tokenize(text)
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    tokens = [word for word in tokens if len(word) > 1]
    stop_words = set(stopwords.words('german'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens = [w for w in tokens if not w.isnumeric()]
    tokens = ' '.join(tokens)
    return tokens

Removal of punctuation, 1-character-long tokens, stopwords, and dates:

In [12]:
def preprocess_punct_len_sw_dates(text):
    text = re.sub(r"[0-9]{1,4}[\_|\-|\/|\|\.][0-9]{1,2}[\_|\-|\/|\|\.][0-9]{1,4}", '', text)
    tokens = word_tokenize(text)
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    tokens = [word for word in tokens if len(word) > 1]
    stop_words = set(stopwords.words('german'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens = ' '.join(tokens)
    return tokens

For each of the approaches, we will need a separate train and test set, a vocabulary and a tokenizer fit specifically on this set:

In [13]:
train_data_texts_1 = []
train_data_labels = []

train_data_texts_2 = []
train_data_texts_3 = []
train_data_texts_4 = []
train_data_texts_5 = []
train_data_texts_6 = []
train_data_texts_7 = []
train_data_texts_8 = []


for doc in tqdm(train_data):
    for page in doc.pages():
        train_data_texts_1.append(page.text)
        train_data_texts_2.append(preprocess_punctuation(page.text))
        train_data_texts_3.append(preprocess_punct_alpha(page.text))
        train_data_texts_4.append(preprocess_punct_alpha_len(page.text))
        train_data_texts_5.append(preprocess_punct_sw(page.text))
        train_data_texts_6.append(preprocess_punct_len_sw(page.text))
        train_data_texts_7.append(preprocess_punct_len_sw_nums(page.text))
        train_data_texts_8.append(preprocess_punct_len_sw_dates(page.text))
        if page.number == 1:
            train_data_labels.append(1)
        elif page.number != 1 and int(page.number):
            train_data_labels.append(0)
        else:
            print(page.number)

100%|███████████████████████████████████████| 1443/1443 [00:59<00:00, 24.17it/s]


In [14]:
test_data_texts_1 = []
test_data_labels = []

test_data_texts_2 = []
test_data_texts_3 = []
test_data_texts_4 = []
test_data_texts_5 = []
test_data_texts_6 = []
test_data_texts_7 = []
test_data_texts_8 = []

for doc in tqdm(test_data):
    for page in doc.pages():
        test_data_texts_1.append(page.text)
        test_data_texts_2.append(preprocess_punctuation(page.text))
        test_data_texts_3.append(preprocess_punct_alpha(page.text))
        test_data_texts_4.append(preprocess_punct_alpha_len(page.text))
        test_data_texts_5.append(preprocess_punct_sw(page.text))
        test_data_texts_6.append(preprocess_punct_len_sw(page.text))
        test_data_texts_7.append(preprocess_punct_len_sw_nums(page.text))
        test_data_texts_8.append(preprocess_punct_len_sw_dates(page.text))
        if page.number == 1:
            test_data_labels.append(1)
        elif page.number != 1 and int(page.number):
            test_data_labels.append(0)
        else:
            print(page.number)

100%|█████████████████████████████████████████| 286/286 [00:08<00:00, 32.41it/s]


In [15]:
vocab_1 = Counter()
vocab_2 = Counter()
vocab_3 = Counter()
vocab_4 = Counter()
vocab_5 = Counter()
vocab_6 = Counter()
vocab_7 = Counter()
vocab_8 = Counter()

vocab = [vocab_1, vocab_2, vocab_3, vocab_4, vocab_5, vocab_6, vocab_7, vocab_8]

In [16]:
train = [train_data_texts_1, train_data_texts_2, train_data_texts_3, train_data_texts_4,
        train_data_texts_5, train_data_texts_6, train_data_texts_7, train_data_texts_8]

for t, v in zip(train, vocab):
    for text in t:
        tokens = word_tokenize(text)
        v.update(tokens)

In [38]:
tokenizer_1 = Tokenizer()
tokenizer_2 = Tokenizer()
tokenizer_3 = Tokenizer()
tokenizer_4 = Tokenizer()
tokenizer_5 = Tokenizer()
tokenizer_6 = Tokenizer()
tokenizer_7 = Tokenizer()
tokenizer_8 = Tokenizer()

tokenizer = [tokenizer_1, tokenizer_2, tokenizer_3, tokenizer_4, tokenizer_5, tokenizer_6, tokenizer_7,
            tokenizer_8]

In [39]:
tokenizer_1.fit_on_texts(train_data_texts_1)
tokenizer_2.fit_on_texts(train_data_texts_2)
tokenizer_3.fit_on_texts(train_data_texts_3)
tokenizer_4.fit_on_texts(train_data_texts_4)
tokenizer_5.fit_on_texts(train_data_texts_5)
tokenizer_6.fit_on_texts(train_data_texts_6)
tokenizer_7.fit_on_texts(train_data_texts_7)
tokenizer_8.fit_on_texts(train_data_texts_8)

In [40]:
Xtrain_1 = tokenizer_1.texts_to_matrix(train_data_texts_1, mode='freq')
Xtrain_2 = tokenizer_2.texts_to_matrix(train_data_texts_2, mode='freq')
Xtrain_3 = tokenizer_3.texts_to_matrix(train_data_texts_3, mode='freq')
Xtrain_4 = tokenizer_4.texts_to_matrix(train_data_texts_4, mode='freq')
Xtrain_5 = tokenizer_5.texts_to_matrix(train_data_texts_5, mode='freq')
Xtrain_6 = tokenizer_6.texts_to_matrix(train_data_texts_6, mode='freq')
Xtrain_7 = tokenizer_7.texts_to_matrix(train_data_texts_7, mode='freq')
Xtrain_8 = tokenizer_8.texts_to_matrix(train_data_texts_8, mode='freq')

In [41]:
Xtest_1 = tokenizer_1.texts_to_matrix(test_data_texts_1, mode='freq')
Xtest_2 = tokenizer_2.texts_to_matrix(test_data_texts_2, mode='freq')
Xtest_3 = tokenizer_3.texts_to_matrix(test_data_texts_3, mode='freq')
Xtest_4 = tokenizer_4.texts_to_matrix(test_data_texts_4, mode='freq')
Xtest_5 = tokenizer_5.texts_to_matrix(test_data_texts_5, mode='freq')
Xtest_6 = tokenizer_6.texts_to_matrix(test_data_texts_6, mode='freq')
Xtest_7 = tokenizer_7.texts_to_matrix(test_data_texts_7, mode='freq')
Xtest_8 = tokenizer_8.texts_to_matrix(test_data_texts_8, mode='freq')

In [42]:
ytrain = np.array(train_data_labels)
ytest = np.array(test_data_labels)

In [43]:
n_words_1 = Xtest_1.shape[1]
n_words_2 = Xtest_2.shape[1]
n_words_3 = Xtest_3.shape[1]
n_words_4 = Xtest_4.shape[1]
n_words_5 = Xtest_5.shape[1]
n_words_6 = Xtest_6.shape[1]
n_words_7 = Xtest_7.shape[1]
n_words_8 = Xtest_8.shape[1]

Fitting the models:

In [46]:
model_1 = Sequential()
model_1.add(Dense(50, input_shape=(n_words_1,), activation='relu'))
model_1.add(Dense(50, activation='elu'))
model_1.add(Dense(50, activation='elu'))

model_1.add(Dense(1, activation='sigmoid'))

model_1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model_1.fit(Xtrain_1, ytrain, epochs=100, verbose=2)

Epoch 1/100
83/83 - 2s - loss: 0.5130 - accuracy: 0.7574 - 2s/epoch - 19ms/step
Epoch 2/100
83/83 - 1s - loss: 0.2018 - accuracy: 0.9172 - 924ms/epoch - 11ms/step
Epoch 3/100
83/83 - 1s - loss: 0.0765 - accuracy: 0.9742 - 836ms/epoch - 10ms/step
Epoch 4/100
83/83 - 1s - loss: 0.0368 - accuracy: 0.9882 - 936ms/epoch - 11ms/step
Epoch 5/100
83/83 - 1s - loss: 0.0222 - accuracy: 0.9916 - 979ms/epoch - 12ms/step
Epoch 6/100
83/83 - 1s - loss: 0.0190 - accuracy: 0.9928 - 865ms/epoch - 10ms/step
Epoch 7/100
83/83 - 1s - loss: 0.0154 - accuracy: 0.9943 - 952ms/epoch - 11ms/step
Epoch 8/100
83/83 - 1s - loss: 0.0202 - accuracy: 0.9924 - 897ms/epoch - 11ms/step
Epoch 9/100
83/83 - 1s - loss: 0.0117 - accuracy: 0.9962 - 924ms/epoch - 11ms/step
Epoch 10/100
83/83 - 1s - loss: 0.0137 - accuracy: 0.9935 - 859ms/epoch - 10ms/step
Epoch 11/100
83/83 - 1s - loss: 0.0125 - accuracy: 0.9958 - 879ms/epoch - 11ms/step
Epoch 12/100
83/83 - 1s - loss: 0.0118 - accuracy: 0.9954 - 870ms/epoch - 10ms/step
Epoc

Epoch 99/100
83/83 - 1s - loss: 0.0048 - accuracy: 0.9958 - 934ms/epoch - 11ms/step
Epoch 100/100
83/83 - 1s - loss: 0.0070 - accuracy: 0.9966 - 1s/epoch - 12ms/step


<keras.callbacks.History at 0x7f9d2ed56fd0>

In [47]:
model_2 = Sequential()
model_2.add(Dense(50, input_shape=(n_words_2,), activation='relu'))
model_2.add(Dense(50, activation='elu'))
model_2.add(Dense(50, activation='elu'))

model_2.add(Dense(1, activation='sigmoid'))

model_2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model_2.fit(Xtrain_2, ytrain, epochs=100, verbose=2)

Epoch 1/100
83/83 - 2s - loss: 0.5301 - accuracy: 0.7373 - 2s/epoch - 22ms/step
Epoch 2/100
83/83 - 1s - loss: 0.1955 - accuracy: 0.9203 - 1s/epoch - 14ms/step
Epoch 3/100
83/83 - 1s - loss: 0.0647 - accuracy: 0.9822 - 1s/epoch - 14ms/step
Epoch 4/100
83/83 - 1s - loss: 0.0266 - accuracy: 0.9920 - 1s/epoch - 14ms/step
Epoch 5/100
83/83 - 1s - loss: 0.0157 - accuracy: 0.9966 - 1s/epoch - 14ms/step
Epoch 6/100
83/83 - 1s - loss: 0.0143 - accuracy: 0.9947 - 1s/epoch - 14ms/step
Epoch 7/100
83/83 - 1s - loss: 0.0175 - accuracy: 0.9939 - 1s/epoch - 14ms/step
Epoch 8/100
83/83 - 1s - loss: 0.0123 - accuracy: 0.9958 - 1s/epoch - 14ms/step
Epoch 9/100
83/83 - 1s - loss: 0.0108 - accuracy: 0.9966 - 1s/epoch - 14ms/step
Epoch 10/100
83/83 - 1s - loss: 0.0105 - accuracy: 0.9951 - 1s/epoch - 14ms/step
Epoch 11/100
83/83 - 1s - loss: 0.0115 - accuracy: 0.9962 - 1s/epoch - 14ms/step
Epoch 12/100
83/83 - 1s - loss: 0.0119 - accuracy: 0.9958 - 1s/epoch - 14ms/step
Epoch 13/100
83/83 - 1s - loss: 0.009

<keras.callbacks.History at 0x7f9cf55e2be0>

In [48]:
model_3 = Sequential()
model_3.add(Dense(50, input_shape=(n_words_3,), activation='relu'))
model_3.add(Dense(50, activation='elu'))
model_3.add(Dense(50, activation='elu'))

model_3.add(Dense(1, activation='sigmoid'))

model_3.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model_3.fit(Xtrain_3, ytrain, epochs=100, verbose=2)

Epoch 1/100
83/83 - 1s - loss: 0.5369 - accuracy: 0.7267 - 1s/epoch - 15ms/step
Epoch 2/100
83/83 - 1s - loss: 0.2024 - accuracy: 0.9241 - 625ms/epoch - 8ms/step
Epoch 3/100
83/83 - 1s - loss: 0.0771 - accuracy: 0.9768 - 623ms/epoch - 8ms/step
Epoch 4/100
83/83 - 1s - loss: 0.0396 - accuracy: 0.9894 - 623ms/epoch - 8ms/step
Epoch 5/100
83/83 - 1s - loss: 0.0263 - accuracy: 0.9913 - 655ms/epoch - 8ms/step
Epoch 6/100
83/83 - 1s - loss: 0.0188 - accuracy: 0.9939 - 671ms/epoch - 8ms/step
Epoch 7/100
83/83 - 1s - loss: 0.0160 - accuracy: 0.9924 - 798ms/epoch - 10ms/step
Epoch 8/100
83/83 - 1s - loss: 0.0223 - accuracy: 0.9928 - 641ms/epoch - 8ms/step
Epoch 9/100
83/83 - 1s - loss: 0.0120 - accuracy: 0.9970 - 651ms/epoch - 8ms/step
Epoch 10/100
83/83 - 1s - loss: 0.0170 - accuracy: 0.9935 - 755ms/epoch - 9ms/step
Epoch 11/100
83/83 - 1s - loss: 0.0134 - accuracy: 0.9951 - 644ms/epoch - 8ms/step
Epoch 12/100
83/83 - 1s - loss: 0.0116 - accuracy: 0.9954 - 652ms/epoch - 8ms/step
Epoch 13/100
8

Epoch 100/100
83/83 - 1s - loss: 0.0047 - accuracy: 0.9973 - 788ms/epoch - 9ms/step


<keras.callbacks.History at 0x7f9ccb06e6a0>

In [49]:
model_4 = Sequential()
model_4.add(Dense(50, input_shape=(n_words_4,), activation='relu'))
model_4.add(Dense(50, activation='elu'))
model_4.add(Dense(50, activation='elu'))

model_4.add(Dense(1, activation='sigmoid'))

model_4.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model_4.fit(Xtrain_4, ytrain, epochs=100, verbose=2)

Epoch 1/100
83/83 - 2s - loss: 0.5189 - accuracy: 0.7426 - 2s/epoch - 21ms/step
Epoch 2/100
83/83 - 1s - loss: 0.1915 - accuracy: 0.9233 - 736ms/epoch - 9ms/step
Epoch 3/100
83/83 - 1s - loss: 0.0773 - accuracy: 0.9711 - 707ms/epoch - 9ms/step
Epoch 4/100
83/83 - 1s - loss: 0.0397 - accuracy: 0.9886 - 800ms/epoch - 10ms/step
Epoch 5/100
83/83 - 1s - loss: 0.0259 - accuracy: 0.9913 - 1s/epoch - 17ms/step
Epoch 6/100
83/83 - 1s - loss: 0.0189 - accuracy: 0.9920 - 1s/epoch - 14ms/step
Epoch 7/100
83/83 - 1s - loss: 0.0193 - accuracy: 0.9932 - 1s/epoch - 16ms/step
Epoch 8/100
83/83 - 2s - loss: 0.0173 - accuracy: 0.9935 - 2s/epoch - 20ms/step
Epoch 9/100
83/83 - 2s - loss: 0.0170 - accuracy: 0.9947 - 2s/epoch - 19ms/step
Epoch 10/100
83/83 - 2s - loss: 0.0142 - accuracy: 0.9951 - 2s/epoch - 25ms/step
Epoch 11/100
83/83 - 2s - loss: 0.0122 - accuracy: 0.9966 - 2s/epoch - 18ms/step
Epoch 12/100
83/83 - 1s - loss: 0.0157 - accuracy: 0.9951 - 866ms/epoch - 10ms/step
Epoch 13/100
83/83 - 1s - l

83/83 - 1s - loss: 0.0064 - accuracy: 0.9954 - 812ms/epoch - 10ms/step


<keras.callbacks.History at 0x7f9c9a7f3070>

In [50]:
model_5 = Sequential()
model_5.add(Dense(50, input_shape=(n_words_5,), activation='relu'))
model_5.add(Dense(50, activation='elu'))
model_5.add(Dense(50, activation='elu'))

model_5.add(Dense(1, activation='sigmoid'))

model_5.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model_5.fit(Xtrain_5, ytrain, epochs=100, verbose=2)

Epoch 1/100
83/83 - 2s - loss: 0.5127 - accuracy: 0.7677 - 2s/epoch - 22ms/step
Epoch 2/100
83/83 - 1s - loss: 0.1708 - accuracy: 0.9324 - 1s/epoch - 13ms/step
Epoch 3/100
83/83 - 1s - loss: 0.0598 - accuracy: 0.9814 - 1s/epoch - 14ms/step
Epoch 4/100
83/83 - 1s - loss: 0.0269 - accuracy: 0.9920 - 1s/epoch - 15ms/step
Epoch 5/100
83/83 - 1s - loss: 0.0154 - accuracy: 0.9947 - 1s/epoch - 14ms/step
Epoch 6/100
83/83 - 1s - loss: 0.0148 - accuracy: 0.9951 - 1s/epoch - 15ms/step
Epoch 7/100
83/83 - 1s - loss: 0.0135 - accuracy: 0.9954 - 1s/epoch - 14ms/step
Epoch 8/100
83/83 - 1s - loss: 0.0108 - accuracy: 0.9966 - 1s/epoch - 14ms/step
Epoch 9/100
83/83 - 1s - loss: 0.0103 - accuracy: 0.9962 - 1s/epoch - 14ms/step
Epoch 10/100
83/83 - 1s - loss: 0.0108 - accuracy: 0.9958 - 1s/epoch - 14ms/step
Epoch 11/100
83/83 - 1s - loss: 0.0115 - accuracy: 0.9958 - 1s/epoch - 15ms/step
Epoch 12/100
83/83 - 1s - loss: 0.0130 - accuracy: 0.9958 - 1s/epoch - 14ms/step
Epoch 13/100
83/83 - 1s - loss: 0.009

<keras.callbacks.History at 0x7f9c7a6c2190>

In [51]:
model_6 = Sequential()
model_6.add(Dense(50, input_shape=(n_words_6,), activation='relu'))
model_6.add(Dense(50, activation='elu'))
model_6.add(Dense(50, activation='elu'))

model_6.add(Dense(1, activation='sigmoid'))

model_6.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model_6.fit(Xtrain_6, ytrain, epochs=100, verbose=2)

Epoch 1/100
83/83 - 2s - loss: 0.5295 - accuracy: 0.7156 - 2s/epoch - 23ms/step
Epoch 2/100
83/83 - 1s - loss: 0.1672 - accuracy: 0.9358 - 1s/epoch - 15ms/step
Epoch 3/100
83/83 - 1s - loss: 0.0599 - accuracy: 0.9803 - 1s/epoch - 17ms/step
Epoch 4/100
83/83 - 1s - loss: 0.0258 - accuracy: 0.9951 - 1s/epoch - 16ms/step
Epoch 5/100
83/83 - 1s - loss: 0.0199 - accuracy: 0.9935 - 1s/epoch - 17ms/step
Epoch 6/100
83/83 - 1s - loss: 0.0143 - accuracy: 0.9962 - 1s/epoch - 17ms/step
Epoch 7/100
83/83 - 1s - loss: 0.0136 - accuracy: 0.9947 - 1s/epoch - 17ms/step
Epoch 8/100
83/83 - 1s - loss: 0.0137 - accuracy: 0.9954 - 1s/epoch - 15ms/step
Epoch 9/100
83/83 - 1s - loss: 0.0124 - accuracy: 0.9951 - 1s/epoch - 16ms/step
Epoch 10/100
83/83 - 1s - loss: 0.0108 - accuracy: 0.9970 - 1s/epoch - 16ms/step
Epoch 11/100
83/83 - 1s - loss: 0.0140 - accuracy: 0.9951 - 1s/epoch - 15ms/step
Epoch 12/100
83/83 - 1s - loss: 0.0117 - accuracy: 0.9966 - 1s/epoch - 15ms/step
Epoch 13/100
83/83 - 1s - loss: 0.010

<keras.callbacks.History at 0x7f9c5af99f70>

In [52]:
model_7 = Sequential()
model_7.add(Dense(50, input_shape=(n_words_7,), activation='relu'))
model_7.add(Dense(50, activation='elu'))
model_7.add(Dense(50, activation='elu'))

model_7.add(Dense(1, activation='sigmoid'))

model_7.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model_7.fit(Xtrain_7, ytrain, epochs=100, verbose=2)

Epoch 1/100
83/83 - 2s - loss: 0.5140 - accuracy: 0.7677 - 2s/epoch - 20ms/step
Epoch 2/100
83/83 - 1s - loss: 0.1807 - accuracy: 0.9256 - 982ms/epoch - 12ms/step
Epoch 3/100
83/83 - 1s - loss: 0.0657 - accuracy: 0.9787 - 902ms/epoch - 11ms/step
Epoch 4/100
83/83 - 1s - loss: 0.0361 - accuracy: 0.9879 - 903ms/epoch - 11ms/step
Epoch 5/100
83/83 - 1s - loss: 0.0179 - accuracy: 0.9939 - 928ms/epoch - 11ms/step
Epoch 6/100
83/83 - 1s - loss: 0.0197 - accuracy: 0.9947 - 922ms/epoch - 11ms/step
Epoch 7/100
83/83 - 1s - loss: 0.0108 - accuracy: 0.9962 - 958ms/epoch - 12ms/step
Epoch 8/100
83/83 - 1s - loss: 0.0146 - accuracy: 0.9943 - 931ms/epoch - 11ms/step
Epoch 9/100
83/83 - 1s - loss: 0.0142 - accuracy: 0.9947 - 917ms/epoch - 11ms/step
Epoch 10/100
83/83 - 1s - loss: 0.0127 - accuracy: 0.9958 - 963ms/epoch - 12ms/step
Epoch 11/100
83/83 - 1s - loss: 0.0141 - accuracy: 0.9947 - 1s/epoch - 12ms/step
Epoch 12/100
83/83 - 1s - loss: 0.0130 - accuracy: 0.9954 - 998ms/epoch - 12ms/step
Epoch 1

83/83 - 1s - loss: 0.0055 - accuracy: 0.9962 - 1s/epoch - 15ms/step


<keras.callbacks.History at 0x7f9c2d262af0>

In [53]:
model_8 = Sequential()
model_8.add(Dense(50, input_shape=(n_words_8,), activation='relu'))
model_8.add(Dense(50, activation='elu'))
model_8.add(Dense(50, activation='elu'))

model_8.add(Dense(1, activation='sigmoid'))

model_8.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model_8.fit(Xtrain_8, ytrain, epochs=100, verbose=2)

Epoch 1/100
83/83 - 3s - loss: 0.5153 - accuracy: 0.7673 - 3s/epoch - 32ms/step
Epoch 2/100
83/83 - 2s - loss: 0.1653 - accuracy: 0.9362 - 2s/epoch - 23ms/step
Epoch 3/100
83/83 - 2s - loss: 0.0558 - accuracy: 0.9829 - 2s/epoch - 22ms/step
Epoch 4/100
83/83 - 2s - loss: 0.0230 - accuracy: 0.9928 - 2s/epoch - 21ms/step
Epoch 5/100
83/83 - 2s - loss: 0.0162 - accuracy: 0.9951 - 2s/epoch - 22ms/step
Epoch 6/100
83/83 - 2s - loss: 0.0138 - accuracy: 0.9943 - 2s/epoch - 23ms/step
Epoch 7/100
83/83 - 2s - loss: 0.0122 - accuracy: 0.9954 - 2s/epoch - 19ms/step
Epoch 8/100
83/83 - 2s - loss: 0.0109 - accuracy: 0.9973 - 2s/epoch - 22ms/step
Epoch 9/100
83/83 - 2s - loss: 0.0128 - accuracy: 0.9951 - 2s/epoch - 19ms/step
Epoch 10/100
83/83 - 1s - loss: 0.0126 - accuracy: 0.9962 - 1s/epoch - 18ms/step
Epoch 11/100
83/83 - 2s - loss: 0.0144 - accuracy: 0.9958 - 2s/epoch - 19ms/step
Epoch 12/100
83/83 - 1s - loss: 0.0163 - accuracy: 0.9951 - 1s/epoch - 17ms/step
Epoch 13/100
83/83 - 1s - loss: 0.011

<keras.callbacks.History at 0x7f9bff539c70>

Evaluating our models' performance:

In [54]:
loss_1, acc_1 = model_1.evaluate(Xtest_1, ytest, verbose=0)

In [55]:
loss_2, acc_2 = model_2.evaluate(Xtest_2, ytest, verbose=0)
loss_3, acc_3 = model_3.evaluate(Xtest_3, ytest, verbose=0)
loss_4, acc_4 = model_4.evaluate(Xtest_4, ytest, verbose=0)
loss_5, acc_5 = model_5.evaluate(Xtest_5, ytest, verbose=0)
loss_6, acc_6 = model_6.evaluate(Xtest_6, ytest, verbose=0)
loss_7, acc_7 = model_7.evaluate(Xtest_7, ytest, verbose=0)
loss_8, acc_8 = model_8.evaluate(Xtest_8, ytest, verbose=0)

In [57]:
print('Accuracy 1: {} \n'.format(acc_1*100),
     'Accuracy 2: {} \n'.format(acc_2*100),
     'Accuracy 3: {} \n'.format(acc_3*100),
     'Accuracy 4: {} \n'.format(acc_4*100),
     'Accuracy 5: {} \n'.format(acc_5*100),
     'Accuracy 6: {} \n'.format(acc_6*100),
     'Accuracy 7: {} \n'.format(acc_7*100),
     'Accuracy 8: {} \n'.format(acc_8*100))

Accuracy 1: 77.93103456497192 
 Accuracy 2: 77.7011513710022 
 Accuracy 3: 77.93103456497192 
 Accuracy 4: 77.93103456497192 
 Accuracy 5: 79.08046245574951 
 Accuracy 6: 79.08046245574951 
 Accuracy 7: 78.85057330131531 
 Accuracy 8: 77.7011513710022 



In [63]:
def predict_label(page_text, vocab, tokenizer, model):
    tokens = word_tokenize(page_text)
    tokens = [t for t in tokens if t in vocab]
    line = ' '.join(tokens)
    encoded = tokenizer.texts_to_matrix([line], mode='freq')
    pred = model.predict(encoded, verbose=0)
    return round(pred[0,0])

In [89]:
def calculate_metrics(texts, labels, tokenizer, model, vocab):
    true_positive = 0
    false_positive = 0
    false_negative = 0
    
    for i, test in tqdm(zip(labels, texts)):
        pred = predict_label(test, vocab, tokenizer, model)
        if i == 1 and pred == 1:
            true_positive += 1
        elif i == 1 and pred == 0:
            false_negative += 1
        elif i == 0 and pred == 1:
            false_positive += 1
    
    if true_positive + false_positive != 0:
        precision = true_positive / (true_positive + false_positive)
    else:
        precision = 0
    
    if true_positive + false_negative != 0:
        recall = true_positive / (true_positive + false_negative)
    else:
        recall = 0
    
    if precision + recall != 0:
    
        f1 = 2 * precision * recall / (precision + recall)
    
    else:
        
        f1 = 0
    
    return precision, recall, f1

In [90]:
precision_1, recall_1, f1_1 = calculate_metrics(test_data_texts_1, test_data_labels, tokenizer_1, model_1, vocab_1)
precision_2, recall_2, f1_2 = calculate_metrics(test_data_texts_2, test_data_labels, tokenizer_2, model_2, vocab_2)
precision_3, recall_3, f1_3 = calculate_metrics(test_data_texts_3, test_data_labels, tokenizer_3, model_3, vocab_3)
precision_4, recall_4, f1_4 = calculate_metrics(test_data_texts_4, test_data_labels, tokenizer_4, model_4, vocab_4)
precision_5, recall_5, f1_5 = calculate_metrics(test_data_texts_5, test_data_labels, tokenizer_5, model_5, vocab_5)
precision_6, recall_6, f1_6 = calculate_metrics(test_data_texts_6, test_data_labels, tokenizer_6, model_6, vocab_6)
precision_7, recall_7, f1_7 = calculate_metrics(test_data_texts_7, test_data_labels, tokenizer_7, model_7, vocab_7)
precision_8, recall_8, f1_8 = calculate_metrics(test_data_texts_8, test_data_labels, tokenizer_8, model_8, vocab_8)

435it [00:19, 22.14it/s]
435it [00:16, 25.61it/s]
435it [00:16, 25.61it/s]
435it [00:16, 26.63it/s]
435it [00:16, 25.59it/s]
435it [00:16, 26.44it/s]
435it [00:16, 26.65it/s]
435it [00:16, 26.71it/s]


In [91]:
print('1. Precision: {}, Recall: {}, F1-score: {} \n'.format(precision_1, recall_1, f1_1),
     '2. Precision: {}, Recall: {}, F1-score: {} \n'.format(precision_2, recall_2, f1_2),
     '3. Precision: {}, Recall: {}, F1-score: {} \n'.format(precision_3, recall_3, f1_3),
     '4. Precision: {}, Recall: {}, F1-score: {} \n'.format(precision_4, recall_4, f1_4),
     '5. Precision: {}, Recall: {}, F1-score: {} \n'.format(precision_5, recall_5, f1_5),
     '6. Precision: {}, Recall: {}, F1-score: {} \n'.format(precision_6, recall_6, f1_6),
     '7. Precision: {}, Recall: {}, F1-score: {} \n'.format(precision_7, recall_7, f1_7),
     '8. Precision: {}, Recall: {}, F1-score: {} \n'.format(precision_8, recall_8, f1_8),)

1. Precision: 0.7608695652173914, Recall: 0.9790209790209791, F1-score: 0.8562691131498472 
 2. Precision: 0.7540106951871658, Recall: 0.986013986013986, F1-score: 0.8545454545454545 
 3. Precision: 0.7526595744680851, Recall: 0.9895104895104895, F1-score: 0.8549848942598186 
 4. Precision: 0.7574931880108992, Recall: 0.972027972027972, F1-score: 0.8514548238897396 
 5. Precision: 0.7613941018766756, Recall: 0.993006993006993, F1-score: 0.8619119878603945 
 6. Precision: 0.7629427792915532, Recall: 0.9790209790209791, F1-score: 0.8575803981623277 
 7. Precision: 0.7587131367292225, Recall: 0.9895104895104895, F1-score: 0.8588770864946889 
 8. Precision: 0.75, Recall: 0.986013986013986, F1-score: 0.8519637462235651 



## BERT approach

In [13]:
tokenized_train = tokenizer(train_data_texts)

In [25]:
train_df = pd.DataFrame({'text': train_data_texts, 'label': train_data_labels})
train_df.to_csv('train_1644.csv')

In [15]:
tokenized_test = tokenizer(test_data_texts)

In [26]:
test_df = pd.DataFrame({'text': test_data_texts, 'label': test_data_labels})
test_df.to_csv('test_1644.csv')

In [27]:
tokenized = load_dataset('csv',
                      data_files={'train': 'train_1644.csv',
                                 'test': 'test_1644.csv'})



Downloading and preparing dataset csv/default to /Users/macbookpro/.cache/huggingface/datasets/csv/default-0ae9c0752663b210/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /Users/macbookpro/.cache/huggingface/datasets/csv/default-0ae9c0752663b210/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Setting the training arguments:


In [28]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2634
    })
    test: Dataset({
        features: ['Unnamed: 0', 'text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 435
    })
})

In [20]:
arguments = TrainingArguments(
    do_predict=True,
    output_dir='model', 
    evaluation_strategy="steps", 
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    num_train_epochs=25,
    logging_steps=1000, 
    logging_strategy='steps', 
    save_strategy='no',
    save_total_limit=2,
    seed=42,

)

data_collator = DataCollatorWithPadding(tokenizer, return_tensors="pt")

Tokenizing our dataset:

In [29]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [30]:
tokenized = dataset.map(preprocess_function, batched=True)

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Defining our metric of choice which is accuracy:

In [21]:
metric = load_metric('accuracy')

Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [22]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

Initializing the Trainer class and starting the training process:

In [23]:
trainer = Trainer(
    model=model,
    args=arguments,
    train_dataset=tokenized['train'],
    eval_dataset=tokenized['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [15]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: Unnamed: 0, text. If Unnamed: 0, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3515
  Num Epochs = 25
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 21975


Step,Training Loss,Validation Loss,Accuracy
1000,0.6552,0.592744,0.489362
2000,0.6516,0.634706,0.731915
3000,0.6835,0.926172,0.459574
4000,0.5817,1.027632,0.425532
5000,0.5811,0.627923,0.591489
6000,0.5686,0.587319,0.731915
7000,0.604,0.755748,0.523404
8000,0.5447,0.439216,0.791489
9000,0.5211,0.603451,0.714894
10000,0.5593,0.63522,0.417021


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: Unnamed: 0, text. If Unnamed: 0, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 235
  Batch size = 4
Saving model checkpoint to drive/MyDrive/knfz/model/checkpoint-1000
Configuration saved in drive/MyDrive/knfz/model/checkpoint-1000/config.json
Model weights saved in drive/MyDrive/knfz/model/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in drive/MyDrive/knfz/model/checkpoint-1000/tokenizer_config.json
Special tokens file saved in drive/MyDrive/knfz/model/checkpoint-1000/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: Unnamed: 0, text. If Unnamed: 0, text are not expected by `BertForSequenceClassification.

TrainOutput(global_step=21975, training_loss=0.5708771286618316, metrics={'train_runtime': 9473.4121, 'train_samples_per_second': 9.276, 'train_steps_per_second': 2.32, 'total_flos': 2.309579388396528e+16, 'train_loss': 0.5708771286618316, 'epoch': 25.0})

In [None]:
trainer.save_model()

Evaluating the trained model's performance:

In [None]:
model = AutoModelForSequenceClassification.from_pretrained('model', 
                                                           config=configuration)
tokenizer = BertTokenizer.from_pretrained('model', do_lower_case=True, 
                                          max_length=10000, padding="max_length", truncate=True)

We calculate our custom metric via the following function that determines how many ground-truth first pages were actually predicted as first pages. The logic behind this approach suggests that by determining first pages correctly we can consecutively split documents correctly, using each first page as a separator (since it means a start of a new document).

In [None]:
def calculate_metrics(texts, labels):
    true_positive = 0
    false_positive = 0
    false_negative = 0
    true_negative = 0
    
    for i, test in tqdm(zip(labels, texts)):
        inputs = tokenizer(test, truncation=True, return_tensors="pt")
        with torch.no_grad():
            logits = model(**inputs).logits
        pred = logits.argmax().item()
        
        if i == 1 and pred == 1:
            true_positive += 1
        elif i == 1 and pred == 0:
            false_negative += 1
        elif i == 0 and pred == 1:
            false_positive += 1
        elif i == 0 and pred == 0:
            true_negative += 1

        
    
    if true_positive + false_positive != 0:
        precision = true_positive / (true_positive + false_positive)
    else:
        precision = 0
    
    if true_positive + false_negative != 0:
        recall = true_positive / (true_positive + false_negative)
    else:
        recall = 0
    
    if precision + recall != 0:
    
        f1 = 2 * precision * recall / (precision + recall)
    
    else:
        
        f1 = 0
    
    acc = (true_positive + true_negative) / len(texts)
    
    return precision, recall, f1, acc

In [10]:
precision, recall, f1, acc = calculate_metrics(pages_test_docs, pages_labels_test)

435it [13:05,  1.81s/it]


In [12]:
print('\n Precision: {} \n Recall: {} \n F1-score: {}'.format(precision, recall, f1))


 Precision: 0.7225433526011561 
 Recall: 0.8741258741258742 
 F1-score: 0.7911392405063291
