# Demo

In this notebook we will see how to use the Konfuzio SDK to to train a model to find and extract relevant information from payslip documents. 

## Setting things up

First, we need to install and initialize the konfuzio_sdk package. See [here](https://github.com/konfuzio-ai/konfuzio-sdk) for more info.

In [1]:
# !pip install konfuzio-sdk

In [2]:
# !konfuzio_sdk init

In [3]:
import os
import sys
sys.path.insert(1, os.path.join(sys.path[0], '..')) # for tests import

from copy import deepcopy

import logging
logger = logging.getLogger()

import konfuzio_sdk
from konfuzio_sdk.data import Project

from konfuzio_sdk.trainer.information_extraction import (DocumentAnnotationMultiClassModel, 
                                                         SeparateLabelsAnnotationMultiClassModel,
                                                         DocumentEntityMulticlassModel,
                                                         SeparateLabelsEntityMultiClassModel)

from konfuzio_sdk.tokenizer.regex import RegexTokenizer, WhitespaceTokenizer
from konfuzio_sdk.tokenizer.base import ListTokenizer
from konfuzio_sdk.api import upload_ai_model

from konfuzio_sdk.evaluate import compare

from tests.variables import OFFLINE_PROJECT, TEST_DOCUMENT_ID

  from konfuzio_sdk.trainer.information_extraction import (DocumentAnnotationMultiClassModel,


In [4]:
from konfuzio_sdk import KONFUZIO_HOST
KONFUZIO_HOST

'https://app.konfuzio.com'

In [5]:
# os.system('pwd')

Now we can load the Konfuzio project. Here we use a simple offline project included in the Konfuzio SDK.


In [59]:
# project = Project(id_=109, project_folder=OFFLINE_PROJECT)
# project = Project(id_=None, project_folder=OFFLINE_PROJECT)
project = Project(id_=46, update=True)

# project = Project(id_=None, project_folder='text_annotation_training_tests') #OFFLINE_PROJECT)

In [7]:
# from konfuzio_sdk.trainer.information_extraction import extraction_result_to_document

In [8]:
# result = extraction_result_to_document(doc2, result)

Each project has one or more categories which will tell us how to deal with the documents belonging to that category. 

In [9]:
# project.categories

Here we initilialize the training pipeline:

In [10]:
category = project.categories[0]

# pipeline = SeparateLabelsAnnotationMultiClassModel() # DocumentAnnotationMultiClassModel
pipeline = DocumentEntityMulticlassModel()
# pipeline = SeparateLabelsEntityMultiClassModel()
# pipeline = DocumentAnnotationMultiClassModel()
pipeline.category = category
# pipeline.documents = pipeline.category.documents()[:5] # Why?
pipeline.test_documents = pipeline.category.test_documents()

In [11]:
pipeline.test_documents

[Document Auswertungspaket - unterschiedliche B_N-Auswertungen.pdf_18.pdf (44865),
 Document Auswertungspaket - unterschiedliche B_N-Auswertungen.pdf_17.pdf (44866),
 Document Auswertungspaket - unterschiedliche B_N-Auswertungen.pdf_19.pdf (44867)]

In [12]:
# for doc in pipeline.test_documents:
#     doc.pages()[0].get_image()

In [13]:
documents = project.documents

In [14]:
# dev
for doc in documents:
    if doc.category is None:
        dev_doc = doc
        print(doc, 'No Category!')
# dev_doc.category = category
# dev_doc.pages()

Document 2022-02-13 16:19:30.684745 (44864) No Category!


In [15]:
documents = documents[:25]
# documents = documents[:2]

In [16]:
len(documents)

25

Let's have a look at what exactly a document in this dataset looks like. 

In [17]:
document = documents[0]

In [18]:
document.text

"                                                            x02   328927/10103/00104\nAbrechnung  der Brutto/Netto-Bezüge   für Dezember 2018                   22.05.2018 Bat:  1\n \nPersonal-Nr.  Geburtsdatum ski Faktor  Ki,Frbtr.Konfession  ‚Freibetragjährl.! |Freibetrag mt! |DBA  iGleitzone  'St.-Tg.  VJuUr. üb. |Url. Anspr. Url.Tg.gen.  |Resturlaub\n00104 150356 1  |     ‚ev                              30     400  3000       3400\n \nSV-Nummer       |Krankenkasse                       KK%®|PGRS Bars  jum.SV-Tg. Anw. Tage |UrlaubTage Krankh. Tg. Fehlz. Tage\n                                             \n50150356B581 AOK  Bayern Die Gesundheitskas 157 101 1111 1 30\n \n                                             Eintritt   ‚Austritt     Anw.Std.  Urlaub Std.  |Krankh. Std. |Fehlz. Std.\n                                             \n                                             170299  L L       l     L     l     l\n -                                       +  Steuer-ID       IMrB?

This is the output of Object Character Recognition model on the following image:


In [19]:
# print(document.pages())
# document.pages()[0].get_image()

In [20]:
for label in category.labels:
    print(label.name, label.threshold)
#     label.threshold = 0.02

Steuer-Brutto 0.1
Austellungsdatum 0.1
Steuerklasse 0.1
Vorname 0.1
Personalausweis 0.1
Betrag 0.1
Bank inkl. IBAN 0.1
Lohnart 0.1
Menge 0.1
EMPTY_LABEL 0.1
Gesamt-Brutto 0.1
Nachname 0.1
Sozialversicherung 0.1
Netto-Verdienst 0.1
Bezeichnung 0.1
Auszahlungsbetrag 0.1
NO_LABEL 0.0
Steuerrechtliche Abzüge 0.1
Faktor 0.1


This is all the information we may want to identify in documents of this category.

### Training the Regex Tokenizer

In [21]:
# pipeline.tokenizer = ListTokenizer(tokenizers=[])
# for label in category.labels:
#     for regex in label.find_regex(category=pipeline.category):
#         pipeline.tokenizer.tokenizers.append(RegexTokenizer(regex=regex))
pipeline.tokenizer = WhitespaceTokenizer()

In [22]:
# print(pipeline.tokenizer)

In [23]:
# print(len(pipeline.tokenizer.tokenizers))

In [24]:
# pipeline.tokenizer.tokenizers[0]


In [25]:
# raise

In [26]:
# len(document.annotations(use_correct=False)) 

In [27]:
# len(document.spans(use_correct=False))

And now we can create new NO_LABEL annotations

In [28]:
# len(test_documents[1].annotations(use_correct=False))

In [29]:
# pipeline.tokenizer.tokenize(document)

In [30]:
# pipeline.tokenizer.processing_steps[0].runtime

In [31]:
# pipeline.tokenizer.processing_steps[1].runtime


In [32]:
# raise

In [33]:
# len(document.annotations(use_correct=False))

In [34]:
# for ann in document.annotations(use_correct=False):
#     print(ann)

In [35]:
# def label_train_doc(doc, doc_spans):
#     s_i = 0
#     for span in doc.spans():
#         while s_i < len(doc_spans) and span.start_offset > doc_spans[s_i].end_offset:
#             s_i += 1
#         if s_i >= len(doc_spans):
#             break
#         if span.end_offset < doc_spans[s_i].start_offset:
#             continue
# #         if span.start_offset <= doc_spans[s_i].end_offset and \
# #             span.end_offset >= doc_spans[s_i].start_offset:
# #             span.annotation.label = doc_spans[s_i].annotation.label
#         r = range(doc_spans[s_i].start_offset, doc_spans[s_i].end_offset+1)
#         if span.start_offset in r and \
#             span.end_offset in r:
#             span.annotation.label = doc_spans[s_i].annotation.label

            
# #         if span.start_offset <= doc_spans[s_i].end_offset and \
# #             span.end_offset >= doc_spans[s_i].start_offset:
# #             span.annotation.label = doc_spans[s_i].annotation.label


Now we can do the same for all documents:

In [36]:
# Tokenize documents.
logger.setLevel(logging.ERROR)

# training_docs = [deepcopy(doc) for doc in documents]

# # for i, doc in enumerate(training_docs):    
# # #     doc._characters = documents[i].bboxes
# #     doc._hocr = documents[i].hocr

# for doc in training_docs:
#     pipeline.tokenizer.tokenize(doc)

# for i, t_doc in enumerate(training_docs):
#     label_train_doc(t_doc, documents[i].spans(use_correct=True))
    
# logger.setLevel(logging.INFO)
# for doc in test_documents:
#     pipeline.tokenizer.tokenize(doc)


In [37]:
# for doc in documents:
#     pipeline.tokenizer.tokenize(doc)
    

In [38]:
# training_docs[0].annotations(use_correct=False)
# training_docs[0]

In [39]:
# for i, doc in enumerate(training_docs):
#     doc.id_ = doc.copy_of_id #+ 1000
# #     doc.bboxes_available = True

In [40]:
print(sum([step.runtime for step in pipeline.tokenizer.processing_steps]))


0


In [41]:
# raise

In [42]:
# Extract features
pipeline.df_train, pipeline.label_feature_list = pipeline.feature_function(documents=documents, retokenize=True)
# pipeline.df_train, pipeline.label_feature_list, err = pipeline.features(documents=training_docs)

#pipeline.df_test, pipeline.test_label_feature_list = pipeline.feature_function(documents=test_documents)


  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can b

  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["pa

  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
  date2 = pandas.to_datetime(s, errors='ig

  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ig

  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
  date2 = pandas.to_datetime(s, errors='ig

  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
  date2 = pandas.to_datetime(s, errors='ig

In [43]:
pipeline.df_train.shape

(7825, 339)

In [44]:
# number of annotations to label accross all documents
# print(sum([len(d.annotations(use_correct=True)) for d in test_documents]))
# print(sum([len(d.spans(use_correct=True)) for d in test_documents]))

{{ len(pipeline.label_feature_list) }} is the number of features we use to classify each annotation

In [45]:
# print(sum([len(d.annotations(use_correct=False)) for d in test_documents]))

In [46]:
# pipeline.df_test.shape

In [47]:
len(pipeline.label_feature_list)

270

In [48]:
# pipeline.label_feature_list

In [49]:
# pipeline.feature_function(documents=documents)

In [50]:
  # Start to train the Classifier.
# logger.setLevel(logging.INFO)
pipeline.fit()
# logger.setLevel(logging.ERROR)

RandomForestClassifier(class_weight='balanced', max_depth=100, random_state=420)

And now we can save the trained Annotation classifier:

In [51]:
pipeline_path = pipeline.save(output_dir='.')
# pipeline_path

In [52]:
# logger.setLevel(logging.INFO)

In [53]:
# pipeline.evaluate() # https://github.com/konfuzio-ai/konfuzio-sdk/pull/109 ?

In [54]:
strict_evaluation = pipeline.evaluate_full(strict=True)

  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, 

  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  df = df.append(label_df, sort=True)
  df = df.append(label_df, 

  if self.bbox():
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
2022-08-19 17:43:57,300 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1459, 1483) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:43:57,301 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1544, 1573) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:43:57,302 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1705, 1722) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:43:57,303 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (3452, 3477) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:43:57,304 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (3566, 3593) contains Chractacters that don't provide a Bounding Box.
  if self.bbox():

In [55]:
print('f1', strict_evaluation.f1(None))
print('tp', strict_evaluation.tp(None))
print('fp', strict_evaluation.fp(None))
print('fn', strict_evaluation.fn(None))
print('tn', strict_evaluation.tn(None))

f1 0.9279279279279279
tp 103
fp 16
fn 0
tn 0


In [56]:
raise

RuntimeError: No active exception to reraise

In [57]:
pipeline.test_documents

[Document Auswertungspaket - unterschiedliche B_N-Auswertungen.pdf_18.pdf (44865),
 Document Auswertungspaket - unterschiedliche B_N-Auswertungen.pdf_17.pdf (44866),
 Document Auswertungspaket - unterschiedliche B_N-Auswertungen.pdf_19.pdf (44867)]

In [60]:
# app_doc44865_eval = self.project.get_document_by_id(314250)
# app_doc44866_eval = self.project.get_document_by_id(314074)
# app_doc44867_eval = self.project.get_document_by_id(314249)
# app_docs_eval = [app_doc44865_eval, app_doc44866_eval, app_doc44867_eval]
app_doc44865 = project.get_document_by_id(318035)
app_doc44866 = project.get_document_by_id(318036)
app_doc44867 = project.get_document_by_id(318037)
app_docs = [app_doc44865, app_doc44866, app_doc44867]



In [61]:
from konfuzio_sdk.evaluate import Evaluation
eval_list = []
for i, document in enumerate(pipeline.test_documents):
    assert document.text == app_docs[i].text
    eval_list.append((document, app_docs[i]))

evaluation = Evaluation(eval_list, strict=True)
#     # F1 0.8725868725868726
#     # TP 113
#     # FP 33
#     # FN 0
# assert evaluation.f1(None) == 1.0

  if self.bbox():
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
2022-08-19 17:44:44,550 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1459, 1483) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:44:44,551 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1544, 1573) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:44:44,552 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1705, 1722) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:44:44,553 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (3452, 3477) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:44:44,554 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (3566, 3593) contains Chractacters that don't provide a Bounding Box.
  if self.bbox():

In [62]:
print('f1', evaluation.f1(None))
print('tp', evaluation.tp(None))
print('fp', evaluation.fp(None))
print('fn', evaluation.fn(None))
print('tn', evaluation.tn(None))


f1 0.9279279279279279
tp 103
fp 16
fn 0
tn 0


In [None]:
raise

In [None]:
strict_evaluation.data[['id_local']]

In [None]:
strict_evaluation.tokenizer()

In [None]:
non_strict_evaluation = pipeline.evaluate_full(strict=False)

In [None]:
print('f1', non_strict_evaluation.f1(None))
print('tp', non_strict_evaluation.tp(None))
print('fp', non_strict_evaluation.fp(None))
print('fn', non_strict_evaluation.fn(None))
print('tn', non_strict_evaluation.tn(None))

In [None]:
import pandas
# inference_document = pipeline.test_documents[0].__deepcopy__(None)
inference_document = project.get_document_by_id(44855).__deepcopy__(None)
# 2. tokenize
pipeline.tokenizer.tokenize(inference_document)
if not inference_document.spans():
    logger.error(f'{pipeline.tokenizer} does not provide Spans for {document}')
    raise NotImplementedError('No error handling when Spans are missing.')
# 3. preprocessing
df, _feature_names, _raw_errors = pipeline.features(inference_document)
independet_variables = df[pipeline.label_feature_list]

results = pandas.DataFrame(data=pipeline.clf.predict_proba(X=independet_variables), columns=pipeline.clf.classes_)


In [None]:
results.columns


In [None]:
# results.columns
df['label_text'] = results.idxmax(axis=1)
df['Accuracy'] = results.max(axis=1)

In [None]:
doc_44855 = project.get_document_by_id(44855) #.__deepcopy__(None)

In [None]:
doc_44855.file_path

In [None]:
for ann in doc_44855.annotations(use_correct=True):
    print(ann.label, ann.offset_string)

In [None]:
# flabels = [x for x in df['label_text'] if x != 'NO_LABEL']
# print(len(flabels))


In [None]:
pandas.set_option('display.max_rows', None)


In [None]:
print(df[['label_text', 'offset_string', 'Accuracy']])


In [None]:
results['Betrag']

In [None]:
# doc = project.get_document_by_id(44855)

In [None]:
project.get_document_by_id(44855).pages()[0].get_image()

In [None]:
non_strict_evaluation.data

In [None]:
from konfuzio_sdk.trainer.information_extraction import extraction_result_to_document
import pandas

In [None]:
####
test_doc = pipeline.test_documents[0]
extraction_result = pipeline.extract(document=test_doc)
predicted_doc = extraction_result_to_document(test_doc, extraction_result)

In [None]:
df_a = pandas.DataFrame(test_doc.eval_dict(use_correct=False))
df_b = pandas.DataFrame(predicted_doc.eval_dict(use_correct=False))

In [None]:
df_a.columns

In [None]:
df_b[['label_set_id', 'label_id']]


In [None]:
# for doc in pipeline.test_documents:
#     pipeline.tokenizer.tokenize(doc)

In [None]:
# pipeline.df_tests = 
print(sum([len(doc.annotations(use_correct=False)) for doc in pipeline.test_documents]))
print(sum([len(doc.view_annotations()) for doc in pipeline.test_documents]))

In [None]:
# for doc in pipeline.test_documents:
# print(sum([len(doc.view_annotations()) for doc in pipeline.test_documents]))

In [None]:
doc.annotation_sets()[0]

In [None]:
extraction_result.keys()

In [None]:
extraction_result['Brutto-Bezug']

In [None]:
pipeline.test_documents
pipeline.test_documents[0].annotations(use_correct=False)

In [None]:
test_doc = test_documents[0]

In [None]:
len(test_doc.annotations(use_correct=False))

In [None]:
pipeline.tokenizer.tokenize(test_doc)

In [None]:
len(test_doc.annotations(use_correct=False))

In [None]:
result = pipeline.extract(test_doc)

In [None]:
result.keys()

In [None]:
result['NO_LABEL_SET']

In [None]:
# result['']

In [None]:
#result['Steuer']
# pipeline_path
category

In [None]:
# upload_ai_model(ai_model_path=pipeline_path)


In [None]:
####

In [None]:
from konfuzio_sdk.regex import regex_matches
import regex as re

In [None]:
pipeline.tokenizer.tokenizers[0]

In [None]:
reg = pipeline.tokenizer.tokenizers[10].regex
doc = project.get_document_by_id(44855)
text = doc.text


In [None]:
pattern = re.compile(reg, flags=0)

In [None]:
pattern

In [None]:
pats = pattern.finditer(text, overlapped=False)

In [None]:
pats = list(pats)
print(pats)

In [None]:
pat = pats[0]

In [None]:
pat.groups()

In [None]:
list(pat.re.groupindex.items())

In [None]:
pat[1]

In [None]:
pat.regs[1][1]