# Demo

In this notebook we will see how to use the Konfuzio SDK to to train a model to find and extract relevant information from payslip documents. 

## Setting things up

First, we need to install and initialize the konfuzio_sdk package. See [here](https://github.com/konfuzio-ai/konfuzio-sdk) for more info.

In [1]:
# !pip install konfuzio-sdk

In [2]:
# !konfuzio_sdk init

In [3]:
import os
import sys
sys.path.insert(1, os.path.join(sys.path[0], '..')) # for tests import

from copy import deepcopy

import logging
logger = logging.getLogger()

import konfuzio_sdk
from konfuzio_sdk.data import Project

from konfuzio_sdk.trainer.information_extraction import (DocumentAnnotationMultiClassModel, 
                                                         SeparateLabelsAnnotationMultiClassModel,
                                                         DocumentEntityMulticlassModel,
                                                         SeparateLabelsEntityMultiClassModel)

from konfuzio_sdk.tokenizer.regex import RegexTokenizer, WhitespaceTokenizer
from konfuzio_sdk.tokenizer.base import ListTokenizer
from konfuzio_sdk.api import upload_ai_model

from konfuzio_sdk.evaluate import compare

from tests.variables import OFFLINE_PROJECT, TEST_DOCUMENT_ID

  from konfuzio_sdk.trainer.information_extraction import (DocumentAnnotationMultiClassModel,


In [4]:
from konfuzio_sdk import KONFUZIO_HOST
KONFUZIO_HOST

'https://app.konfuzio.com'

In [5]:
# os.system('pwd')

Now we can load the Konfuzio project. Here we use a simple offline project included in the Konfuzio SDK.


In [89]:
# project = Project(id_=109, project_folder=OFFLINE_PROJECT)
# project = Project(id_=None, project_folder=OFFLINE_PROJECT)
project = Project(id_=46, update=True)

# project = Project(id_=None, project_folder='text_annotation_training_tests') #OFFLINE_PROJECT)

In [17]:
# doc2.text.replace(' ', '')

In [18]:
# for ann in doc.annotations(use_correct=False):
#     ann.is_correct = True

In [19]:
# doc.annotations()

In [20]:
# ndoc1 = project.get_document_by_id(314030)
# ndoc2 = project.get_document_by_id(44865)
# ndoc3 = project.get_document_by_id(314074)

In [21]:
# ndoc1.text==ndoc2.text

In [22]:
# len(ndoc1.text)

In [23]:
# ndoc2.text

In [24]:
# ndoc3.text

In [25]:
# result = pipeline.extract(doc2)

In [26]:
# from konfuzio_sdk.trainer.information_extraction import extraction_result_to_document

In [27]:
# result = extraction_result_to_document(doc2, result)

In [28]:
# result.annotations(use_correct=False)

In [29]:
# result.category.labels[0]

In [30]:
# category.labels[9] == result.category.labels[0]

In [31]:
# comp_res = compare(doc, result)

In [32]:
# comp_res['is_found_by_tokenizer']

Each project has one or more categories which will tell us how to deal with the documents belonging to that category. 

In [33]:
# project.categories

Here we initilialize the training pipeline:

In [34]:
category = project.categories[0]

pipeline = SeparateLabelsAnnotationMultiClassModel() # DocumentAnnotationMultiClassModel
# pipeline = DocumentEntityMulticlassModel()
# pipeline = SeparateLabelsEntityMultiClassModel()
# pipeline = DocumentAnnotationMultiClassModel()
pipeline.category = category
# pipeline.documents = pipeline.category.documents()[:5] # Why?
pipeline.test_documents = pipeline.category.test_documents()

In [35]:
pipeline.test_documents

[Document Auswertungspaket - unterschiedliche B_N-Auswertungen.pdf_18.pdf (44865),
 Document Auswertungspaket - unterschiedliche B_N-Auswertungen.pdf_17.pdf (44866),
 Document Auswertungspaket - unterschiedliche B_N-Auswertungen.pdf_19.pdf (44867)]

In [36]:
# for doc in pipeline.test_documents:
#     doc.pages()[0].get_image()

In [37]:
documents = project.documents

In [38]:
# dev
for doc in documents:
    if doc.category is None:
        dev_doc = doc
        print(doc, 'No Category!')
# dev_doc.category = category
# dev_doc.pages()

Document 2022-02-13 16:19:30.684745 (44864) No Category!


In [39]:
documents = documents[:25]
# documents = documents[:2]

In [40]:
len(documents)

25

In [41]:
# documents[25:]

In [42]:
# len(test_documents)

Let's have a look at what exactly a document in this dataset looks like. 

In [43]:
document = documents[0]

In [44]:
document.text

"                                                            x02   328927/10103/00104\nAbrechnung  der Brutto/Netto-Bezüge   für Dezember 2018                   22.05.2018 Bat:  1\n \nPersonal-Nr.  Geburtsdatum ski Faktor  Ki,Frbtr.Konfession  ‚Freibetragjährl.! |Freibetrag mt! |DBA  iGleitzone  'St.-Tg.  VJuUr. üb. |Url. Anspr. Url.Tg.gen.  |Resturlaub\n00104 150356 1  |     ‚ev                              30     400  3000       3400\n \nSV-Nummer       |Krankenkasse                       KK%®|PGRS Bars  jum.SV-Tg. Anw. Tage |UrlaubTage Krankh. Tg. Fehlz. Tage\n                                             \n50150356B581 AOK  Bayern Die Gesundheitskas 157 101 1111 1 30\n \n                                             Eintritt   ‚Austritt     Anw.Std.  Urlaub Std.  |Krankh. Std. |Fehlz. Std.\n                                             \n                                             170299  L L       l     L     l     l\n -                                       +  Steuer-ID       IMrB?

This is the output of Object Character Recognition model on the following image:


In [45]:
# print(document.pages())
# document.pages()[0].get_image()

In [46]:
# documents

In [47]:
for label in category.labels:
    print(label.name, label.threshold)
#     label.threshold = 0.02

Bezeichnung 0.1
Steuer-Brutto 0.1
Lohnart 0.1
Gesamt-Brutto 0.1
Steuerrechtliche Abzüge 0.1
Bank inkl. IBAN 0.1
Netto-Verdienst 0.1
Menge 0.1
Faktor 0.1
Austellungsdatum 0.1
NO_LABEL 0.0
Betrag 0.1
Steuerklasse 0.1
Sozialversicherung 0.1
EMPTY_LABEL 0.1
Personalausweis 0.1
Vorname 0.1
Nachname 0.1
Auszahlungsbetrag 0.1


This is all the information we may want to identify in documents of this category.

In [48]:
# p = DocumentEntityMulticlassModel()
# p.__init__()
# hasattr(p, 'tokenizer')

In [49]:
# p.tokenizer

### Training the Regex Tokenizer

In [50]:
pipeline.tokenizer = ListTokenizer(tokenizers=[])
for label in category.labels:
    for regex in label.find_regex(category=pipeline.category):
        pipeline.tokenizer.tokenizers.append(RegexTokenizer(regex=regex))
# pipeline.tokenizer = WhitespaceTokenizer()

2022-08-19 17:31:09,246 [konfuzio_sdk.data   ] [MainThread] [INFO    ] [tokens              ][0577] Load existing tokens for Label Bezeichnung in Category Lohnabrechnung (63).
2022-08-19 17:31:14,217 [konfuzio_sdk.data   ] [MainThread] [INFO    ] [find_regex          ][0715] For Label Bezeichnung we found 78 regex proposals for 75 annotations.
2022-08-19 17:31:16,883 [konfuzio_sdk.data   ] [MainThread] [INFO    ] [find_regex          ][0727] We compare 78 regex for 75 correct Annotations for Category Lohnabrechnung (63).
2022-08-19 17:31:16,884 [konfuzio_sdk.data   ] [MainThread] [INFO    ] [find_regex          ][0733] Evaluate Label: Bezeichnung for best regex.
2022-08-19 17:31:16,904 [konfuzio_sdk.regex  ] [MainThread] [INFO    ] [get_best_regex      ][0145] 

|    | regex                                                                                                                                                                                                                       

2022-08-19 17:31:18,209 [konfuzio_sdk.data   ] [MainThread] [INFO    ] [tokens              ][0577] Load existing tokens for Label Lohnart in Category Lohnabrechnung (63).
2022-08-19 17:31:22,091 [konfuzio_sdk.data   ] [MainThread] [INFO    ] [find_regex          ][0715] For Label Lohnart we found 134 regex proposals for 107 annotations.
2022-08-19 17:31:25,563 [konfuzio_sdk.data   ] [MainThread] [INFO    ] [find_regex          ][0727] We compare 134 regex for 107 correct Annotations for Category Lohnabrechnung (63).
2022-08-19 17:31:25,564 [konfuzio_sdk.data   ] [MainThread] [INFO    ] [find_regex          ][0733] Evaluate Label: Lohnart for best regex.
2022-08-19 17:31:25,592 [konfuzio_sdk.regex  ] [MainThread] [INFO    ] [get_best_regex      ][0145] 

|    | regex                                                            |   runtime |   annotation_recall |   annotation_precision |   f1_score |   new_matches_count |
|---:|:------------------------------------------------------------

2022-08-19 17:31:30,965 [konfuzio_sdk.data   ] [MainThread] [INFO    ] [find_regex          ][0733] Evaluate Label: Bank inkl. IBAN for best regex.
2022-08-19 17:31:30,978 [konfuzio_sdk.regex  ] [MainThread] [INFO    ] [get_best_regex      ][0145] 

|    | regex                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |   runtime |   annotation_recall |   annotation_precision |   f1_score |   new_matches_count |
|---:|:------------------------------------------------------------------------

2022-08-19 17:31:36,912 [konfuzio_sdk.data   ] [MainThread] [INFO    ] [find_regex          ][0727] We compare 20 regex for 24 correct Annotations for Category Lohnabrechnung (63).
2022-08-19 17:31:36,913 [konfuzio_sdk.data   ] [MainThread] [INFO    ] [find_regex          ][0733] Evaluate Label: Austellungsdatum for best regex.
2022-08-19 17:31:36,922 [konfuzio_sdk.regex  ] [MainThread] [INFO    ] [get_best_regex      ][0145] 

|    | regex                                                                                |   runtime |   annotation_recall |   annotation_precision |   f1_score |   new_matches_count |
|---:|:-------------------------------------------------------------------------------------|----------:|--------------------:|-----------------------:|-----------:|--------------------:|
|  0 | [ ]{2,}(?:(?P<Label_867_N_12516017_159>\d\d\.\d\d\.\d\d\d\d))[ ]+[A-ZÄÖÜ][a-zäöüß]+: |    0.0004 |              1.0000 |                 0.9600 |     0.9796 |                  24 |

202

2022-08-19 17:31:50,135 [konfuzio_sdk.data   ] [MainThread] [INFO    ] [tokens              ][0577] Load existing tokens for Label Personalausweis in Category Lohnabrechnung (63).
2022-08-19 17:31:51,686 [konfuzio_sdk.data   ] [MainThread] [INFO    ] [find_regex          ][0715] For Label Personalausweis we found 24 regex proposals for 43 annotations.
2022-08-19 17:31:52,303 [konfuzio_sdk.data   ] [MainThread] [INFO    ] [find_regex          ][0727] We compare 24 regex for 43 correct Annotations for Category Lohnabrechnung (63).
2022-08-19 17:31:52,304 [konfuzio_sdk.data   ] [MainThread] [INFO    ] [find_regex          ][0733] Evaluate Label: Personalausweis for best regex.
2022-08-19 17:31:52,317 [konfuzio_sdk.regex  ] [MainThread] [INFO    ] [get_best_regex      ][0145] 

|    | regex                                                                  |   runtime |   annotation_recall |   annotation_precision |   f1_score |   new_matches_count |
|---:|:----------------------------------

In [51]:
# print(pipeline.tokenizer)

In [52]:
# print(len(pipeline.tokenizer.tokenizers))

In [53]:
# pipeline.tokenizer.tokenizers[0]


In [54]:
# raise

In [55]:
# len(document.annotations(use_correct=False)) 

In [56]:
# len(document.spans(use_correct=False))

And now we can create new NO_LABEL annotations

In [57]:
# len(test_documents[1].annotations(use_correct=False))

In [58]:
# pipeline.tokenizer.tokenize(document)

In [59]:
# pipeline.tokenizer.processing_steps[0].runtime

In [60]:
# pipeline.tokenizer.processing_steps[1].runtime


In [61]:
# raise

In [62]:
# len(document.annotations(use_correct=False))

In [63]:
# for ann in document.annotations(use_correct=False):
#     print(ann)

In [64]:
# def label_train_doc(doc, doc_spans):
#     s_i = 0
#     for span in doc.spans():
#         while s_i < len(doc_spans) and span.start_offset > doc_spans[s_i].end_offset:
#             s_i += 1
#         if s_i >= len(doc_spans):
#             break
#         if span.end_offset < doc_spans[s_i].start_offset:
#             continue
# #         if span.start_offset <= doc_spans[s_i].end_offset and \
# #             span.end_offset >= doc_spans[s_i].start_offset:
# #             span.annotation.label = doc_spans[s_i].annotation.label
#         r = range(doc_spans[s_i].start_offset, doc_spans[s_i].end_offset+1)
#         if span.start_offset in r and \
#             span.end_offset in r:
#             span.annotation.label = doc_spans[s_i].annotation.label

            
# #         if span.start_offset <= doc_spans[s_i].end_offset and \
# #             span.end_offset >= doc_spans[s_i].start_offset:
# #             span.annotation.label = doc_spans[s_i].annotation.label


Now we can do the same for all documents:

In [65]:
# Tokenize documents.
logger.setLevel(logging.ERROR)

# training_docs = [deepcopy(doc) for doc in documents]

# # for i, doc in enumerate(training_docs):    
# # #     doc._characters = documents[i].bboxes
# #     doc._hocr = documents[i].hocr

# for doc in training_docs:
#     pipeline.tokenizer.tokenize(doc)

# for i, t_doc in enumerate(training_docs):
#     label_train_doc(t_doc, documents[i].spans(use_correct=True))
    
# logger.setLevel(logging.INFO)
# for doc in test_documents:
#     pipeline.tokenizer.tokenize(doc)


In [66]:
# for doc in documents:
#     pipeline.tokenizer.tokenize(doc)
    

In [67]:
# training_docs[0].annotations(use_correct=False)
# training_docs[0]

In [68]:
# for i, doc in enumerate(training_docs):
#     doc.id_ = doc.copy_of_id #+ 1000
# #     doc.bboxes_available = True

In [69]:
print(sum([step.runtime for step in pipeline.tokenizer.processing_steps]))


0


In [70]:
# raise

In [71]:
# Extract features
pipeline.df_train, pipeline.label_feature_list = pipeline.feature_function(documents=documents, retokenize=True)
# pipeline.df_train, pipeline.label_feature_list, err = pipeline.features(documents=training_docs)

#pipeline.df_test, pipeline.test_label_feature_list = pipeline.feature_function(documents=test_documents)


  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
2022-08-19 17:31:58,781 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1787, 1807) contains Chractacters that don't provide a Bounding Box.
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
2022-08-19 17:31:58,986 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][091

  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
2022-08-19 17:31:59,861 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1593, 1621) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:31:59,862 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1677, 1703) contains Chractacters that don't provide a Bounding Box.
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
2022-08-19 17:32:00,144 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1868, 1889) contains Chract

  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
2022-08-19 17:32:01,007 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1604, 1628) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:32:01,007 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1689, 1718) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:32:01,018 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (3537, 3564) contains Chractacters that don't provide a Bounding Box.
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
2022-08-19 17:32:01,228 [konfu

  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
2022-08-19 17:32:02,198 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1475, 1499) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:32:02,199 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1560, 1589) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:32:02,213 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (3646, 3673) contains Chractacters that don't provide a Bounding Box.
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bb

2022-08-19 17:32:03,091 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1636, 1665) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:32:03,102 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (3549, 3576) contains Chractacters that don't provide a Bounding Box.
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
2022-08-19 17:32:03,316 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1517, 1541) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:32:03,317 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1602, 1631) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:32:

In [72]:
pipeline.df_train.shape

(1274, 339)

In [73]:
# number of annotations to label accross all documents
# print(sum([len(d.annotations(use_correct=True)) for d in test_documents]))
# print(sum([len(d.spans(use_correct=True)) for d in test_documents]))

{{ len(pipeline.label_feature_list) }} is the number of features we use to classify each annotation

In [74]:
# print(sum([len(d.annotations(use_correct=False)) for d in test_documents]))

In [75]:
# pipeline.df_test.shape

In [76]:
len(pipeline.label_feature_list)

270

In [77]:
# pipeline.label_feature_list

In [78]:
# pipeline.feature_function(documents=documents)

In [79]:
  # Start to train the Classifier.
# logger.setLevel(logging.INFO)
pipeline.fit()
# logger.setLevel(logging.ERROR)

RandomForestClassifier(class_weight='balanced', max_depth=100, random_state=420)

And now we can save the trained Annotation classifier:

In [80]:
pipeline_path = pipeline.save(output_dir='.')
# pipeline_path

In [81]:
# logger.setLevel(logging.INFO)


In [82]:
# pipeline.evaluate() # https://github.com/konfuzio-ai/konfuzio-sdk/pull/109 ?

In [83]:
strict_evaluation = pipeline.evaluate_full(strict=True)

  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
2022-08-19 17:32:23,592 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1459, 1483) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:32:23,593 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1544, 1573) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:32:23,594 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1705, 1722) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:32:23,605 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (3566, 3593) contains Chractacters that don't provide a Bounding Box.
  date2 = pandas.to_datetime(s, errors='ignore')
  df["relative_position_in_page"] = df["page_index"] / document_n_pages
  df = df.append(

  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  span.bbox()  # check that the bbox can be calculated  # todo add test
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
2022-08-19 17:32:25,266 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1474, 1498) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:32:

  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  df = df.append(label_df, sort=True)
  if self.bbox():
  characters = {key: self.annotation.document.bboxes.get(key) for key in character_range}
2022-08-19 17:32:25,928 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1459, 1483) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:32:25,929 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1544, 1573) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:32:25,930 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (1705, 1722) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:32:25,931 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox                ][0910] Span (3452, 3477) contains Chractacters that don't provide a Bounding Box.
2022-08-19 17:32:25,932 [konfuzio_sdk.data   ] [MainThread] [ERROR   ] [bbox  

In [84]:
print('f1', strict_evaluation.f1(None))
print('tp', strict_evaluation.tp(None))
print('fp', strict_evaluation.fp(None))
print('fn', strict_evaluation.fn(None))
print('tn', strict_evaluation.tn(None))

f1 0.8773584905660378
tp 93
fp 14
fn 12
tn 50


In [85]:
raise

RuntimeError: No active exception to reraise

In [86]:
pipeline.test_documents

[Document Auswertungspaket - unterschiedliche B_N-Auswertungen.pdf_18.pdf (44865),
 Document Auswertungspaket - unterschiedliche B_N-Auswertungen.pdf_17.pdf (44866),
 Document Auswertungspaket - unterschiedliche B_N-Auswertungen.pdf_19.pdf (44867)]

In [90]:
# app_doc44865_eval = self.project.get_document_by_id(314250)
# app_doc44866_eval = self.project.get_document_by_id(314074)
# app_doc44867_eval = self.project.get_document_by_id(314249)
# app_docs_eval = [app_doc44865_eval, app_doc44866_eval, app_doc44867_eval]
app_doc44865 = project.get_document_by_id(318035)
app_doc44866 = project.get_document_by_id(318036)
app_doc44867 = project.get_document_by_id(318037)
app_docs = [app_doc44865, app_doc44866, app_doc44867]



In [95]:
from konfuzio_sdk.evaluate import Evaluation
eval_list = []
for i, document in enumerate(pipeline.test_documents):
    assert document.text == app_docs[i].text
    eval_list.append((document, app_docs[i]))

evaluation = Evaluation(eval_list, strict=True)
#     # F1 0.8725868725868726
#     # TP 113
#     # FP 33
#     # FN 0
# assert evaluation.f1(None) == 1.0

In [98]:
print('f1', evaluation.f1(None))
print('tp', evaluation.tp(None))
print('fp', evaluation.fp(None))
print('fn', evaluation.fn(None))
print('tn', evaluation.tn(None))


f1 0.9279279279279279
tp 103
fp 16
fn 0
tn 0


In [None]:
strict_evaluation.data[['id_local']]

In [None]:
strict_evaluation.tokenizer()

In [None]:
non_strict_evaluation = pipeline.evaluate_full(strict=False)

In [None]:
print('f1', non_strict_evaluation.f1(None))
print('tp', non_strict_evaluation.tp(None))
print('fp', non_strict_evaluation.fp(None))
print('fn', non_strict_evaluation.fn(None))
print('tn', non_strict_evaluation.tn(None))

In [None]:
import pandas
# inference_document = pipeline.test_documents[0].__deepcopy__(None)
inference_document = project.get_document_by_id(44855).__deepcopy__(None)
# 2. tokenize
pipeline.tokenizer.tokenize(inference_document)
if not inference_document.spans():
    logger.error(f'{pipeline.tokenizer} does not provide Spans for {document}')
    raise NotImplementedError('No error handling when Spans are missing.')
# 3. preprocessing
df, _feature_names, _raw_errors = pipeline.features(inference_document)
independet_variables = df[pipeline.label_feature_list]

results = pandas.DataFrame(data=pipeline.clf.predict_proba(X=independet_variables), columns=pipeline.clf.classes_)


In [None]:
results.columns


In [None]:
# results.columns
df['label_text'] = results.idxmax(axis=1)
df['Accuracy'] = results.max(axis=1)

In [None]:
doc_44855 = project.get_document_by_id(44855) #.__deepcopy__(None)

In [None]:
doc_44855.file_path

In [None]:
for ann in doc_44855.annotations(use_correct=True):
    print(ann.label, ann.offset_string)

In [None]:
# flabels = [x for x in df['label_text'] if x != 'NO_LABEL']
# print(len(flabels))


In [None]:
pandas.set_option('display.max_rows', None)


In [None]:
print(df[['label_text', 'offset_string', 'Accuracy']])


In [None]:
results['Betrag']

In [None]:
# doc = project.get_document_by_id(44855)

In [None]:
project.get_document_by_id(44855).pages()[0].get_image()

In [None]:
non_strict_evaluation.data

In [None]:
from konfuzio_sdk.trainer.information_extraction import extraction_result_to_document
import pandas

In [None]:
####
test_doc = pipeline.test_documents[0]
extraction_result = pipeline.extract(document=test_doc)
predicted_doc = extraction_result_to_document(test_doc, extraction_result)

In [None]:
df_a = pandas.DataFrame(test_doc.eval_dict(use_correct=False))
df_b = pandas.DataFrame(predicted_doc.eval_dict(use_correct=False))

In [None]:
df_a.columns

In [None]:
df_b[['label_set_id', 'label_id']]


In [None]:
# for doc in pipeline.test_documents:
#     pipeline.tokenizer.tokenize(doc)

In [None]:
# pipeline.df_tests = 
print(sum([len(doc.annotations(use_correct=False)) for doc in pipeline.test_documents]))
print(sum([len(doc.view_annotations()) for doc in pipeline.test_documents]))

In [None]:
# for doc in pipeline.test_documents:
# print(sum([len(doc.view_annotations()) for doc in pipeline.test_documents]))

In [None]:
doc.annotation_sets()[0]

In [None]:
extraction_result.keys()

In [None]:
extraction_result['Brutto-Bezug']

In [None]:
pipeline.test_documents
pipeline.test_documents[0].annotations(use_correct=False)

In [None]:
test_doc = test_documents[0]

In [None]:
len(test_doc.annotations(use_correct=False))

In [None]:
pipeline.tokenizer.tokenize(test_doc)

In [None]:
len(test_doc.annotations(use_correct=False))

In [None]:
result = pipeline.extract(test_doc)

In [None]:
result.keys()

In [None]:
result['NO_LABEL_SET']

In [None]:
# result['']

In [None]:
#result['Steuer']
# pipeline_path
category

In [None]:
# upload_ai_model(ai_model_path=pipeline_path)


In [None]:
####

In [None]:
from konfuzio_sdk.regex import regex_matches
import regex as re

In [None]:
pipeline.tokenizer.tokenizers[0]

In [None]:
reg = pipeline.tokenizer.tokenizers[10].regex
doc = project.get_document_by_id(44855)
text = doc.text


In [None]:
pattern = re.compile(reg, flags=0)

In [None]:
pattern

In [None]:
pats = pattern.finditer(text, overlapped=False)

In [None]:
pats = list(pats)
print(pats)

In [None]:
pat = pats[0]

In [None]:
pat.groups()

In [None]:
list(pat.re.groupindex.items())

In [None]:
pat[1]

In [None]:
pat.regs[1][1]