pydocu

fastapi server for classification of documents and extraction of data

clone it

git clone https://github.com/jankstar/pydocu.git

start server

source env/bin/activate
export HF_HOME=./build 
uvicorn main:app --reload --workers 10

The --reload parameter is only needed when developing. The source command when working with env.

OpenAPI starts at

http://127.0.0.1:8000/docs

The following directories must be provided on the server:
/home/pydoc - all temporary files for the processing
/home/build - the AI models buffered

These directories are Azure compliant under /home for persistent data. Otherwise the directories ./pydocund ./build used. In an Azure instance, these directories are no longer available after a restart - please note that.

Resources

The installation requires approx. 16 GB - of which the AI models account for approx. 5 GB, the Python installation with libraries requires approx. 10 GB. The Azure instance requires approx. 7 GB of memory.

Docker

Build docker images: docker build --tag pydoc .

Function overview

login

we use bearer token via url

/token

with username and password, example ist admin password test - change please.

All further accesses must be made with the token

Attention: the application does not manage a user DB; there is a user "Admin" in the code with the hash - this must be implemented individually. There is a servcies that provides the hash for a password, so that you can write this hash into the code for testing. This is not a productive solution and must be adapted individually.

GET "/" - get main info

Overview of the installed information, about the user and the tenants used.

{
  "name": "pydocu",
  "description": "Services for processing documents.",
  "version": "0.1.0",
  "datetime": "2023-01-22T11:25:32.455365+01:00",
  "ghostscript": "GPL Ghostscript 9.56.1 (2022-04-04)",
  "tesseract": "tesseract 5.3.0",
  "models": "yes",
  "temp_dir": "./pydoc",
  "status": {
    "tenants": [
      "1000"
    ],
    "background_tasks": 0
  },
  "user": {
    "username": "<...>",
    "email": "<...>",
    "full_name": "<...>",
    "tenants": [
      "1000"
    ],
    "roles": [
      "admin"
    ],
    "disabled": false,
    "hashed_password": "<...>"
  }
}

The application uses ghostscript for conversion and tesseract for OCR. These applications can be installed via a service. The call is done via shel commands. The server must run on a Unix system - e.g. Azure or in Docker.

POST "/install/{phrase}" - Installed asynchronously

args: phrase with gs, tesseract or models

The installation is performed via /usr/bin/apt-get a linux installation - alternatively, the applications can also be installed manually.

Tesseract is installed in the German variant.

POST "/api/tenant" - define a new tenant

A separate directory is created for the tenant. Master data - e.g. sender and recipient search can be defined for each tenant. If a new document is transferred for processing, the tenant ID must also be entered. The permissible tenant ID can be specified in the user authorization.

example:

{
  "id": "1000",
  "name": "PyDoc GmbH"
}

GET "/api/tenant/{id}" - provides all current information on a tenant

    return {"data": {
        "id": tenant.id,
        "name": tenant.name,
        "files": files, 
        "count": len(files),
        "classes": classes_count,
        "senders": senders_count,
        "receivers": receivers_count,
        "entities": entities_count,
        "documents counter": tenant.count_documents,
        "pages counter": tenant.count_pages,
        }}

POST "/api/delete_tenant/{id}" - delete the tenant files and directory

All files and the directory to the tenant will be deleted - be careful.

POST "/api/master_data/{list_name}/{tenant}" - Transfer of master data for "sender", "receiver" or "entities" to a tenant

The data is stored as txt files in the directory to the tenant and read on demand.

Example:

{"entities": [
  { 
    "id": "LIFNR/70000", 
    "name": "Hans Müller-Lüdenscheid", 
    "receiver_id": "BUKRS/1000", 
    "sender_id": "", 
    "address": "Hauptstr. 100, 1000 Berlin", 
    "tax_id": "", 
    "iban": "", 
    "tel": "", 
    "email": "hans.mueller-luedenscheid@gmail.com", 
    "exact": "", 
    "similar": "", 
    "regexp": ""
  }, { 
    "id": "LIFNR/70001", 
    "name": "Krueger", 
    "receiver_id": "BUKRS/1000", 
    "sender_id": "", 
    "address": "Hauptstr 101, 1000 Berlin", 
    "tax_id": "", 
    "iban": "", 
    "tel": "", 
    "email": "hans-juergen.krueger@gmail.com", 
    "exact": "", 
    "similar": "", 
    "regexp": ""
  }]
}

POST "/api/classes/{tenant}" - Define classification classes

Example:

{"labels": [
  "Rechnung", "Schlussrechnung", "Abschlag", "Abschlagsrechnung", "Zahlungsaufforderung", "Zahlung", "Vorauszahlung", 
  "Mahnung", "Zahlungserinnerung", "Auftrag", "Auftragsbest\u00e4tigung", "B\u00fcrgschaft", "K\u00fcndigung", 
  "Jahresrechnung", "Monatsrechnung", "Miete", "Mietrechnung", 
  "Geb\u00fchr", "Geb\u00fchrenbescheid", "Bescheid", "Guthaben"
  ]
}

POST "/api/new_document/{tenant}" - new document for processing

Transfer a new document for processing.

example:

{
  "id": "",
  "ext_id": "20230203/18:00:01",
  "inputpath": "pdf",
  "base64": "<****>",
  "email_text": "",
  "langu": "deu"
}

An ID can be provided, otherwise it will be generated. The ID is used to access the document later to read the status and the results.

A base64 data block is required, in which the document is stored as a pdf.

The processing takes place in the background in various steps:
10 - save as pdf
20 - convert page to jpg
30 - ocr
40 - classification
41 - find receiver
42 - find sender
43 - find entities
51 - find date
99 - end

With the ID, the status must be queried via the GET function below. The document has been processed when step 99 is reached.

POST "/api/do_parse/{tenant}/{id}" - Perform parse text-data from template

This function parses with the library invoice2data based on yaml-templates based on regex formulas

see invoice2data

The files must be located in the template subdirectory of the tenant.

GET "/api/document/{tenant}/{id}" - Provides the status and data for the document

With this function the processing status is read out - when the step "99 - end" is reached, the processing is also finished. The result of the processing is available.

The sender, the receiver and the entities with the highest score.

example:

{
  "id": "a8fb6152-fc2f-4d35-a257-243b68242217",
  "ext_id": null,
  "inputpath": "pdf",
  "base64": "<***>",
  "email_text": null,
  "langu": "deu",
  "created_at": "2022-09-11T17:50:42.933206+02:00",
  "task": "99 - end",
  "tenant_id": "1000",
  "filename": "./pydoc/1000/a8fb6152-fc2f-4d35-a257-243b68242217",
  "ocr_all": "<* all pages *>",
  "ocr_p1": "<* page 1 *>",
  "pages": 1,
  "protocol": [
    "2022-09-11T17:50:42.933172+02:00/I - init new document",
    "2022-09-11T17:50:42.934290+02:00/10 - save as pdf",
    "2022-09-11T17:50:42.935085+02:00/20 - convert page to jpg",
    "2022-09-11T17:50:43.621445+02:00/ GPL Ghostscript 9.56.1 (2022-04-04)\nCopyright (C) 2022 Artifex Software, Inc.  All rights reserved.\nThis software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:\nsee the file COPYING for details.\nProcessing pages 1 through 1.\nPage 1\n",
    "2022-09-11T17:50:43.622931+02:00/30 - ocr",
    "2022-09-11T17:50:43.623012+02:00/03 - ocr file:a8fb6152-fc2f-4d35-a257-243b68242217page001.jpg",
    "2022-09-11T17:50:58.183496+02:00/ tesseract ready",
    "2022-09-11T17:50:58.185273+02:00/40 - classification",
    "2022-09-11T17:51:39.655181+02:00/41 - find receiver",
    "2022-09-11T17:51:46.139791+02:00/42 - find sender",
    "2022-09-11T17:51:46.141156+02:00/43 - find entities",
    "2022-09-11T17:51:50.285902+02:00/51 - find date ",
    "2022-09-11T17:51:50.286054+02:00/99 - end"
  ],
  "classification": [
    { "label": "Rechnung", "score": 6.586556136608124 },
    { "label": "Schlussrechnung", "score": 6.088626384735107 },
    { "label": "Abschlag", "score": 5.9678588062524796 },
    { "label": "Abschlagsrechnung", "score": 5.68193644285202 },
    { "label": "Zahlungsaufforderung", "score": 5.645296350121498 }
  ],
  "senders": [
    {
      "item": {
        "id": "LIFNR/70000",
        "name": "xyz",
        "receiver_id": "BUKRS/1000",
        "sender_id": "",
        "address": "xyz",
        "tax_id": "",
        "iban": "",
        "tel": "",
        "email": "xyz@gmail.com",
        "exact": "",
        "similar": "",
        "regexp": ""
      },
      "score": 100.0,
      "method": ""
    }
  ],
  "receiver": [
    {
      "item": {
        "id": "BUKRS/1000",
        "name": "abc",
        "receiver_id": "",
        "sender_id": "",
        "address": "",
        "tax_id": "",
        "iban": "",
        "tel": "",
        "email": "",
        "exact": "",
        "similar": "abc",
        "regexp": ""
      },
      "score": 38.43818008899689,
      "method": "similar"
    },
    {
      "item": {
        "id": "BUKRS/1001",
        "name": "def",
        "receiver_id": "",
        "sender_id": "",
        "address": "",
        "tax_id": "",
        "iban": "",
        "tel": "",
        "email": "",
        "exact": "",
        "similar": "def",
        "regexp": ""
      },
      "score": 36.11149191856384,
      "method": "similar"
    },
    {
      "item": {
        "id": "BUKRS/1002",
        "name": "ghj",
        "receiver_id": "",
        "sender_id": "",
        "address": "",
        "tax_id": "",
        "iban": "",
        "tel": "",
        "email": "",
        "exact": "",
        "similar": "ghj",
        "regexp": ""
      },
      "score": 34.47766900062561,
      "method": "similar"
    }
  ],
  "entities": [
    {
      "item": {
        "id": "1000/WE1000",
        "name": "lmn",
        "receiver_id": "BUKRS/1000",
        "sender_id": "",
        "address": "",
        "tax_id": "",
        "iban": "",
        "tel": "",
        "email": "",
        "exact": "",
        "similar": "lmn",
        "regexp": ""
      },
      "score": 22.262464463710785,
      "method": "similar"
    }
  ],
  "data": {
    "sender_id": "LIFNR/70000",
    "receiver_id": "BUKRS/1000",
    "datum": null
  }
}

POST "/api/delete_document/{tenant}/{id}"

The clean-up should take place when the document has been processed - step 99 - and the data has been downloaded.

Test functions

POST "/api/predict_sts" - compares semantics of two sentences
POST "/api/predict_zs" - semantic classification of a sentence

GET "/api/find_date" - find date in param string by regexe as document date (testing regex)

GET "/api/hash_password/{password}"- print hash of an string

AI Models

Sahajtomar/German_Zeroshot
german zero shot is used for the classification of the forms

Sahajtomar/German-semantic or
deutsche-telekom/gbert-large-paraphrase-cosine
compares semantics of two german sentences is used for determination of business partners, if no match was found via discrete comparison, e.g. bank account or order number

Attention: the models are so big that the application will only run on a virtual machine with at least 6 GB. Installing the models of about 1 GB each takes time and bandwidth. The models are buffered on the machine - space must be provided for this.

The application runs on a virtual Azure instance Basic B3 with 7GB. Additional memory can be mounted under /home/build, then the buffered models are located there.

Berlin 2023/05/15

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
scripts		scripts
target		target
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
classification.py		classification.py
favicon.ico		favicon.ico
freeze.sh		freeze.sh
gosseract.ini		gosseract.ini
invoice2data_txt.py		invoice2data_txt.py
main.py		main.py
requirements.txt		requirements.txt
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pydocu

start server

Resources

Docker

Function overview

login

GET "/" - get main info

POST "/install/{phrase}" - Installed asynchronously

POST "/api/tenant" - define a new tenant

GET "/api/tenant/{id}" - provides all current information on a tenant

POST "/api/delete_tenant/{id}" - delete the tenant files and directory

POST "/api/master_data/{list_name}/{tenant}" - Transfer of master data for "sender", "receiver" or "entities" to a tenant

POST "/api/classes/{tenant}" - Define classification classes

POST "/api/new_document/{tenant}" - new document for processing

POST "/api/do_parse/{tenant}/{id}" - Perform parse text-data from template

GET "/api/document/{tenant}/{id}" - Provides the status and data for the document

POST "/api/delete_document/{tenant}/{id}"

Test functions

AI Models

About

Releases

Packages

Languages

License

jankstar/pydocu

Folders and files

Latest commit

History

Repository files navigation

pydocu

start server

Resources

Docker

Function overview

login

GET "/" - get main info

POST "/install/{phrase}" - Installed asynchronously

POST "/api/tenant" - define a new tenant

GET "/api/tenant/{id}" - provides all current information on a tenant

POST "/api/delete_tenant/{id}" - delete the tenant files and directory

POST "/api/master_data/{list_name}/{tenant}" - Transfer of master data for "sender", "receiver" or "entities" to a tenant

POST "/api/classes/{tenant}" - Define classification classes

POST "/api/new_document/{tenant}" - new document for processing

POST "/api/do_parse/{tenant}/{id}" - Perform parse text-data from template

GET "/api/document/{tenant}/{id}" - Provides the status and data for the document

POST "/api/delete_document/{tenant}/{id}"

Test functions

AI Models

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages