NeuralNet Document Classifier

Trains a model capable of classifying any document using some Huggingface transformers and LayoutXLM multi-language weights.

Requires:

PyTorch
Huggingface bindings
Tesseract installation as well as PyTesseract bindings for testing purposes

plus some pip dependencies.

To train the model, create a /train_data subdirectory on main.py dir level containing folders representing the classifier labels, e.G:

/invoices
 - invoice_train1.jpg
 - invoice_train2.png
 - invoice_train3.tif
/vehicle_registrations
 - vehicle_registration_train1.jpg
 - ...

and so forth

and run main.py. This will put out a model capable of being loaded and inferenced from by loader.py, which spins up a simple webserver serving the classified response in a form of:

{
    "class": "vehicle_registration",
    "class_id": 15,
    "confidence": "99.7821569442749",
    "inference_time": 0.055999755859375,
    "tokenizer_time": 1.834416389465332
}

~

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
train_data		train_data
.gitignore		.gitignore
README.md		README.md
loader.py		loader.py
main.py		main.py
test1.jpg		test1.jpg
test2.png		test2.png
test3.jpg		test3.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train_data

train_data

.gitignore

.gitignore

README.md

README.md

loader.py

loader.py

main.py

main.py

test1.jpg

test1.jpg

test2.png

test2.png

test3.jpg

test3.jpg

Repository files navigation

NeuralNet Document Classifier

About

Releases

Packages

Languages

nurjeff/document_preprocessor_neural

Folders and files

Latest commit

History

Repository files navigation

NeuralNet Document Classifier

About

Resources

Stars

Watchers

Forks

Languages