NeuralNet Document Classifier

Trains a model capable of classifying any document using some Huggingface transformers and LayoutXLM multi-language weights.

Requires:

PyTorch
Huggingface bindings
Tesseract installation as well as PyTesseract bindings for testing purposes

plus some pip dependencies.

To train the model, create a /train_data subdirectory on main.py dir level containing folders representing the classifier labels, e.G:

/invoices
 - invoice_train1.jpg
 - invoice_train2.png
 - invoice_train3.tif
/vehicle_registrations
 - vehicle_registration_train1.jpg
 - ...

and so forth

and run main.py. This will put out a model capable of being loaded and inferenced from by loader.py, which spins up a simple webserver serving the classified response in a form of:

{
    "class": "vehicle_registration",
    "class_id": 15,
    "confidence": "99.7821569442749",
    "inference_time": 0.055999755859375,
    "tokenizer_time": 1.834416389465332
}

~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NeuralNet Document Classifier

Files

README.md

Latest commit

History

README.md

File metadata and controls

NeuralNet Document Classifier