classifier does not work #9902

DrMartinus · 2025-05-08T14:30:40Z

DrMartinus
May 8, 2025

Hi,
I have posted about this some time ago, but other issues came up which were more important. I looked for the old post and didn't find it, maybe it was somewhere else? Anyway, most likely it will be closed.
I really wish paperless-ngx would classify my documents, but it just does not. In the log I find the note: Document classification model does not exist (yet), not performing automatic matching.
Well, so far so bad. I remember being hinted at a support page with some commands for manually initiating the classification process, which I did (document_create_classifier), and I saw in the log that the process had started. It ran for several hours, and then it got killed. There is no indication as to why it has been killed. I didn't do anything at that time, usually I was away from the PC and found out only later.
Well, maybe some hint is there: at the command line, I got this output:
/usr/local/bin/document_create_classifier: line 14: 1507 Killed s6-setuidgid paperless python3 manage.py document_create_classifier “$@”

I repeated the process several times, each time it got killed somewhere after two to three hours, mostly after it has started training the correspondent classifier (so the tags classifier had been trained already), but not immediately after it had started this part of the training, but somewhere along the way. It never got past the training of the correspondent classifier.

The result is, that up to now there is no classification. In the meantime I have stored more than 3400 documents in paperless (initially, I thought it was still learning for the classification and because of that it didn't add tags and other info, but after a few weeks I realized that couldn't be the reason).
It will be cumbersome to classify all existing documents by myself, but I will do of course, but it is frustrating to know that it could work otherwise and that, what I am doing, has no effect on the classifier, because it's just not running.
Has anyone an idea what I could do in order to get it to work (besides the command document_create_classifier)?

stumpylog · 2025-05-08T18:19:13Z

stumpylog
May 8, 2025
Maintainer

A process is typically killed due to memory. You may need to increase the resource available to the container, virtual machine or however you have installed it.

4 replies

jloehel May 9, 2025

Is it possible to train the model on a different machine? I am running paperless on a small rock64. I run into a seg fault during the training. How much resources do I need for the training?

stumpylog May 9, 2025
Maintainer

You would need all the data or at least data base on the other machine, then transfer the pickle file back. And disable the automatic training job. It would have to be transferred back and forth for updates.

So possible,but not easy

jloehel May 9, 2025

Can I add an additional celery worker and move the task especially to this worker?

stumpylog May 9, 2025
Maintainer

That's also not possible

DrMartinus · 2025-05-09T15:50:27Z

DrMartinus
May 9, 2025
Author

Ok, I checked now: the NAS has 8 GB RAM. I stopped as many services as possible on that machine, but it didn't work out yet. This time it got to training the documents classifier, that's an advancement to the previous behaviour, when it got killed while training the correspondents classifier, but still it got killed. I noticed that RAM consumption went up to 80-90% in the first step, then it went further up by about 3% for the correspondents training, and the last step took it to 96% max, where it got killed. Before it started, RAM consumption was at about 50%. Even though the manufacturer states 8 GB as the limit, there are people who managed to get more GB RAM running with that NAS. Hence I will try that.

0 replies

DrMartinus · 2025-05-13T08:56:15Z

DrMartinus
May 13, 2025
Author

Yesterday I got two 8GB RAM modules and installed them in my Asustor AS6104T NAS, which doubled the RAM (also the max. allowed RAM of the NAS). After verification that the RAM modules were accepted, I started again the (re)training, and it went through, the classifier ended with saving the data file. Maximum RAM taken was around 70% of the 16 GB available.

Today I started adding new documents. First, I tagged and otherwise categorized a bunch of papers (around 20) with all similar content, particularly with the same title (kind of a magazine), all got almost the same tags (at least three of them are the same), all have the same correspondent, the same document type and the same path. The path and the correspondent and one of the tags (of a total of 4) hadn't been there when I trained the classifier.

Then, I uploaded one file of the same type (magazine) in order to see if it will be classified. But no, it wasn't. Hence I added tags etc. manually again. Another upload of a similar file yielded the same result. After uploading 8 documents which were supposed to be classified in the same way as the other 20 I had classified before, but weren't, I wonder: does the classifier work at all? It seems it doesn't.

I checked user rights - I am a superuser, hence I would assume that this isn't the problem. In the log I don't see any hint that the classifier doesn't work - but also no hint, that it does. The name classifier appeared last during the training.
So what can be wrong? Some setting that I missed? (It maybe of interest that the classifier didn't work from the beginning, when I started using paperless-ngx).
Mine is running in a docker container, does that mean anything? The maintainer of the app says that the software is being downloaded from the official web page and there are no changes made.

1 reply

shamoon May 13, 2025
Maintainer

Your tags / correspondents etc. need to have matching set to 'auto' and there need to be tagged documents without an inbox tag.

Then when you train the classifier the logs should show something like:

[2025-05-13 08:35:40,389] [DEBUG] [paperless.classifier] Gathering data from database...
[2025-05-13 08:35:40,433] [DEBUG] [paperless.classifier] 69 documents, 2 tag(s), 4 correspondent(s), 2 document type(s). 0 storage path(s)
[2025-05-13 08:35:40,433] [DEBUG] [paperless.classifier] Vectorizing data...
[2025-05-13 08:35:46,789] [DEBUG] [paperless.classifier] Training tags classifier...
[2025-05-13 08:36:23,620] [DEBUG] [paperless.classifier] Training correspondent classifier...
[2025-05-13 08:36:26,455] [DEBUG] [paperless.classifier] Training document type classifier...
[2025-05-13 08:39:24,159] [DEBUG] [paperless.classifier] There are no storage paths. Not training storage path classifier.
[2025-05-13 08:39:24,166] [INFO] [paperless.tasks] Saving updated classifier model to /usr/src/paperless/data/classification_model.pickle...

Or at least

[2025-05-13 02:01:03,325] [DEBUG] [paperless.classifier] Gathering data from database...
[2025-05-13 02:01:03,410] [INFO] [paperless.classifier] No updates since last training
[2025-05-13 02:01:03,412] [DEBUG] [paperless.tasks] Training data unchanged.

DrMartinus · 2025-05-14T09:50:16Z

DrMartinus
May 14, 2025
Author

Thank you for your reply. It seems it has started to work. Today I added a few new documents, some of which received tags, correspondent, path and type, but a few didn't. Probably there is some learning still necessary. Does the training not go automatically? If so, is there a way to change the schedule for it, i.e. set the time and interval?
BTW, all tags etc. are set to auto matching, and there is no inbox tag.

1 reply

shamoon May 14, 2025
Maintainer

Re the schedule: https://docs.paperless-ngx.com/configuration/#PAPERLESS_TRAIN_TASK_CRON

2025-11-11T03:24:57Z

github-actions[bot]
Bot Nov 11, 2025

This discussion has been automatically closed due to inactivity. Please see our contributing guidelines for more details.

0 replies

2025-12-11T03:32:03Z

github-actions[bot]
Bot Dec 11, 2025

This discussion has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion for related concerns. See our contributing guidelines for more details.

0 replies

Uh oh!

classifier does not work #9902

Uh oh!

DrMartinus May 8, 2025

Replies: 6 comments · 6 replies

Uh oh!

stumpylog May 8, 2025 Maintainer

Uh oh!

Uh oh!

jloehel May 9, 2025

Uh oh!

stumpylog May 9, 2025 Maintainer

Uh oh!

jloehel May 9, 2025

Uh oh!

stumpylog May 9, 2025 Maintainer

Uh oh!

DrMartinus May 9, 2025 Author

Uh oh!

DrMartinus May 13, 2025 Author

Uh oh!

Uh oh!

shamoon May 13, 2025 Maintainer

Uh oh!

DrMartinus May 14, 2025 Author

Uh oh!

shamoon May 14, 2025 Maintainer

Uh oh!

github-actions[bot] Bot Nov 11, 2025

Uh oh!

github-actions[bot] Bot Dec 11, 2025

DrMartinus
May 8, 2025

Replies: 6 comments 6 replies

stumpylog
May 8, 2025
Maintainer

stumpylog May 9, 2025
Maintainer

stumpylog May 9, 2025
Maintainer

DrMartinus
May 9, 2025
Author

DrMartinus
May 13, 2025
Author

shamoon May 13, 2025
Maintainer

DrMartinus
May 14, 2025
Author

shamoon May 14, 2025
Maintainer

github-actions[bot]
Bot Nov 11, 2025

github-actions[bot]
Bot Dec 11, 2025