Tags from consumer sub directories #50

jayme-github · 2020-11-27T09:20:02Z

I made a small patch to allow me to set tags based on sub directories of the consumer directory (jayme-github/paperless#2) and I would like to know if you would accept that feature to your fork.
Looking at the feature-ocrmypdf branch you switched back to using INotify again, so my patch would also bring back the ability to run a recursive consumer.

Let me know what you think and I'll rebase against whatever branch makes sense.

jonaswinkler · 2020-11-27T10:31:46Z

That looks like a very neat idea! With this, we could have a TODO folder in the consumption directory and throw in documents we need to take care of.

I should really make some guidelines on how to contribute and how the code is organised in general at some point. That's based on og paperless code, and quite a few things moved and had their responsibilities changed.

master always reflects the latest release. Branch dev is for changes that will be in the next release. Use that. The feature-X branches are for experimental stuff.

consumer.py is not responsible for checking for new files anymore. consume_new_files is gone. The consumer does not care about where the files come from. The mail consumer also uses this and puts files in a temporary directory, entirely skipping the consume folder watching mechanisms.
FileInfo is responsible for getting information from the file name. Path information is not available. See above.
Watching the consume directory is now entirely done in the consumer management command. This is where you should implement your feature like so:
- just before async_task in _consume, check if the feature is enabled and get all sub folders from the given path.
- get any tags from from found sub folders and their corresponding ids.
- pass the ids of the tags as override_tag_ids=[tag1id,tag2id] to async_task. see documents.tasks.py for the interface.
- Watchdog is still used for polling and has a recursive option. No need to write all that folder watching code ourselves.
- I can't think of a good place to put the remove empty directories logic. That also should be part of the consume folder watcher, but it doesn't know when the tasks are completed. I'd skip that for now.
- Please make a test case! See test_management_consumer.test_consume_file on how I tested the other consumer logic. A few hints: Make a tag, make a sub folder in the consumer directory, move the sample file into the sub folder, wait for task_mock to get called and then check its arguments for the relevant tag ids.

jayme-github · 2020-11-27T12:42:38Z

Okay, cool. I'll come up with something.

The remove empty directories thing is probably an edge case of my current setup where I consume files via an ocrmypdf docker container and "forward" them to paperless. In normal cases you will most likely want the folders to stick around anyways as you keep reusing them.

jonaswinkler · 2020-11-27T14:52:39Z

Neat. What options do you use with ocrmypdf? Still considering what to support with paperless. Right now it uses --skip by default, so that OCR is only done when required. --redo and --force are configurable, as well as --pages and --output-type.

jayme-github · 2020-11-27T15:00:24Z

I'm running with ocrmypdf.ocr(language=["deu", "eng"], tesseract_timeout=300, skip_text=True, deskew=False) only generating PDF/A. Oh and export OMP_THREAD_LIMIT=1. My system is not that fast and ocrmypdf runns one process (or thread, don't remember) per page (maximum to # of CPUs I guess) plus tesseract using multiple thready by default which made the whole process painfully slow and led to timeouts (for documents with a lot of pages).

I did not check into if it is possible to make ocrmypdf fail on tesseract timeout. As it currently is you will end up with a not fully processed PDF which I find quite bad.

jonaswinkler · 2020-11-27T15:06:19Z

Thank you. On a side note, tesseract uses twice as much cpu time when two languages are specified.

jayme-github · 2020-11-27T15:09:27Z

Ouch. You know if that's linear? That would be crazy!
I can count myself lucky that all of my documents are in one of those languages then ;)

totti4ever · 2020-11-28T13:07:02Z

Wow, good to know regarding the languages!!

Regarding the main idea of this issue: That is a really neat idea. I might not be able to use it for semantic tags (I would have to setup different targets on the scanner, might become confusing at it's interface), but one could use it as a source-tag (e.g. scanner or samba which then write to different sub folders)

jonaswinkler · 2020-11-28T14:42:49Z

See #23.

The names of sub directories in the consumer directory will be added as tags for the document to be consumed. To enable this, set: PAPERLESS_CONSUMER_RECURSIVE=1 PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=1 Fixes #50

jayme-github changed the title ~~Tags from consumer directories~~ Tags from consumer sub directories Nov 27, 2020

jonaswinkler added Back end feature request New feature or request labels Nov 27, 2020

jonaswinkler assigned jayme-github Nov 28, 2020

totti4ever mentioned this issue Nov 28, 2020

Integration with OCRmyPDF #23

Closed

jayme-github mentioned this issue Nov 29, 2020

Create tags from sub directories #69

Merged

jonaswinkler closed this as completed in fa9a5cc Dec 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tags from consumer sub directories #50

Tags from consumer sub directories #50

jayme-github commented Nov 27, 2020

jonaswinkler commented Nov 27, 2020 •

edited

jayme-github commented Nov 27, 2020

jonaswinkler commented Nov 27, 2020

jayme-github commented Nov 27, 2020 •

edited

jonaswinkler commented Nov 27, 2020

jayme-github commented Nov 27, 2020

totti4ever commented Nov 28, 2020 •

edited

jonaswinkler commented Nov 28, 2020

Tags from consumer sub directories #50

Tags from consumer sub directories #50

Comments

jayme-github commented Nov 27, 2020

jonaswinkler commented Nov 27, 2020 • edited

jayme-github commented Nov 27, 2020

jonaswinkler commented Nov 27, 2020

jayme-github commented Nov 27, 2020 • edited

jonaswinkler commented Nov 27, 2020

jayme-github commented Nov 27, 2020

totti4ever commented Nov 28, 2020 • edited

jonaswinkler commented Nov 28, 2020

jonaswinkler commented Nov 27, 2020 •

edited

jayme-github commented Nov 27, 2020 •

edited

totti4ever commented Nov 28, 2020 •

edited