Skip to content
This repository has been archived by the owner on Feb 16, 2023. It is now read-only.

Tags from consumer sub directories #50

Closed
jayme-github opened this issue Nov 27, 2020 · 8 comments
Closed

Tags from consumer sub directories #50

jayme-github opened this issue Nov 27, 2020 · 8 comments
Assignees
Labels
feature request New feature or request

Comments

@jayme-github
Copy link
Contributor

I made a small patch to allow me to set tags based on sub directories of the consumer directory (jayme-github/paperless#2) and I would like to know if you would accept that feature to your fork.
Looking at the feature-ocrmypdf branch you switched back to using INotify again, so my patch would also bring back the ability to run a recursive consumer.

Let me know what you think and I'll rebase against whatever branch makes sense.

@jayme-github jayme-github changed the title Tags from consumer directories Tags from consumer sub directories Nov 27, 2020
@jonaswinkler
Copy link
Owner

jonaswinkler commented Nov 27, 2020

That looks like a very neat idea! With this, we could have a TODO folder in the consumption directory and throw in documents we need to take care of.

I should really make some guidelines on how to contribute and how the code is organised in general at some point. That's based on og paperless code, and quite a few things moved and had their responsibilities changed.

master always reflects the latest release. Branch dev is for changes that will be in the next release. Use that. The feature-X branches are for experimental stuff.

  • consumer.py is not responsible for checking for new files anymore. consume_new_files is gone. The consumer does not care about where the files come from. The mail consumer also uses this and puts files in a temporary directory, entirely skipping the consume folder watching mechanisms.
  • FileInfo is responsible for getting information from the file name. Path information is not available. See above.
  • Watching the consume directory is now entirely done in the consumer management command. This is where you should implement your feature like so:
    • just before async_task in _consume, check if the feature is enabled and get all sub folders from the given path.
    • get any tags from from found sub folders and their corresponding ids.
    • pass the ids of the tags as override_tag_ids=[tag1id,tag2id] to async_task. see documents.tasks.py for the interface.
    • Watchdog is still used for polling and has a recursive option. No need to write all that folder watching code ourselves.
    • I can't think of a good place to put the remove empty directories logic. That also should be part of the consume folder watcher, but it doesn't know when the tasks are completed. I'd skip that for now.
    • Please make a test case! See test_management_consumer.test_consume_file on how I tested the other consumer logic. A few hints: Make a tag, make a sub folder in the consumer directory, move the sample file into the sub folder, wait for task_mock to get called and then check its arguments for the relevant tag ids.

@jayme-github
Copy link
Contributor Author

Okay, cool. I'll come up with something.

The remove empty directories thing is probably an edge case of my current setup where I consume files via an ocrmypdf docker container and "forward" them to paperless. In normal cases you will most likely want the folders to stick around anyways as you keep reusing them.

@jonaswinkler
Copy link
Owner

Neat. What options do you use with ocrmypdf? Still considering what to support with paperless. Right now it uses --skip by default, so that OCR is only done when required. --redo and --force are configurable, as well as --pages and --output-type.

@jonaswinkler jonaswinkler added Back end feature request New feature or request labels Nov 27, 2020
@jayme-github
Copy link
Contributor Author

jayme-github commented Nov 27, 2020

I'm running with ocrmypdf.ocr(language=["deu", "eng"], tesseract_timeout=300, skip_text=True, deskew=False) only generating PDF/A. Oh and export OMP_THREAD_LIMIT=1. My system is not that fast and ocrmypdf runns one process (or thread, don't remember) per page (maximum to # of CPUs I guess) plus tesseract using multiple thready by default which made the whole process painfully slow and led to timeouts (for documents with a lot of pages).

I did not check into if it is possible to make ocrmypdf fail on tesseract timeout. As it currently is you will end up with a not fully processed PDF which I find quite bad.

@jonaswinkler
Copy link
Owner

Thank you. On a side note, tesseract uses twice as much cpu time when two languages are specified.

@jayme-github
Copy link
Contributor Author

Ouch. You know if that's linear? That would be crazy!
I can count myself lucky that all of my documents are in one of those languages then ;)

@totti4ever
Copy link

totti4ever commented Nov 28, 2020

Wow, good to know regarding the languages!!

Regarding the main idea of this issue: That is a really neat idea. I might not be able to use it for semantic tags (I would have to setup different targets on the scanner, might become confusing at it's interface), but one could use it as a source-tag (e.g. scanner or samba which then write to different sub folders)

@jonaswinkler
Copy link
Owner

See #23.

jonaswinkler pushed a commit that referenced this issue Nov 30, 2020
The names of sub directories in the consumer directory will be added as
tags for the document to be consumed.
To enable this, set:
PAPERLESS_CONSUMER_RECURSIVE=1
PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=1

Fixes #50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants