Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Dynamic document storage paths #916

Merged
merged 63 commits into from May 19, 2022

Conversation

mskg
Copy link
Contributor

@mskg mskg commented May 10, 2022

Proposed change

What I like about Paperless is the flexibility to provide a file-based document structure, as this allows for easy consumption of the documents with file-based replication. But, the "one size fits all" solution with PAPERLESS_FILENAME_FORMAT is not enough.

The change is simple:

  • Added storage pathes, where each entry is a different PAPERLESS_FILENAME_FORMAT
  • Each document get's a storage path assigned via existing matching mechanisms (all, exact, regular, fuzzy, auto), but is adjustable by the user at any time
  • file_handling.py has been modified to use the per document filename format
  • The implementation is complete, also adding bulk-actions, filtering, proposals to the UI and admin interface. I hope I didn't miss anything

Screenshot 2022-05-10 at 17 23 14

Usage

I use specific type of storage layouts for different purposes:

  • Everything related to warranty claims is a flat directory with file names like Warranty/2022-04 Correspondent Title.pdf

  • Insurance communication go into Insurance/Correspondent/..., whereas related receipts get sorted by year Insurance/Correspondent/Year/...

This also allows easy file-based consumption of related documents. The maintenance/assignment gets a no-brainer with auto learning enabled.

Compatibility

The change is non-intrusive and fully backward compatible.

Not using storage pathes or not having one assigned forces the system into using PAPERLESS_FILENAME_FORMAT.

Known limitations

  • If an existing storage path get's adjusted, this can only be fixed by running document_renamer again
  • There is a breaking change implemented regarding file naming. The implementation removes empty pathes and parts of the filename that would otherwise resolve to none. This is feature-flagged with PAPERLESS_FILENAME_REMOVE_NONE, defaulting to No and such being backward compatible with the existing behaviour.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Other (please explain)

Checklist:

  • I have read & agree with the contributing guidelines.
  • If applicable, I have tested my code for new features & regressions on both mobile & desktop devices, using the latest version of major browsers.
  • If applicable, I have checked that all tests pass, see documentation.
  • I have run all pre-commit hooks, see documentation.
  • I have made corresponding changes to the documentation as needed.
  • I have checked my modifications for any breaking changes.

@mskg mskg requested review from a team as code owners May 10, 2022 15:31
@mskg mskg changed the title Feature document storage Feature: Dynamic document storage pathes May 10, 2022
@mskg
Copy link
Contributor Author

mskg commented May 10, 2022

I add the missing documentation part, once you're willing to integrate this.

@shamoon
Copy link
Member

shamoon commented May 10, 2022

Thanks for the PR! We will take a look asap. In the meantime this should be targeted to dev and rebased on that branch.

@mskg
Copy link
Contributor Author

mskg commented May 10, 2022

Thanks for the PR! We will take a look asap. In the meantime this should be targeted to dev and rebased on that branch.

Sure. Please take a look and give an indication if you're willing to integrate. If yes, I retarget it - should not be a big thing.

@qcasey qcasey deleted the branch paperless-ngx:dev May 10, 2022 23:11
@qcasey qcasey closed this May 10, 2022
@shamoon
Copy link
Member

shamoon commented May 10, 2022

This was closed automatically because it was targeting that outdated branch, I will change it to dev

Edit: still needs a re-base

@shamoon shamoon reopened this May 10, 2022
@shamoon shamoon changed the base branch from v1.7.1 to dev May 10, 2022 23:33
@mskg
Copy link
Contributor Author

mskg commented May 12, 2022

@shamoon I merged with dev

@shamoon
Copy link
Member

shamoon commented May 12, 2022

@shamoon I merged with dev

Thank you, but I think it needs a re-base, theres still a lot of changes that arent yours.

As for the PR itself, I was waiting for others to chime in. Personally, I dont think this is a feature I would use. My preference is to just store docs in some simple way and let paperless handle things from there rather than complicate the file storage. I bet there are some that would, perhaps a vocal group, but IMHO if its not going to get very wide use I dont think I would be in favor of including it as-is since its so prominent, etc (in that case perhaps it could be made optional or something like that?).

Obviously its not like we have telemetry data or something to have a real clear answer for this, so again, I'd love to hear others' thoughts.

@shamoon
Copy link
Member

shamoon commented May 19, 2022

The docs look good so far, thanks for writing those.

I don't think the document_renamer moves existing documents into new storage paths. Am I missing something there or is that unimplemented?

I think it does? E.g. if you edit a storage path with this feature it doesnt move immediately but then running document_renamer does move it.

@mskg
Copy link
Contributor Author

mskg commented May 19, 2022

I> I think it does? E.g. if you edit a storage path with this feature it doesnt move immediately but then running document_renamer does move it.

It does - as you described it. It always compares the calculated path with db values and acts if required. It only touches files with non-compliant pathes. The implementation is unchanged.

@qcasey
Copy link
Member

qcasey commented May 19, 2022

if you edit a storage path with this feature it doesnt move immediately but then running document_renamer does move it.

Right, I see that (emphasis mine). However I don't see the documents move when they are uploaded before the Storage Path rule. For example:

  1. Upload a document containing the words "test"
  2. Create a Storage Path rule matching the word "test"
  3. Run the document_renamer
  4. Notice the document is not renamed

The same document is matched and renamed according to the Storage Path rule if I upload it after the rule is created.

@mskg
Copy link
Contributor Author

mskg commented May 19, 2022

if you edit a storage path with this feature it doesnt move immediately but then running document_renamer does move it.

Right, I see that (emphasis mine). However I don't see the documents move when they are uploaded before the Storage Path rule. For example:

  1. Upload a document containing the words "test"
  2. Create a Storage Path rule matching the word "test"
  3. Run the document_renamer
  4. Notice the document is not renamed

The same document is matched and renamed according to the Storage Path rule if I upload it after the rule is created.

This is only available for tags - the command is called document_retagger.

I would propose to not implement this, as this would require conflict resolution - priorities of rules. That’s not the case for tags - there can be many.

That’s probably also why it doesn’t exist for correspondents, and types, too.

@qcasey
Copy link
Member

qcasey commented May 19, 2022

Also we should probably add how this feature handles precedence to the docs and UI.

A Document can have multiple Correspondents/Tags/Document Types but only one Storage Path. This feature appears to pick the oldest rule and apply that, regardless of more specific rules created later.

Nevermind, I see your comment above. That's news to me, I was under the misconception we could have multiple.

@mskg
Copy link
Contributor Author

mskg commented May 19, 2022

Nevermind, I see your comment above. That's news to me, I was under the misconception we could have multiple.

I saw many comments of people wanting to have many correspondents (sender, receivers). Definitely worth thinking about 😉

Copy link
Member

@qcasey qcasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is much more straightforward than I initially expected. Of course I'll defer to shamoon on the remaining UI stuff but I've tested this a bit and I personally don't have any more concerns.

@shamoon
Copy link
Member

shamoon commented May 19, 2022

Ok great, Im pretty happy with these changes, I do think they were important, appreciate your patience. Also after spending some time with this PR I can see how it will probably be useful to others.

My only outstanding concern is about validating the input. On the one hand users can make their own bad decisions =) on the other hand we might want to help prevent people making unintended mistakes. The link to the docs / example in the hint is helpful but especially for variables I still think we should validate them, for example if someone typed {cerated_year}. I added a very crude test for this, basically it tries to format the string the way generate_filename does with dummy replacements just so that it can catch a KeyError and then the frontend can show a helpful error, see screenshot.

Really, thats it! Let me know if theres a better way to do the validation, ofc.

Screen Shot 2022-05-19 at 1 56 59 PM

@mskg
Copy link
Contributor Author

mskg commented May 19, 2022

Really, thats it! Let me know if theres a better way to do the validation, ofc.

Syntax wise {bad variables} are the only real error that would result in the whole formula not being used. If we validate that - 90% of all errors.

Possible addition: Show him a formatted example while he types. Than we might also catch the unexpected result.

Really appreciate your work 👍

@shamoon
Copy link
Member

shamoon commented May 19, 2022

Yes that would be cool or I even thought about some kind of visual drag-and-drop tool to build the path but that seems like another PR.

As long as you Python folks are OK with my validate_path implementation, then as they say. LGTM 😄

@shamoon shamoon merged commit 69ef26d into paperless-ngx:dev May 19, 2022
@shamoon shamoon changed the title Feature: Dynamic document storage pathes Feature: Dynamic document storage paths May 19, 2022
@shamoon shamoon added this to the Next Release milestone May 20, 2022
gador pushed a commit to gador/paperless-ngx that referenced this pull request Jun 7, 2022
* Added devcontainer

* Add feature storage pathes

* Exclude tests and add versioning

* Check escaping

* Check escaping

* Check quoting

* Echo

* Escape

* Escape :

* Double escape \

* Escaping

* Remove if

* Escape colon

* Missing \

* Esacpe :

* Escape all

* test

* Remove sed

* Fix exclude

* Remove SED command

* Add LD_LIBRARY_PATH

* Adjusted to v1.7

* Updated test-cases

* Remove devcontainer

* Removed internal build-file

* Run pre-commit

* Corrected flak8 error

* Adjusted to v1.7

* Updated test-cases

* Corrected flak8 error

* Adjusted to new plural translations

* Small adjustments due to code-review backend

* Adjusted line-break

* Removed PAPERLESS prefix from settings variables

* Corrected style change due to search+replace

* First documentation draft

* Revert changes to Pipfile

* Add sphinx-autobuild with keep-outdated

* Revert merge error that results in wrong storage path is evaluated

* Adjust styles of generated files ...

* Adds additional testing to cover dynamic storage path functionality

* Remove unnecessary condition

* Add hint to edit storage path dialog

* Correct spelling of pathes to paths

* Minor documentation tweaks

* Minor typo

* improving wrapping of filter editor buttons with new storage path button

* Update .gitignore

* Fix select border radius in non input-groups

* Better storage path edit hint

* Add note to edit storage path dialog re document_renamer

* Add note to bulk edit storage path re document_renamer

* Rename FILTER_STORAGE_DIRECTORY to PATH

* Fix broken filter rule parsing

* Show default storage if unspecified

* Remove note re storage path on bulk edit

* Add basic validation of filename variables

Co-authored-by: Markus Kling <markus@markus-kling.net>
Co-authored-by: Trenton Holmes <holmes.trenton@gmail.com>
Co-authored-by: Michael Shamoon <4887959+shamoon@users.noreply.github.com>
Co-authored-by: Quinn Casey <quinn@quinncasey.com>
@shamoon shamoon deleted the feature-document-storage branch February 17, 2023 17:24
@github-actions
Copy link
Contributor

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backend frontend non-trivial Requires approval by several team members work-in-progress Pull request which needs some changes before being able to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants