Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding date suggestions to the documents details view #1367

Merged
merged 11 commits into from
Aug 25, 2022

Conversation

Eckii24
Copy link
Contributor

@Eckii24 Eckii24 commented Aug 6, 2022

Proposed change

Suggestions for tags, correspondent and the document type are made in the detail view of a document.
The value of the creation date is only pre-assigned once after parsing with the first date found in the document.

This change includes a suggestion list for other date values found in the document.

image

To achieve this, the parsing function is adjusted to be able to parse all date occurrences within the document.
Since this can be very time consuming for large documents, a setting is also introduced to control how many dates of the document should be output as a suggestion. If this setting is set to zero, Paperless behaves exactly as before.

An other approach to potentially improve the performance could be to use dateparser.search.search_dates like mentioned in #741. I have not tested this yet...
EDIT: I quickly checked the usage of search_dates. Right now the implementation does not cover the common date formats of the current implementation. This may be improved when the PR there to re-implement the feature is merged....

This addresses #384

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Other (please explain)

Checklist:

  • I have read & agree with the contributing guidelines.
  • If applicable, I have tested my code for new features & regressions on both mobile & desktop devices, using the latest version of major browsers.
  • If applicable, I have checked that all tests pass, see documentation.
  • I have run all pre-commit hooks, see documentation.
  • I have made corresponding changes to the documentation as needed.
  • I have checked my modifications for any breaking changes.

@paperless-ngx-secretary paperless-ngx-secretary bot added backend documentation Improvements or additions to documentation frontend non-trivial Requires approval by several team members labels Aug 6, 2022
@paperless-ngx-secretary
Copy link

Hello @Eckii24,

thank you very much for submitting this PR to us!

This is what will happen next:

  1. My robotic colleagues will check your changes to see if they break anything. You can see the progress below.
  2. Once that is finished, human contributors from paperless-ngx review your changes. Since this is a non-trivial change, a review from at least two contributors is required.
  3. Please improve anything that comes up during the review until your pull request gets approved.
  4. Your pull request will be merged into the dev branch. Changes there will be tested further.
  5. Eventually, changes from you and other contributors will be merged into main and a new release will be made.

Please allow up to 7 days for an initial review. We're all very excited about new pull requests but we only do this as a hobby.
If any action will be required by you, please reply within a month.

@Eckii24 Eckii24 marked this pull request as ready for review August 6, 2022 11:40
@Eckii24 Eckii24 requested review from a team as code owners August 6, 2022 11:40
@shamoon
Copy link
Member

shamoon commented Aug 6, 2022

This is an awesome PR, thank you! Highly requested. I see some very small issues with frontend code but I will find some time to test and review this hopefully in the next few days.

Copy link
Member

@shamoon shamoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frontend code looks good to me, I will let others review on backend. In general this works pretty well in testing, though I realize the accuracy of the suggested dates comes from a dependency.

One thought: I would be in favor of setting PAPERLESS_NUMBER_OF_SUGGESTED_DATES to 3 (or something sane), the other suggestions are enabled by default so I suspect people will not find / use this if its not. And frankly I think its super useful.

Thanks again for your contribution

@shamoon shamoon added this to the v1.8.1 milestone Aug 7, 2022
@Eckii24
Copy link
Contributor Author

Eckii24 commented Aug 7, 2022

One thought: I would be in favor of setting PAPERLESS_NUMBER_OF_SUGGESTED_DATES to 3 (or something sane), the other suggestions are enabled by default so I suspect people will not find / use this if its not. And frankly I think its super useful.

First of all I had a default of 10. The problem comes into place, if you work on very large documents. The performance of the date parsing library is not that good and the request to the suggestion api will take up to 10 sec. with is pretty noticeable.

Therefore my intention was to not introduce a kind of „breaking change“ in those cases.

That’s also the reason why I took the dates from the top to the bottom of the document and did not decide about the relevance of each date, because if you really parse all dates in the documents it took in my test for about 25 pages up to 30 sec…

@shamoon
Copy link
Member

shamoon commented Aug 7, 2022

Would love to hear others' thoughts on the PAPERLESS_NUMBER_OF_SUGGESTED_DATES question. In real world testing it hasn't slowed things down for me so I would be in favor of a default > 0, but totally understand your concerns.

@qcasey qcasey added the enhancement New feature label Aug 7, 2022
@qcasey
Copy link
Member

qcasey commented Aug 7, 2022

Thank you for this PR, it looks really useful!

The parsing time is very dependent on the document and hardware being used. With some of my testing documents the extra parsing takes:

23 pg without dates

  • Laptop: ~2.3 seconds
  • RPi 3A: ~55.4 seconds

176 pg with date on pg 160

  • Laptop: ~0.2855 seconds
  • RPi 3A: ~5.3 seconds

That said, I did not experience noticeable slowdowns when date parsing on either machine.

the request to the suggestion api will take up to 10 sec. with is pretty noticeable.

I'm curious where this is noticeable? I suppose like shamoon, the only difference I noticed was the date suggestions did not appear until they were parsed. If there is a hit to document load times, processing, etc. then I agree leaving it off should probably be the default. Right now the change seems benign to me.

@Eckii24
Copy link
Contributor Author

Eckii24 commented Aug 9, 2022

Hi @qcasey

I'm curious where this is noticeable? I suppose like shamoon, the only difference I noticed was the date suggestions did not appear until they were parsed. If there is a hit to document load times, processing, etc. then I agree leaving it off should probably be the default. Right now the change seems benign to me.

That is exactly the point I wanted to make. The time until the suggestions appear (not the time to load the other document values) is delayed depending on the document size and hardware.

If the complete suggestions take 10 seconds or longer, one or the other will go on without the date suggestions and continue to enter the value manually, but have the other suggestions directly at hand to use them.

In other words, if the suggestions take too long to load, the document editing process will be negatively affected.

A solution to this potential problem could be an improvement in the date parsing logic (but current I do not know how to improve it) or introduction of caching or something like that (see #1367 (comment)).

Do you see a problem here as well?

@qcasey
Copy link
Member

qcasey commented Aug 10, 2022

one or the other will go on without the date suggestions and continue to enter the value manually, but have the other suggestions directly at hand to use them.

Please correct me if I'm wrong, but you're saying that if:

  1. PAPERLESS_NUMBER_OF_SUGGESTED_DATES=3
  2. Paperless finds 2 dates somewhat quickly but takes 55+ seconds to find the 3rd date.

Then the user will not have any suggested dates to choose because the parsing of all 3 takes too long.

If I'm reading that right, would a default of PAPERLESS_NUMBER_OF_SUGGESTED_DATES=1 still exhibit this issue? In this case, if the date suggestions take too long to load then there likely isn't a suggestion available anyway.

@stumpylog
Copy link
Member

I've lost the thread on the status of this. i know it's often requested, so it would be nice to make it available. Even if defaulted to off for the moment while figuring out performance implications

@shamoon
Copy link
Member

shamoon commented Aug 21, 2022

Agree we should move ahead, only question is about defaulting to off or on. I think we should default to on, maybe leave PAPERLESS_NUMBER_OF_SUGGESTED_DATES low, like 2? No one will see it otherwise. My feeling is if people have issues with the performance after release we're sure to hear about it =)

Unless there are obvious ways to speed things up, bug again, I'd be happy with as-is and potentially improving after-the-fact

@stumpylog
Copy link
Member

Yeah, I saw we enable it, set to 2 or 3, and see what user's think about any speed issues.

@shamoon
Copy link
Member

shamoon commented Aug 25, 2022

Great, I set it to 3. Any more issues with this PR?

@Eckii24
Copy link
Contributor Author

Eckii24 commented Aug 25, 2022

A default of 3 sounds greats! Thanks for your feedback and the adjustments.

I hope I will find some time next week, because I want to try a caching mechanism. If this will lead to an improvement I will submit a new PR, with the caching in place.

So from my point of view, if there are no more other open issues, we can merge this one here.

@shamoon
Copy link
Member

shamoon commented Aug 25, 2022

Caching would be great if you can figure it out. I think we're good to merge this, its going into dev anyway and we can test for next release 1.8.1. Thanks @Eckii24 !

@shamoon shamoon merged commit d40c134 into paperless-ngx:dev Aug 25, 2022
@Eckii24 Eckii24 deleted the feat/date-suggestions branch September 5, 2022 19:44
@ghost
Copy link

ghost commented Sep 11, 2022

I'm looking so forward to it! This will be an awesome feature! Maybe we can introduce a ML algorithm in the future that will autofill the "correct" date by learning from other documents falling in the same category (e.g.: docs 1-100 always use the second date that appears, doc 101 will also use the 2nd date).

@github-actions
Copy link
Contributor

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backend documentation Improvements or additions to documentation enhancement New feature frontend non-trivial Requires approval by several team members
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants