-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding date suggestions to the documents details view #1367
Conversation
Hello @Eckii24, thank you very much for submitting this PR to us! This is what will happen next:
Please allow up to 7 days for an initial review. We're all very excited about new pull requests but we only do this as a hobby. |
This is an awesome PR, thank you! Highly requested. I see some very small issues with frontend code but I will find some time to test and review this hopefully in the next few days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The frontend code looks good to me, I will let others review on backend. In general this works pretty well in testing, though I realize the accuracy of the suggested dates comes from a dependency.
One thought: I would be in favor of setting PAPERLESS_NUMBER_OF_SUGGESTED_DATES
to 3 (or something sane), the other suggestions are enabled by default so I suspect people will not find / use this if its not. And frankly I think its super useful.
Thanks again for your contribution
First of all I had a default of 10. The problem comes into place, if you work on very large documents. The performance of the date parsing library is not that good and the request to the suggestion api will take up to 10 sec. with is pretty noticeable. Therefore my intention was to not introduce a kind of „breaking change“ in those cases. That’s also the reason why I took the dates from the top to the bottom of the document and did not decide about the relevance of each date, because if you really parse all dates in the documents it took in my test for about 25 pages up to 30 sec… |
Would love to hear others' thoughts on the |
Thank you for this PR, it looks really useful! The parsing time is very dependent on the document and hardware being used. With some of my testing documents the extra parsing takes: 23 pg without dates
176 pg with date on pg 160
That said, I did not experience noticeable slowdowns when date parsing on either machine.
I'm curious where this is noticeable? I suppose like shamoon, the only difference I noticed was the date suggestions did not appear until they were parsed. If there is a hit to document load times, processing, etc. then I agree leaving it off should probably be the default. Right now the change seems benign to me. |
Hi @qcasey
That is exactly the point I wanted to make. The time until the suggestions appear (not the time to load the other document values) is delayed depending on the document size and hardware. If the complete suggestions take 10 seconds or longer, one or the other will go on without the date suggestions and continue to enter the value manually, but have the other suggestions directly at hand to use them. In other words, if the suggestions take too long to load, the document editing process will be negatively affected. A solution to this potential problem could be an improvement in the date parsing logic (but current I do not know how to improve it) or introduction of caching or something like that (see #1367 (comment)). Do you see a problem here as well? |
Please correct me if I'm wrong, but you're saying that if:
Then the user will not have any suggested dates to choose because the parsing of all 3 takes too long. If I'm reading that right, would a default of |
I've lost the thread on the status of this. i know it's often requested, so it would be nice to make it available. Even if defaulted to off for the moment while figuring out performance implications |
Agree we should move ahead, only question is about defaulting to off or on. I think we should default to on, maybe leave Unless there are obvious ways to speed things up, bug again, I'd be happy with as-is and potentially improving after-the-fact |
Yeah, I saw we enable it, set to 2 or 3, and see what user's think about any speed issues. |
Great, I set it to 3. Any more issues with this PR? |
A default of 3 sounds greats! Thanks for your feedback and the adjustments. I hope I will find some time next week, because I want to try a caching mechanism. If this will lead to an improvement I will submit a new PR, with the caching in place. So from my point of view, if there are no more other open issues, we can merge this one here. |
Caching would be great if you can figure it out. I think we're good to merge this, its going into |
I'm looking so forward to it! This will be an awesome feature! Maybe we can introduce a ML algorithm in the future that will autofill the "correct" date by learning from other documents falling in the same category (e.g.: docs 1-100 always use the second date that appears, doc 101 will also use the 2nd date). |
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns. |
Proposed change
Suggestions for tags, correspondent and the document type are made in the detail view of a document.
The value of the creation date is only pre-assigned once after parsing with the first date found in the document.
This change includes a suggestion list for other date values found in the document.
To achieve this, the parsing function is adjusted to be able to parse all date occurrences within the document.
Since this can be very time consuming for large documents, a setting is also introduced to control how many dates of the document should be output as a suggestion. If this setting is set to zero, Paperless behaves exactly as before.
An other approach to potentially improve the performance could be to use
dateparser.search.search_dates
like mentioned in #741.I have not tested this yet...EDIT: I quickly checked the usage of
search_dates
. Right now the implementation does not cover the common date formats of the current implementation. This may be improved when the PR there to re-implement the feature is merged....This addresses #384
Type of change
Checklist:
pre-commit
hooks, see documentation.