PDF Import: automatically figure out the date of the document #338

jflesch · 2014-09-30T17:38:35Z

PDF have metadatas, including "creation date" and "last modified". These dates could be used for the document date.

Note: be careful, some PDFs have crappy dates

tYYGH · 2014-10-06T18:35:45Z

If this feature does get included into a release, please make it optional.
Following Paperwork’s “scan-and-forget” logic, my papers are not sorted by date-on-the-paper, but by date-of-scan. My papers are a FIFO:
1— I scan
2— I put the paper on the FIFO
3— I forget
4— Any paper at the head of the FIFO, older than 10 years, gets discarded.
It seems to me, that this is exactly how Paperwork is intended to work, and the date-on-the-paper has no role here.

akarzim · 2014-11-26T17:06:42Z

@tYYGH I'm not agree with you. I just start using Paperwork few weeks ago and I have a lot of papers to scan. In this context, the date-of-scan has no sens : all my papers have the same dummy date.
So retrieving the date-on-the-paper follows the "scan-and-forget" logic more than adding notes manually for example.

tYYGH · 2014-11-28T09:31:28Z

@akarzim I agree with you. You cannot compare first-time setup and regular usage, though. With this feature being optional, it could be enabled for the first-time setup, and then disabled for regular usage.
But then, you may disagree on the "regular usage" part. Well, to each their own ;-)

Whisprin · 2016-09-29T11:16:55Z

Maybe paperwork should actually keep two dates:

scanned/imported at date
document date

One could even separate the scanned and imported at date. I am scanning documents as they come and import them in bulks into paperwork. The imported date is even less interesting to me than the scanned date (e.g. creation date of the pdf).

Also the document date shouldn't be set automatically. Paperwork should suggest dates based on file contents.

Do OCR
Return all strings matching ([0-9]?[0-9]).([0-9]?[0-9]).(20)?[0-9][0-9]
Suggest a list of documents dates in the date chooser
Optional: Set document date to best guess (first hit, date closest to current date, etc.)
The regex could be user configurable

If there's a better way to talk about features please let me know.

jflesch · 2016-09-29T11:39:58Z

Hm, below the calendar we could display a list of suggested dates based on the content of the document. I guess it could help first time users quite a lot.

One of the main problems will be to find the dates in the document. Each culture has its own way of writing the dates ... :/
The regex you provided will catch all the dates written numerically in occidental cultures I think, but not the ones with words ("Septembre the 5th 2012" or "5 septembre 2012").

It's not an easy problem, but it would be a fun one to try to solve :-)

Whisprin · 2016-09-29T11:57:16Z

Yes, you're right. Didn't think about different conventions. Even the order of month and day isn't fixed.

But parsing a date, without knowing the format already, from a human readable string sounds like something others have already taken on.
edit: Found this: https://github.com/kvh/recurrent

So there are two problems:

Find a string which could be a date
Parse that string into a date, taking locale and user preferences into account

Looking at my local paperwork db: I could try to extract dates from a paper.words file on the command line.

tYYGH · 2016-09-29T13:10:42Z

I suggest we add a big-data AI to Paperwork, and we crowd-feed it all the dates from all our “words”-files, so that it learns… :-p

lapineige · 2016-10-09T08:44:39Z

I simple comment to say that this kind of feature would be very handy for me: I frequently have to integrate a PDF that is not a fresh scanned document. And sometimes a lot of them at the same time.

Relying on import time is just irrelevant in my case. I can't use it for a search for instance.
So if during import I could activate an option to use the creation/modification time, it would be handy ! :)

I don't understand why you want to scan the date on the document. It would be useful for sure (even if it's more complex) but at least using document date is handy (I mean, the pdf metadata, not the content parsed) as you probably scanned/received it almost at the same date as scanned/created.

jflesch · 2016-10-09T09:19:02Z

Ok, so we have 2 ideas here:

At import time, ask the users if they want to use the current time or the file metadata for the date. Note that you will always find files with crappy metadata like 01-01-1970, so using metadatas won't be the default value.
On already scanned/imported documents, when editing document properties, suggest the date found in the document (metadata/content)

Whisprin · 2016-10-09T10:03:27Z

Re: 1) Maybe show the dialog on first import and provide a "remember this setting flag" which can be changed in the settings. At least this dialog shouldn't block batch imports of pdfs from folders.

Re: 2) After looking at some recognized text it turns out that a lot of dates are not parsed correctly. e.g.:
29.o9.2o1e
There might be an option to pass parameters to the OCR engine to improve the recognition of numbers or suggest similar looking characters. Also a preceding "Date: " (in different languages) could be handy.
This already sounds a lot like the suggested big-data AI ;)

edit: I was able to get better OCR results (in general and regarding numbers) using those two commands:
convert -density 300 doc.pdf -colorspace gray doc.png
tesseract doc.png doc.words -l deu hocr

lapineige · 2016-10-09T10:34:47Z

At least this dialog shouldn't block batch imports of pdfs from folders.

👍

I think the 1) is a lot easier to implement, while I suppose 2) will require more work/testing.

jflesch added to study new feature labels Sep 30, 2014

jflesch added this to the 0.3-unstable milestone Sep 30, 2014

jflesch modified the milestones: 0.4-unstable, 0.3-unstable Oct 9, 2015

jflesch modified the milestones: 0.5.0, 0.4.0 Feb 19, 2016

jflesch modified the milestones: 2.0, 1.1.0 Dec 6, 2016

mjourdan mentioned this issue Jul 19, 2017

[Meta] pdf handling #655

Open

7 tasks

mjourdan added the gui improvement label Jul 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF Import: automatically figure out the date of the document #338

PDF Import: automatically figure out the date of the document #338

jflesch commented Sep 30, 2014

tYYGH commented Oct 6, 2014

akarzim commented Nov 26, 2014

tYYGH commented Nov 28, 2014

Whisprin commented Sep 29, 2016

jflesch commented Sep 29, 2016

Whisprin commented Sep 29, 2016 •

edited

tYYGH commented Sep 29, 2016

lapineige commented Oct 9, 2016

jflesch commented Oct 9, 2016

Whisprin commented Oct 9, 2016 •

edited

lapineige commented Oct 9, 2016

PDF Import: automatically figure out the date of the document #338

PDF Import: automatically figure out the date of the document #338

Comments

jflesch commented Sep 30, 2014

tYYGH commented Oct 6, 2014

akarzim commented Nov 26, 2014

tYYGH commented Nov 28, 2014

Whisprin commented Sep 29, 2016

jflesch commented Sep 29, 2016

Whisprin commented Sep 29, 2016 • edited

tYYGH commented Sep 29, 2016

lapineige commented Oct 9, 2016

jflesch commented Oct 9, 2016

Whisprin commented Oct 9, 2016 • edited

lapineige commented Oct 9, 2016

Whisprin commented Sep 29, 2016 •

edited

Whisprin commented Oct 9, 2016 •

edited