Skip to content
This repository has been archived by the owner on Dec 18, 2019. It is now read-only.

PDF Import: automatically figure out the date of the document #338

Open
jflesch opened this issue Sep 30, 2014 · 11 comments
Open

PDF Import: automatically figure out the date of the document #338

jflesch opened this issue Sep 30, 2014 · 11 comments

Comments

@jflesch
Copy link
Member

jflesch commented Sep 30, 2014

PDF have metadatas, including "creation date" and "last modified". These dates could be used for the document date.

Note: be careful, some PDFs have crappy dates

@tYYGH
Copy link

tYYGH commented Oct 6, 2014

If this feature does get included into a release, please make it optional.
Following Paperwork’s “scan-and-forget” logic, my papers are not sorted by date-on-the-paper, but by date-of-scan. My papers are a FIFO:
1— I scan
2— I put the paper on the FIFO
3— I forget
4— Any paper at the head of the FIFO, older than 10 years, gets discarded.
It seems to me, that this is exactly how Paperwork is intended to work, and the date-on-the-paper has no role here.

@akarzim
Copy link

akarzim commented Nov 26, 2014

@tYYGH I'm not agree with you. I just start using Paperwork few weeks ago and I have a lot of papers to scan. In this context, the date-of-scan has no sens : all my papers have the same dummy date.
So retrieving the date-on-the-paper follows the "scan-and-forget" logic more than adding notes manually for example.

@tYYGH
Copy link

tYYGH commented Nov 28, 2014

@akarzim I agree with you. You cannot compare first-time setup and regular usage, though. With this feature being optional, it could be enabled for the first-time setup, and then disabled for regular usage.
But then, you may disagree on the "regular usage" part. Well, to each their own ;-)

@jflesch jflesch modified the milestones: 0.4-unstable, 0.3-unstable Oct 9, 2015
@jflesch jflesch modified the milestones: 0.5.0, 0.4.0 Feb 19, 2016
@Whisprin
Copy link

Maybe paperwork should actually keep two dates:

  • scanned/imported at date
  • document date

One could even separate the scanned and imported at date. I am scanning documents as they come and import them in bulks into paperwork. The imported date is even less interesting to me than the scanned date (e.g. creation date of the pdf).

Also the document date shouldn't be set automatically. Paperwork should suggest dates based on file contents.

  1. Do OCR
  2. Return all strings matching ([0-9]?[0-9]).([0-9]?[0-9]).(20)?[0-9][0-9]
  3. Suggest a list of documents dates in the date chooser
  4. Optional: Set document date to best guess (first hit, date closest to current date, etc.)
  5. The regex could be user configurable

If there's a better way to talk about features please let me know.

@jflesch
Copy link
Member Author

jflesch commented Sep 29, 2016

Hm, below the calendar we could display a list of suggested dates based on the content of the document. I guess it could help first time users quite a lot.

One of the main problems will be to find the dates in the document. Each culture has its own way of writing the dates ... :/
The regex you provided will catch all the dates written numerically in occidental cultures I think, but not the ones with words ("Septembre the 5th 2012" or "5 septembre 2012").

It's not an easy problem, but it would be a fun one to try to solve :-)

@Whisprin
Copy link

Whisprin commented Sep 29, 2016

Yes, you're right. Didn't think about different conventions. Even the order of month and day isn't fixed.

But parsing a date, without knowing the format already, from a human readable string sounds like something others have already taken on.
edit: Found this: https://github.com/kvh/recurrent

So there are two problems:

  • Find a string which could be a date
  • Parse that string into a date, taking locale and user preferences into account

Looking at my local paperwork db: I could try to extract dates from a paper.words file on the command line.

@tYYGH
Copy link

tYYGH commented Sep 29, 2016

I suggest we add a big-data AI to Paperwork, and we crowd-feed it all the dates from all our “words”-files, so that it learns… :-p

@lapineige
Copy link

I simple comment to say that this kind of feature would be very handy for me: I frequently have to integrate a PDF that is not a fresh scanned document. And sometimes a lot of them at the same time.

Relying on import time is just irrelevant in my case. I can't use it for a search for instance.
So if during import I could activate an option to use the creation/modification time, it would be handy ! :)

I don't understand why you want to scan the date on the document. It would be useful for sure (even if it's more complex) but at least using document date is handy (I mean, the pdf metadata, not the content parsed) as you probably scanned/received it almost at the same date as scanned/created.

@jflesch
Copy link
Member Author

jflesch commented Oct 9, 2016

Ok, so we have 2 ideas here:

  1. At import time, ask the users if they want to use the current time or the file metadata for the date. Note that you will always find files with crappy metadata like 01-01-1970, so using metadatas won't be the default value.
  2. On already scanned/imported documents, when editing document properties, suggest the date found in the document (metadata/content)

@Whisprin
Copy link

Whisprin commented Oct 9, 2016

Re: 1) Maybe show the dialog on first import and provide a "remember this setting flag" which can be changed in the settings. At least this dialog shouldn't block batch imports of pdfs from folders.

Re: 2) After looking at some recognized text it turns out that a lot of dates are not parsed correctly. e.g.:
29.o9.2o1e
There might be an option to pass parameters to the OCR engine to improve the recognition of numbers or suggest similar looking characters. Also a preceding "Date: " (in different languages) could be handy.
This already sounds a lot like the suggested big-data AI ;)

edit: I was able to get better OCR results (in general and regarding numbers) using those two commands:
convert -density 300 doc.pdf -colorspace gray doc.png
tesseract doc.png doc.words -l deu hocr

@lapineige
Copy link

At least this dialog shouldn't block batch imports of pdfs from folders.

👍

I think the 1) is a lot easier to implement, while I suppose 2) will require more work/testing.

@jflesch jflesch modified the milestones: 2.0, 1.1.0 Dec 6, 2016
@mjourdan mjourdan mentioned this issue Jul 19, 2017
7 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants