Skip to content
This repository has been archived by the owner on Dec 18, 2019. It is now read-only.

[OCR] Memory leak #725

Open
jflesch opened this issue Dec 27, 2017 · 5 comments
Open

[OCR] Memory leak #725

jflesch opened this issue Dec 27, 2017 · 5 comments
Milestone

Comments

@jflesch
Copy link
Member

jflesch commented Dec 27, 2017

It looks like there is a memory leak in the OCR process.
See #724 (comment) for reference

@jflesch jflesch added this to the 1.2.3 milestone Dec 27, 2017
@Moini
Copy link

Moini commented Dec 27, 2017

Thanks! Let me know if there's anything that I can do to help find the cause.

@jflesch
Copy link
Member Author

jflesch commented Dec 28, 2017

This ticket is going to be trickier to fix. I think the simplest way to go for me is to disable the duplicated-PDF detection and import many times a PDF without text.

@Moini
Copy link

Moini commented Dec 28, 2017

Some additional info:

it already output that it was opening the same pdf multiple times (IIRC, about 15 times in a row, for each document - it's impossible to see in the edited log, sorry about that).

If it could keep the pdf names (and would allow to name pdfs), would that help with this? (I'd love that... because it would allow the collection to work even when paperworks doesn't work - but it might be a policy thing...)

@jflesch
Copy link
Member Author

jflesch commented Dec 28, 2017

it already output that it was opening the same pdf multiple times

It's not surprising actually.
The main problem with poppler-glib + Python is that, while you open the PDF file explicitly, it is only closed when the garbage collector collects the PDF object. So it's unfortunately pretty hard to have a clear control of when the file is closed. Also the PDF object is not serializable, which was a problem for me in paperwork-backend.

So I took the lazy but reliable way: I use lazy initialization. Evey time Paperwork needs an information from the PDF (number of pages, page rendering, page text extraction, etc), it opens the file, gets the information and close it. Disk cache is taking the hit .. :/

it could keep the pdf names

It's a design thing. Giving documents a title is one of the most requested features, but this is beside the point of Paperwork. Paperwork is about being lazy, not about spending time sorting and naming documents.
However, instead of naming it "doc.pdf" everytime, I guess it could keep the file name and just not display it in the UI. I've opened a ticket for that point: #726

@Moini
Copy link

Moini commented Dec 28, 2017

Arghs, yes, sounds like an annoying issue... :-/

Thank you very much for putting thought into this and making a report! Keeping file names would indeed help a lot to make the program more useful to me.

I understand about the design. Yet I'm asking myself if perhaps the requesting users have a point, esp. if they are numerous, as you say...? As long as it's optional, it would help those who ask for it, and could be ignored by those who don't. When I look at the file thumbnails at the left, I feel a bit blind or lost, not knowing what is what (the thumbnails don't help much with certificates or invoices, they all look the same). Esp. if you don't know the specific search term (like mmh, was that certificate about ultrasound or sonography?), or have tagged things badly, or OCR didn't go well.

But I know that it appears to be fashionable now to make GNOME apps as feature-free and uncustomizable as possible, so people who have grown up with their mobile phones can still just tap and use the app, even on a tablet with no keyboard. And that's a valid reason.
There are many programs that fight the 'design' vs. 'user request' fight. And their devs often have good reasons. Sometimes I even agree :-)
Here, though, it is my belief that having good defaults that work for the simplest use(r) doesn't mean that one cannot offer options. But it's perfectly okay if we don't agree here - it's still a great program that you're offering, and I know that if it were a huge need for me, there are other options out there, that I could evaluate instead (but I find the auto-tagging very attractive and I doubt other (FLOSS) software can do that! Didn't see it in action yet, though, I'm hoping I can try it out! I also very much like that it's small, not a client-server architecture monster.). Sorry, probably TL;DR... and way off-topic.

Bonne Année et merci beaucoup!

@jflesch jflesch modified the milestones: 1.2.3, 1.2.4 Feb 1, 2018
@jflesch jflesch modified the milestones: 1.2.4, 1.3.0 Feb 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants