[OCR] Memory leak #725

jflesch · 2017-12-27T12:02:33Z

It looks like there is a memory leak in the OCR process.
See #724 (comment) for reference

Moini · 2017-12-27T17:06:10Z

Thanks! Let me know if there's anything that I can do to help find the cause.

jflesch · 2017-12-28T12:36:27Z

This ticket is going to be trickier to fix. I think the simplest way to go for me is to disable the duplicated-PDF detection and import many times a PDF without text.

Moini · 2017-12-28T12:56:51Z

Some additional info:

it already output that it was opening the same pdf multiple times (IIRC, about 15 times in a row, for each document - it's impossible to see in the edited log, sorry about that).

If it could keep the pdf names (and would allow to name pdfs), would that help with this? (I'd love that... because it would allow the collection to work even when paperworks doesn't work - but it might be a policy thing...)

jflesch · 2017-12-28T13:43:02Z

it already output that it was opening the same pdf multiple times

It's not surprising actually.
The main problem with poppler-glib + Python is that, while you open the PDF file explicitly, it is only closed when the garbage collector collects the PDF object. So it's unfortunately pretty hard to have a clear control of when the file is closed. Also the PDF object is not serializable, which was a problem for me in paperwork-backend.

So I took the lazy but reliable way: I use lazy initialization. Evey time Paperwork needs an information from the PDF (number of pages, page rendering, page text extraction, etc), it opens the file, gets the information and close it. Disk cache is taking the hit .. :/

it could keep the pdf names

It's a design thing. Giving documents a title is one of the most requested features, but this is beside the point of Paperwork. Paperwork is about being lazy, not about spending time sorting and naming documents.
However, instead of naming it "doc.pdf" everytime, I guess it could keep the file name and just not display it in the UI. I've opened a ticket for that point: #726

Moini · 2017-12-28T14:43:23Z

Arghs, yes, sounds like an annoying issue... :-/

Thank you very much for putting thought into this and making a report! Keeping file names would indeed help a lot to make the program more useful to me.

I understand about the design. Yet I'm asking myself if perhaps the requesting users have a point, esp. if they are numerous, as you say...? As long as it's optional, it would help those who ask for it, and could be ignored by those who don't. When I look at the file thumbnails at the left, I feel a bit blind or lost, not knowing what is what (the thumbnails don't help much with certificates or invoices, they all look the same). Esp. if you don't know the specific search term (like mmh, was that certificate about ultrasound or sonography?), or have tagged things badly, or OCR didn't go well.

But I know that it appears to be fashionable now to make GNOME apps as feature-free and uncustomizable as possible, so people who have grown up with their mobile phones can still just tap and use the app, even on a tablet with no keyboard. And that's a valid reason.
There are many programs that fight the 'design' vs. 'user request' fight. And their devs often have good reasons. Sometimes I even agree :-)
Here, though, it is my belief that having good defaults that work for the simplest use(r) doesn't mean that one cannot offer options. But it's perfectly okay if we don't agree here - it's still a great program that you're offering, and I know that if it were a huge need for me, there are other options out there, that I could evaluate instead (but I find the auto-tagging very attractive and I doubt other (FLOSS) software can do that! Didn't see it in action yet, though, I'm hoping I can try it out! I also very much like that it's small, not a client-server architecture monster.). Sorry, probably TL;DR... and way off-topic.

Bonne Année et merci beaucoup!

jflesch added bug to study labels Dec 27, 2017

jflesch added this to the 1.2.3 milestone Dec 27, 2017

jflesch mentioned this issue Dec 28, 2017

[PDF import] Keep the file names #726

Open

jflesch modified the milestones: 1.2.3, 1.2.4 Feb 1, 2018

jflesch modified the milestones: 1.2.4, 1.3.0 Feb 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OCR] Memory leak #725

[OCR] Memory leak #725

jflesch commented Dec 27, 2017

Moini commented Dec 27, 2017

jflesch commented Dec 28, 2017

Moini commented Dec 28, 2017

jflesch commented Dec 28, 2017 •

edited

Moini commented Dec 28, 2017

[OCR] Memory leak #725

[OCR] Memory leak #725

Comments

jflesch commented Dec 27, 2017

Moini commented Dec 27, 2017

jflesch commented Dec 28, 2017

Moini commented Dec 28, 2017

jflesch commented Dec 28, 2017 • edited

Moini commented Dec 28, 2017

jflesch commented Dec 28, 2017 •

edited