Buenos Aires: MemoryError #154

jpmckinney · 2019-05-09T14:24:24Z

in collection_file table, some files have ["MemoryError()"] in the errors column

Romi got it to work locally by using Flask with the dev server option. I wonder if it's UWSGI then? But if it was, that's not how I would expect that error to appear.

romifz · 2019-05-28T14:20:07Z

This also happens with buenos-aires collection (@yolile).

By the way, shouldn't this issue be on the kingfisher-process project?

jpmckinney · 2019-05-28T15:26:35Z

I don't know - is the MemoryError occurring during a crawl, and then getting sent to kingfisher-process' file_errors API endpoint? If so, I think the fix is in kingfisher-scrape.

romifz · 2019-05-28T15:45:43Z

Apparently there is no errors sent during a crawl, see this log. API calls are not returning error codes neither (I've seen HTTP error codes when sending files with incorrect formats).

robredpath · 2019-06-05T13:07:48Z

@romifz is this something that you'd like someone like @odscjames to look into? How urgent is it?

romifz · 2019-06-05T14:54:28Z

It's not urgent, for now I've been downloading the files using kingfisher-scrape locally and loading them to kingfisher-process using local-load.

odscjames · 2019-06-21T08:07:31Z

When I run this locally on Django debug server I report a memory useage of 681 MB for processing the largest file. UWSGI is currently capped at 500MB. (Tho I am puzzled this isn't the memory error I've seen before from UWSGI)

Anyway, the short term fix is simply to increase memory. I think that server has 64GB of RAM after all.

The long term fix would be to use the Redis que - when a large file comes in, the web process simply stores it, returns a response to the caller as quick as possible and puts a message on the que. A Redis worker then processes it - they have no memory limits, I think, and they can take a longer time to process larger files without an HTTP agent waiting for a response.

open-contracting/kingfisher-collect#154

odscjames · 2019-06-21T11:20:29Z

Limit now increased to 1GB. We also have a 10 min limit per request, but we haven't seen that being broken yet.

romifz · 2019-07-03T13:45:30Z

@odscjames I checked Buenos Aires file and it has ~800MB (uncompressed), so the memory limit should work for now! This file gets updated every 15 days, so it will grow with time, but I think it would take a while to reach the 1GB cap.

odscjames · 2019-07-03T14:05:58Z

The memory use of Python will be more to load the file than the size of the file on disk - but I'm not sure how much more it will be. I'll increase the limits a bit for now anyway.

robredpath · 2019-07-04T11:46:30Z

Get Honduras and Buenos Aires scrapers running on hosted Kingfisher

odscjames · 2019-08-05T09:32:40Z

I increased the RAM limit to 4GB and Buenos Aires still errors. This needs looking at more.

odscjames · 2019-08-05T15:20:21Z

I have looked into this and something isn't right here.

I think there is actually only one place in the code that would get this to crash the way it is:

                with open(file_to_store.get_filename(), encoding=encoding) as f:
                    data = json.load(f)

I tested locally and I got 3.3GB memory use opening the 800MB Buenos Aires file - which is big, but I tried upping the limit to 16GB on the live server and it still crashed, so something else is going on.

Need to dig into this more.

(We hear that local load works, so an incremental improvement to what's currently there along the lines explored in open-contracting/kingfisher-process#171 would still be good to discuss too)

odscjames · 2019-08-08T16:30:45Z

I have tried the code section above with the Buenos Aires data file via a UWSGI process and it worked fine on my desktop. So more digging into this is required.

jpmckinney · 2020-02-04T17:02:18Z

Marking as blocked as the fix is in Kingfisher Process. Updating issue title as ONCAE works now, but BA still fails.

yolile · 2020-04-22T17:22:37Z

@jpmckinney I think that I can do something similar to what we did for Portugal, but using ijson as you suggested

jpmckinney · 2020-04-22T17:24:27Z

I'm fine with fixing this on the Kingfisher Scrape side, as it will be some time before Kingfisher Process if fixed to handle large files.

yolile · 2020-04-23T15:17:30Z

@jpmckinney Buenos Aires uses a big release package, should we add to each release the package metadata before send it to kingfisher-process?

jpmckinney · 2020-04-23T15:40:31Z

Yes, that seems best.

jpmckinney · 2020-04-23T15:52:08Z

Hmm, I'm remembering it's not so easy to solve in the general case with streaming input (but we should be able to do it for just BA, where we can read a file from disk multiple times).

Note: Instead of sending a package with a single release, we can create packages with as many releases as we can without running out of memory.

Now, the challenge is extracting the metadata in an iterative fashion, and then extracting the releases in an iterative fashion.

I wrote up a general solution here, which requires that all metadata fields occur before the releases/records field: open-contracting/ocdskit#118

Another solution is to:

First pass: Extract the metadata using ijson.parse instead of ijson.items, and break once the releases/records field is seen (or, if there's metadata after that field, we can let ijson.parse read the entire file, but only store any metadata it reads).
Second pass: Run ijson.items(f, 'releases.item') and yield packages with the chosen number of releases.

Does that make sense?

jpmckinney · 2020-04-23T15:58:16Z

I'm also remembering that OCDS Kit has a grouper method used in the package-* commands to split an ijson stream of releases into equal sized chunks.

yolile · 2020-05-07T13:29:42Z

done in #366

jpmckinney added the existing spider label May 9, 2019

odscjames added a commit to OpenDataServices/opendataservices-deploy that referenced this issue Jun 21, 2019

OCDS Kingfisher - increase memory limit for web worker

2f85a9b

open-contracting/kingfisher-collect#154

odscjames mentioned this issue Jul 3, 2019

Stop Web API processing large files; instead pass throught REDIS and process in worker open-contracting/kingfisher-process#171

Closed

odscjames changed the title ~~Honduras ONCAE: MemoryError~~ Honduras ONCAE and Buenos Aires scrapers running on hosted Kingfisher: MemoryError Aug 5, 2019

odscjames changed the title ~~Honduras ONCAE and Buenos Aires scrapers running on hosted Kingfisher: MemoryError~~ Honduras ONCAE and Buenos Aires: MemoryError Aug 5, 2019

aguilerapy assigned aguilerapy and unassigned aguilerapy Jan 22, 2020

jpmckinney added this to Unprioritized or blocked in kingfisher-collect Feb 4, 2020

jpmckinney added the blocked We can't do this yet label Feb 4, 2020

jpmckinney changed the title ~~Honduras ONCAE and Buenos Aires: MemoryError~~ Buenos Aires: MemoryError Feb 4, 2020

jpmckinney moved this from To Do to Blocked in kingfisher-collect Feb 7, 2020

odscjames mentioned this issue Apr 8, 2020

Portugal scraper fails to post to kingfisher-process #354

Closed

yolile self-assigned this Apr 22, 2020

yolile mentioned this issue Apr 24, 2020

[WIP] Split large files with ijson #366

Merged

jpmckinney removed the blocked We can't do this yet label Apr 27, 2020

jpmckinney moved this from Blocked to In progress in kingfisher-collect Apr 27, 2020

jpmckinney added this to High priority in CDS 2020-05/2021-02 May 5, 2020

jpmckinney moved this from High priority to In progress in CDS 2020-05/2021-02 May 5, 2020

yolile closed this as completed May 7, 2020

CDS 2020-05/2021-02 automation moved this from In progress [6 max] to Done May 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buenos Aires: MemoryError #154

Buenos Aires: MemoryError #154

jpmckinney commented May 9, 2019

romifz commented May 28, 2019

jpmckinney commented May 28, 2019 •

edited

Loading

romifz commented May 28, 2019

robredpath commented Jun 5, 2019

romifz commented Jun 5, 2019

odscjames commented Jun 21, 2019

odscjames commented Jun 21, 2019

romifz commented Jul 3, 2019

odscjames commented Jul 3, 2019

robredpath commented Jul 4, 2019

odscjames commented Aug 5, 2019

odscjames commented Aug 5, 2019

odscjames commented Aug 8, 2019

jpmckinney commented Feb 4, 2020

yolile commented Apr 22, 2020

jpmckinney commented Apr 22, 2020

yolile commented Apr 23, 2020

jpmckinney commented Apr 23, 2020

jpmckinney commented Apr 23, 2020 •

edited

Loading

jpmckinney commented Apr 23, 2020

yolile commented May 7, 2020

Buenos Aires: MemoryError #154

Buenos Aires: MemoryError #154

Comments

jpmckinney commented May 9, 2019

romifz commented May 28, 2019

jpmckinney commented May 28, 2019 • edited Loading

romifz commented May 28, 2019

robredpath commented Jun 5, 2019

romifz commented Jun 5, 2019

odscjames commented Jun 21, 2019

odscjames commented Jun 21, 2019

romifz commented Jul 3, 2019

odscjames commented Jul 3, 2019

robredpath commented Jul 4, 2019

odscjames commented Aug 5, 2019

odscjames commented Aug 5, 2019

odscjames commented Aug 8, 2019

jpmckinney commented Feb 4, 2020

yolile commented Apr 22, 2020

jpmckinney commented Apr 22, 2020

yolile commented Apr 23, 2020

jpmckinney commented Apr 23, 2020

jpmckinney commented Apr 23, 2020 • edited Loading

jpmckinney commented Apr 23, 2020

yolile commented May 7, 2020

jpmckinney commented May 28, 2019 •

edited

Loading

jpmckinney commented Apr 23, 2020 •

edited

Loading