Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buenos Aires: MemoryError #154

Closed
jpmckinney opened this issue May 9, 2019 · 21 comments
Closed

Buenos Aires: MemoryError #154

jpmckinney opened this issue May 9, 2019 · 21 comments

Comments

@jpmckinney
Copy link
Member

@romifz:

in collection_file table, some files have ["MemoryError()"] in the errors column

@odscjames:

Romi got it to work locally by using Flask with the dev server option. I wonder if it's UWSGI then? But if it was, that's not how I would expect that error to appear.

@romifz
Copy link
Contributor

romifz commented May 28, 2019

This also happens with buenos-aires collection (@yolile).

By the way, shouldn't this issue be on the kingfisher-process project?

@jpmckinney
Copy link
Member Author

jpmckinney commented May 28, 2019

I don't know - is the MemoryError occurring during a crawl, and then getting sent to kingfisher-process' file_errors API endpoint? If so, I think the fix is in kingfisher-scrape.

@romifz
Copy link
Contributor

romifz commented May 28, 2019

Apparently there is no errors sent during a crawl, see this log. API calls are not returning error codes neither (I've seen HTTP error codes when sending files with incorrect formats).

@robredpath
Copy link
Contributor

@romifz is this something that you'd like someone like @odscjames to look into? How urgent is it?

@romifz
Copy link
Contributor

romifz commented Jun 5, 2019

It's not urgent, for now I've been downloading the files using kingfisher-scrape locally and loading them to kingfisher-process using local-load.

@odscjames
Copy link
Contributor

When I run this locally on Django debug server I report a memory useage of 681 MB for processing the largest file. UWSGI is currently capped at 500MB. (Tho I am puzzled this isn't the memory error I've seen before from UWSGI)

Anyway, the short term fix is simply to increase memory. I think that server has 64GB of RAM after all.

The long term fix would be to use the Redis que - when a large file comes in, the web process simply stores it, returns a response to the caller as quick as possible and puts a message on the que. A Redis worker then processes it - they have no memory limits, I think, and they can take a longer time to process larger files without an HTTP agent waiting for a response.

odscjames added a commit to OpenDataServices/opendataservices-deploy that referenced this issue Jun 21, 2019
@odscjames
Copy link
Contributor

Limit now increased to 1GB. We also have a 10 min limit per request, but we haven't seen that being broken yet.

@romifz
Copy link
Contributor

romifz commented Jul 3, 2019

@odscjames I checked Buenos Aires file and it has ~800MB (uncompressed), so the memory limit should work for now! This file gets updated every 15 days, so it will grow with time, but I think it would take a while to reach the 1GB cap.

@odscjames
Copy link
Contributor

The memory use of Python will be more to load the file than the size of the file on disk - but I'm not sure how much more it will be. I'll increase the limits a bit for now anyway.

@robredpath
Copy link
Contributor

@odscjames odscjames changed the title Honduras ONCAE: MemoryError Honduras ONCAE and Buenos Aires scrapers running on hosted Kingfisher: MemoryError Aug 5, 2019
@odscjames odscjames changed the title Honduras ONCAE and Buenos Aires scrapers running on hosted Kingfisher: MemoryError Honduras ONCAE and Buenos Aires: MemoryError Aug 5, 2019
@odscjames
Copy link
Contributor

I increased the RAM limit to 4GB and Buenos Aires still errors. This needs looking at more.

@odscjames
Copy link
Contributor

I have looked into this and something isn't right here.

I think there is actually only one place in the code that would get this to crash the way it is:

                with open(file_to_store.get_filename(), encoding=encoding) as f:
                    data = json.load(f)

I tested locally and I got 3.3GB memory use opening the 800MB Buenos Aires file - which is big, but I tried upping the limit to 16GB on the live server and it still crashed, so something else is going on.

Need to dig into this more.

(We hear that local load works, so an incremental improvement to what's currently there along the lines explored in open-contracting/kingfisher-process#171 would still be good to discuss too)

@odscjames
Copy link
Contributor

I have tried the code section above with the Buenos Aires data file via a UWSGI process and it worked fine on my desktop. So more digging into this is required.

@aguilerapy aguilerapy assigned aguilerapy and unassigned aguilerapy Jan 22, 2020
@jpmckinney jpmckinney added this to Unprioritized or blocked in kingfisher-collect Feb 4, 2020
@jpmckinney jpmckinney added the blocked We can't do this yet label Feb 4, 2020
@jpmckinney
Copy link
Member Author

Marking as blocked as the fix is in Kingfisher Process. Updating issue title as ONCAE works now, but BA still fails.

@jpmckinney jpmckinney changed the title Honduras ONCAE and Buenos Aires: MemoryError Buenos Aires: MemoryError Feb 4, 2020
@jpmckinney jpmckinney moved this from To Do to Blocked in kingfisher-collect Feb 7, 2020
@yolile
Copy link
Member

yolile commented Apr 22, 2020

@jpmckinney I think that I can do something similar to what we did for Portugal, but using ijson as you suggested

@yolile yolile self-assigned this Apr 22, 2020
@jpmckinney
Copy link
Member Author

I'm fine with fixing this on the Kingfisher Scrape side, as it will be some time before Kingfisher Process if fixed to handle large files.

@yolile
Copy link
Member

yolile commented Apr 23, 2020

@jpmckinney Buenos Aires uses a big release package, should we add to each release the package metadata before send it to kingfisher-process?

@jpmckinney
Copy link
Member Author

Yes, that seems best.

@jpmckinney
Copy link
Member Author

jpmckinney commented Apr 23, 2020

Hmm, I'm remembering it's not so easy to solve in the general case with streaming input (but we should be able to do it for just BA, where we can read a file from disk multiple times).

Note: Instead of sending a package with a single release, we can create packages with as many releases as we can without running out of memory.

Now, the challenge is extracting the metadata in an iterative fashion, and then extracting the releases in an iterative fashion.

I wrote up a general solution here, which requires that all metadata fields occur before the releases/records field: open-contracting/ocdskit#118

Another solution is to:

  1. First pass: Extract the metadata using ijson.parse instead of ijson.items, and break once the releases/records field is seen (or, if there's metadata after that field, we can let ijson.parse read the entire file, but only store any metadata it reads).
  2. Second pass: Run ijson.items(f, 'releases.item') and yield packages with the chosen number of releases.

Does that make sense?

@jpmckinney
Copy link
Member Author

I'm also remembering that OCDS Kit has a grouper method used in the package-* commands to split an ijson stream of releases into equal sized chunks.

@jpmckinney jpmckinney removed the blocked We can't do this yet label Apr 27, 2020
@jpmckinney jpmckinney moved this from Blocked to In progress in kingfisher-collect Apr 27, 2020
@jpmckinney jpmckinney added this to High priority in CDS 2020-05/2021-02 May 5, 2020
@jpmckinney jpmckinney moved this from High priority to In progress in CDS 2020-05/2021-02 May 5, 2020
@yolile
Copy link
Member

yolile commented May 7, 2020

done in #366

@yolile yolile closed this as completed May 7, 2020
CDS 2020-05/2021-02 automation moved this from In progress [6 max] to Done May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

6 participants