-
Notifications
You must be signed in to change notification settings - Fork 36
ESRI GeoJSON sources seem to use a lot of RAM #34
Comments
For reference, here's the code. I don't see anything that would be hanging on to memory, but maybe something in |
Thanks, I agree that on first read the code is careful; just holds 500 rows at a time in memory. I need to look deeper that this really is a problem and really unique to ESRI, I filed the issue so I don't forget. |
Maybe the list comprehensions in |
(own comment deleted, I was mis-reading the code) |
It's the convert-GeoJSON-to-CSV function that's taking the RAM. I just ran The OGR code is the exact same code we use for shapefile conversion. Perhaps OGR doesn't have a streaming GeoJSON parser? That wouldn't surprise me. The I don't feel any great need to solve this problem right now, I just wanted to record the observation. If it is a problem we could solve it by being smarter in how we use OGR, writing our own GeoJSON to CSV convertor using |
Just noticed the Node code for GeoJSON doesn't use OGR, but instead uses FWIW I just ran the entire source list in 3 hours on a single machine with 16 gigs of ram. 8x parallelism, never hit any memory limit. It may be we can just accept that the occasional source needs a gig or two of RAM to run, at least for now. |
I'm closing this issue because I don't think we should take any action on it. I did do a little research and I think the problem is simply that's how OGR works; no streaming of GeoJSON input. I'm guessing here but I think even if a machine has to start swapping it won't be too bad. If we find GeoJSON sources are causing RAM contention problems in production we can fix it then. Of the solutions I proposed above, I like the idea of using |
I'm reopening this because Mike's having some failures that suggest the EC2 system is running out of system memory and the Linux OOM Killer is wreaking havoc on our processing jobs. So the memory use might matter. Some big sources: |
I'd love to see one of those sources run with a memory profiler. Are there lots of dictionaries? Lots of strings? |
I don't think it's Python memory, it's OGR. OGR doesn't have a streaming JSON parser, you can see it balloon up with a simple |
@iandees can you comment on how hard it'd be to parse these ESRI sources without OGR? EsriRestDownloadTask looks pretty simple and the data is quite stereotyped. For that matter I'm wondering why we deal with GeoJSON at all. Would it be just as easy to convert the ESRI JSON a row at a time directly to CSV right as we're downloading it? That'd be swell. |
Sure, if that would help I can do it for sure. This was originally written when the download tasks simply downloaded to an OGR-able geo format and there was a separate step that converted the cached data to CSV. How should I handle polygons and multipolygons getting written to CSV? |
Just got confirmation from Mike the OOM killer is to blame. That causes a worker process to terminate in an unclean way, which triggers a design flaw in multiprocessing.Pool where the pool hangs. Yuck! The short-term workaround is to run fewer jobs at once and hope we don't run out of RAM. The long term fix is to use less RAM. Notes on that coming soon. Mike's syslog excerpt showing OOM killer at work:
|
So proposal: @iandees if you could help with this that'd be awesome. I'd be happy to do it but I'm just about to leave on vacation; I may have a little time tomorrow and Thur to work on it, but not later. Let me know if you're on it. This task isn't urgent, I think just running fewer jobs in parallel will give us a workaround for now. What I'd like is for ESRI sources to be converted to some natural CSV format that's as close to the source schema as possible. Then the conform processing code can feed that to our (Although really that's a bit silly; if you can write a CSV with X and Y columns, in WGS84 coordinates, and in UTF-8 then the csv-to-csv transform is a no-op. If some ESRI sources aren't in WGS84, then it'd be better to write native coordinates in the For Polygons and Multipolygons, just outputting the centroid as X/Y is all we need. You can probably use OGR to do the centroid calculation by creating an OGR geometry, one per row from the ESRI source. Or just do the math in Python, that's what the Node code does. |
I like this idea as well. The GeoJSON sources that blow up OGR are mostly our own creations, and we can choose an alternate form. CSV in UTF8 sounds sensible, and OGR can read it natively so we might not need to change any conform code. I’m unsure whether OGR’s CSV reader has a default interpretation for X and Y columns. This page suggests it wants a VRT file for that. |
FWIW we don't use OGR's reader for CSV sources now, mostly because the CSV parser code is all about dealing with strange CSV sources that OGR can't handle. If we write our own CSV files from ESRI they could go either through the OGR path or the CSV path. I'd prefer the CSV path, more code under our own control. |
Got it. Glad we already have a CSV path; I keep getting confused with the extract code. |
One of my favorite features of OpenAddresses is that we would store the source data in case the source disappears or is otherwise unreachable. Does the new Python code still do that? If so, I'd love to figure out a way to store the original data as close to the original format as possible (GeoJSON in the ESRI case) and then write a custom iterative JSON parser for GeoJSON to deal with memory issues in OGR. |
The Python code still stores source data, it's the same code you wrote :-) The file is I don't think that CSV format is much further from the ESRI JSON source than GeoJSON is, they're both transformations of the data. Smashing the geometry to a centroid does degrade the source data though. If being "close to the original format" is a goal maybe that Note that the only GeoJSON sources we see now are ESRI; there are no other types of GeoJSON sources in our collection. There might be some day, but beware that writing a general purpose GeoJSON feature extractor from any random collection of Features and GeometryCollections and the like is harder than just parsing the stereotyped data coming out of the ESRI source. I wouldn't try to do that without good examples and tests. *The JSON in |
Yea, true. How about storing the the CSV with full geometry in WKT or something as a column of the CSV and also writing out the same information with a centroid? |
Putting the geometry in the CSV would work, the WKT won't be any bigger than the GeoJSON representation. It'd be nice to do the Centroid calculation in the ESRI downloader like you suggest; the |
If the ESRI download code is doing both the "original geometry as WKT" and "geometry as centroid" outputs, what file path should I write to? I would want the "original geometry" version to be cached/saved and the "geometry as centroid" to be passed on to the next step in the conform process. |
I was thinking only a single Are ESRI JSON sources guaranteed to be in EPSG:4326? If not I suggest extracting everything in the source SRS; the CSV code does know how to reproject points. |
I tell the ESRI server to reproject everything to 4326 for me, so we're guaranteed to have 4326 (or nothing, if for some reason the server can't reproject). I will use the X,Y,geometry column idea. 👍 |
🐾😸 |
Ian and I are working on ESRI-to-CSV now in the branch esri-outputs-csv. Basically working already, needs some cleanups to the test suite. Also extensive testing. One wrinkle: we have a lot of user-contributed sources that are defined as But it's not geojson at all now, it's CSV. I've hacked the code to understand that ESRI/geojson really means ESRI/csv so it keeps working. It'd be nice to clean those source definitions up some time. That will require updating the Node code too. |
I've done a full run of 137 ESRI sources and compared its output to a full run I did with the old code yesterday. In general it is very promising. If we can fix a couple of problems in the downloader I think the code is ready for production. It's very nice to see these jobs only use 50M of working memory. Also in every single case that an Errors: 25 of the 137 ESRI sources did not produce an Here's a tarball with the outputs of these 25 problem runs – http://minar.us.to/~nelson/tmp/esri-errors.tgz I think we have two actual problems. Some sources don't work with the way we extract metadata to write the CSV column headers, maybe some ESRI protocol thing? And the Python CSV module has a sanity check that no column can be more than 128k, which barfs on a few sources with giant polygons. The latter should be easy to fix. There was also a third problem with two sources, it may be they are returning bad data that OGR didn't catch. Here's the errors I saw: KeyError: 'fields': The way the ESRI download code gets the metadata for the schema isn't working on some sources. us-ar us-ct-avon us-fl-alachua us-ga-gordon us-ia-linn us-il-tazewell us-in-madison us-mn-polk us-mn-wadena us-mn-yellow_medicine us-mo-barry us-mo-st_louis_county us-nv-henderson us-nv-lander us-nv-nye TypeError: 'NoneType' object is not iterable: This only showed up in one source, and seems also related to metadata. us-va-roanoke Error: field larger than field limit (131072): Some very large columns are causing the Python CSV parser to barf. I think it has a sanity check that no column can be more than 128k. This can be overriden, details here. us-al-calhoun us-ms-copiah us-ms-hinds us-nc-avery us-nc-burke us-nc-montgomery TypeError: in method 'Geometry_AddPoint', argument 2 of type 'double': Not sure, maybe some of these sources don't contain valid coordinates? |
The "type: geojson" thing has always seemed incredibly wrong to me. If we can synch it, I'd be happy to bulk-edit the main openaddresses source list README to reflect what we think should be there. |
WIth 3c1155d we should be handling the first three bugs you pointed out up there, @NelsonMinar. Looking at the last one now. |
@iandees thanks for 3c1155d! Edit: I tested us-ms-hinds, one of the ones with giant columns, it runs now and its output looks correct. And it used a tiny amount of RAM. |
Oops nevermind! I think 3c1155d fixes all three issues. I'm not too worried about the 4th issue, so I'd say this code is good to merge. I leave it to @migurski to decide when to merge this vs. his testing. I think it's a significant improvement and will pretty much solve the memory problems in production. On the metadata problem, when I was validating output against the old run I'd forgotten about issue #55. Almost every single run that has the metadata problem did produce an |
I did another full run with 3c1155d and am confident this new code is good and ready to merge. All tests pass. @iandees, on the last TypeError problem for two sources it appears that us-va-city_of_emporia is serving us a |
@NelsonMinar the |
Travis looks good. Can I figure out how to get back the geometry type lost in 171fd8b? It’s a useful hint of potential parcel data. Otherwise, sounds good to rebase and push live. I’m doing thrice-daily complete runs to http://s3.amazonaws.com/ditch-node.openaddresses.io/index.html to make sure everything’s happy. |
Thanks Ian, verified that us-nc-alexander and us-va-city_of_emporia are working now. I don't have time to help with the geometry type before I leave, sorry. I took a quick look though, I think that value is set in the |
Of course, the CSV thing. I'll make this my problem. |
But I won't let it stop the merge process. If one of you can confirm for me that this merge should go into |
I think it's ready to merge. |
Rebased and merged up. |
While doing some testing I noticed us-ca-kern, us-ca-marin_county, and us-ca-solano all map 1G+ of RAM while running, compared to 70M for other runs. Those are all ESRI sources, perhaps the ESRI download code tries to hold the whole dataset in RAM?
update: us-al-calhoun is the biggest I've seen, at 10G.
The text was updated successfully, but these errors were encountered: