Skip to content
This repository has been archived by the owner on May 5, 2022. It is now read-only.

ESRI GeoJSON sources seem to use a lot of RAM #34

Closed
NelsonMinar opened this issue Jan 9, 2015 · 39 comments
Closed

ESRI GeoJSON sources seem to use a lot of RAM #34

NelsonMinar opened this issue Jan 9, 2015 · 39 comments
Milestone

Comments

@NelsonMinar
Copy link
Contributor

While doing some testing I noticed us-ca-kern, us-ca-marin_county, and us-ca-solano all map 1G+ of RAM while running, compared to 70M for other runs. Those are all ESRI sources, perhaps the ESRI download code tries to hold the whole dataset in RAM?

update: us-al-calhoun is the biggest I've seen, at 10G.

@iandees
Copy link
Member

iandees commented Jan 9, 2015

For reference, here's the code.

I don't see anything that would be hanging on to memory, but maybe something in requests is trying to cache?

@NelsonMinar
Copy link
Contributor Author

Thanks, I agree that on first read the code is careful; just holds 500 rows at a time in memory. I need to look deeper that this really is a problem and really unique to ESRI, I filed the issue so I don't forget.

@iandees
Copy link
Member

iandees commented Jan 9, 2015

Maybe the list comprehensions in convert_esrijson_to_geojson() are hanging on to objects?

@migurski
Copy link
Member

migurski commented Jan 9, 2015

(own comment deleted, I was mis-reading the code)

@NelsonMinar
Copy link
Contributor Author

It's the convert-GeoJSON-to-CSV function that's taking the RAM. I just ran us-ca-solano again and watched closely. Resident RAM size is about 30M while downloading, then balloons to 2G when the OGR Python code converts cache.json to extracted-foo.csv.

The OGR code is the exact same code we use for shapefile conversion. Perhaps OGR doesn't have a streaming GeoJSON parser? That wouldn't surprise me. The cache.json GeoJSON file we download to is 140M, so I don't think it's just that the file was mmap()ed and showing up in the resident set.

I don't feel any great need to solve this problem right now, I just wanted to record the observation. If it is a problem we could solve it by being smarter in how we use OGR, writing our own GeoJSON to CSV convertor using ijson, or just feed the ESRI GeoJSON data 500 rows at a time to OGR.

@NelsonMinar
Copy link
Contributor Author

Just noticed the Node code for GeoJSON doesn't use OGR, but instead uses geojson-stream and parses the data directly in Javascript.
https://github.com/openaddresses/openaddresses-conform/blob/572b7a8066db107f028ecd5407fda0b2c7041d84/Tools/convert.js#L116-L180

FWIW I just ran the entire source list in 3 hours on a single machine with 16 gigs of ram. 8x parallelism, never hit any memory limit. It may be we can just accept that the occasional source needs a gig or two of RAM to run, at least for now.

@NelsonMinar NelsonMinar added this to the Ditch Node milestone Jan 14, 2015
@NelsonMinar NelsonMinar changed the title ESRI sources seem to use a lot of RAM ESRI GeoJSON sources seem to use a lot of RAM Jan 14, 2015
@NelsonMinar
Copy link
Contributor Author

I'm closing this issue because I don't think we should take any action on it. I did do a little research and I think the problem is simply that's how OGR works; no streaming of GeoJSON input.

I'm guessing here but I think even if a machine has to start swapping it won't be too bad. If we find GeoJSON sources are causing RAM contention problems in production we can fix it then. Of the solutions I proposed above, I like the idea of using ijson best, doing our own GeoJSON parsing instead of using OGR. sample.py already has some code to read GeoJSON with ijson, it wouldn't be too hard to adapt it to convert the entire JSON input to a CSV file.

@NelsonMinar
Copy link
Contributor Author

I'm reopening this because Mike's having some failures that suggest the EC2 system is running out of system memory and the Linux OOM Killer is wreaking havoc on our processing jobs. So the memory use might matter.

Some big sources:
us-ca-orange: 85MB of JSON data, 15G virtual RAM (12G resident)
us-al-calhoun: 550MB of JSON data, 10G of RAM (?)

@NelsonMinar NelsonMinar reopened this Jan 21, 2015
@iandees
Copy link
Member

iandees commented Jan 21, 2015

I'd love to see one of those sources run with a memory profiler. Are there lots of dictionaries? Lots of strings?

@NelsonMinar
Copy link
Contributor Author

I don't think it's Python memory, it's OGR. OGR doesn't have a streaming JSON parser, you can see it balloon up with a simple ogr2ogr. (Streaming JSON is not trivial, few folks bother implementing it.) That's why I think we should switch to ijson to read GeoJSON sources; we can cut OGR out entirely.

@NelsonMinar
Copy link
Contributor Author

@iandees can you comment on how hard it'd be to parse these ESRI sources without OGR? EsriRestDownloadTask looks pretty simple and the data is quite stereotyped. For that matter I'm wondering why we deal with GeoJSON at all. Would it be just as easy to convert the ESRI JSON a row at a time directly to CSV right as we're downloading it? That'd be swell.

@iandees
Copy link
Member

iandees commented Jan 21, 2015

Sure, if that would help I can do it for sure. This was originally written when the download tasks simply downloaded to an OGR-able geo format and there was a separate step that converted the cached data to CSV. How should I handle polygons and multipolygons getting written to CSV?

@NelsonMinar
Copy link
Contributor Author

Just got confirmation from Mike the OOM killer is to blame. That causes a worker process to terminate in an unclean way, which triggers a design flaw in multiprocessing.Pool where the pool hangs. Yuck! The short-term workaround is to run fewer jobs at once and hope we don't run out of RAM. The long term fix is to use less RAM. Notes on that coming soon.

Mike's syslog excerpt showing OOM killer at work:

Jan 20 22:35:44 ip-10-171-76-182 kernel: [192792.152126] [ 9747]  1000  9747  2571099  2493044    5007        0             0 openaddr us-ca-
Jan 20 22:35:44 ip-10-171-76-182 kernel: [192792.152175] Out of memory: Kill process 9747 (openaddr us-ca-) score 323 or sacrifice child
Jan 20 22:35:44 ip-10-171-76-182 kernel: [192792.155213] Killed process 9747 (openaddr us-ca-) total-vm:10284396kB, anon-rss:9972176kB, file-rss:0kB

@NelsonMinar
Copy link
Contributor Author

So proposal: EsriRestDownloadTask should download to a cache.csv file instead of a cache.json file. This is a product change but one I think should be OK.

@iandees if you could help with this that'd be awesome. I'd be happy to do it but I'm just about to leave on vacation; I may have a little time tomorrow and Thur to work on it, but not later. Let me know if you're on it. This task isn't urgent, I think just running fewer jobs in parallel will give us a workaround for now.

What I'd like is for ESRI sources to be converted to some natural CSV format that's as close to the source schema as possible. Then the conform processing code can feed that to our csv_source_to_csv() for CSV sources which will convert that source-native CSV to an extracted CSV, and thence to out.csv.

(Although really that's a bit silly; if you can write a CSV with X and Y columns, in WGS84 coordinates, and in UTF-8 then the csv-to-csv transform is a no-op. If some ESRI sources aren't in WGS84, then it'd be better to write native coordinates in the cache.csv file and let csv_source_to_csv() do the conversion.)

For Polygons and Multipolygons, just outputting the centroid as X/Y is all we need. You can probably use OGR to do the centroid calculation by creating an OGR geometry, one per row from the ESRI source. Or just do the math in Python, that's what the Node code does.

@migurski
Copy link
Member

I like this idea as well. The GeoJSON sources that blow up OGR are mostly our own creations, and we can choose an alternate form. CSV in UTF8 sounds sensible, and OGR can read it natively so we might not need to change any conform code. I’m unsure whether OGR’s CSV reader has a default interpretation for X and Y columns. This page suggests it wants a VRT file for that.

@NelsonMinar
Copy link
Contributor Author

FWIW we don't use OGR's reader for CSV sources now, mostly because the CSV parser code is all about dealing with strange CSV sources that OGR can't handle. If we write our own CSV files from ESRI they could go either through the OGR path or the CSV path. I'd prefer the CSV path, more code under our own control.

@migurski
Copy link
Member

Got it. Glad we already have a CSV path; I keep getting confused with the extract code.

@iandees
Copy link
Member

iandees commented Jan 21, 2015

One of my favorite features of OpenAddresses is that we would store the source data in case the source disappears or is otherwise unreachable. Does the new Python code still do that? If so, I'd love to figure out a way to store the original data as close to the original format as possible (GeoJSON in the ESRI case) and then write a custom iterative JSON parser for GeoJSON to deal with memory issues in OGR.

@NelsonMinar
Copy link
Contributor Author

The Python code still stores source data, it's the same code you wrote :-) The file is cache.json for ESRI sources and is in GeoJSON* format. I'm proposing changing that to cache.csv.

I don't think that CSV format is much further from the ESRI JSON source than GeoJSON is, they're both transformations of the data. Smashing the geometry to a centroid does degrade the source data though.

If being "close to the original format" is a goal maybe that cache.json file should be ESRI JSON and we transform it later, writing a ESRI ijson parser for the conform step? That's about as much work as writing a GeoJSON ijson parser.

Note that the only GeoJSON sources we see now are ESRI; there are no other types of GeoJSON sources in our collection. There might be some day, but beware that writing a general purpose GeoJSON feature extractor from any random collection of Features and GeometryCollections and the like is harder than just parsing the stereotyped data coming out of the ESRI source. I wouldn't try to do that without good examples and tests.

*The JSON in cache.json we write and serve right now is not valid JSON. The last Feature has a comma at the end, JSON doesn't allow that. Both Python's json and the JSON tool jq won't parse it.)

@iandees
Copy link
Member

iandees commented Jan 21, 2015

Yea, true. How about storing the the CSV with full geometry in WKT or something as a column of the CSV and also writing out the same information with a centroid?

@NelsonMinar
Copy link
Contributor Author

Putting the geometry in the CSV would work, the WKT won't be any bigger than the GeoJSON representation. It'd be nice to do the Centroid calculation in the ESRI downloader like you suggest; the csv_source_to_csv() code currently doesn't know anything about geometries.

@iandees
Copy link
Member

iandees commented Jan 21, 2015

If the ESRI download code is doing both the "original geometry as WKT" and "geometry as centroid" outputs, what file path should I write to? I would want the "original geometry" version to be cached/saved and the "geometry as centroid" to be passed on to the next step in the conform process.

@NelsonMinar
Copy link
Contributor Author

I was thinking only a single cache.csv with both a geometry column and X and Y columns. The CSV code doesn't understand geometries, it'll just read X and Y. (In the common case where the source geometry is a point we could skip writing the redundant geometry column and save some space.)

Are ESRI JSON sources guaranteed to be in EPSG:4326? If not I suggest extracting everything in the source SRS; the CSV code does know how to reproject points.

@iandees
Copy link
Member

iandees commented Jan 21, 2015

I tell the ESRI server to reproject everything to 4326 for me, so we're guaranteed to have 4326 (or nothing, if for some reason the server can't reproject).

I will use the X,Y,geometry column idea.

👍

@NelsonMinar
Copy link
Contributor Author

🐾😸
🐾

@NelsonMinar
Copy link
Contributor Author

Ian and I are working on ESRI-to-CSV now in the branch esri-outputs-csv. Basically working already, needs some cleanups to the test suite. Also extensive testing.

One wrinkle: we have a lot of user-contributed sources that are defined as
type: ESRI
conform: type: geojson

But it's not geojson at all now, it's CSV. I've hacked the code to understand that ESRI/geojson really means ESRI/csv so it keeps working. It'd be nice to clean those source definitions up some time. That will require updating the Node code too.

@NelsonMinar
Copy link
Contributor Author

I've done a full run of 137 ESRI sources and compared its output to a full run I did with the old code yesterday. In general it is very promising. If we can fix a couple of problems in the downloader I think the code is ready for production. It's very nice to see these jobs only use 50M of working memory. Also in every single case that an out.csv was produced, it more or less matched the out.csv from the old code. (Small differences in rounding of coordinates.)

Errors: 25 of the 137 ESRI sources did not produce an out.csv file when it did with the old code. We need to fix these before I'm fully confident we can switch to the new code. I can try to lend a hand tomorrow but I suspect @iandees is in a better position to fix these.

Here's a tarball with the outputs of these 25 problem runs – http://minar.us.to/~nelson/tmp/esri-errors.tgz
In those directories,new.txt is the debug log from the new run, out.txt and out.csv are the old files. These errors should all be reproducible with openaddr-process-one.

I think we have two actual problems. Some sources don't work with the way we extract metadata to write the CSV column headers, maybe some ESRI protocol thing? And the Python CSV module has a sanity check that no column can be more than 128k, which barfs on a few sources with giant polygons. The latter should be easy to fix. There was also a third problem with two sources, it may be they are returning bad data that OGR didn't catch.

Here's the errors I saw:

KeyError: 'fields': The way the ESRI download code gets the metadata for the schema isn't working on some sources.

us-ar us-ct-avon us-fl-alachua us-ga-gordon us-ia-linn us-il-tazewell us-in-madison us-mn-polk us-mn-wadena us-mn-yellow_medicine us-mo-barry us-mo-st_louis_county us-nv-henderson us-nv-lander us-nv-nye

TypeError: 'NoneType' object is not iterable: This only showed up in one source, and seems also related to metadata.

us-va-roanoke

Error: field larger than field limit (131072): Some very large columns are causing the Python CSV parser to barf. I think it has a sanity check that no column can be more than 128k. This can be overriden, details here.

us-al-calhoun us-ms-copiah us-ms-hinds us-nc-avery us-nc-burke us-nc-montgomery

TypeError: in method 'Geometry_AddPoint', argument 2 of type 'double': Not sure, maybe some of these sources don't contain valid coordinates?
us-nc-alexander us-va-city_of_emporia

@migurski
Copy link
Member

The "type: geojson" thing has always seemed incredibly wrong to me. If we can synch it, I'd be happy to bulk-edit the main openaddresses source list README to reflect what we think should be there.

@iandees
Copy link
Member

iandees commented Jan 22, 2015

WIth 3c1155d we should be handling the first three bugs you pointed out up there, @NelsonMinar.

Looking at the last one now.

@NelsonMinar
Copy link
Contributor Author

@iandees thanks for 3c1155d! But is there some better way we can handle the metadata problem? I'd hate to lose those sources; the old code was able to get useful data. Maybe just get an actual row and look at the property names in the ESRI JSON to infer the schema? see below; I was wrong.

Edit: I tested us-ms-hinds, one of the ones with giant columns, it runs now and its output looks correct. And it used a tiny amount of RAM.

@NelsonMinar
Copy link
Contributor Author

Oops nevermind! I think 3c1155d fixes all three issues. I'm not too worried about the 4th issue, so I'd say this code is good to merge. I leave it to @migurski to decide when to merge this vs. his testing. I think it's a significant improvement and will pretty much solve the memory problems in production.

On the metadata problem, when I was validating output against the old run I'd forgotten about issue #55. Almost every single run that has the metadata problem did produce an out.csv, but I didn't notice it was one of those 23 byte 0 row useless things. Those servers are just broken. (us-mn-wadena is the exception, I think it was just down when I was testing last night.) So this code now fixes issue #55 too I think. My apologies for my confusion earlier.

@NelsonMinar
Copy link
Contributor Author

I did another full run with 3c1155d and am confident this new code is good and ready to merge. All tests pass.

@iandees, on the last TypeError problem for two sources it appears that us-va-city_of_emporia is serving us a POINT(NaN, NaN). It'd be nice to just ignore rows like that but I wasn't sure what error handling strategy was appropriate so i didn't try to add any. FWIW we've always had a few ESRI sources with various kinds of geometry errors, that's why this one doesn't concern me.

@iandees
Copy link
Member

iandees commented Jan 23, 2015

@NelsonMinar the POINT(NaN, NaN) problem is solved with c6b86f5.

@migurski
Copy link
Member

Travis looks good. Can I figure out how to get back the geometry type lost in 171fd8b? It’s a useful hint of potential parcel data. Otherwise, sounds good to rebase and push live. I’m doing thrice-daily complete runs to http://s3.amazonaws.com/ditch-node.openaddresses.io/index.html to make sure everything’s happy.

@NelsonMinar
Copy link
Contributor Author

Thanks Ian, verified that us-nc-alexander and us-va-city_of_emporia are working now.

I don't have time to help with the geometry type before I leave, sorry. I took a quick look though, I think that value is set in the excerpt() function which has a completely different way of parsing source data than conform(). Now that ESRI sources write CSV instead of GeoJSON, that code doesn't know how to guess the geometry type. Maybe add code to look in CSV files for a geom column and infer the type from the WKT there, else assume Point? Or the downloader code is the thing that really knows the geometry type, if there's a way to set the variable in the state object there then good. Whoever adds that should re-enable the tests I disabled in 171fd8b and bae0efe.

@migurski
Copy link
Member

Of course, the CSV thing. I'll make this my problem.

@migurski
Copy link
Member

But I won't let it stop the merge process. If one of you can confirm for me that this merge should go into ditch-node I'll make that happen ASAP.

@NelsonMinar
Copy link
Contributor Author

I think it's ready to merge.

@migurski
Copy link
Member

Rebased and merged up.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants