ESRI GeoJSON sources seem to use a lot of RAM #34

NelsonMinar · 2015-01-09T01:19:14Z

While doing some testing I noticed us-ca-kern, us-ca-marin_county, and us-ca-solano all map 1G+ of RAM while running, compared to 70M for other runs. Those are all ESRI sources, perhaps the ESRI download code tries to hold the whole dataset in RAM?

update: us-al-calhoun is the biggest I've seen, at 10G.

iandees · 2015-01-09T01:42:36Z

For reference, here's the code.

I don't see anything that would be hanging on to memory, but maybe something in requests is trying to cache?

NelsonMinar · 2015-01-09T01:49:14Z

Thanks, I agree that on first read the code is careful; just holds 500 rows at a time in memory. I need to look deeper that this really is a problem and really unique to ESRI, I filed the issue so I don't forget.

iandees · 2015-01-09T01:54:10Z

Maybe the list comprehensions in convert_esrijson_to_geojson() are hanging on to objects?

migurski · 2015-01-09T02:01:09Z

(own comment deleted, I was mis-reading the code)

NelsonMinar · 2015-01-09T02:04:21Z

It's the convert-GeoJSON-to-CSV function that's taking the RAM. I just ran us-ca-solano again and watched closely. Resident RAM size is about 30M while downloading, then balloons to 2G when the OGR Python code converts cache.json to extracted-foo.csv.

The OGR code is the exact same code we use for shapefile conversion. Perhaps OGR doesn't have a streaming GeoJSON parser? That wouldn't surprise me. The cache.json GeoJSON file we download to is 140M, so I don't think it's just that the file was mmap()ed and showing up in the resident set.

I don't feel any great need to solve this problem right now, I just wanted to record the observation. If it is a problem we could solve it by being smarter in how we use OGR, writing our own GeoJSON to CSV convertor using ijson, or just feed the ESRI GeoJSON data 500 rows at a time to OGR.

NelsonMinar · 2015-01-11T17:18:09Z

Just noticed the Node code for GeoJSON doesn't use OGR, but instead uses geojson-stream and parses the data directly in Javascript.
https://github.com/openaddresses/openaddresses-conform/blob/572b7a8066db107f028ecd5407fda0b2c7041d84/Tools/convert.js#L116-L180

FWIW I just ran the entire source list in 3 hours on a single machine with 16 gigs of ram. 8x parallelism, never hit any memory limit. It may be we can just accept that the occasional source needs a gig or two of RAM to run, at least for now.

NelsonMinar · 2015-01-17T01:33:58Z

I'm closing this issue because I don't think we should take any action on it. I did do a little research and I think the problem is simply that's how OGR works; no streaming of GeoJSON input.

I'm guessing here but I think even if a machine has to start swapping it won't be too bad. If we find GeoJSON sources are causing RAM contention problems in production we can fix it then. Of the solutions I proposed above, I like the idea of using ijson best, doing our own GeoJSON parsing instead of using OGR. sample.py already has some code to read GeoJSON with ijson, it wouldn't be too hard to adapt it to convert the entire JSON input to a CSV file.

NelsonMinar · 2015-01-21T02:51:48Z

I'm reopening this because Mike's having some failures that suggest the EC2 system is running out of system memory and the Linux OOM Killer is wreaking havoc on our processing jobs. So the memory use might matter.

Some big sources:
us-ca-orange: 85MB of JSON data, 15G virtual RAM (12G resident)
us-al-calhoun: 550MB of JSON data, 10G of RAM (?)

iandees · 2015-01-21T02:53:00Z

I'd love to see one of those sources run with a memory profiler. Are there lots of dictionaries? Lots of strings?

NelsonMinar · 2015-01-21T02:56:40Z

I don't think it's Python memory, it's OGR. OGR doesn't have a streaming JSON parser, you can see it balloon up with a simple ogr2ogr. (Streaming JSON is not trivial, few folks bother implementing it.) That's why I think we should switch to ijson to read GeoJSON sources; we can cut OGR out entirely.

NelsonMinar · 2015-01-21T03:09:31Z

@iandees can you comment on how hard it'd be to parse these ESRI sources without OGR? EsriRestDownloadTask looks pretty simple and the data is quite stereotyped. For that matter I'm wondering why we deal with GeoJSON at all. Would it be just as easy to convert the ESRI JSON a row at a time directly to CSV right as we're downloading it? That'd be swell.

iandees · 2015-01-21T03:17:16Z

Sure, if that would help I can do it for sure. This was originally written when the download tasks simply downloaded to an OGR-able geo format and there was a separate step that converted the cached data to CSV. How should I handle polygons and multipolygons getting written to CSV?

NelsonMinar · 2015-01-21T03:39:29Z

Just got confirmation from Mike the OOM killer is to blame. That causes a worker process to terminate in an unclean way, which triggers a design flaw in multiprocessing.Pool where the pool hangs. Yuck! The short-term workaround is to run fewer jobs at once and hope we don't run out of RAM. The long term fix is to use less RAM. Notes on that coming soon.

Mike's syslog excerpt showing OOM killer at work:

Jan 20 22:35:44 ip-10-171-76-182 kernel: [192792.152126] [ 9747]  1000  9747  2571099  2493044    5007        0             0 openaddr us-ca-
Jan 20 22:35:44 ip-10-171-76-182 kernel: [192792.152175] Out of memory: Kill process 9747 (openaddr us-ca-) score 323 or sacrifice child
Jan 20 22:35:44 ip-10-171-76-182 kernel: [192792.155213] Killed process 9747 (openaddr us-ca-) total-vm:10284396kB, anon-rss:9972176kB, file-rss:0kB

NelsonMinar · 2015-01-21T03:51:24Z

So proposal: EsriRestDownloadTask should download to a cache.csv file instead of a cache.json file. This is a product change but one I think should be OK.

@iandees if you could help with this that'd be awesome. I'd be happy to do it but I'm just about to leave on vacation; I may have a little time tomorrow and Thur to work on it, but not later. Let me know if you're on it. This task isn't urgent, I think just running fewer jobs in parallel will give us a workaround for now.

What I'd like is for ESRI sources to be converted to some natural CSV format that's as close to the source schema as possible. Then the conform processing code can feed that to our csv_source_to_csv() for CSV sources which will convert that source-native CSV to an extracted CSV, and thence to out.csv.

(Although really that's a bit silly; if you can write a CSV with X and Y columns, in WGS84 coordinates, and in UTF-8 then the csv-to-csv transform is a no-op. If some ESRI sources aren't in WGS84, then it'd be better to write native coordinates in the cache.csv file and let csv_source_to_csv() do the conversion.)

For Polygons and Multipolygons, just outputting the centroid as X/Y is all we need. You can probably use OGR to do the centroid calculation by creating an OGR geometry, one per row from the ESRI source. Or just do the math in Python, that's what the Node code does.

migurski · 2015-01-21T04:19:49Z

I like this idea as well. The GeoJSON sources that blow up OGR are mostly our own creations, and we can choose an alternate form. CSV in UTF8 sounds sensible, and OGR can read it natively so we might not need to change any conform code. I’m unsure whether OGR’s CSV reader has a default interpretation for X and Y columns. This page suggests it wants a VRT file for that.

NelsonMinar · 2015-01-21T04:24:20Z

FWIW we don't use OGR's reader for CSV sources now, mostly because the CSV parser code is all about dealing with strange CSV sources that OGR can't handle. If we write our own CSV files from ESRI they could go either through the OGR path or the CSV path. I'd prefer the CSV path, more code under our own control.

migurski · 2015-01-21T04:34:38Z

Got it. Glad we already have a CSV path; I keep getting confused with the extract code.

iandees · 2015-01-21T12:52:11Z

One of my favorite features of OpenAddresses is that we would store the source data in case the source disappears or is otherwise unreachable. Does the new Python code still do that? If so, I'd love to figure out a way to store the original data as close to the original format as possible (GeoJSON in the ESRI case) and then write a custom iterative JSON parser for GeoJSON to deal with memory issues in OGR.

NelsonMinar · 2015-01-21T15:29:50Z

The Python code still stores source data, it's the same code you wrote :-) The file is cache.json for ESRI sources and is in GeoJSON* format. I'm proposing changing that to cache.csv.

I don't think that CSV format is much further from the ESRI JSON source than GeoJSON is, they're both transformations of the data. Smashing the geometry to a centroid does degrade the source data though.

If being "close to the original format" is a goal maybe that cache.json file should be ESRI JSON and we transform it later, writing a ESRI ijson parser for the conform step? That's about as much work as writing a GeoJSON ijson parser.

Note that the only GeoJSON sources we see now are ESRI; there are no other types of GeoJSON sources in our collection. There might be some day, but beware that writing a general purpose GeoJSON feature extractor from any random collection of Features and GeometryCollections and the like is harder than just parsing the stereotyped data coming out of the ESRI source. I wouldn't try to do that without good examples and tests.

*The JSON in cache.json we write and serve right now is not valid JSON. The last Feature has a comma at the end, JSON doesn't allow that. Both Python's json and the JSON tool jq won't parse it.)

iandees · 2015-01-21T15:33:33Z

Yea, true. How about storing the the CSV with full geometry in WKT or something as a column of the CSV and also writing out the same information with a centroid?

NelsonMinar · 2015-01-21T15:37:30Z

Putting the geometry in the CSV would work, the WKT won't be any bigger than the GeoJSON representation. It'd be nice to do the Centroid calculation in the ESRI downloader like you suggest; the csv_source_to_csv() code currently doesn't know anything about geometries.

iandees · 2015-01-21T15:39:55Z

If the ESRI download code is doing both the "original geometry as WKT" and "geometry as centroid" outputs, what file path should I write to? I would want the "original geometry" version to be cached/saved and the "geometry as centroid" to be passed on to the next step in the conform process.

NelsonMinar · 2015-01-21T15:46:55Z

I was thinking only a single cache.csv with both a geometry column and X and Y columns. The CSV code doesn't understand geometries, it'll just read X and Y. (In the common case where the source geometry is a point we could skip writing the redundant geometry column and save some space.)

Are ESRI JSON sources guaranteed to be in EPSG:4326? If not I suggest extracting everything in the source SRS; the CSV code does know how to reproject points.

iandees · 2015-01-21T15:48:43Z

I tell the ESRI server to reproject everything to 4326 for me, so we're guaranteed to have 4326 (or nothing, if for some reason the server can't reproject).

I will use the X,Y,geometry column idea.

👍

NelsonMinar · 2015-01-21T15:52:36Z

🐾😸
🐾

NelsonMinar · 2015-01-22T02:05:40Z

Ian and I are working on ESRI-to-CSV now in the branch esri-outputs-csv. Basically working already, needs some cleanups to the test suite. Also extensive testing.

One wrinkle: we have a lot of user-contributed sources that are defined as
type: ESRI
conform: type: geojson

But it's not geojson at all now, it's CSV. I've hacked the code to understand that ESRI/geojson really means ESRI/csv so it keeps working. It'd be nice to clean those source definitions up some time. That will require updating the Node code too.

NelsonMinar · 2015-01-22T03:58:44Z

I've done a full run of 137 ESRI sources and compared its output to a full run I did with the old code yesterday. In general it is very promising. If we can fix a couple of problems in the downloader I think the code is ready for production. It's very nice to see these jobs only use 50M of working memory. Also in every single case that an out.csv was produced, it more or less matched the out.csv from the old code. (Small differences in rounding of coordinates.)

Errors: 25 of the 137 ESRI sources did not produce an out.csv file when it did with the old code. We need to fix these before I'm fully confident we can switch to the new code. I can try to lend a hand tomorrow but I suspect @iandees is in a better position to fix these.

Here's a tarball with the outputs of these 25 problem runs – http://minar.us.to/~nelson/tmp/esri-errors.tgz
In those directories,new.txt is the debug log from the new run, out.txt and out.csv are the old files. These errors should all be reproducible with openaddr-process-one.

I think we have two actual problems. Some sources don't work with the way we extract metadata to write the CSV column headers, maybe some ESRI protocol thing? And the Python CSV module has a sanity check that no column can be more than 128k, which barfs on a few sources with giant polygons. The latter should be easy to fix. There was also a third problem with two sources, it may be they are returning bad data that OGR didn't catch.

Here's the errors I saw:

KeyError: 'fields': The way the ESRI download code gets the metadata for the schema isn't working on some sources.

us-ar us-ct-avon us-fl-alachua us-ga-gordon us-ia-linn us-il-tazewell us-in-madison us-mn-polk us-mn-wadena us-mn-yellow_medicine us-mo-barry us-mo-st_louis_county us-nv-henderson us-nv-lander us-nv-nye

TypeError: 'NoneType' object is not iterable: This only showed up in one source, and seems also related to metadata.

us-va-roanoke

Error: field larger than field limit (131072): Some very large columns are causing the Python CSV parser to barf. I think it has a sanity check that no column can be more than 128k. This can be overriden, details here.

us-al-calhoun us-ms-copiah us-ms-hinds us-nc-avery us-nc-burke us-nc-montgomery

TypeError: in method 'Geometry_AddPoint', argument 2 of type 'double': Not sure, maybe some of these sources don't contain valid coordinates?
us-nc-alexander us-va-city_of_emporia

migurski · 2015-01-22T06:26:26Z

The "type: geojson" thing has always seemed incredibly wrong to me. If we can synch it, I'd be happy to bulk-edit the main openaddresses source list README to reflect what we think should be there.

iandees · 2015-01-22T13:23:23Z

WIth 3c1155d we should be handling the first three bugs you pointed out up there, @NelsonMinar.

Looking at the last one now.

NelsonMinar · 2015-01-22T15:21:17Z

@iandees thanks for 3c1155d! But is there some better way we can handle the metadata problem? I'd hate to lose those sources; the old code was able to get useful data. Maybe just get an actual row and look at the property names in the ESRI JSON to infer the schema? see below; I was wrong.

Edit: I tested us-ms-hinds, one of the ones with giant columns, it runs now and its output looks correct. And it used a tiny amount of RAM.

NelsonMinar · 2015-01-22T16:01:13Z

Oops nevermind! I think 3c1155d fixes all three issues. I'm not too worried about the 4th issue, so I'd say this code is good to merge. I leave it to @migurski to decide when to merge this vs. his testing. I think it's a significant improvement and will pretty much solve the memory problems in production.

On the metadata problem, when I was validating output against the old run I'd forgotten about issue #55. Almost every single run that has the metadata problem did produce an out.csv, but I didn't notice it was one of those 23 byte 0 row useless things. Those servers are just broken. (us-mn-wadena is the exception, I think it was just down when I was testing last night.) So this code now fixes issue #55 too I think. My apologies for my confusion earlier.

NelsonMinar · 2015-01-22T19:56:38Z

I did another full run with 3c1155d and am confident this new code is good and ready to merge. All tests pass.

@iandees, on the last TypeError problem for two sources it appears that us-va-city_of_emporia is serving us a POINT(NaN, NaN). It'd be nice to just ignore rows like that but I wasn't sure what error handling strategy was appropriate so i didn't try to add any. FWIW we've always had a few ESRI sources with various kinds of geometry errors, that's why this one doesn't concern me.

iandees · 2015-01-23T03:17:46Z

@NelsonMinar the POINT(NaN, NaN) problem is solved with c6b86f5.

migurski · 2015-01-23T06:55:03Z

Travis looks good. Can I figure out how to get back the geometry type lost in 171fd8b? It’s a useful hint of potential parcel data. Otherwise, sounds good to rebase and push live. I’m doing thrice-daily complete runs to http://s3.amazonaws.com/ditch-node.openaddresses.io/index.html to make sure everything’s happy.

NelsonMinar · 2015-01-23T15:23:30Z

Thanks Ian, verified that us-nc-alexander and us-va-city_of_emporia are working now.

I don't have time to help with the geometry type before I leave, sorry. I took a quick look though, I think that value is set in the excerpt() function which has a completely different way of parsing source data than conform(). Now that ESRI sources write CSV instead of GeoJSON, that code doesn't know how to guess the geometry type. Maybe add code to look in CSV files for a geom column and infer the type from the WKT there, else assume Point? Or the downloader code is the thing that really knows the geometry type, if there's a way to set the variable in the state object there then good. Whoever adds that should re-enable the tests I disabled in 171fd8b and bae0efe.

migurski · 2015-01-23T21:24:41Z

Of course, the CSV thing. I'll make this my problem.

migurski · 2015-01-23T21:25:19Z

But I won't let it stop the merge process. If one of you can confirm for me that this merge should go into ditch-node I'll make that happen ASAP.

NelsonMinar · 2015-01-23T21:26:03Z

I think it's ready to merge.

migurski · 2015-01-23T21:53:42Z

Rebased and merged up.

Added geometry type detection for ESRI sources

NelsonMinar added this to the Ditch Node milestone Jan 14, 2015

NelsonMinar changed the title ~~ESRI sources seem to use a lot of RAM~~ ESRI GeoJSON sources seem to use a lot of RAM Jan 14, 2015

NelsonMinar closed this as completed Jan 17, 2015

NelsonMinar reopened this Jan 21, 2015

migurski closed this as completed Jan 23, 2015

migurski mentioned this issue Jan 23, 2015

Fix geometry type detection for ESRI sources #57

Closed

migurski referenced this issue Feb 18, 2015

Merge pull request #69 from openaddresses/detect-geometry-#34

1318dcd

Added geometry type detection for ESRI sources

ESRI GeoJSON sources seem to use a lot of RAM #34

ESRI GeoJSON sources seem to use a lot of RAM #34

Comments

NelsonMinar commented Jan 9, 2015

iandees commented Jan 9, 2015

NelsonMinar commented Jan 9, 2015

iandees commented Jan 9, 2015

migurski commented Jan 9, 2015

NelsonMinar commented Jan 9, 2015

NelsonMinar commented Jan 11, 2015

NelsonMinar commented Jan 17, 2015

NelsonMinar commented Jan 21, 2015

iandees commented Jan 21, 2015

NelsonMinar commented Jan 21, 2015

NelsonMinar commented Jan 21, 2015

iandees commented Jan 21, 2015

NelsonMinar commented Jan 21, 2015

NelsonMinar commented Jan 21, 2015

migurski commented Jan 21, 2015

NelsonMinar commented Jan 21, 2015

migurski commented Jan 21, 2015

iandees commented Jan 21, 2015

NelsonMinar commented Jan 21, 2015

iandees commented Jan 21, 2015

NelsonMinar commented Jan 21, 2015

iandees commented Jan 21, 2015

NelsonMinar commented Jan 21, 2015

iandees commented Jan 21, 2015

NelsonMinar commented Jan 21, 2015

NelsonMinar commented Jan 22, 2015

NelsonMinar commented Jan 22, 2015

migurski commented Jan 22, 2015

iandees commented Jan 22, 2015

NelsonMinar commented Jan 22, 2015

NelsonMinar commented Jan 22, 2015

NelsonMinar commented Jan 22, 2015

iandees commented Jan 23, 2015

migurski commented Jan 23, 2015

NelsonMinar commented Jan 23, 2015

migurski commented Jan 23, 2015

migurski commented Jan 23, 2015

NelsonMinar commented Jan 23, 2015

migurski commented Jan 23, 2015