-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate memory issues on fetch #452
Comments
Hi, I was just browsing around and this issue caught my attention - it seems that the main issue is that you try to load the whole file into memory, then parse it and then make some calculations. This means that the payload of let's say 50 megs of files is repeated a couple times in the memory (on each array.map, JSON.parse and function calls). If you use a fairly recent node.js you could use a combination of JSONStream and scramjet so that the files are downloaded no faster than they are being processed. I'd be happy to help. |
Hey @MichalCz thanks for commenting! I think you're most likely correct. However, due to current architecture, we may be stuck with some of this behavior.
All that being said, there is no reason we couldn't potentially stream results from an adapter, back to fetch.js and have it try to insert into the database via the stream. This would require some architectural redoing though. But if you wanted to check out one of the bigger adapters like https://github.com/openaq/openaq-fetch/blob/develop/adapters/eea-direct.js or https://github.com/openaq/openaq-fetch/blob/develop/adapters/airnow.js and see if you think this may be an option, that'd be really helpful! |
I think it may be an option - the only problem is that I haven't checked scramjet with node below 8.2 lately. The good thing with streaming you could simplify the code by a lot - the adapters would fetch the data in any sensible way you'd like them to and then simply transformed by a simple map method to some intermediate format. That is even without injecting the data into database as a stream (or by batch save). |
Well I haven't run this airnow adapter, but: The code would look much nicer with node.js 9.2+ async/await structures and regardless it won't run on node below 8.2+. Check this file out. I believe that it will take the memory footprint down by quite a lot as I removed at least 5 ops on the whole aggregated array. |
I had some time to take a closer look, including fetch.js - seems there's quite some optimization to do there as well and probably that would result in much bigger optimization. I'd suggest you added a fetchStream next to fetchData (fetchData could be a fallback method), the overall code would be (excuse my pseudo-js): await (
(adapter.fetchStream || DataStream.fromArray(adapter.fetchData))
.use(utils.verifyStreamFormat) // I need to think a little bit about this
.map(utils.cleanupMeasurements)
.batch(1024) // this will group entries in batches for more optimal inserts
.map(db.insertRecords) // here there should be a multi-row insert query, resolved with results
.consume()
) Now I would probably need to add something for the jsonschema validation - so let me know what you think. |
Thanks @MichalCz! I'll try and look at this today, probably just the single adapter changes for now. Do you know of a good way to measure the memory footprint before and after to see how the changes affect it? |
@MichalCz I looked at this a bit this morning, but ran into an error where it said Also, I updated the code to Node.js 9.5 but it was throwing an error, so I dropped it down to 8.2 and everything was happy again, so we should be good to go at least for using scramjet. Thanks! |
@jflasher - I won't find time sooner than during the weekend probably, but I'll find some time then. |
That would be excellent, thanks! |
Hi @jflasher, With modified fetch.js and eea-direct.js - I'm happy to report quite some success 😄, even though I think there's still quite some optimizations to achieve - this should be quite good: Here's Here's the output from original fetch:
After changing most of transforms to streams:
In general: Wall clock time - 30% Check it out - I pushed the changes to my branch. The fetch.js file is quite rewritten but should work with any adapter as previously - although I haven't had enough time to check it in detail. I could make a pull request during the coming week, but I think you guys may want to take a closer look and consider the options around optimizations to other adapters. If you export |
I'll take a look too! 👍 |
@dolugen I'll make a pull request for you this weekend, let me know if you find anything worth fixing... |
We've seen a repeating failure mode where the fetch instance will run out of memory. Logs look look like
With 2ced514, the memory is up to 3200 for the container which is a fair amount, but we're still hitting limits.
My guess is that one of the big sources (my guess is EEA) is doing a big update at this point and our system is trying to pull it all in. Not sure how to best handle this.
To examine this a bit further, I added
console.log(source.name, (JSON.stringify(data).length / 1024).toFixed(2));
to https://github.com/openaq/openaq-fetch/blob/develop/fetch.js#L161 to get a rough size of the data coming back from the adapters. Results of that below, sizes should be in KB. Note that when I ran this on my local machine (this was for all sources), all the EEA sources failed for some reason, which is interesting in of itself.So nothing above seems super huge give we have a machine with ~3GB RAM. Using HEAD on the EEA objects like
curl -I http://discomap.eea.europa.eu/map/fme/latest/FR_PM10.csv
we can see that these are often around 5 MB, but that's right as I checked it. I bet there are big spikes when more data is added.
Open to suggestions here. I kinda sorta like the simplicity of fetch at the moment, but part of me wonders if we should use this as a reason to move to something like Lambda function to kick off fetch -> Lambda functions for each adapter -> SNS or SQS or Kinesis for each fetch result -> some worker to get the data into database and S3.
The text was updated successfully, but these errors were encountered: