Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate memory issues on fetch #452

Closed
jflasher opened this issue Jan 29, 2018 · 13 comments
Closed

Investigate memory issues on fetch #452

jflasher opened this issue Jan 29, 2018 · 13 comments
Labels

Comments

@jflasher
Copy link
Contributor

We've seen a repeating failure mode where the fetch instance will run out of memory. Logs look look like

> openaq-fetch@0.1.0 start /app
> node index.js

2018-01-28T21:31:09.829Z - info: Skipping inactive source: Buenos Aires
2018-01-28T21:31:09.833Z - info: Skipping inactive source: Dr. Raphael E. Arku and Colleagues
2018-01-28T21:31:09.833Z - info: Skipping inactive source: Punjabi Bagh
2018-01-28T21:31:09.833Z - info: Skipping inactive source: Mandir Marg
2018-01-28T21:31:09.833Z - info: Skipping inactive source: RK Puram
2018-01-28T21:31:09.833Z - info: Skipping inactive source: Anand Vihar
2018-01-28T21:31:09.834Z - info: Skipping inactive source: Mosecom
2018-01-28T21:31:09.834Z - info: Skipping inactive source: Spartan
2018-01-28T21:31:09.900Z - info: Connected to database.
2018-01-28T21:31:10.039Z - info: Database migrations are handled, ready to roll!
2018-01-28T21:31:10.040Z - info: Running all fetch tasks.
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

<--- Last few GCs --->

   89591 ms: Mark-sweep 1293.0 (1434.9) -> 1292.9 (1434.9) MB, 1603.4 / 0 ms [allocation failure] [GC in old space requested].
   91204 ms: Mark-sweep 1292.9 (1434.9) -> 1292.9 (1434.9) MB, 1613.1 / 0 ms [allocation failure] [GC in old space requested].
   92872 ms: Mark-sweep 1292.9 (1434.9) -> 1292.3 (1434.9) MB, 1668.0 / 0 ms [last resort gc].
   94512 ms: Mark-sweep 1292.3 (1434.9) -> 1292.1 (1434.9) MB, 1640.0 / 0 ms [last resort gc].


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x1e07949c9e59 <JS Object>
    1: __push [/app/node_modules/csv-parse/lib/index.js:286] [pc=0x46e997523c3] (this=0x192287b067d1 <JS Object>,line=0x278a8d36dc01 <JS Array[24]>)
    2: __write [/app/node_modules/csv-parse/lib/index.js:496] [pc=0x46e9974f7de] (this=0x192287b067d1 <JS Object>,chars=0x278a8d00b499 <Very long string[196057]>,end=0x1e0794904299 <false>)
    3: /* anonymous */ [/app/node_modules/csv-parse/lib/sy...

With 2ced514, the memory is up to 3200 for the container which is a fair amount, but we're still hitting limits.

My guess is that one of the big sources (my guess is EEA) is doing a big update at this point and our system is trying to pull it all in. Not sure how to best handle this.

To examine this a bit further, I added

console.log(source.name, (JSON.stringify(data).length / 1024).toFixed(2)); to https://github.com/openaq/openaq-fetch/blob/develop/fetch.js#L161 to get a rough size of the data coming back from the adapters. Results of that below, sizes should be in KB. Note that when I ran this on my local machine (this was for all sources), all the EEA sources failed for some reason, which is interesting in of itself.

StateAir_AbuDhabi 8.92
StateAir_Dhaka 8.84
Guangzhou 7.88
Shenyang 7.86
StateAir_Manama 17.59
Beijing US Embassy 8.08
Bosnia 15.29
Australia - Tasmania 24.48
Chengdu 7.85
Shanghai 7.87
StateAir_Bogota 8.82
Slovenian Environment Agency 2.87
StateAir_KuwaitCity 8.92
StateAir_Ulaanbaatar 8.96
StateAir_JakartaSouth 9.02
StateAir_Addis_Ababa_Central 9.13
StateAir_Kampala 8.85
StateAir_Mumbai 8.81
StateAir_Hanoi 8.87
StateAir_Addis_Ababa_School 9.10
StateAir_Pristina 8.89
StateAir_JakartaCentral 9.07
StateAir_Chennai 8.83
StateAir_Lima 8.87
Ho Chi Minh City 9.10
StateAir_Kolkata 8.84
StateAir_Colombo 8.85
Bosnia_Tuzlanski 7.87
StateAir_Kathmandu 18.14
Thailand 77.83
Bozhou 4.52
Arpae Emilia-Romagna 47.33
StateAir_Hyderabad 8.88
Chizhou 7.26
Lu'an 8.31
Norway 48.14
Anqing 7.71
Xuancheng 6.16
Wuhu 8.03
StateAir_PhoraDurbar 18.39
Huangshan 7.09
Huaibei 7.35
Suzhou 6.83
StateAir_NewDelhi 8.89
Chuzhou 4.86
Bengbu 13.76
Ma'anshan 11.44
Fuyang 6.57
Andalucia 0.03
Tongling 14.02
Hefei 19.98
Peru Clima 0.03
Chile - SINCA 74.78
Huainan 14.08
DEFRA 139.45
CPCB 155.53
Sweden 220.91
ARPALAZIO 1735.44
Sao Paulo 0.03
AirNow 86.75
Netherlands 99.12
Richards Bay 378.47
Israel 7604.79

So nothing above seems super huge give we have a machine with ~3GB RAM. Using HEAD on the EEA objects like

curl -I http://discomap.eea.europa.eu/map/fme/latest/FR_PM10.csv

we can see that these are often around 5 MB, but that's right as I checked it. I bet there are big spikes when more data is added.

Open to suggestions here. I kinda sorta like the simplicity of fetch at the moment, but part of me wonders if we should use this as a reason to move to something like Lambda function to kick off fetch -> Lambda functions for each adapter -> SNS or SQS or Kinesis for each fetch result -> some worker to get the data into database and S3.

@jflasher jflasher added the bug label Jan 29, 2018
@MichalCz
Copy link
Contributor

MichalCz commented Feb 8, 2018

Hi,

I was just browsing around and this issue caught my attention - it seems that the main issue is that you try to load the whole file into memory, then parse it and then make some calculations.

This means that the payload of let's say 50 megs of files is repeated a couple times in the memory (on each array.map, JSON.parse and function calls).

If you use a fairly recent node.js you could use a combination of JSONStream and scramjet so that the files are downloaded no faster than they are being processed.

I'd be happy to help.

@jflasher
Copy link
Contributor Author

jflasher commented Feb 8, 2018

Hey @MichalCz thanks for commenting! I think you're most likely correct. However, due to current architecture, we may be stuck with some of this behavior.

  1. Each individual adapter comes up with a final list of all the data it sees from a source.
  2. The adapter sends this data back to the main fetch.js code
  3. Each adapter's data is attempted to be inserted into the database where it is rejected (due to already being there, this is a uniqueness check) or is accepted

All that being said, there is no reason we couldn't potentially stream results from an adapter, back to fetch.js and have it try to insert into the database via the stream. This would require some architectural redoing though. But if you wanted to check out one of the bigger adapters like https://github.com/openaq/openaq-fetch/blob/develop/adapters/eea-direct.js or https://github.com/openaq/openaq-fetch/blob/develop/adapters/airnow.js and see if you think this may be an option, that'd be really helpful!

@MichalCz
Copy link
Contributor

MichalCz commented Feb 8, 2018

I think it may be an option - the only problem is that I haven't checked scramjet with node below 8.2 lately.

The good thing with streaming you could simplify the code by a lot - the adapters would fetch the data in any sensible way you'd like them to and then simply transformed by a simple map method to some intermediate format. That is even without injecting the data into database as a stream (or by batch save).

@MichalCz
Copy link
Contributor

MichalCz commented Feb 8, 2018

Well I haven't run this airnow adapter, but:

https://github.com/signicode/openaq-fetch/blob/3c93556bbb4e5d1b365cf308f20a2bfc1d369ca9/adapters/airnow.js

The code would look much nicer with node.js 9.2+ async/await structures and regardless it won't run on node below 8.2+. Check this file out.

I believe that it will take the memory footprint down by quite a lot as I removed at least 5 ops on the whole aggregated array.

@MichalCz
Copy link
Contributor

MichalCz commented Feb 9, 2018

I had some time to take a closer look, including fetch.js - seems there's quite some optimization to do there as well and probably that would result in much bigger optimization.

I'd suggest you added a fetchStream next to fetchData (fetchData could be a fallback method), the overall code would be (excuse my pseudo-js):

await (
    (adapter.fetchStream || DataStream.fromArray(adapter.fetchData))
        .use(utils.verifyStreamFormat) // I need to think a little bit about this
        .map(utils.cleanupMeasurements)
        .batch(1024)                            // this will group entries in batches for more optimal inserts
        .map(db.insertRecords)           // here there should be a multi-row insert query, resolved with results
        .consume()
    )

Now I would probably need to add something for the jsonschema validation - so let me know what you think.

@jflasher
Copy link
Contributor Author

Thanks @MichalCz! I'll try and look at this today, probably just the single adapter changes for now. Do you know of a good way to measure the memory footprint before and after to see how the changes affect it?

@jflasher
Copy link
Contributor Author

@MichalCz I looked at this a bit this morning, but ran into an error where it said map() is not a function on getStream(). Presume this was just due to what was being returned, but I'm afraid I've just never fully wrapped my head around Promises and streams, so I didn't troubleshoot too much. If you're interested, I think it'd be very helpful if you'd be able to do a similar modification to https://github.com/openaq/openaq-fetch/blob/develop/adapters/eea-direct.js which is probably the one that'd be most helpful. If you could comment that well, we could use it as a good template for new adapters to follow similar patterns.

Also, I updated the code to Node.js 9.5 but it was throwing an error, so I dropped it down to 8.2 and everything was happy again, so we should be good to go at least for using scramjet.

Thanks!

@MichalCz
Copy link
Contributor

@jflasher - I won't find time sooner than during the weekend probably, but I'll find some time then.

@jflasher
Copy link
Contributor Author

That would be excellent, thanks!

@MichalCz
Copy link
Contributor

Hi @jflasher,

With modified fetch.js and eea-direct.js - I'm happy to report quite some success 😄, even though I think there's still quite some optimizations to achieve - this should be quite good:

Here's /usr/bin/time -v output for both cases:

Here's the output from original fetch:

        Command being timed: "node index.js --dryrun --source EEA France"
        User time (seconds): 12.32
        System time (seconds): 0.55
        Percent of CPU this job got: 72%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:17.81
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 576488
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 143989
        Voluntary context switches: 28184
        Involuntary context switches: 27
        Swaps: 0
        File system inputs: 0
        File system outputs: 21520
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

After changing most of transforms to streams:

        Command being timed: "node index.js --dryrun --source EEA France"
        User time (seconds): 6.10
        System time (seconds): 0.28
        Percent of CPU this job got: 46%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.84
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 170976
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 36222
        Voluntary context switches: 2329
        Involuntary context switches: 45
        Swaps: 0
        File system inputs: 0
        File system outputs: 21520
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

In general:

Wall clock time - 30%
Max RSS memory: -75% !!! (and this could be much less)
CPU Time (user): -50%

Check it out - I pushed the changes to my branch.

The fetch.js file is quite rewritten but should work with any adapter as previously - although I haven't had enough time to check it in detail.

I could make a pull request during the coming week, but I think you guys may want to take a closer look and consider the options around optimizations to other adapters. If you export fetchStream from the adapter - fetch.js will use that.

@jflasher
Copy link
Contributor Author

Thanks @MichalCz, I'll try and take a look soon! @dolugen may also be of interest to look over?

@dolugen
Copy link
Contributor

dolugen commented Feb 22, 2018

I'll take a look too! 👍

@MichalCz
Copy link
Contributor

MichalCz commented Mar 2, 2018

@dolugen I'll make a pull request for you this weekend, let me know if you find anything worth fixing...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants