Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major performance issue with hapi.js 15.x #3347

Closed
lostthetrail opened this issue Sep 13, 2016 · 32 comments

Comments

Projects
None yet
@lostthetrail
Copy link

commented Sep 13, 2016

What are you trying to achieve or the steps to reproduce?

Under higher concurrency Hapi 15.x memory usage spikes and fails, where Hapi 14.x had no problem

'use strict';

const Hapi = require('hapi');
const hoek = require('hoek');

const server = new Hapi.Server({
    connections: {
        router: {
            stripTrailingSlash: true
        }
    }
});

server.connection({
    port: process.env.NODE_PORT || 8080,
    host: process.env.HOSTNAME || '0.0.0.0',
    routes: {
        response: {
            emptyStatusCode: 204
        }
    }
});

server.route([
    {
        method: 'GET',
        path: '/validate',
        handler: function handle(request, reply) {
            return reply({
                validate: true
            });
        }
    }
]);

server.start(function handleServerStart(serverStartErr) {
    hoek.assert(!serverStartErr, serverStartErr);
    console.log(`Server running at: ${server.info.uri}`);
});
artillery quick --duration 60 --rate 100 -n 100 http://localhost:8080/validate

What was the result you received?

With Hapi 15.x, you will see a "JavaScript heap out of memory" and it will fail.
With Hapi 14.x, you will have a successful test and have zero errors.

What did you expect?

No major failures and zero errors.

Context

  • node version: 6.5.0 & 4.5.0
  • hapi version: 15.0.3
  • os: OSX & Centos
  • any other relevant information: I am aware this is a very high RPS to push through a server. But it is an attack vector. I also was able to replicate this with JMeter, so the load tool isn't important in this case. Just high concurrency and RPS.
@hueniverse

This comment has been minimized.

Copy link
Member

commented Sep 19, 2016

I cannot reproduce this issue with hapi@15.0.3. I ran the artillery test about 20 times against the same process. I ran it 3 times in parallel. I then ran 3 parallel artillery processes at 200 rps over 2 minutes - I got lots of errors on the artillery side (timeouts and disconnected sockets) but not a single memory error on the hapi server side. When that test was done, I was able to connect to the server and get the response.

Can someone else reproduce this?

And to the people posting thumbs up, what does that mean in this context? That you can reproduce it? That you are just concerned? You are not being helpful in any way.

@mtharrison

This comment has been minimized.

Copy link
Member

commented Sep 19, 2016

I can't replicate any memory issues but I do see slightly higher CPU usage with hapi 15.x.x compared with 14.x.x. The additional time is mainly spent in Joi validation for every request coming from internals.Podium.registerEvent

screen shot 2016-09-19 at 10 37 32

@genediazjr

This comment has been minimized.

Copy link

commented Sep 19, 2016

Hi, @mtharrison what did you use for the profiling?

@mtharrison

This comment has been minimized.

Copy link
Member

commented Sep 19, 2016

V8 inspector (node --inspect).

@genediazjr

This comment has been minimized.

Copy link

commented Sep 19, 2016

@mtharrison thanks!

@lostthetrail

This comment has been minimized.

Copy link
Author

commented Sep 19, 2016

It appears there is some correlation between the failure I am seeing in Hapi and the version of Artillery used during the test.

I have pushed up an example with the specific version I used pinned:
https://github.com/lostthetrail/hapi-15-mem-issue

npm start in terminal 1
npm test in terminal 2

I had a co-worker clone the repo and follow the README. He was able to reproduce via the repo above.

@Marsup

This comment has been minimized.

Copy link
Member

commented Sep 19, 2016

The most important source of memory seems to be _notificationsQueue on the hapi Connection object. I don't see why it would retain so much unless we never reach a point where setImmediate runs those events, but I thought the whole point of this against nextTick was IO, I'm puzzled.

@Marsup

This comment has been minimized.

Copy link
Member

commented Sep 19, 2016

@hueniverse doesn't seem to make a difference but is https://github.com/hapijs/podium/blob/master/lib/index.js#L168 really in the right place ?

@lostthetrail

This comment has been minimized.

Copy link
Author

commented Sep 28, 2016

Let me know if there is anything else I can do.

@rfk

This comment has been minimized.

Copy link

commented Oct 3, 2016

FWIW, I can confirm that the steps in #3347 (comment) reproduced the bug for me as well, with the server process from npm start eventually dying with Allocation failed - process out of memory.

Based on #3347 (comment) I added some diagnostic logging inside podium, and the process got up to logging _notificationsQueue.length == 642928 before it died.

@Marsup

This comment has been minimized.

Copy link
Member

commented Oct 3, 2016

That's a difficult trade-off, if we switch from setImmediate to nextTick to allow node to actually process those, we're accepting less sockets, but at least the server doesn't fall. I still find it weird that node is never ever running setImmediate as long as IO is keeping it busy, this looks like a flaw in node.

@DavidTPate

This comment has been minimized.

Copy link

commented Oct 3, 2016

@Marsup It actually makes sense (based on the docs) that what you are seeing is happening, below is why:

setImmediate

Schedules the "immediate" execution of the callback after I/O events' callbacks and before timers created using setTimeout() and setInterval() are triggered. Returns an Immediate for use with clearImmediate().

When multiple calls to setImmediate() are made, the callback functions are queued for execution in the order in which they are created. The entire callback queue is processed every event loop iteration. If an immediate timer is queued from inside an executing callback, that timer will not be triggered until the next event loop iteration.

nextTick

The process.nextTick() method adds the callback to the "next tick queue". Once the current turn of the event loop turn runs to completion, all callbacks currently in the next tick queue will be called.

This is not a simple alias to setTimeout(fn, 0), it's much more efficient. It runs before any additional I/O events (including timers) fire in subsequent ticks of the event loop.

So if we are talking about the processing then all of our callbacks made using setImmediate will be called after I/O completes and then called after the event loop completes, so in this case if I/O never completes they won't be called. Whereas those added with nextTick which will process once the current event loop completes and prior to other I/O.

@Marsup

This comment has been minimized.

Copy link
Member

commented Oct 3, 2016

I know that but I'm having a difficult time understanding how IO can never leave any tiny amount of time for it to kick in.

@vladikoff

This comment has been minimized.

Copy link

commented Oct 3, 2016

Allocation profile chart below:

image

Download it via:

hapi15v3.heapprofile.zip

based on https://github.com/lostthetrail/hapi-15-mem-issue

@DavidTPate

This comment has been minimized.

Copy link

commented Oct 3, 2016

Just going through the Node code around (both JS and C++) around the underlying functionality of setImmediate it can definitely get into what looks to be some really interesting edge cases (still digging into the code, so could be wrong here). For example it will actually fallback to nextTick if it fails to execute the timer during the callback operation and reconstructs the queue.

There's also re-use of the handles for the Timers but that doesn't appear to be a case that this particular code would fall into from what I'm seeing. Definitely going to dig around some more.

@vladikoff

This comment has been minimized.

Copy link

commented Oct 24, 2016

@hueniverse @mtharrison how can we help move this issue forward?

@hueniverse

This comment has been minimized.

Copy link
Member

commented Oct 26, 2016

This requires some attention from node core contributors who can help understand the actual cause. @cjihrig

@cjihrig

This comment has been minimized.

Copy link
Contributor

commented Oct 26, 2016

So far, using hapi@15.0.3 and the hapi-15-mem-issue repo on OS X, I was able to recreate this on:

  • v6.0.0
  • v6.5.0
  • v6.9.1 (latest Boron release)

I was unable to reproduce on:

  • v4.5.0
  • v4.6.1 (latest Argon release)
  • v5.12.0 (the final v5 release)
  • v7.0.0

I was also unable to reproduce on v6.9.1 using hapi@14.2.0.

After a bit of bisecting on Node v6.9.1, the break occurs at 63f7d89. That commit represents the bump from podium@1.2.1 to 1.2.3. Within podium, the break occurs in podium@1.2.2, which is only 1 commit.

As mentioned previously in this thread, the problem seems to be coming from the use of setImmediate() (failures also seen with setTimeout()). Substituting with nextTick() or no deferral at all, eliminates the error. It's worth noting that Node's builtin event emitter does not defer.

@Marsup

This comment has been minimized.

Copy link
Member

commented Oct 26, 2016

@cjihrig isn't it surprising to you that there isn't any time left for node to slip even a single setImmediate ? Is it at all possible to saturate node like that ?

@cjihrig

This comment has been minimized.

Copy link
Contributor

commented Oct 26, 2016

@Marsup yes, it's possible to saturate Node. It's possible that since setImmediate() defers its work, the flood of incoming requests could prevent the setImmediate() callbacks from executing causing memory consumption to increase. I added a counter for the setImmediate()s, and you can see them gradually slowing over time.

@Marsup

This comment has been minimized.

Copy link
Member

commented Oct 26, 2016

Then I guess the only solution is to go with nextTick in podium @hueniverse.

@cjihrig

This comment has been minimized.

Copy link
Contributor

commented Oct 26, 2016

FWIW, using setImmediate() isn't necessarily wrong. The system is being stress tested, and eventually it breaks down. You'll also get an out of memory crash if you just write while (true) { console.log(); } :-)

hapi's load handling features would probably be another good way to mitigate the problem.

@hueniverse

This comment has been minimized.

Copy link
Member

commented Oct 26, 2016

Wouldn't nextTick() just starve something else that this test isn't looking for? I am not sure this performance issue is a real problem. Is this causing actual production issues or is this a theoretical limit given this particular test?

@Marsup

This comment has been minimized.

Copy link
Member

commented Oct 26, 2016

It would slow down on accepting new connections, thus maybe a few clients would timeout, but it's less likely you'd run out of memory in case of a ddos.

@pjcraven

This comment has been minimized.

Copy link

commented Oct 26, 2016

On our project we had to revert to 14.2.0 due to memory usage increasing to failure while using nes. This is on Heroku FWIW.

@vladikoff

This comment has been minimized.

Copy link

commented Oct 26, 2016

Is this causing actual production issues or is this a theoretical limit given this particular test?

Yes, crashes production servers on routes that are high usage.

@cjihrig

This comment has been minimized.

Copy link
Contributor

commented Oct 26, 2016

I don't think performance is an issue, but memory consumption might be. Is there a reason podium needs to defer when emitting events, when Node core's event emitter does not?

@devinivy

This comment has been minimized.

Copy link
Member

commented Nov 5, 2016

@cjihrig I wonder if it's a way to recurse through podiums and source podiums without worrying about creating an arbitrarily large stack.

@mcollina

This comment has been minimized.

Copy link

commented Nov 29, 2016

Please check if hapijs/podium#17 solves your issues.

@vladikoff

This comment has been minimized.

Copy link

commented Nov 29, 2016

@lostthetrail it seems like your repo is holding up with the fix from #3347 (comment) , wanna double check?

@lostthetrail

This comment has been minimized.

Copy link
Author

commented Nov 29, 2016

@vladikoff Yup. I am on it. At a conference so it may be a few hours.

@hueniverse hueniverse self-assigned this Nov 29, 2016

@hueniverse hueniverse added this to the 16.0.0 milestone Nov 29, 2016

hueniverse added a commit that referenced this issue Nov 29, 2016

@lostthetrail

This comment has been minimized.

Copy link
Author

commented Nov 29, 2016

Confirmed. The patch to podium does fix the ever going process. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.