Significant performance hit caused by throwing Errors: should be returned or passed to callback #5149

Closed
CrabDude opened this Issue Mar 26, 2013 · 34 comments

Projects

None yet

9 participants

@CrabDude

Branching from a related issue.

The way node currently handles errors exacerbate an application's ability to avoid an undefined state, increasing restart frequency by multiple orders of magnitude. If node handled errors as proposed below, this could be avoided.

  • invalid input errors should be passed to callbacks (async) or returned (sync). There is much precedent for this. JSON.parse is a poor API because JSON.validate does not exist, requiring a try/catch. Similarly, fs.readFile.validate, http.createServer.validate, util.inherits.validate, url.parse.validate, etc... do not exist.
  • handlers (and callbacks as necessary) should be wrapped in a try/catch to avoid undefined state in core.
  • Errors caught from handlers should be passed to domain.active or either emitted as a separate event like uncaughtApplicationException or have err.internal = false.

Goal:
Since nothing hurts performance worse than a process restart, anything that creates undefined state should be treated as the worst performing code in the entire codebase.

@isaacs
isaacs commented Mar 26, 2013

JSON.parse is a poor API because JSON.validate does not exist, requiring a try/catch.

It's much much more difficult to validate JSON strings than to make sure that a path string is actually a string. If it can be validated via a typeof check, then you don't need a special method for it.

The "precedents" you cite are from two different languages, neither of which are JavaScript, and neither of which are Node.

handlers (and callbacks as necessary) should be wrapped in a try/catch to avoid undefined state in core.

No. Qv the discussion in #5114. Actually try a patch to do this, and compare the benchmarks that @trevnorris suggested.

Errors caught from handlers should be passed to domain.active or either emitted as a separate event like uncaughtApplicationException or have err.internal = false.

Throws already go to the active domain. Your breakdown of "error" vs "exception" vs "application exception" is not a real thing. We're not going to reproduce all of the error handling logic in the hottest code path in node.

Since nothing hurts performance worse than a process restart

I disagree. Nothing hurts performance worse than a poorly performing component.

Write a very small master process whose only job is to manage a handful of cluster workers. When the domain catches an error, do server.close(), and exit gracefully. In the master process, whenever you get a cluster.on('disconnect'), create a new worker.

Voila. Nearly zero performance impact from graceful shutdowns. On the other hand, a misbehaving undefined-state worker is a real serious hazard to your system, because it's not clear what the error is.

(Of course, there are other ways to achieve similar results, which scale across machines, but then you have to use an http proxy. @Substack has written some very interesting tooling to achieve this. You can still do those other techniques along with the cluster/domain approach, and double-up on your error resilience.)


I am sorry to be so severe in shutting down this request. My goal is not to be cruel, or insensitive to your needs. I only wish to be very clear about what it is that we're willing to do, and what we are not. We are well past the point of inventing whole new error handling paradigms, especially patterns that will cause errors to be ignored or missed in existing Node programs.

@isaacs isaacs closed this Mar 26, 2013
@CrabDude

First, I don't mind being shut down, so long as discussion is occurring, and already it has yielded some potential progress as in #5114. Equally, I care more that the full depth and breadth of the conversation is explored and considered than whether the issue is closed. Any extent to which I continue beyond the issue being closed is in pursuit of this.

I referenced Go to disarm any rebuttals that the idea is unprecedented. While, as you say there is little precedence in JavaScript and node for returning errors, there was also none for error-first callbacks. I would place that discussion on par with that, if only to raise the consideration for node 2.0, though since the suggestion thus far in node is to avoid try/catch and merely pre-validate, I doubt many modules or applications would be affected by such a change.

Likely what I'll do is build a module that does precisely that, and see how pervasive invalid input try/catchs are in the module ecosystem. I suspect they're not, but will certainly report back if I'm wrong.

Your breakdown of "error" vs "exception" vs "application exception" is not a real thing.

I disagree, and you have yet to address any of my points on the distinction, nor did you provide any specific reasons for or examples of why my assertions regarding which create an undefined state are incorrect. It is incredibly important, as it drives the discussion on #5114 regarding whether invalid input exception can even be passed to the callback. You seem to (somewhat appropriately) treat undefined state as a mystical thing with a lot of FUD. My extremely verbose argument in #5114 is an attempt to chip away at a large portion of that FUD. You have instead chosen to focus on the question of whether the proposed solution is even feasible from a performance standpoint, an important question that needs to be answered nonetheless.

Your point about optimizing for the base case is an extremely important consideration. There will always be a point at which performance tradeoffs are acceptable, but the bar should be very high. We're in agreement here.

However, your point about nearly zero performance impact from graceful shutdown is a strawman. What we're talking about is a production server running 1000s of concurrent requests encountering the error from #5114 suddenly having to return 1000s of 500s due to "Gateway Timeout"s because the http client entered into an undefined state due to a failure to wrap handlers in try/catchs thinking the requests have timed out and we tried to let the requests be handled gracefully not knowing whether they were affected by the uncaughtException. The real performance hit to users is that 1000s of concurrent requests time out. Now extrapolate that across multiple processes and you have a server that performs nearly as bad as possible, and this is the current state of things and a very real consideration at LinkedIn.

@isaacs
isaacs commented Mar 27, 2013

suddenly having to return 1000s of 500s due to "Gateway Timeout"s because the http client entered into an undefined state due to a failure to wrap handlers in try/catchs thinking the requests have timed out and we tried to let the requests be handled gracefully not knowing whether they were affected by the uncaughtException.

That's the strawman, right there.

Use a domain. Send a 500 response to the request that triggered the error. Stop listening in the worker. Finish any existing requests in due time, and shut down gracefully. Meanwhile, spin up a new worker. You have 15 other workers that are not affected even slightly. (You know that server.close() just stops listening, right? It doesn't destroy any existing connections.)

I don't get why you are assuming that "thousands of requests" all have to be affected. When I say "your mission is to get out as soon as possible", I don't mean that you must process.exit() right away in that tick. But it should be a priority to stop routing new requests to this error-state process, and to close down as gracefully as you can in the near future.

@carlos8f

Isaac, I agree with your assessment here, this is a good case for using domains and shutting down gracefully if an unexpected exception is thrown. this has really nothing to do with performance, since a potentially corrupted application state is not generally a good tradeoff for avoiding the split second it takes to spawn a process.

Re: exceptions, I think the usefulness can be broken into two cases:

  1. known exceptions: for complex synchronous actions (such as JSON.parse). i don't think there's really a problem with this style, other than these need to be documented as much as possible. JSON.parse is left without a try/catch often in the wild, and it's unfortunate, but education/documentation is more the culprit than the api. i don't agree that node 2.0 should return errors (i find that more awkward than try/catch).
  2. unknown exceptions: (catch it with a domain, trigger a graceful shutdown sequence defined by the program and re-start the program in a clean state).

I think it's worth noting that the more you care about performance, the more distributed your application tends to be designed (spread out to multiple machines, ideally), and the more you can contain errors and minimize the functional and performance impact they have on the rest of your concurrent requests. I developed the amino suite which uses a load-balanced elastic pool of workers (amino-gateway and amino-deploy), when one of these workers times out or disconnects in some way, it is automatically removed from all load-balancers via pub/sub (in lieu of a traditional health check), and in some cases respawned in the background. this is a strategy that helps isolate faults and has worked very well in production.

@CrabDude

This is not about continuing in an undefined/corrupted state. This is about avoiding that state and decoupling node's state from your application state. If someone wants to provide any reason at all why that's not at least partially avoidable to the tune of what I've laid out here and in #5114, please do. In every one of his responses, @isaacs unfortunately has simply not provided any evidence to support his view or refute a single point of mine on the matter. If I'm wrong, then any resulting discussion will be significantly more fruitful than the otherwise lack of it.

Unknown core exceptions should require a restart.

Unknown application exceptions should be at the purview of the developer.

This is the current de facto state of how most node.js developers are writing their applications, and partially responsible for the popularity of things like promises.

@isaacs
isaacs commented Mar 29, 2013

@CrabDude

This is about avoiding that state and decoupling node's state from your application state.

As discussed on IRC, what you're asking for here is, in a nutshell, for Node's internals to be more stable and more resilient to operator error. You are asking for Node to be bulletproof.

Ok. Fine, I agree. Patches welcome.

Given the nature of the language we're using here, there are some things that are always going to be potentially hazardous, because JavaScript has a shared global environmetn, mutable state, and throw abruptly jumps out of loops to who-knows-where (and it is prohibitively expensive to wrap in try blocks at every level). This situation admits an iterative "find and fix bugs and problems" solution, not a sweeping "Node needs to change everything" solution.

We're not going to do crazy stuff like lock all the objects, make state immutable, or run your code in an entirely separate context. We pass functions around, because Node uses callbacks. If what you want is Haskell or Erlang, you should be using Haskell or Erlang. That comes with other tradeoffs, as you can see in those languages. JavaScript is what it is, and some of the things we like about it also open the door for certain problems.

In every one of his responses, @isaacs unfortunately has simply not provided any evidence to support his view or refute a single point of mine on the matter.

If you have a complaint about my handling of these issues, please address me directly, preferably via email, so that we can resolve our communication difficulties.

If you have a problem with a specific response that I've made, please refer to it specifically.

Saying "@isaacs doesn't provide evidence for anything he says, or addressed any of my complaints" is a personal attack, and patently untrue. I've been addressing many of your points (usually by explaining why your proposed solution would have adverse effects), and there are many cases where I've provided evidence to support my views. They're just not your views, or evidence that you find satisfactory. I feel offended and frustrated by what I perceive as a personal attack in this statement. This is not the appropriate forum in which to air grievances about my character. Please stick to the topic of the Node.js program, what problems you encounter with it, etc.


I suggest we back up a few steps.

What are your needs?

What is happening that is causing problems for LinkedIn? (Details would be helpful here, stack traces, example code, etc.)

Why are the proposed solutions inadequate?

Please be specific about your needs and problems, and let's do solutions afterwards. Show me the actual bug, and we can address how to fix it.

You've suggested a lot of solutions, without explaining the specific problem you're trying to solve (except "bad programs cause bad state", which is not really solvable). It has not been satisfying for any of us. I suggest a different approach. Continuing in this manner will only continue to lead to frustration. Every time you get the same response (use process isolation, get out asap on errors) you feel like you're not being heard. So stop pitching solutions, and just tell me the piece that I'm missing.

Use a domain. Send a 500 response to the request that triggered the error. Stop listening in the worker. Finish any existing requests in due time, and shut down gracefully. Meanwhile, spin up a new worker. You have 15 other workers that are not affected even slightly. (You know that server.close() just stops listening, right? It doesn't destroy any existing connections.)

I don't get why you are assuming that "thousands of requests" all have to be affected. When I say "your mission is to get out as soon as possible", I don't mean that you must process.exit() right away in that tick. But it should be a priority to stop routing new requests to this error-state process, and to close down as gracefully as you can in the near future.

Why is it not acceptable to isolate your application into multiple processes, and shut down gracefully when the domain catches an error?

Can you provide some details about the actual issues you have encountered using that method? That would be much more informative and productive for all of us.

@carlos8f

Unknown core exceptions should require a restart.
Unknown application exceptions should be at the purview of the developer.

If both are unknown, what's the practical difference between the two? Core API's aren't necessarily special, other than that they have bindings, which npm modules can have also.

@othiym23

I operate in three different roles with Node:

  1. An application developer who's mostly interested in quickly adding features to applications. In this capacity, I try to follow community conventions, not reinvent the wheel, and use only the standard features described in the API documentation.
  2. A tooling developer who's trying to capture state about the runtime in order to provide actionable information about applicaiton performance to other developers. In this capacity, I need access to the internal state of Node, need to understand community conventions so I can support them, and need frequently to spend time working with the source of both Node and popular modules to ensure that I can get the information I need without altering application behavior or breaking Node.
  3. An interested observer who stands on the periphery of Node core development, trying to figure out how to extend and alter Node to make role 2 easier, and also to square the Node team's understanding of the world with the needs of the broader application developer community.

Due to the nature of my job, I've worked with a lot of end-user code over the last year, and due to the specific details of what New Relic does, I've spent a lot of time obsessing over error handling in that time. I completely understand and sympathize with where @isaacs is coming from – it's difficult, if not impossible, to reconcile exceptions with the primitive continuation-esque async call chain that Node, and JavaScript, impose on us all. At the same time, there's a pretty major disjuncture between what the core team knows to be safe and unsafe and what the community at large is doing. I've seen uncaughtException handler code that dispatches exceptions off to Airbrake and then keeps going. I've seen a lot of discussion of using domains to capture and contain crashes within HTTP handler calls. I've seen loads of finally clauses in places where there's not a chance in hell they're going to do what they're meant to do.

Some of this is down to poor communications -- after spending a lot of time working with domains, the documentation makes sense, although it does a poor job of explaining to newcomers what they're for or why you'd want them. There's also not a clear sense in the documentation that domains are meant to be used as a mechanism for graceful shutdown rather than recovery. It seems entirely reasonable that you can use them to clean up after failed calls and proceed. I've been meaning to take a pass at a PR to make those things clearer, and it's on me for not having carved out the time to do that yet.

At the same time, there's a larger issue, which is that there isn't a really strong message overall about how Node deals with errors, which is a less strong way of saying that error-handling is kind of a mess in Node. We don't have the hot code replacement of Erlang (nor do I see much discussion among Node developers of crash-only architecture), we don't have the plethora of means for dealing with exceptions of Haskell, we don't have the thread-local storage for containing additional context of Python or Ruby, and we don't have the strong tools for reflection and introspection provided by the JVM and CLR.

In all of those cases, ultimately we're bumping up against the limitations of using a language intended for safe execution in an untrusted environment to run long-running server processes – mutation would be less of an issue if we could at least inspect the internal state of closures, V8 isn't built to take on the added complexity of treeshaking closures and optimizing code wrapped in try blocks, everybody suffers when the interface between V8 and libuv increases in complexity, and so on. Still, it kinda sucks in a systemic way, and when you look at trycatch, you can see what it is that @CrabDude's trying to make happen, which is to make this a simpler and more consistent story for everyone.

We can't make Node foolproof, and trying to do so runs the risk of compromising Node's key advantages, which are its speed and its minimalism (I almost wrote "simplicity", but I actually don't think Node is all that simple anymore). I do think there is a lot of value in making the behavior of core more consistent wrt how it deals with errors, even if it involves doing more work on the C++ side of things to back that up. #4583 is an example of this -- I agree with @hueniverse that a runtime error (or an error triggered by input data) should be coming in through the callback rather than being thrown -- even if making that happen is a pain in the ass for core developers. The reason for this is that doing so reduces the amount of undefined state that results from the computed gotos that are exceptions, and it makes it more reasonable to at least contemplate recovering from errors rather than shutting down.

The alternative is to put a lot more emphasis in the documentation on how to properly design a crash-only architecture using tools like cluster and forever, and maybe moving more of that style of process management into core. I'd be interested to hear @substack's thoughts on that, given that he hasn't seemed wildly enthusiastic about having cluster in core anyway. I do feel that whatever strategy we want users to follow around this stuff should be achievable using only the core distribution, and we should have lots of examples available to developers of what we consider the right / conventional way to do things.

I feel like this is a huge opportunity for us to improve Node developers' lives, and I'm sorry this is long on abstract principles and short on concrete suggestions, but I guess I wanted to see if I could bridge the gap I see here.

@bnoordhuis
Member

a runtime error (or an error triggered by input data) should be coming in through the callback rather than being thrown -- even if making that happen is a pain in the ass for core developers.

That's not the reason I'm against it. From an implementation POV, it's a pretty trivial change.

I'm opposed because I believe it will make application debugging much harder. With an exception, you know where you stand - the offending call site is right there in the stack trace.

When you start passing error objects around, that information gets lost. That's something long stack traces address but they're too expensive time and space-wise to enable in production applications.

What's more, errors objects have a tendency to get passed down the chain of callbacks (that is, it's often not the immediate callback that deals with it but some aggregator further down the line), which makes backtracking that much harder.

Better tooling would do much to address this (long stack traces, a smart debugger, etc.) but those tools don't exist yet.

@trevnorris

I just want to second what Ben is saying. Recently felt some of this pain
after upgrading to 3.17.13. You must generate the stack trace the moment
the function is called because once it's gone through the event loop the
stack is completely lost. It's not a possibility to preemptively generate
the trace on every function call just in case it's needed later.

Debugging from an asynchronous stack is pretty much useless, and only
brings pain and suffering. If you want an example of the difficulty doing
this look at the first DEBUG in fs.js. js is loose enough that pre
validating inputs is not difficult. There are also many cases were
automatic type coercion will take care of any errors.

@CrabDude

@isaacs I'm extremely sorry for the frustration and offense I've made you feel. I have failed in communicating if you feel this way, and it is not my intention to attack your character.

Part 1: The original issue

The current bugs we have hit are now being addressed since you reopened #5107. 5107 was avoidable from the beginning by wrapping http callbacks in a try/catch or process.nextTick though.

Is that a bug in node.js if the application can circumvent it?
Yes, I believe it is.

Will it prevent a graceful shutdown of all concurrent requests?
Yes, they will timeout unnecessarily.

Is it affecting us in that it cannot be worked around?
No.

Will others be bitten by the bug?
Yes, almost certainly.

Has the core team now addressed our most specific issue we're facing?
Yes.

Conclusion:
No further action required beyond 5107.

Part 2: Node is crash only.

This may be a little long, but I just want to be clear about what we're talking about. This doesn't need to be resolved in core, but it's difficult to have a conversation about whether the solution is worth the performance impact if we're not acknowledging there's a problem and properly defining it.

This specifically is what I was referring to when I said you weren't addressing my points. Your response was

Your breakdown of "error" vs "exception" vs "application exception" is not a real thing.

If it's not a real thing, why would you even consider a patch? If node should crash-only on any thrown exception (b/c none should be caught b/c there's no distinction) a patch should be discouraged. You never addressed any of my arguments about how and why it is a real thing and about how and why the distinction allows us to avoid restarts.

There is a context to this statement:
"Node is crash only on uncaught exception."

Is there a difference between an "uncaught exception" (exceptions) and an "exception that shouldn't be caught" (errors)?
Yes, it turns out there is.

If we don't throw exceptions, will we still require a restart?
No. Most of these are in core for invalid input. These could passed to callbacks instead, avoiding the need for an unnecessary restart.

What causes us to require a restart on exception then?
throwing

If exceptions don't require a restart and could be passed, why do we throw them forcing a restart?
We shouldn't.

If we never throw exceptions, will we ever require a restart?
Yes, * but only on errors,* which are far less frequent.

But what about runtime errors in application code?
These can be caught at the appropriate boundaries, avoiding the need for a restart.

Prove it:

fs.readFile = function(path, cb) {
// stuff
somethingAsync(function(data) {
  try {
    cb(null, data)
  } catch(e) {
    // Pass this somewhere that knows a restart is unnecessary
  }
  // continue doing stuff
})
// stuff
}

And this prevents core from entering into an undefined state?
Yes.

And this prevents application runtime errors from requiring a restart since we avoid placing core in an undefined state?
Yes.

But isn't the application code now in an undefined state?
Depends.

On what?
On what you're doing.

For example:

// This does not require a restart
http.createServer(function(req, res) {
  try {
    // Whoops! This is a bad request that doesn't contain a "user-agent" header.
    var isCurl = req.headers['user-agent'].indexOf('curl') !== -1
    // We'll never get here...
    // do stuff
  } catch(e) {
    res.writeHead(500)
    res.end(500, e.stack)
  }
}

// This does
var foo = ['hello']

http.createServer(function(req, res) {
  try {
    // After the bad request, foo was not properly placed back into its proper starting value state as an array, causing this to now error on every request.
    foo = foo.join('')
    // Whoops! This is a bad request that doesn't contain a "user-agent" header.
    foo += req.headers['user-agent']
    // We'll never get here...
    res.end(foo)
    foo = ['hello']
  } catch(e) {
    res.writeHead(500)
    res.end(500, e.stack)
  }
}

So not all unexpected application errors require a restart?
Correct.

So if we write our application code in a way that does not unnecessarily share state, and we protect core from any runtime errors in our application code, we could avoid a restart because our application never enters into an undefined state?
Yes.

But isn't try/catch slow?
Slower than what?

Isn't try/catch slower than not having it?
Not if you're having to restart constantly when you don't have to.

But what's so slow about restarting?
For starters, it requires a lot of synchronous IO.

But what about having a worker pool?
Imagine you have a minor error that is touched by 100% of all requests, but is completely non-critical. Even with a worker pool, you'll be limited to the number of requests your box can serve using one process per request (or slight more than one if you're lucky). Now let's say the process is non-critical, but we're willing to at least wait for a response.

If unnecessary restarts were avoided, even the erring process could still handle 10k concurrent requests resulting in (status code) 500s at 15ms each. If on the other hand, the process restarts on error, then it can only handle as much as a process per request model can handle, which is less and slower, slowing down the rest of the site.

But why don't you just fix the error, or never write errors?
One, you would as fast as you can, and two, errors happen.

The devil's in the details. In one model, an error occurred, but did not slow your site down, in the other, if you're lucky your box that was running at 40% capacity, will have enough head room to deal with the requests, slightly slower. If you're unlucky, it will spillover and DoS your box adding a timeout latency to every upstream request.

Why not just set your timeout to something minimal?
There are of course ways to mitigate, but in the end you've taken a problem that didn't have to be a problem at all even in a fault-tolerant SOA architecture, and turned it into a liability, unnecessarily.

Conclusion:
Yes, try/catch is slower than no try/catch, but crashing due to one error in your entire application often can be a lot slower than adding a couple try/catches.

Only a patch will reveal the real impact though, but I'm content to at least have it acknowledge the problem exists and is solvable, regardless of the efficacy of the solution.

@isaacs
isaacs commented Mar 30, 2013

@othiym23 In the latest release, a significant example and warning was added to http://nodejs.org/docs/latest/api/domain.html Do you think that it addresses some of your issues?

Everyone on the node core team, myself included, has some experience (past and/or present) running Node in production and having to debug real world apps. In fact, the most vocal core-team opponent to "never throw ever" is @bnoordhuis, based primarily on his experience debugging Cloud9. I've gotten similar feedback from some of the devs at Joyent.

Of course, at the "edge" of your application, the thing that talks to things you didn't write (like web browsers, foreign servers, etc.) you have to be a bit more Postel's Law about things.

@CrabDude Github just plunked your reply in here while I was writing this, I haven't read it yet, but I will now.

@CrabDude

@bnoordhuis @trevnorris
I agree a runtime error should not come in on the callback. We're in an undefined state here, and we should maintain the line that caused them and restart.

Are you also talking about why passing invalid input exceptions to the callback would be bad? Because in this case it doesn't apply because your throw line will always be:

throw new TypeError('bad path')

and you would only see that if you're throwing, which defeats the whole purpose of distinguishing them as exceptions, not errors, and passing them on the callback (i.e., you don't need a restart).

@othiym23

@isaacs That's a helpful addition to the docs, and I hadn't seen that yet. Sorry for jumping the gun!

I also apologize for the implicit call to authority in my comment. It wasn't intended. My intent was to present what I've seen in the code my customers have been writing, and to explain what's been useful to me in trying to help them without direct access to their production environments. Trying to deal with errors third parties are seeing is a real and ongoing pain point for me with Node.

@bnoordhuis @trevnorris I agree re: the uselessness of async stacktraces – most of the time. In certain situations, when you run into runtime faults triggered by bad data (rather than developer error), immediately throwing is just as, if not more, useless, than an async stacktrace, because as a developer you end up with insufficient context to figure out what went wrong. I think I'm just rehashing some of @hueniverse's argument at this point, so I'll leave it at that. I do think that Postel's Law is a good reference when thinking about this stuff, though – a big piece of the problem is components included by Node, like OpenSSL, that are either indifferent or hostile to the idea of being liberal about what they accept.

I also agree that tooling is a big part of the solution. It's distressing how frequently dropping a long stacktrace module into a badly-behaved application will immediately make clear what's going wrong, in a way that logging or debugging won't. Being able to step through turns of the event loop would be incredibly helpful, too.

@isaacs
isaacs commented Mar 30, 2013

If node should crash-only on any thrown exception (b/c none should be caught b/c there's no distinction) a patch should be discouraged.

To be clear, I'm not advocating "never catch errors". I'm advocating "catch errors, and then deal with your undefined state, for example by closing servers, finishing existing responses, and then shutting down asap." Also, fix whatever it is that's letting errors get thrown. Maybe it should write to a log file that is being monitored by a program that sends an SMS to the cell phones of all the dev team whenever it sees an error.

In JavaScript, you often cannot safely have a "crash only" design, because we don't have adequate isolation of requests for that. Crash-only approaches are feasible only when you can be sure that the crash is going to be suitably isolated. We can't make that guarantee of a node process, for obvious reasons.

I would like you to stop referring to my suggestion as "crash only". That term has a specific technical meaning in the context of servers and error handling, and it's not the technique that I am advocating. (At least, not for most user-facing web servers.)

If we don't throw exceptions, will we still require a restart?
No. Most of these are in core for invalid input. These could passed to callbacks instead, avoiding the need for an unnecessary restart.

There is some debate about specifically which sorts of errors should be passed to callbacks, and which should be thrown. If that's what this issue is about, we should talk about that specifically, and not get into all this other stuff.

I still don't know what the difference is between "Errors" and "Exceptions". Is an "exception" a thing that's thrown? It seems like this is a technical term you're using, and I just re-read your last comment where you broke it into all these different categories, and I still don't get it.

As a general rule, if it's worth creating an Error object, then it's a Big Problem.

In Node, you can break errors into three basic categories based on how they're handled:

  1. Egregious detectable indefensible programmer mistakes. fs.readFile({ number: 12 }, "what is this i don't even", /regexp not a cb/)
  2. Run-time errors from sync IO functions. fs.readFileSync('file-does-not-exist')
  3. Run-time errors from async IO functions. fs.readFile('file-does-not-exist', callback)

(2) and (3) are not really worth discussing much, since we all agree about them. (2) gets thrown, and (3) gets passed to the callback.

I'd say that "erroneous run-time state on an EventEmitter" is a special case of (3), but just s/callback/emit('error')/. You could call that a fourth category, I suppose, if you wanted to be very technical.

(1) is where it gets sticky. "Egregious" is a judgement call. One person's "horrible obvious mistake" is another person's "bad state at run time". However, if we make it easy to deal with the second at the expense of the first, then that's making a bad trade-off that makes debugging harder long term.

We're not talking about cases where it's expensive to test for validity, like JSON.parse. Most of the time, it's a typeof or a range check.

Maybe you might take issue with crypto.randomBytes getting a negative number, or url.decode getting a non-string. But what about fs.readFile({number:12}, /definitely not a function/). Should that throw, or not?

If you agree that this clear and obvious mistake should throw, then we agree that there are cases where it's justified to throw on invalid input, rather than pass it to a callback or something, and we only disagree on exactly where that line is.

Some other follow-up:

  1. The complications of passing detectable TypeError mistakes to the callback is nil. But what if you pass a non-function as the callback? Is THAT a TypeError that can be thrown? (So again, there is a line, it's just a matter of specifically where it ought to be placed!) The complication of catching emit() handler errors and making sure that the other handlers get called, or allowing domains to "fix" the error and safely continue from where it was thrown, is completely unreasonable, however.
  2. @othiym23 No worries about any claims of authority. You've been "in the shit" wrt error handling, far more than most, so you've got some authority and I appreciate the input.
  3. "throwing is just as, if not more, useless, than an async stacktrace, because as a developer you end up with insufficient context to figure out what went wrong." -- Watch for new V8 options to produce a core dump whenever an error goes unhandled. Even "one in a million" errors should be debuggable if they happen in production one time, provided you gather sufficient data at the time.
  4. Regarding tooling, I think there's still a long way to go with dtrace integration (or etw/systemtap) and core dump analysis. Ben's got some stuff in C++ that does full js-and-c++ stacks (a la mdb jsstack) but at run time. Now that socket handles have file descriptors, we can in theory trace that through its entire lifetime. There's a lot more work going on there.
@davepacheco

FWIW, I'm with @isaacs, @bnoordhuis, and @trevnorris here. TypeError and
ReferenceError are examples of errors thrown in many contexts where they
represent simple but egregious programmer mistakes (e.g., a mistyped variable
name) -- mistakes we all make from time to time, no matter how careful. As a
programmer, I'd much rather explicitly validate user input so that the runtime
can safely assume that when it does see a TypeError or ReferenceError, it's
clearly one of these egregious bugs, and the best thing to do is to fail
explicitly and quickly.

If that's manifesting as a performance problem, then the solution is to fix the
underlying bugs, not paper over them. (Yes, you'll always have errors, but they
won't be a performance problem if they're not happening often, and if they're
happening often, you've got a more serious issue with your software than its
performance.)

@CrabDude
CrabDude commented Apr 2, 2013

Since we're jumping around a bit and to facilitate discussion, here's a formalization of my Goal, Assumptions and Conclusions:

Goal:
Remove all avoidable sources of undefined state in core, and the corresponding need to restart.[1]

Assumptions: (Givens)

  1. Definition "undefined state": An indeterminate state.
  2. Core should crash (eventually) on undefined state.
  3. Core should not crash if not in an undefined state.
  4. TRESUR[2] Errors indicate undefined state within the layer they occur.
  5. throw creates undefined state (within the existing layer) if the state wasn't already undefined
  6. Thrown Errors create undefined state only to the breadth of their bubbling, which can be limited if caught.
  7. It is the purview of the developer/application whether an Error in application logic creates an undefined state.

Conclusions:

  1. throw on detectable undefined state. - Given 2
  2. Don't throw ever unless already in an undefined state - Given 3
  3. TRESUR Errors in core should not be caught. - Given 2, 4
  4. All other Errors should be passed to callbacks. - Given 3, 5
  5. Errors from sync (IO and non-IO) calls should be returned. - Given 3, 5
  6. Invalid input does not indicate undefined state. - Given 1
  7. Invalid input Errors should be passed to the callback or returned. - Proof 4, 5, 6
  8. Core should catch thrown Errors from the application layer. (See #5107) - Given 3, 4, 6
  9. A mechanism should exist to differentiate crash-level Errors from caught application Errors. - Given 2, 3, 7

QED

These are hierarchical of course. (i.e., It should first be concluded that avoiding unnecessary undefined state in core is desirable, then whether the assumptions are valid, and finally whether the conclusions are valid, what are their alternatives, and what are their tradeoffs)

[1] Since restarts require more CPU, IO and application logic than not.
[2] Unanticipated run-time errors: TypeError, ReferenceError, EvalError, SyntaxError, URIError, RangeError

@isaacs
isaacs commented Apr 2, 2013

@CrabDude You're making this all much more complicated than it has to be. I don't have the slightest idea what you're QEDing about.

Remove all avoidable sources of undefined state in core, and the corresponding need to restart.[1]

You mean... don't ever have errors in any program? That is unrealistic.

The rest is just confusing.

@CrabDude
CrabDude commented Apr 2, 2013

@isaacs

Do you have a definition of what you define as "undefined state" and the reasoning for why it requires a restart? Like an actual formal expression of what an "undefined state" is, how you decide that you're in an "undefined state" and why it is that "undefined state" requires a restart and what specifically causes it?

You mean... don't ever have errors in any program? That is unrealistic.

Errors are not sources of undefined state.
For example, the following error does not create any undefined state:

function foo(bar) {
  if (bar == null) return new Error('Invalid Input')
}

fs.readFile(path, function(err, data) {
  // I don't care if this fails b/c it's not critical
  foo(baz)
})

I list sources of undefined state in "Assumptions":

* TRESUR Errors indicate undefined state within the layer they occur.
* throw creates undefined state (within the existing layer) if the state wasn't already undefined

These are sources of "undefined state" as I see it. It's easy to go backwards to conclude that I define "undefined state" as either explicit (manually throwing on an invalid assumption of internal state, like that you've connected the database by the time you call query) or implicit (a unexpected thrown TRESUR error).

Here is an example of an error that creates an undefined state:

function foo(bar) {
  // if (bar == null) return new Error('Invalid Input')
  bar.qux = 'hello'
}

process.nextTick(function() {
  // I don't care if this fails b/c it's not critical
  foo(baz)
})

The reason this creates undefined state and the other does not, is because all the other nextTicks will not fire and we don't know if they're important or not.

The question you need to ask is, first do you agree that my second example creates an undefined state and second, that the other does not? Why?

@awwright
awwright commented Apr 2, 2013

I'm completely unable to decipher this conversation because no one can seem to use the terms consistently.

ECMAScript doesn't have Exception objects, it has try/catch/finally exception support that uses Error instances. If you're using Exception with a capital E, you're doing it wrong, I don't know what an Exception is.

Since when does Node.js crash? Without using native .node modules, I've never seen a Node.js crash. Maybe you refer to an uncaughtException exit? But that's not a crash, crashes are fatal.

What is an "undefined core state"? If I'm calling a safe (no modification of data) asynchronous function, the worst that can happen is that one of the callbacks is never called, maybe because an Error was never caught or properly tested. But if you do proper error checking, and proper try/catch, there's no such thing as a safe function that produces a "undefined state" never mind a "undefined core state". Furthermore, if I'm writing a file and I get an Error, restarting the process isn't going to fix anything, I have no way of knowing if the file was written or not, and restarting the process doesn't revert that. So indeed, what is this "undefined state" we're talking about?

@CrabDude
CrabDude commented Apr 2, 2013

@Acubed The situation with the terminology is unfortunate, but we're limited by what's available to us. FWIW, I took great care to keep the terminology consistent when I originally broached the subject in #5114.

ECMAScript doesn't have Exception objects

Correct, again, see #5114, but for this discussion Exception was meant to distinguish (as it does in other languages) a non-fatal Error. The capitalization was meant to distinguish it as a proper noun with a very exact meaning.

Since when does Node.js crash

Node can crash in that V8 can crash and uncaughtExceptions can come from errors in core. A part of what we're talking about here is uncaughtException, but it's specifically about distinguishing between uncaughtExceptions originating from TRESUR[1] Errrors in core, uncaught invalid input errors in core and application errors; and which need to be thrown and which require a restart.

You are correct; the current issue at hand is defining "undefined state," what it means for core to be in an "undefined state," and how it gets there, so we can address whether it's avoidable, and what is required to avoid it.

[1] Unanticipated run-time errors: TypeError, ReferenceError, EvalError, SyntaxError, URIError, RangeError

@davepacheco

We can spend years debating approaches to software structure, but there's nothing directly actionable in this issue. A patch would go a long way towards focusing the discussion.

@CrabDude
CrabDude commented Apr 2, 2013

@davepacheco See Specific goal, assumptions and approaches. The goal and conclusions are subjective, but the assumptions are objective.

Also, the OP has a bullet list of 3 actionable items. All 3 are non-trivial, and originally seemed like they would be a waste of time to implement since it didn't seem like anyone even understood the point I was trying to make (I'm still not sure if anyone understands the point I'm trying to make).

A patch would go a long way towards focusing the discussion.

Agreed.

FWIW, this started with a pull request, (promptly closed) followed by an issue (promptly closed, though now open after an irc chat with @isaacs) followed by a simpler issue (promptly closed) leading here.

I'm discussing it here precisely because the original pull request and issues failed to generate discussion on the core issue restated here. I would be more likely to submit a patch knowing anyone understands my point or my goal.

@trevnorris

@CrabDude Some of what I'm understanding is that you want to be allowed to make assumptions about the data you're working with. Which explains a lot about how/why you would experience so many application crashes. For example you wrote the following:

http.createServer(function(req, res) {
  try {
    // Whoops! This is a bad request that doesn't contain a "user-agent" header.
    var isCurl = req.headers['user-agent'].indexOf('curl') !== -1
    // We'll never get here...
    // do stuff
  } catch(e) {
    res.writeHead(500)
    res.end(500, e.stack)
  }
}

Which implies you assume req.headers['user-agent'] is available and depend on a try/catch to bail out if it doesn't. I absolutely agree this doesn't require a restart, but I also disagree with the approach. Mine would be like the following:

http.createServer(function(req, res) {
  if (!req && !req.headers['user-agent']) {
    res.writeHead(500);
    return res.end(500, 'Missing your freakin user-agent!');
  }
  var isCurl = req.headers['user-agent'].indexOf('curl') !== -1;
}

See, except for extreme cases like using JSON.parse, I never use try/catch. Because if I don't have a conditional handling the issue then I want it to explode in a flaming ball of death, and imho I don't want the other paradigm forced on me. js fields are easily verified and cheaply cast, so I like to know what my code is doing at all times.

Also when you're trying to squeeze out every last ounce of performance, allowing any unverified parameters through is likely to cause a DEOPT. v8 only allows a function to DEOPT 10 times before it's marked for non-optimization. Which means from that point forward you're going to loose the performance advantage Crankshaft could have given you.

EDIT: and figured I'd throw in that anytime my code would hit the catch I'd consider that an undefined state. Simply because I don't know exactly why it was reached. Therefore I'm not sure how to recuperate and continue.

DOUBLE EDIT: Was looking at another example you posted:

function foo(bar) {
  if (bar == null) return new Error('Invalid Input')
}
fs.readFile(path, function(err, data) {
  // I don't care if this fails b/c it's not critical
  foo(baz)
})

Then don't throw from the function if it's not critical. Instead do like the following:

function foo(bar) { }
fs.readFile(path, function(err, data) {
  if (baz != null)
    foo(baz)
  // just for an example, grabing the stacktrace w/o throwing
  else
    console.trace('baz == null');
})

There, you've just jumped over the non-critical function call and saved yourself the expense of calling a function, throwing, catching and generating a stack trace. I never make assumptions about my inputs, or what state my program is in. If ever my code would reach a catch then I want it to let me know violently.

@awwright
awwright commented Apr 2, 2013

I'm not sure a patch would do anything until we know what functionality we're looking for in our patch. The arguments that @CrabDude makes deserves serious consideration.They are formal and laid out, I like this. These issues really have no purpose being closed.

I would elaborate on application restarts:

An application restart is one of the most expensive tasks that a Node.js program can do, by design. An application restart must be avoidable in all situations, especially where user input may be involved, which poses a significant security risk (denial of service).

I would truncate the goal, I don't see how it has anything to do with restarts:

Goal: Remove all avoidable sources of undefined state in core.

Assumptions: (Givens)

I still have no clue what an undefined state is, and "TRESUR" seems unnecessary. How about this:

  1. Definition "undefined state": Any part of a program whose state could not be predictable by examining the return values is said to be undefined.
  2. Core must raise an Error if attempting to use an undefined state.
  3. Core must not crash if not in an undefined state.
  4. Unanticipated run-time errors (TypeError, ReferenceError, EvalError, SyntaxError, URIError, RangeError) indicate undefined state within the layer they occur. -- Not necessarially, a TypeError doesn't change the state of the program at all, it's effect is entirely predictable.
  5. throw creates undefined state (within the existing layer) if the state wasn't already undefined -- This is misleading, a single "throw" inside a single "try" block does not create any undefined state at all. Combined with (6) this may be redundant.
  6. Thrown Errors create undefined state only to the breadth of their bubbling, which can be limited if caught.
  7. It is the purview of the developer/application whether an Error in application logic creates an undefined state.

Is this all consistent with your definition of "undefined state"?

Throwing because of a JSON parse error by itself does not create an undefined state, throwing inside a try block, when there are calculations also inside the try block, does not create an undefined state. The "finally" block exists entirely so application developers can write programs without throwing any part of their application into an undefined state, e.g. so I can unlock a resource that was previously locked.

Likewise, if I'm writing to a file, and I get an Error either thrown or returned in a callback, is my application said to be undefined? It could be, but there's no reason it couldn't. In my understanding, Node.js can't be half-successful, either it returns success and all the contents were written out, or an Error is returned, or thrown, or passed-by-argument to a callback.

Conclusions:

  1. throw on detectable undefined state. - Given 2
  2. Don't throw ever unless already in an undefined state - Given 3
  3. TRESUR Errors in core must not be caught. - Given 2, 4 -- Not necessarily, this may be required of all types of Errors precisely to prevent core libraries from entering an undefined state. What if one of the functions locks a resource then an Error is thrown, shouldn't the error be caught so the resource can be unlocked?
  4. All other Errors must be passed to callbacks. - Given 3, 5
  5. Errors from sync (IO and non-IO) calls must be returned. - Given 3, 5 -- If Conclusion-3 says errors must not be caught, how are you supposed to return the Error object instead? Also, silent-by-default errors should be considered bad. There's nothing wrong with catching these errors, or am I missing something?
  6. Invalid input does not indicate undefined state. - Given 1
  7. Invalid input Errors must be passed to the callback or returned. - Proof 4, 5, 6
  8. Core must catch thrown Errors from the application layer. (See #5107) - Given 3, 4, 6
  9. A mechanism must exist to differentiate crash-level Errors from caught application Errors. - Given 2, 3, 7 -- I'm still not sure what a crash-level error is

Many of the other conclusions may not follow in light of my arguments so I can't comment on them.

@CrabDude
CrabDude commented Apr 2, 2013

@trevnorris To be fair, I'm familiar with JavaScript best practices.

I don't write code that way, it was meant to be demonstrative of a point, that application errors happen and that undefined state is avoidable when they do. We all write errors and they're not always found before code ships to production, nor are they always detectable on code inspection.

The issue here is whether the current architecture of node is brittle and vulnerable to DoS attacks due to the necessity to restart on application errors.

It would be naive to suggest a constantly restarting server pool could handle as many 500s as one that avoided undefined state and restarts altogether (though to what extent needs to be quantified). I've shipped production servers that didn't restart using my trycatch library (though in retrospect it was overzealous in catching all errors, but in practice nearly all errors in production were application errors (not invalid input, TRESUR, or other fatals, though those did happen occasionally), so it was a non-issue).

Also, your point about V8 DEOPT is valid in that we of course want to pre-validate for optimization, and I appreciate the numerous V8 reference links you provided earlier, but it doesn't seem to apply here specifically since we're comparing it with a restart, and I don't think the performance effect of a deoptimized function (that will be deoptimized again on restart) compares with processes constantly coming down and back up.

Because if I don't have a conditional handling the issue then I want it to explode in a flaming ball of death

What are you building that this is your preference over merely logging the error and returning a 500? This is the complete opposite of what I want, especially since it's completely unnecessary.

@trevnorris

@CrabDude Thanks for the clarification. So I guess what you're saying is
that despite defensive programming you still run into errors that would
cause your application to crash?

I think the discussion of restarting the node process needs to end. It's
apparent that the reasons I'd expect my app to crash and those of yours are
completely different. Mainly because I don't expect my app to crash often
enough that restarting it would be a performance hit.

@bjouhier
bjouhier commented Apr 2, 2013

@trevnorris @CrabDude
Different people have different needs and I think that node should accomodate that.

If you have a small application with very tight code you can probably bring it to a level which is almost bug free. Then the cost of restarting it everytime there is an exception is probably acceptable.

But if you have a large application with lots of contributors and lots of rules, you have to live with the fact that it may contain bugs and you don't want these bugs to crash the application every time. What you typically do then is put a try/catch in your HTTP request dispatcher, return a 500 to your users / clients and log message and stracktrace if you catch an exception, and you let the system go on with the other requests. Usually these bugs are really silly things, like forgetting to test for a null pointer. They don't corrupt state. So there is no point in bringing the whole system down because of them.

@trevnorris

@bjouhier I don't disagree with you. What I do disagree with is wrapping core events in try/catch, and definitely the idea that node itself runs within a domain (think that idea's been presented). There has been short discussion about allowing node to coerce more instead of throwing, but overall it's possible to prevent node from crashing despite how it throws.

I think what I'm missing is code for a real world example where this was absolutely necessary. I mean, other than errors like "developer A accidentally parsed data incorrectly so the function parameters were wrong".

@bjouhier
bjouhier commented Apr 3, 2013

@trevnorris Agree.

If you do not catch errors in your callbacks and don't trap uncaughtException node will crash on every error. But if you catch in every callback you can build a robust server that recovers from errors. May be a bit tedious (hint: much easier with a preprocessor) but this is feasible.

So we don't need no new API to do that. We don't even need domains (I know it because we've been doing it since 0.2).

@CrabDude
CrabDude commented Apr 4, 2013

@trevnorris

Preventing a crash is trivial, uncaughtException. Preventing the need to restart, is much more complex.

First, you need to wrap all your application logic in try/catch in case of an TRESUR error, so it doesn't place core in an undefined state. This can be tedious, or requires shimming core APIs.

Second, you need to parse caught errors to ensure they didn't come from core. If they did, you need to determine if it's a TRESUR error or an invalid input, which is exacerbated by the fact that core uses TypeError for invalid input, so now you need to parse the message as well based on some (?) pattern, which is of course brittle.

Alternatively, you could do your best to prevalidate and avoid errors, and restart otherwise. In practice though, this is a bad idea. LinkedIn mobile runs a single forked server. The larger the server gets, the longer a restart takes. Additionally, the frequency of errors is indeterminate as well since any push could potentially introduce an untested code path.

I ran the cluster example with an error on every 1000th request, and even with the lightest server imaginable there was a 33% degradation in performance (though on every 10,000th it was negligible). So I increased the process pool count to double the # of CPUs, which mostly eliminated it, but this is a manual and indeterminate process. One option would be to just fork CPUs*5? 10?, but now you have context switching overhead. I don't have any hard numbers, but I suspect they're worse than the overhead of try/catch, and the number of processes you run is still a manual and difficult to predict. You could mitigate this by having a pool of idle processes, but you stil have the additional overhead of restarting a large process and some buffer of processes. Breaking the single server process into many in theory would be a good idea to further limit the surface area of restarts, but is completely infeasible.

Conversely, if my recommendation is followed or I work around it myself, wrapping everything in try/catch creates a predictable marginal decrease in performance (on the order of 1%), with no performance impact or race to stay above the error rate with your process pool. Restarts then become rare, and running something like #CPUs+2 processes is sufficient. In other words, it's a much more sane solution.

@trevnorris

@CrabDude As I said earlier, discussion of the "solution" to restart the process should end. Honestly I don't think anyone would agree that's a plausible solution for the types of errors I think you're describing. As far as what/when/where you'd expect node to/not to crash, I really need to see some code to understand what you're getting at. Or, if you work out of Mountain View then we can just meet up over lunch and hammer this out.

@CrabDude
CrabDude commented Apr 5, 2013

@trevnorris Sent you message on your site's contact form.

@CrabDude
CrabDude commented May 6, 2013

For reference, on the distinction of errors vs exceptions, see the Mozilla ECMAScript discussion on conditional catches:
http://mozilla.6506.n7.nabble.com/Conditional-catch-clause-td267035.html

TLDR;

  • There is no confusion about the distinction amongst @BrendanEich, @dherman, et al.
  • There is no disagreement on the value of typed error catching, only on whether ECMAScript needs sugar.

If re-throwing involves undesirable side-effects, I would encourage you and anyone with an opinion on the issue to bring them up in discussions like the above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment