Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in wrap_js_stream.js during doClose #27258

Open
Tracked by #34108
kotsss opened this issue Apr 16, 2019 · 10 comments
Open
Tracked by #34108

Crash in wrap_js_stream.js during doClose #27258

kotsss opened this issue Apr 16, 2019 · 10 comments
Labels
stream Issues and PRs related to the stream subsystem.

Comments

@kotsss
Copy link

kotsss commented Apr 16, 2019

  • Version: v10.15.3
  • Platform: Linux zagentXXXX 4.4.0-141-generic Graceful FS by default #167~14.04.1-Ubuntu SMP Mon Dec 10 13:20:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Subsystem: net

I've debated whether this belongs in node/help, but due to potential of being a node bug I've decided to post here (apologize in advance if it turns out otherwise).

We have a crash happening on our production servers a few times a day (especially during high load times):

TypeError: Cannot read property 'finishWrite' of null
    at JSStreamWrap.finishWrite (internal/wrap_js_stream.js:195:12)
    at Immediate.setImmediate (internal/wrap_js_stream.js:218:12)
    at runCallback (timers.js:705:18)
    at tryOnImmediate (timers.js:676:5)
    at processImmediate (timers.js:658:5)

I was unable to see any possible code path that could cause this._handle to be null before the call to doClose as it's only triggered by calling this._handle.close().
Async paths printing would have come very useful here, however, this is node 10.x.

I've done grep on all our code as well as all node-modules for anything that could relate to 'games' with "_handle = " or doClose, nothing pop'ed up.

I've put DEBUG_NODE=net,stream_wrap on one of the servers to avoid overall network degradation (waiting for crash to re-occur).

Note: server is a large traffic networking proxy application involving http/s/2.

I would like to give more info (specifically setImmediate async callstack), however, I'm unsure how to continue, would appreciate any advice.

@addaleax
Copy link
Member

How easy is this to reproduce for you? Can you modify the Node.js source code? It might be helpful to add an assertion in doClose() that checks that this._handle is still alive at that point, so that the crash is observable more easily?

@kotsss
Copy link
Author

kotsss commented Apr 16, 2019

Putting a modified Node.js compilation on the server and waiting to reproduce was something we raised as an option but were hopping to for it to be plan B.
Plan A was to find way (using this post) to alter the existing version:

  • override file loaded from node_javascript (I remember that old node versions had ./lib js files on the disk and could be modified directly - is this something that still exists?)
  • patch code
  • undocumented way to get async trace

But if you're saying that compiling is the way then I'll get working on that.

@kotsss
Copy link
Author

kotsss commented Apr 17, 2019

Ok, I've got a modified node running and got new stacktrace:

AssertionError [ERR_ASSERTION]: Handle null - very bad
Error
    at JSStreamWrap.doClose (internal/wrap_js_stream.js:214:9) // assert(_this.handle)
    at JSStreamWrap.handle.close (internal/wrap_js_stream.js:39:12)
    at TLSWrap.close (_tls_wrap.js:395:23)
    at TLSSocket.Socket._destroy (net.js:595:18)
    at TLSSocket._destroy (/var/agent/util/net.js:379:37) // net.Socket.prototype hook for stats
    at TLSSocket.destroy (internal/streams/destroy.js:32:8)
    at JSStreamWrap.TLSSocket.wrap.once (_tls_wrap.js:317:35)
    at Object.onceWrapper (events.js:277:13)
    at JSStreamWrap.emit (events.js:189:13)
    at _handle.close (net.js:597:12)
    at Immediate.setImmediate (internal/wrap_js_stream.js:231:7) // cb();
    at runCallback (timers.js:705:18)
    at tryOnImmediate (timers.js:676:5)
    at processImmediate (timers.js:658:5)
Async
    at JSStreamWrap.doClose (internal/wrap_js_stream.js:212:30) // save first doClose stack on 'this'
    at JSStreamWrap.handle.close (internal/wrap_js_stream.js:39:12)
    at JSStreamWrap.Socket._destroy (net.js:595:18)
    at JSStreamWrap._destroy (/var/agent/util/net.js:379:37) // net.Socket.prototype hook for stats
    at JSStreamWrap.destroy (internal/streams/destroy.js:32:8)
    at Duplex.JSStreamWrap.stream.once (internal/wrap_js_stream.js:76:12)
    at Object.onceWrapper (events.js:277:13)
    at Duplex.emit (events.js:189:13)
    at emitCloseNT (internal/streams/destroy.js:59:8)
    at process._tickCallback (internal/process/next_tick.js:63:19)

So TLSWrap._parent points to Socket._handle that shows how the invalid call is possible, but this doesn't reproduce to a million other tls sockets used by the app.

Sounds like a race condition of some sorts, but it's beyond my code understanding to figure it out.

Edit: increased Error.stackTraceLimit and got full stack
Edit2: add async jump to get more stack
Edit3: managed to catch a non-crashing doClose stack for reference

STREAM_WRAP 13998: doClose stack
Error
    at JSStreamWrap.doClose (internal/wrap_js_stream.js:212:30)
    at JSStreamWrap.handle.close (internal/wrap_js_stream.js:39:12)
    at JSStreamWrap.Socket._destroy (net.js:595:18)
    at JSStreamWrap._destroy (/var/agent/util/net.js:379:37)
    at JSStreamWrap.destroy (internal/streams/destroy.js:32:8)
    at TLSWrap.close (_tls_wrap.js:393:29)
    at TLSSocket.Socket._destroy (net.js:595:18)
    at TLSSocket._destroy (/var/agent/util/net.js:379:37)
    at TLSSocket.destroy (internal/streams/destroy.js:32:8)
    at TLSSocket.onConnectEnd (_tls_wrap.js:1095:10)
    at Object.onceWrapper (events.js:277:13)
    at TLSSocket.emit (events.js:194:15)
    at endReadableNT (_stream_readable.js:1125:12)
    at process._tickCallback (internal/process/next_tick.js:63:19)

@kotsss
Copy link
Author

kotsss commented Apr 21, 2019

We've been able to narrow the crash down to HTTP2 client sockets by correlating http2 usage and crash frequency.
We're trying to reproduce issue in our tests, no luck yet.
I'll be happy to run patches on the code to provide more information.

Will appreciate if someone can take a look at the stacktraces I provided and/or give more debugging instructions.

@gireeshpunathil
Copy link
Member

looks like difference between good and bad case is the former is handling end while the later, close event, but definitely not enough to make any conclusions.

cc @nodejs/streams

@omgaz
Copy link

omgaz commented Feb 6, 2020

Since upgrading to Node 10 from Node 8 we've been seeing the following stack appear. I'm having trouble tracking down what's causing it - but could it be related to this issue?

{
  "date": "Thu Feb 06 2020 12:05:47 GMT+1100 (Australian Eastern Daylight Time)",
  "process": {
    "pid": 339,
    "uid": 496,
    "gid": 494,
    "cwd": "/var/app/current",
    "execPath": "/node-install/node-v10.18.1-linux-x64/bin/node",
    "version": "v10.18.1",
    "argv": [
      "/node-install/node-v10.18.1-linux-x64/bin/node",
      "/var/app/current/packages/app/lib/server.js"
    ],
    "memoryUsage": {
      "rss": 217214976,
      "heapTotal": 111587328,
      "heapUsed": 85340080,
      "external": 62857751
    }
  },
  "os": {
    "loadavg": [
      0,
      0,
      0
    ],
    "uptime": 89121
  },
  "trace": [
    {
      "column": 23,
      "file": "_tls_wrap.js",
      "function": "TLSWrap.close",
      "line": 395,
      "method": "close",
      "native": false
    },
    {
      "column": 18,
      "file": "net.js",
      "function": "TLSSocket.Socket._destroy",
      "line": 605,
      "method": "_destroy",
      "native": false
    },
    {
      "column": 8,
      "file": "internal/streams/destroy.js",
      "function": "TLSSocket.destroy",
      "line": 37,
      "method": "destroy",
      "native": false
    },
    {
      "column": 20,
      "file": "events.js",
      "function": "Object.onceWrapper",
      "line": 286,
      "method": "onceWrapper",
      "native": false
    },
    {
      "column": 13,
      "file": "events.js",
      "function": "TLSSocket.emit",
      "line": 198,
      "method": "emit",
      "native": false
    },
    {
      "column": 23,
      "file": "domain.js",
      "function": "TLSSocket.EventEmitter.emit",
      "line": 466,
      "method": "emit",
      "native": false
    },
    {
      "column": 14,
      "file": "_stream_writable.js",
      "function": "finishMaybe",
      "line": 646,
      "method": null,
      "native": false
    },
    {
      "column": 5,
      "file": "_stream_writable.js",
      "function": "ShutdownWrap.stream._final [as callback]",
      "line": 624,
      "method": "_final [as callback]",
      "native": false
    },
    {
      "column": 8,
      "file": "net.js",
      "function": "ShutdownWrap.afterShutdown [as oncomplete]",
      "line": 374,
      "method": "afterShutdown [as oncomplete]",
      "native": false
    }
  ],
  "stack": [
    "TypeError: this._parent.close is not a function",
    "    at TLSWrap.close (_tls_wrap.js:395:23)",
    "    at TLSSocket.Socket._destroy (net.js:605:18)",
    "    at TLSSocket.destroy (internal/streams/destroy.js:37:8)",
    "    at Object.onceWrapper (events.js:286:20)",
    "    at TLSSocket.emit (events.js:198:13)",
    "    at TLSSocket.EventEmitter.emit (domain.js:466:23)",
    "    at finishMaybe (_stream_writable.js:646:14)",
    "    at ShutdownWrap.stream._final [as callback] (_stream_writable.js:624:5)",
    "    at ShutdownWrap.afterShutdown [as oncomplete] (net.js:374:8)"
  ],
  "level": "error",
  "message": "uncaughtException: this._parent.close is not a function"
}

@mcollina
Copy link
Member

mcollina commented Feb 6, 2020

@omgaz have you been able to reproduce this reliably? could you post a script to reproduce?

@omgaz
Copy link

omgaz commented Feb 9, 2020

@omgaz have you been able to reproduce this reliably? could you post a script to reproduce?

Not yet, it's proving taxing tracking down what could potentially be trying to end this connection. I'll update when/if I can repro the script.

@jasnell jasnell added the stream Issues and PRs related to the stream subsystem. label Jun 26, 2020
@ronag ronag mentioned this issue Jun 28, 2020
25 tasks
@pimterry
Copy link
Member

I've also been seeing a lot of this, with HTTP/2 traffic on both node v10.15.1 and v12.8.3, but I've just been testing v14.6.0 and it doesn't seem to reproduce at all there.

@omgaz
Copy link

omgaz commented Aug 13, 2020

I haven't been able to create an isolated example for this, but believe I'm getting this when using custom certificate authorities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stream Issues and PRs related to the stream subsystem.
Projects
None yet
Development

No branches or pull requests

7 participants