Connection reset by peer #76

phillipp · 2015-09-22T15:48:49Z

Hi,
we have bud in production for quite some time now and we're pretty happy with it (still waiting for a core dump on #74).

There is one customer who complains about regular "Connection reset by peer" errors. We were unable to reproduce that with our monitoring, but our monitoring is not a demanding as a browser.

I'm pretty clueless how we could best debug this. There are quite some -104 errors in the logs:

Sep 22 16:55:19 ssl-fe01 bud[1981]: client 0x1135e260 on frontend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 16:55:40 ssl-fe01 bud[1981]: client 0x1135e260 on frontend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 16:55:40 ssl-fe01 bud[9625]: client 0xdbfb300 on frontend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 16:55:40 ssl-fe01 bud[19663]: client 0xe55b760 on frontend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 16:55:40 ssl-fe01 bud[1997]: client 0xe064170 on frontend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 16:55:40 ssl-fe01 bud[1981]: client 0x1142c490 on frontend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 16:55:40 ssl-fe01 bud[19663]: client 0xe6d4a90 on frontend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 16:55:40 ssl-fe01 bud[1997]: client 0xe104e10 on frontend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 16:56:41 ssl-fe01 bud[1981]: client 0x113ac260 on frontend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:01:39 ssl-fe01 bud[1981]: client 0x11468170 on frontend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:01:47 ssl-fe01 bud[19663]: client 0xe871e60 on backend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:01:49 ssl-fe01 bud[1981]: client 0x114a8ca0 on backend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:01:49 ssl-fe01 bud[1997]: client 0xe22a8a0 on backend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:03:29 ssl-fe01 bud[19663]: client 0xe8f8c00 on backend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on backend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:07:47 ssl-fe01 bud[9625]: client 0xdb3bf30 on backend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:07:56 ssl-fe01 bud[19663]: client 0xe6d4a90 on backend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:08:10 ssl-fe01 bud[19663]: client 0xe8adff0 on backend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:10:12 ssl-fe01 bud[9625]: client 0xd9807b0 on frontend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:11:26 ssl-fe01 bud[19663]: client 0xe6d4a90 on frontend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on backend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:21:19 ssl-fe01 bud[16418]: client 0x1e4a1b0 on backend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on backend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)

When I see that correctly the client and/or bud closes or drops the connection and then bud tries to communicate on that connection, gets the -104 error and forcefully closes the client and backend connection.

I checked with nginx (backend):

There is no "Bad request" (not fully received/processed HTTP request) in the access logs
There is not single error for any connection received from bud recorded in the nginx error.log

I'm pretty baffled because on plain HTTP we don't see any connection resets by nginx and I can't find any clues in the nginx logs. Dropped connections by iptables or conntrack etc is unlikely because we monitor the relevant stats. Timing from the logs (everything from backend connect and frontend new to the force close happens in the same second) suggests that no timeouts are involved.

Sample los:

Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on backend connecting to 127.0.0.1:10010
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend new
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend after read_cb() => 517
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend SSL_read() => -1
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend uv_write(137) iovcnt: 1
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend immediate write
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend write_cb => 137
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend recycle
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend SSL_read() => -1
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on backend connect 0
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend SSL_read() => -1
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend after read_cb() => 1158
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend SSL_read() => 1078
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on backend uv_write(1129) iovcnt: 1
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on backend immediate write
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on backend write_cb => 1129
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend SSL_read() => -1
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend recycle
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend SSL_read() => -1
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on backend after read_cb() => -104
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend SSL_read() => -1
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on backend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend force closing (and waiting for other)
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on backend force closing (and waiting for other)
Sep 22 17:21:19 ssl-fe01 bud[16419]: client 0x15016b0 on frontend close_cb

Any ideas how to debug this further?

The text was updated successfully, but these errors were encountered:

indutny · 2015-09-22T15:58:36Z

@phillipp thanks for posting this! Do you have any other sample logs like the one at the bottom of this issue? Having additional info may help a lot!

phillipp · 2015-09-22T16:56:01Z

Sure, here we go:

Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on frontend new
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on backend connect 0
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on frontend SSL_read() => -1
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on frontend after read_cb() => 517
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on frontend SSL_read() => -1
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on frontend uv_write(137) iovcnt: 1
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on frontend immediate write
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on frontend write_cb => 137
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on backend read_start
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on frontend recycle
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on frontend SSL_read() => -1
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on frontend after read_cb() => 1859
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on frontend SSL_read() => 1516
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on backend uv_write(1570) iovcnt: 1
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on backend immediate write
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on backend write_cb => 1570
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on frontend SSL_read() => 234
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on backend uv_write(234) iovcnt: 1
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on backend immediate write
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on backend write_cb => 234
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on frontend SSL_read() => -1
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on frontend recycle
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on frontend SSL_read() => -1
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on backend after read_cb() => -104
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on frontend SSL_read() => -1
Sep 22 17:07:45 ssl-fe01 bud[1981]: client 0x1142c490 on backend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)

Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend new
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on backend connect 0
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend after read_cb() => 517
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend uv_write(137) iovcnt: 1
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend immediate write
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend write_cb => 137
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on backend read_start
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend recycle
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend after read_cb() => 1614
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => 1406
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on backend uv_write(1458) iovcnt: 1
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on backend immediate write
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on backend write_cb => 1458
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => 99
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on backend uv_write(99) iovcnt: 1
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on backend immediate write
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on backend write_cb => 99
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend recycle
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on backend after read_cb() => -104
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on backend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend force closing (and waiting for other)
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on backend force closing (and waiting for other)
Sep 22 17:19:25 ssl-fe01 bud[16419]: client 0x14b2400 on frontend close_cb

This one has two frontend connections, just in case this is relevant

Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on frontend new
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on backend connect 0
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on frontend after read_cb() => 517
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on frontend uv_write(137) iovcnt: 1
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on frontend immediate write
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on frontend write_cb => 137
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on backend read_start
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on frontend recycle
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on frontend after read_cb() => 51
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on backend uv_write(51) iovcnt: 1
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on backend immediate write
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on backend write_cb => 51
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on frontend recycle
Sep 22 17:19:10 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend after read_cb() => 31
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => 0
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend shutdown
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend recycle
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => 0
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend after read_cb() => -4095
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => 0
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend shutdown cb
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend after read_cb() => -4095
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => 0
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend shutdown
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend uv_write(31) iovcnt: 1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend immediate write
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend write_cb => 31
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => 0
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => 0
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend shutdown cb
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend bud_client_close()
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend force closing (and waiting for other)
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend force closing (and waiting for other)
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend close_cb
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend connecting to 127.0.0.1:10010
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend new
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend after read_cb() => 517
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend ssl_cert_cb {0}
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend recycle
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend ssl_cert_cb {2}
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend uv_write(1416) iovcnt: 1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend immediate write
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend write_cb => 1416
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend recycle
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend connect 0
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend after read_cb() => 126
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend uv_write(51) iovcnt: 1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend immediate write
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend write_cb => 51
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend uv_write(53) iovcnt: 1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend immediate write
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend write_cb => 53
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend recycle
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend after read_cb() => 241
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => 212
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend uv_write(212) iovcnt: 1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend immediate write
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend write_cb => 212
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend recycle
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend after read_cb() => 16000
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_write() => 16000
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend frontend.output => 16348
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend uv_write(16348) iovcnt: 2
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend immediate write
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend write_cb => 16348
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend recycle
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend after read_cb() => 16000
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_write() => 16000
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend frontend.output => 16348
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend uv_write(16348) iovcnt: 2
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend async uv_write(3736) follow up: 2
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend queued write
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend after read_cb() => 16000
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_write() => 16000
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend frontend.output => 20084
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend after read_cb() => 16000
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_write() => 16000
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend frontend.output => 36432
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend after read_cb() => 16000
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_write() => 16000
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend frontend.output => 52780
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend after read_cb() => 16000
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_write() => 16000
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend frontend.output => 69128
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend throttle, buffer full: 69128
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend write_cb => 3736
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => -1
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend uv_write(63652) iovcnt: 4
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend async uv_write(37588) follow up: 3
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend queued write
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend after read_cb() => 31
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => 0
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend shutdown
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend recycle
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => 0
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend after read_cb() => -104
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend SSL_read() => 0
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend force closing (and waiting for other)
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on backend force closing (and waiting for other)
Sep 22 17:19:15 ssl-fe01 bud[16419]: client 0x14b2400 on frontend close_cb

I was noticing just now that the after_read_cb gives the -104 sometimes on the backend and sometimes on the frontend.

For our log since this morning thats 832 -104 on the frontend and 154 on the backend (I would expect that the frontend has way more errors).

Do you need more info? In this case a traffic dump is a bit hard, because it is a production system sending possibly sensitive data...

indutny · 2015-09-22T17:02:51Z

Thank you @phillipp. I'll look into this.

indutny · 2015-09-22T17:21:20Z

@phillipp do you happen to see anything like this in nginx logs?

PROXY protocol unknown protocol

Have you tried enabling debug-level logging on nginx? (http://nginx.org/en/docs/debugging_log.html)

phillipp · 2015-09-22T17:36:03Z

I havent seen any PROXY related error messages in the error log. But we actually applied a patch to fix an nginx problem with the proxyline support, but I'll have to look that up.

nginx debug log is now enabled. I'll post more information on backend connection resets when we have it in both logs.

Here we have another one (the customer complained about this specified request, it's on the frontend side):

Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend new
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend after read_cb() => 77
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on backend ssl_cert_cb {0}
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend SSL_read() => -1
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend recycle
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on backend ssl_cert_cb {2}
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend SSL_read() => -1
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend uv_write(1070) iovcnt: 1
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend immediate write
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend write_cb => 1070
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend recycle
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend SSL_read() => -1
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on backend connect 0
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend SSL_read() => -1
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend after read_cb() => 318
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend SSL_read() => -1
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend uv_write(51) iovcnt: 1
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend immediate write
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend write_cb => 51
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on backend uv_write(50) iovcnt: 1
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on backend immediate write
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on backend write_cb => 50
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend recycle
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend SSL_read() => -1
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend after read_cb() => 285
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend SSL_read() => 258
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on backend uv_write(258) iovcnt: 1
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on backend immediate write
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on backend write_cb => 258
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend SSL_read() => -1
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend recycle
Sep 22 19:10:31 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend SSL_read() => -1
Sep 22 19:10:36 ssl-fe01 bud[16419]: client 0x1c4cf40 on backend after read_cb() => 2105
Sep 22 19:10:36 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend SSL_write() => 2105
Sep 22 19:10:36 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend frontend.output => 2162
Sep 22 19:10:36 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend SSL_read() => -1
Sep 22 19:10:36 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend uv_write(2162) iovcnt: 1
Sep 22 19:10:36 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend immediate write
Sep 22 19:10:36 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend write_cb => 2162
Sep 22 19:10:36 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend recycle
Sep 22 19:10:36 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend SSL_read() => -1
Sep 22 19:10:41 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend after read_cb() => -104
Sep 22 19:10:41 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend SSL_read() => -1
Sep 22 19:10:41 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend closed because: uv_read_start(client) cb returned -104 (connection reset by peer)
Sep 22 19:10:41 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend force closing (and waiting for other)
Sep 22 19:10:41 ssl-fe01 bud[16419]: client 0x1c4cf40 on backend force closing (and waiting for other)
Sep 22 19:10:41 ssl-fe01 bud[16419]: client 0x1c4cf40 on frontend close_cb

indutny · 2015-09-22T17:38:01Z

@phillipp yeah, frontend-side is definitely unrelated to proxyline.

Do you think it may be possible to obtain strace -p <pid> output during this failure? I just want to recheck what actually happens on the system level.

phillipp · 2015-09-22T17:44:59Z

Do you need the strace output of bud or nginx?

I have a error log of the nginx-side of a connection reset on the backend now. I fear that I have to heavily redact the log, but everything before the pasted lines seems to be internal processing anyways:

[..]
2015/09/22 19:34:14 [debug] 31455#0: *119940366 set http keepalive handler
2015/09/22 19:34:14 [debug] 31455#0: *119940366 http close request
2015/09/22 19:34:14 [debug] 31455#0: *119940366 cleanup lua tcp socket request: "/[FILTERED]/"
2015/09/22 19:34:14 [debug] 31455#0: *119940366 lua finalize socket
2015/09/22 19:34:14 [debug] 31455#0: *119940366 lua close socket connection
2015/09/22 19:34:14 [debug] 31455#0: *119940366 free: 0000000001EFACF0, unused: 48
2015/09/22 19:34:14 [debug] 31455#0: *119940366 reusable connection: 0
2015/09/22 19:34:14 [debug] 31455#0: *119940366 http log handler
2015/09/22 19:34:14 [debug] 31455#0: *119940366 run cleanup: 0000000001D508B8
2015/09/22 19:34:14 [debug] 31455#0: *119940366 run cleanup: 0000000001D50648
2015/09/22 19:34:14 [debug] 31455#0: *119940366 free: 0000000001EF58F0
2015/09/22 19:34:14 [debug] 31455#0: *119940366 free: 0000000000000000
2015/09/22 19:34:14 [debug] 31455#0: *119940366 free: 0000000001E8DAF0
2015/09/22 19:34:14 [debug] 31455#0: *119940366 free: 0000000001D4B150
2015/09/22 19:34:14 [debug] 31455#0: *119940366 free: 0000000001D4F930, unused: 3
2015/09/22 19:34:14 [debug] 31455#0: *119940366 free: 0000000001DF2E30, unused: 0
2015/09/22 19:34:14 [debug] 31455#0: *119940366 free: 0000000001E99BD0, unused: 0
2015/09/22 19:34:14 [debug] 31455#0: *119940366 free: 0000000001F1E580, unused: 3638
2015/09/22 19:34:14 [debug] 31455#0: *119940366 free: 0000000001F1A010
2015/09/22 19:34:14 [debug] 31455#0: *119940366 hc free: 0000000000000000 0
2015/09/22 19:34:14 [debug] 31455#0: *119940366 hc busy: 0000000000000000 0
2015/09/22 19:34:14 [debug] 31455#0: *119940366 reusable connection: 1
2015/09/22 19:34:14 [debug] 31455#0: *119940366 event timer add: 27: 30000:1442943284264
2015/09/22 19:34:14 [debug] 31455#0: *119940366 post event 0000000001DFE3D0
2015/09/22 19:34:14 [debug] 31455#0: *119940366 delete posted event 0000000001DFE3D0
2015/09/22 19:34:14 [debug] 31455#0: *119940366 http keepalive handler
2015/09/22 19:34:14 [debug] 31455#0: *119940366 malloc: 0000000001F1A010:1024
2015/09/22 19:34:14 [debug] 31455#0: *119940366 recv: fd:27 -1 of 1024
2015/09/22 19:34:14 [debug] 31455#0: *119940366 recv() not ready (11: Resource temporarily unavailable)
2015/09/22 19:34:14 [debug] 31455#0: *119940366 free: 0000000001F1A010
2015/09/22 19:34:44 [debug] 31455#0: *119940366 event timer del: 27: 1442943284264
2015/09/22 19:34:44 [debug] 31455#0: *119940366 http keepalive handler
2015/09/22 19:34:44 [debug] 31455#0: *119940366 close http connection: 27
2015/09/22 19:34:44 [debug] 31455#0: *119940366 reusable connection: 0
2015/09/22 19:34:44 [debug] 31455#0: *119940366 free: 0000000000000000
2015/09/22 19:34:44 [debug] 31455#0: *119940366 free: 0000000001F1B690, unused: 4
2015/09/22 19:34:44 [debug] 31455#0: *119940366 free: 0000000001F1B7A0, unused: 52

indutny · 2015-09-24T05:09:32Z

@phillipp sorry, but so far nothing rings the bell... Perhaps, it may be a good idea to take a look at the tcpdumps yourself, to see what kind of data is exchanged by the sides right before this error. Maybe there is some corrupted memory, or something unrelated to bud at all!

I could only say that this error is usually not a bud's fault, and means that the other side killed the connection.

Very sorry to be unable to provide you any further suggestions on this.

databeille · 2015-09-24T08:18:16Z

I am not as good as you two are, but it seems quite similar to my problem: #63

phillipp · 2015-09-24T10:34:30Z

I go out on a limb here but memory corruption seems unlikely because this of course is server hardware wiith ECC scrubbing and occurs on multiple hardware machines.

I now got lucky and intercepted a connection between bud and nginx with connection reset.

nginx debug log is as follows:

2015/09/24 11:31:56 [debug] 31459#0: *141724585 accept: 127.0.0.1:43415 fd:77
2015/09/24 11:31:56 [debug] 31459#0: *141724585 event timer add: 77: 60000:1443087176294
2015/09/24 11:31:56 [debug] 31459#0: *141724585 reusable connection: 1
2015/09/24 11:31:56 [debug] 31459#0: *141724585 epoll add event: fd:77 op:1 ev:80002001
2015/09/24 11:32:05 [debug] 31459#0: *141724585 http wait request handler
2015/09/24 11:32:05 [debug] 31459#0: *141724585 malloc: 0000000001E98AD0:1024
2015/09/24 11:32:05 [debug] 31459#0: *141724585 recv: fd:77 1024 of 1024
2015/09/24 11:32:05 [debug] 31459#0: *141724585 close http connection: 77
2015/09/24 11:32:05 [debug] 31459#0: *141724585 event timer del: 77: 1443087176294
2015/09/24 11:32:05 [debug] 31459#0: *141724585 reusable connection: 0
2015/09/24 11:32:05 [debug] 31459#0: *141724585 free: 0000000001E98AD0
2015/09/24 11:32:05 [debug] 31459#0: *141724585 free: 0000000001EAED90, unused: 24
2015/09/24 11:32:05 [debug] 31459#0: *141724585 free: 0000000001E64E60, unused: 72

I can't export the wireshark trace of it for privacy reasons, but summary looks like this:

[edit] nginx error log shows no errors for that request and/or time frame.

I fear I cant show the bud log because stdout isn't timestamped and our syslog for bud debug is heavily rate-limited. I'm not able to correlate client ids or process ids. I often wondered why you decided not to timestamp the stdout messages. Any good reason?

indutny · 2015-09-24T16:20:03Z

@phillipp is there anything odd in wireshark? Invalid HTTP request maybe?

indutny · 2015-09-24T16:21:15Z

@phillipp stupid question, but could you be running out of memory on this box?

indutny · 2015-09-24T16:27:54Z

@phillipp if you have access to the HTTP request data - may I ask you to post a proxystring from it (you may redact address in it, but please make the size the same).

indutny · 2015-09-24T16:28:53Z

@phillipp do you see anything like broken header in nginx error log?

indutny · 2015-09-24T16:31:27Z

It is just pretty odd that it doesn't print http process request line in debug log. Here is the relevant part of nginx source:

https://github.com/nginx/nginx/blob/7c2e7816cb1364176ca4b06954f617eb6dd28652/src/http/ngx_http_request.c#L373-L500

It reads the data from socket, tries to parse PROXY line, and then goes straight into ngx_http_process_request_line which should print it. So it looks like the problem happens in between, so it must be either proxyline or memory allocation.

Btw, what nginx version are you on?

phillipp · 2015-09-25T15:16:23Z

Nope, wireshark has just the HTTP request and that direct ACK and then ACK, RST.

nginx error logs shows exactly nothing. The server is not out of memory. And there is no tcp port reuse warning in wireshark either.

The http request does not have any PROXYLINE, it's just a plain http request. That may the problem, we nginx doesn't complain because we patched that error message out, so we would see that error message but don't because we removed it (because otherwise the log would be flooded from all the errors. because bud only send the proxyline on the first request all other requests showed that error).

Would that be possible, that bud didn't injext the proxyline?

phillipp · 2015-09-25T15:19:24Z

nginx is openresty/1.7.10.1

built by gcc 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
TLS SNI support enabled
configure arguments: --prefix=/usr/share/nginx/nginx --with-debug --with-cc-opt='-I/tmp/ngx_openresty-1.7.10.1/build/luajit-root/usr/share/nginx/luajit/include/luajit-2.1 -DNGX_LUA_USE_ASSERT -DNGX_LUA_ABORT_AT_PANIC -O2' --add-module=../ngx_devel_kit-0.2.19 --add-module=../echo-nginx-module-0.57 --add-module=../xss-nginx-module-0.04 --add-module=../ngx_coolkit-0.2rc2 --add-module=../set-misc-nginx-module-0.28 --add-module=../form-input-nginx-module-0.10 --add-module=../encrypted-session-nginx-module-0.03 --add-module=../srcache-nginx-module-0.29 --add-module=../ngx_lua-0.9.15 --add-module=../ngx_lua_upstream-0.02 --add-module=../headers-more-nginx-module-0.25 --add-module=../array-var-nginx-module-0.03 --add-module=../memc-nginx-module-0.15 --add-module=../redis2-nginx-module-0.11 --add-module=../redis-nginx-module-0.3.7 --add-module=../rds-json-nginx-module-0.13 --add-module=../rds-csv-nginx-module-0.05 --with-ld-opt='-Wl,-rpath,/usr/share/nginx/luajit/lib -L/tmp/ngx_openresty-1.7.10.1/build/luajit-root/usr/share/nginx/luajit/lib' --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-client-body-temp-path=/var/lib/nginx/body --http-fastcgi-temp-path=/var/lib/nginx/fastcgi --http-log-path=/var/log/nginx/access.log --http-proxy-temp-path=/var/lib/nginx/proxy --http-scgi-temp-path=/var/lib/nginx/scgi --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --lock-path=/var/lock/nginx.lock --pid-path=/run/nginx.pid --with-pcre-jit --with-http_addition_module --with-http_gzip_static_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_stub_status_module --with-http_ssl_module --with-http_sub_module --with-ipv6 --with-sha1=/usr/include/openssl --with-md5=/usr/include/openssl

phillipp · 2015-09-25T15:23:22Z

Yeah, right, we changed the invalid section of the ngx_proxy_protocol_parse function in ngx_proxy_protocol.c like this:

invalid:

    if (c->proxy_protocol_addr.len == 0) {
        ngx_log_error(NGX_LOG_ERR, c->log, 0,
                      "broken header: \"%*s\"", (size_t) (last - buf), buf);
    }

    return NULL;
}

Then the error is only logged if the header is missing and no IP address is set from previous requests.

phillipp · 2015-09-25T15:26:54Z

Ugh, that explains why the connection is closed. That returns NULL and NULL leads to a closed connection 😱

I'll fix that and see if that resolved the problem.

phillipp · 2015-09-25T16:31:02Z

With the patched/fixed nginx we still see the problem, I'll try to get more debug info why that's the case.

indutny · 2015-09-30T23:07:58Z

@phillipp may I still ask you to post a PROXYLINE from that failed request?

phillipp · 2015-09-30T23:38:22Z

I can't, because there is none in the whole TCP connection.

indutny · 2015-09-30T23:45:40Z

@phillipp oh sorry! Looks like I forgot the context when replying to this issue (switching between several projects atm). May I ask you to publish the HEX data of first TCP packet's body?

phillipp · 2015-10-22T16:24:34Z

Looks like our nginx problem solved the main issue, but there is still an open issue. We see regular requests from bud that are mangled in a strange way, like this:

2015/10/22 17:59:55 [error] 30632#0: *286758 broken first proxyline header: "GET /[REDACTED] HTTP/1.1
Host: [REDACTED]
Connection: keep-alive
Cache-Control: max-age=0
Accept: */*
If-Modified-Since: Tue, 23 Sep 2014 14:18:18 GMT
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36
Referer: https://[REDACTED]
Accept-Encoding: gzip, deflate, sdch
Accept-Language: de-DE,de;q=0.8,en-US;q=0.6,en;q=0.4
Cookie: _lcp=a; _lcp2=a; _lcp3=a; phpbb3_thh_u=246; phpbb3_thh_k=; phpbb3_thh_sid=[REDACTED]; style_cookie=A

/index.php
DNT: 1

; _ga=[REDACTED]" while reading PROXY protocol, client: 127.0.0.1, server: 127.0.0.1:10010

So you can see there is some stuff injected into the Cookie header. We see that around 2.000 times per day and from random clients, even from monitoring requests. The nginx server then closed the connection, which leads to the connection reset by perr backend message. We previously ignored all these errors but we see them again after we fixed our nginx patch.

A possible next step here would be to compile with address sanitizer (may be a memory corruption) and dump the original request in bud so we can check the incoming buffer against what's going out. Any other ideas?

indutny · 2015-10-22T16:25:43Z

@phillipp this sounds like a good plan. I hope ASAN will give us more information. Thanks for the follow-up!

phillipp · 2015-10-22T16:55:59Z

Starting bud with ASAN makes it hang. strace shows the following:

close(12)                               = 0
socketpair(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0, [12, 13]) = 0
pipe2([14, 15], O_CLOEXEC)              = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f9cd60bba10) = ? ERESTARTNOINTR (To be restarted)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8446, si_status=127, si_utime=0, si_stime=0} ---
read(3, "*", 1)                         = 1
write(7, "he\217\0\0\0\0\0\21\0\0\0\0\0\0\0", 16) = -1 EAGAIN (Resource temporarily unavailable)
write(4, "*", 1)                        = 1
rt_sigreturn()                          = 56
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f9cd60bba10) = 8448
close(15)                               = 0
read(14, "\363\377\377\377", 4)         = 4
close(14)                               = 0
close(13)                               = 0
ioctl(12, FIONBIO, [1])                 = 0
epoll_ctl(5, EPOLL_CTL_DEL, 12, {EPOLLWRNORM|EPOLLMSG|EPOLLERR|0x541000, {u32=24614, u64=40359773320798246}}) = -1 ENOENT (No such file or directory)
close(12)                               = 0
socketpair(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0, [12, 13]) = 0
pipe2([14, 15], O_CLOEXEC)              = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8448, si_status=127, si_utime=0, si_stime=0} ---
read(3, "*", 1)                         = 1
write(7, "he\217\0\0\0\0\0\21\0\0\0\0\0\0\0", 16) = -1 EAGAIN (Resource temporarily unavailable)
write(4, "*", 1)                        = 1
rt_sigreturn()                          = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f9cd60bba10) = 8449
close(15)                               = 0
read(14, "\363\377\377\377", 4)         = 4
close(14)                               = 0
close(13)                               = 0
ioctl(12, FIONBIO, [1])                 = 0
epoll_ctl(5, EPOLL_CTL_DEL, 12, {EPOLLRDBAND|EPOLLWRNORM|EPOLLWRBAND|EPOLLERR|0x541020, {u32=24614, u64=40359773320798246}}) = -1 ENOENT (No such file or directory)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8449, si_status=127, si_utime=0, si_stime=0} ---
read(3, "*", 1)                         = 1
write(7, "he\217\0\0\0\0\0\21\0\0\0\0\0\0\0", 16) = -1 EAGAIN (Resource temporarily unavailable)
write(4, "*", 1)                        = 1
rt_sigreturn()                          = -1 ENOENT (No such file or directory)
close(12)                               = 0
socketpair(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0, [12, 13]) = 0
pipe2([14, 15], O_CLOEXEC)              = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f9cd60bba10) = 8450
close(15)                               = 0
read(14, "\363\377\377\377", 4)         = 4
close(14)                               = 0
close(13)                               = 0
ioctl(12, FIONBIO, [1])                 = 0
epoll_ctl(5, EPOLL_CTL_DEL, 12, {EPOLLRDNORM|EPOLLWRBAND|EPOLLERR|0x541000, {u32=24614, u64=40359773320798246}}) = -1 ENOENT (No such file or directory)
close(12)                               = 0
socketpair(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0, [12, 13]) = 0
pipe2([14, 15], O_CLOEXEC)              = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f9cd60bba10) = ? ERESTARTNOINTR (To be restarted)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8450, si_status=127, si_utime=0, si_stime=0} ---
read(3, "*", 1)                         = 1
write(7, "he\217\0\0\0\0\0\21\0\0\0\0\0\0\0", 16) = -1 EAGAIN (Resource temporarily unavailable)
write(4, "*", 1)                        = 1
rt_sigreturn()                          = 56
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f9cd60bba10) = 8452
close(15)                               = 0
read(14, "\363\377\377\377", 4)         = 4
close(14)                               = 0
close(13)                               = 0
ioctl(12, FIONBIO, [1])                 = 0
epoll_ctl(5, EPOLL_CTL_DEL, 12, {EPOLLRDNORM|EPOLLRDBAND|EPOLLERR|0x541020, {u32=24614, u64=40359773320798246}}) = -1 ENOENT (No such file or directory)
close(12)                               = 0
socketpair(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0, [12, 13]) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8452, si_status=127, si_utime=0, si_stime=0} ---
read(3, "*", 1)                         = 1
write(7, "he\217\0\0\0\0\0\21\0\0\0\0\0\0\0", 16) = -1 EAGAIN (Resource temporarily unavailable)
write(4, "*", 1)                        = 1
rt_sigreturn()                          = 0
pipe2([14, 15], O_CLOEXEC)              = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f9cd60bba10) = 8453
close(15)                               = 0
read(14, "\363\377\377\377", 4)         = 4
close(14)                               = 0
close(13)                               = 0
ioctl(12, FIONBIO, [1])                 = 0
epoll_ctl(5, EPOLL_CTL_DEL, 12, {EPOLLRDBAND|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLERR|0x540800, {u32=24614, u64=40359773320798246}}) = -1 ENOENT (No such file or directory)
close(12)                               = 0
socketpair(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0, [12, 13]) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8453, si_status=127, si_utime=0, si_stime=0} ---
read(3, "*", 1)                         = 1
write(7, "he\217\0\0\0\0\0\21\0\0\0\0\0\0\0", 16) = -1 EAGAIN (Resource temporarily unavailable)
write(4, "*", 1)                        = 1
rt_sigreturn()                          = 0
pipe2([14, 15], O_CLOEXEC)              = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f9cd60bba10) = 8454
close(15)                               = 0
read(14, "\363\377\377\377", 4)         = 4
close(14)                               = 0
close(13)                               = 0
ioctl(12, FIONBIO, [1])                 = 0
epoll_ctl(5, EPOLL_CTL_DEL, 12, {EPOLLWRBAND|EPOLLMSG|EPOLLERR|0x540820, {u32=24614, u64=40359773320798246}}) = -1 ENOENT (No such file or directory)
close(12)                               = 0
socketpair(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0, [12, 13]) = 0
pipe2([14, 15], O_CLOEXEC)              = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f9cd60bba10) = ? ERESTARTNOINTR (To be restarted)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8454, si_status=127, si_utime=0, si_stime=0} ---
read(3, "*", 1)                         = 1
write(7, "he\217\0\0\0\0\0\21\0\0\0\0\0\0\0", 16) = -1 EAGAIN (Resource temporarily unavailable)

Compiled with gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)

indutny · 2015-10-22T17:51:24Z

@phillipp does debug log print anything at all?

phillipp · 2015-10-22T17:52:14Z

Nope, nothing. CPU is stuck @ ~ 100%.

indutny · 2015-10-22T17:56:03Z

@phillipp very interesting... May I ask you to try perf on it ? (http://baptiste-wicht.com/posts/2011/07/profile-applications-linux-perf-tools.html)

phillipp · 2015-10-22T18:35:35Z

Very strange. I tried bud with ASAN on two machines and on both it generates a lot of bud processes and basically kills the system by forking.

What works is starting a new docker container (based on ubuntu:trusty tag), compiling and running.

Perf on the forking version shows 26k cycles and 0 events.

Does that help?

phillipp · 2015-10-31T03:48:13Z

And what exactly is the difference?

indutny · 2015-10-31T03:50:47Z

It is built using node-gyp, and then symlinks binary to the bin folder.

phillipp · 2015-10-31T16:46:40Z

Alright, got that running now, with ASAN and some more debug output. I added a log of the buffer after SSL_read at client.c ~ L630 so I can see the buffer after it read from the connection.

I noticed that there are requests that don't get a PROXYLINE and now I can see the content in bud:

(dbg) [2333] client 0xe83170 on frontend SSL_read() => 518 "GET / HTTP/1.1^M^M$
User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; AppInsights)^M^M$
X-Ms-User-Agent: System Center Operations Manager (9538)^M^M$
Accept: */*^M^M$
Accept-Language: en-US^M^M$
Accept-Encoding: GZIP^M^M$
SyntheticTest-Id: hab.designer_60b24346-b049-4583-ae62-[REDACTED]-fl-mia-edge^M^M$
SyntheticTest-Location: us-fl-mia-edge^M^M$
SyntheticTest-RunId: 9b09d110-4578-4afa-ba44-[REDACTED]^M^M$
Referer: http://[REDACTED]^M^M$
Host: [REDACTED]^M^M$
Connection: Keep-Alive^M^M$

Notice that there are two ^M (carriage returns). Possible the x-forward.c doesn't handle injection of the header correctly in that case? There are plenty of clients that send that.

I can see if I can find out what is causing the garbled requests, because it looks like some other parts of the buffer are written to the backend as well, but I haven't figured out how.

indutny · 2015-10-31T17:53:06Z

Oh gosh! Yeah, this is quite unexpected. I guess I will just add support for this, thank you for figuring it out @phillipp !

indutny · 2015-10-31T18:36:52Z

@phillipp thinking more about, I don't think that much servers accept two CRs as CRLF. Just tried it on google, and node.js, and it doesn't seem to be working at all. I will definitely make x-forward thing less strict, by just looking at following sequence \r*\n (regexp), but omitting \n seems to be a bad idea.

indutny · 2015-10-31T18:41:41Z

@phillipp you may use this snippet to log data that is being written to backend:

diff --git a/src/client.c b/src/client.c
index e6da79a..a8a6c49 100644
--- a/src/client.c
+++ b/src/client.c
@@ -727,8 +727,11 @@ bud_client_error_t bud_client_send(bud_client_t* client,
   DBG(side, "uv_write(%ld) iovcnt: %ld", side->write_size, count);

   side->write_req.data = client;
-  for (i = 0; i < count; i++)
+  for (i = 0; i < count; i++) {
+    if (side == &client->backend)
+      fprintf(stderr, "\"%.*s\"\n", size[i], out[i]);
     buf[i] = uv_buf_init(out[i], size[i]);
+  }

   /* Try writing without queueing first */
   r = uv_try_write((uv_stream_t*) &side->tcp, buf, count);

indutny · 2015-10-31T19:05:06Z

Actually even \r*\n is too weak to be published. I'll do \r?\n.

indutny · 2015-10-31T19:32:50Z

Ok, pushed out the fix for x-forwarded-for. Unfortunately, it won't help your case, though.

phillipp · 2015-11-01T17:19:43Z

I have found a fix for the garbled requests but I couldn't figure out WHY the fix works exactly.

I figured out now that the bud_client_send sends more data than the client_read:

(dbg) [2331] client 0xd92230 on frontend SSL_read() => 518
(dbg) [2331] client 0xd92230 on backend uv_write(570) iovcnt: 1

Which is okay, because the PROXYLINE should be prepended. SHOULD is important, because nginx cleary shows that there is no PROXYLINE. So I added debug output to check out how long that PROXYLINE is, and the size matches exactly the difference between the SSL_read() and uv_write. That is to be expected.

But then when I see the request nginx is receiving there are these garbled headers at the end and the size of them is exactly the number of bytes that should be the size of the PROXYLINE. So I figured that maybe the proxyline is written to the ringbuffer but the pointer is somehow advanced in a wrong way or something, couldn't figure that out yet. But it is clear from inspecting the buffer SSL_read() reads into that the bytes from the garbled requests are leftovers from previous requests.

What I did is I replaced the return bud_ok() from bud_client_prepend_proxyline with return bud_client_send(client, &client->backend); and that works fine. The PROXYLINE is written immediately to the backend and nginx doesn't complain anymore. That could be a workaround, but maybe there is a bigger problem there.

Maybe you can figure out what could be the problem?

phillipp · 2015-11-01T17:21:36Z

BTW: looks like that only happens when the SSL handshake fires an SNI request.

indutny · 2015-11-01T17:24:52Z

Very interesting! I will review the ringbuffer code, thanks for the tip and thorough investigation!

indutny · 2015-11-01T17:27:39Z

@phillipp while I look into it - may I ask you to give a try to this patch?

diff --git a/deps/ringbuffer/ringbuffer.c b/deps/ringbuffer/ringbuffer.c
index 0f3341e..fac2c28 100644
--- a/deps/ringbuffer/ringbuffer.c
+++ b/deps/ringbuffer/ringbuffer.c
@@ -287,6 +287,8 @@ char* ringbuffer_write_ptr(ringbuffer* rb, size_t* length) {
 int ringbuffer_write_append(ringbuffer* rb, size_t length) {
   rb->write_head->write_pos += length;
   rb->length += length;
+  if (rb->write_head->write_pos > RING_BUFFER_LEN)
+    abort();
   assert(rb->write_head->write_pos <= RING_BUFFER_LEN);

   /* Allocate new buffer if write head is full,

indutny · 2015-11-01T17:30:47Z

Gosh, disregard this patch. The fix is:

diff --git a/deps/ringbuffer/ringbuffer.c b/deps/ringbuffer/ringbuffer.c
index 0f3341e..c609f71 100644
--- a/deps/ringbuffer/ringbuffer.c
+++ b/deps/ringbuffer/ringbuffer.c
@@ -315,6 +315,7 @@ int ringbuffer_insert(ringbuffer* rb,
   bufent* start;
   size_t left;
   size_t offset;
+  size_t to_alloc;
   int r;

   assert(length < RING_BUFFER_LEN);
@@ -334,9 +335,18 @@ int ringbuffer_insert(ringbuffer* rb,
   b = rb->write_head;

   /* Ensure that we have enough space for shift */
-  r = ringbuffer_write_append(rb, length);
-  if (r != 0)
-    return r;
+  to_alloc = length;
+  while (to_alloc != 0) {
+    size_t avail;
+
+    avail = to_alloc;
+    ringbuffer_write_ptr(rb, &avail);
+    r = ringbuffer_write_append(rb, avail);
+    if (r != 0)
+      return r;
+
+    to_alloc -= avail;
+  }

   next = rb->write_head;
   while (left > 0) {

Please let me know if it works for you.

phillipp · 2015-11-01T17:45:48Z

No, I fear that doesn't fix the problem. Still the same symptoms, like the ring buffers current read position is off by the size of the proxyline.

phillipp · 2015-11-01T17:48:10Z

If that helps: the code path is that the bud_client_prepend_proxyline is called from fatal block in bud_client_handshake_done_cb.

phillipp · 2015-11-01T17:51:21Z

Oh, never mind. that's not the case.

indutny · 2015-11-01T17:54:24Z

@phillipp thanks, looking further

indutny · 2015-11-01T18:02:04Z

@phillipp are you using sni-based backend selection?

indutny · 2015-11-01T18:08:53Z

@phillipp what about this patch?

diff --git a/deps/ringbuffer/ringbuffer.c b/deps/ringbuffer/ringbuffer.c
index 0f3341e..862a8ab 100644
--- a/deps/ringbuffer/ringbuffer.c
+++ b/deps/ringbuffer/ringbuffer.c
@@ -315,6 +315,7 @@ int ringbuffer_insert(ringbuffer* rb,
   bufent* start;
   size_t left;
   size_t offset;
+  size_t to_alloc;
   int r;

   assert(length < RING_BUFFER_LEN);
@@ -334,9 +335,18 @@ int ringbuffer_insert(ringbuffer* rb,
   b = rb->write_head;

   /* Ensure that we have enough space for shift */
-  r = ringbuffer_write_append(rb, length);
-  if (r != 0)
-    return r;
+  to_alloc = length;
+  while (to_alloc != 0) {
+    size_t avail;
+
+    avail = to_alloc;
+    ringbuffer_write_ptr(rb, &avail);
+    r = ringbuffer_write_append(rb, avail);
+    if (r != 0)
+      return r;
+
+    to_alloc -= avail;
+  }

   next = rb->write_head;
   while (left > 0) {
@@ -373,7 +383,7 @@ int ringbuffer_insert(ringbuffer* rb,

     delta = off + length - RING_BUFFER_LEN;
     memcpy(start->data + off, data, length - delta);
-    memcpy(next->data, data + length - delta, delta);
+    memcpy(start->next->data, data + length - delta, delta);
   } else {
     memcpy(start->data + off, data, length);
   }

phillipp · 2015-11-01T18:48:19Z

I have a clue about that's going on. The ringbuffer is accessed concurrently:

I added some log statements in the client.c and this is my output for a failed request:

(dbg) [5571] client 0xfe25a0 on backend [bud_client_backend_out] line before SSL_read (write pointer pos @ 0)
(dbg) [5571] client 0xfe25a0 on backend [prepend] inserting client proxyline (47 bytes) at write pos 0: "PROXY TCP6 ::ffff:10.0.0.3 0.0.0.0 65330 1443
"
(dbg) [5571] client 0xfe25a0 on backend [prepend] new backend read+write pos: 0, 47
(dbg) [5571] client 0xfe25a0 on backend [bud_client_backend_out] wrote 338 bytes to buffer
(dbg) [5571] client 0xfe25a0 on backend uv_write(385) iovcnt: 1
(dbg) [5571] client 0xfe25a0 on backend immediate write
(dbg) [5571] client 0xfe25a0 on backend write_cb => 385
(dbg) [5571] client 0xfe25a0 on backend avail3 buf: GET / HTTP/1.1

So:

the bud_client_backend_out retreives the write pointer
the function is then preempted on calling SSL_read()
PROXYLINE is injected ointo the buffer - at write position 0 because the buffer is empty right now
the SSL_read() read 338 bytes and inserted them at the buffer from the ring buffer write pointer - which is the very beginning of the ring buffer.

That would explain why the workaround (sending the ring buffer) works, because the buffer is okay, just SSL_read overwrites the PROXYLINE data because the write pouinter is stale.

Do you see any easy fix for that or is sending the data just fine?

indutny · 2015-11-01T18:49:49Z

Aaargh. You're so right about it, I'll create a fix in a bit.

indutny · 2015-11-01T19:26:13Z

@phillipp I have pushed both fixes to master, may I ask you to give them a try before I will release?

phillipp · 2015-11-01T19:30:23Z

Great, could not reproduce the problem with your master :-)

indutny · 2015-11-01T19:32:45Z

Ok, published as 1.3.5 . Thank you so much! You nailed most of the problem!

indutny · 2015-11-01T19:32:50Z

I mean, all of it :)

indutny · 2015-11-01T19:33:09Z

What's your twitter handle?

phillipp · 2015-11-01T19:38:22Z

Great, thanks for the fast reponse and great software!

And I thought I was going crazy because nobody else reported the problem. My twitter handle is: https://twitter.com/phillipp

indutny · 2015-11-01T19:40:24Z

Thank you!

indutny closed this as completed in 448e572 Nov 1, 2015

phillipp mentioned this issue Nov 4, 2015

bud leaks file descriptiors (heavily) #78

Closed

Connection reset by peer #76

Connection reset by peer #76

Comments

phillipp commented Sep 22, 2015

indutny commented Sep 22, 2015

phillipp commented Sep 22, 2015

indutny commented Sep 22, 2015

indutny commented Sep 22, 2015

phillipp commented Sep 22, 2015

indutny commented Sep 22, 2015

phillipp commented Sep 22, 2015

indutny commented Sep 24, 2015

databeille commented Sep 24, 2015

phillipp commented Sep 24, 2015

indutny commented Sep 24, 2015

indutny commented Sep 24, 2015

indutny commented Sep 24, 2015

indutny commented Sep 24, 2015

indutny commented Sep 24, 2015

phillipp commented Sep 25, 2015

phillipp commented Sep 25, 2015

phillipp commented Sep 25, 2015

phillipp commented Sep 25, 2015

phillipp commented Sep 25, 2015

indutny commented Sep 30, 2015

phillipp commented Sep 30, 2015

indutny commented Sep 30, 2015

phillipp commented Oct 22, 2015

indutny commented Oct 22, 2015

phillipp commented Oct 22, 2015

indutny commented Oct 22, 2015

phillipp commented Oct 22, 2015

indutny commented Oct 22, 2015

phillipp commented Oct 22, 2015

phillipp commented Oct 31, 2015

indutny commented Oct 31, 2015

phillipp commented Oct 31, 2015

indutny commented Oct 31, 2015

indutny commented Oct 31, 2015

indutny commented Oct 31, 2015

indutny commented Oct 31, 2015

indutny commented Oct 31, 2015

phillipp commented Nov 1, 2015

phillipp commented Nov 1, 2015

indutny commented Nov 1, 2015

indutny commented Nov 1, 2015

indutny commented Nov 1, 2015

phillipp commented Nov 1, 2015

phillipp commented Nov 1, 2015

phillipp commented Nov 1, 2015

indutny commented Nov 1, 2015

indutny commented Nov 1, 2015

indutny commented Nov 1, 2015

phillipp commented Nov 1, 2015

indutny commented Nov 1, 2015

indutny commented Nov 1, 2015

phillipp commented Nov 1, 2015

indutny commented Nov 1, 2015

indutny commented Nov 1, 2015

indutny commented Nov 1, 2015

phillipp commented Nov 1, 2015

indutny commented Nov 1, 2015