TCP handle leakage for "stale" connections #94

shsmith · 2017-01-02T06:04:48Z

I have noticed that over time the number of IP handles reported by lsof continues to grow and far exceeds the number of connections reported by the sessions RPC.

One IP in particular appears over 1000 times, even though there is no active session for the IP.

Here are the logs from 2 days ago when this IP briefly connected:

INFO:ElectrumX:[21536] SSL 41.144.95.105:52782, 666 total
INFO:ServerManager:closing stale connections [19859, 21536, 20904, 21120, 6123, 21278, 21097, 21124, 21481]

This IP does not appear at all after that.

Possibly related: the "subs" value in the getinfo RPC is much higher than the sum of subs for the sessions RPC. I suspect these are related to the stuck TCP handles.

Live Munin charts: http://vps.hsmiths.com:49001/munin/electrumx-day.html

A snapshot from near block 446220:

full lsof output: http://vps.hsmiths.com:49001/lsof.txt
getinfo output: http://vps.hsmiths.com:49001/getinfo
sessions output: http://vps.hsmiths.com:49001/sessions

kyuupichan · 2017-01-02T07:14:20Z

I just noticed this having gotten back from a 10 day break. My subs are over-counted, and my groups list way more sessions than exist.

kyuupichan · 2017-01-07T06:38:31Z

As this was a bug, I've done a backport to 0.9.23. I don't anticipate backporting anything else to 0.9.x.

shsmith · 2017-01-10T00:15:46Z

I think the leak is still happening.
http://h.hsmiths.com:49001/munin/localdomain/localhost.localdomain/electrumx_lsof-day.png

Using this script I can see that there are currently over 500 IPv4 connections that do not show up in the 'sessions' rpc:

# report lsof for electrumx, omitting known sessions (looking for leakage)

#get electrumx pid
pid=`electrumx_rpc.py getinfo | grep pid | grep -o '[0-9]*'`

#get list of session IP:PORT and convert newline to " -e "
iplist=`electrumx_rpc.py sessions | grep -v Peer | sed -E "s/\s+/ /g" | cut -d" " -f 12 | tr '\n' ' ' | sed 's/ / -e /g'`

#get open IPv handles, omit session addresses and listening ports
lsof -n -p $pid | grep IPv | grep -v -e $iplist LISTEN

This is from the raspberry pi, which I restarted this morning.
http://h.hsmiths.com:49004/munin/localdomain/localhost.localdomain/electrumx_lsof-day.png

You can see that for the first hour or so the open IPv count tracked sessions, but after that IPv got ahead of sessions. Looks like some sessions are ending but not actually closing the socket.

shsmith · 2017-01-10T02:09:20Z

I searched my logs for the leaked addresses reported by the script and found very few hits.
I suspect these are not electrum connections, but something else scanning the port and getting stuck.

As a test, I did a simple telnet to port 50002 on one server and telnet just sat there with a blank screen. ElectrumX did not log this as a session.
I waited over 10 minutes and the telnet session was still open and waiting for input.

Maybe we need some sort of an initial handshake timeout.
If a connection is accepted but no valid SSL handshake and/or no RPC commands are received in x seconds, the socket should be closed.

kyuupichan · 2017-01-11T02:45:39Z

I cannot see any way asyncio allows one to control this. A TCP connection goes through immediately to connection_made (and therefore is logged) on creation, but because SSL is wrapped, it doesn't call connection_made until the handshake is complete. The Protocol class is constructed when the initial socket connection is made, but there are no details available of that socket to my application until the handshake completes. If it doesn't, it seems to just sit there forever.

shsmith · 2017-01-11T23:20:37Z

I wonder if something like this would work:
socket.setdefaulttimeout(env.session_timeout)

kyuupichan · 2017-01-11T23:35:44Z

The problem I have is I don't have a socket.

…

On Thu, 12 Jan 2017, 08:20 shsmith, ***@***.***> wrote: I wonder if something like this would work: socket.setdefaulttimeout(env.session_timeout) — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#94 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADKliN2KR3IeLNDDspqeIN9cl5I948-nks5rRWPGgaJpZM4LY2Vm> .

shsmith · 2017-01-11T23:40:58Z

I thought it was a static method in the socket class and applied to all newly created sockets.

shsmith · 2017-01-11T23:48:00Z

https://docs.python.org/3/library/socket.html?highlight=socket#socket.setdefaulttimeout

shsmith · 2017-01-13T01:05:55Z

shsmith@c84f68c didn't help because asyncio is using nonblocking sockets.

I suspect the stuck sockets are looping here:
https://github.com/python/asyncio/blob/374930d766c347f126c88d61a75c17a7dd3a2450/asyncio/sslproto.py#L201

This code block does not have any kind of timeout, but one could be added to make use of self._handshake_start_time.

kyuupichan · 2017-02-28T22:24:50Z

I noticed there was a commit late last year to asyncio that might help this - if you have 3.6 installed can you see if the issue still exists there?

shsmith · 2017-03-01T01:46:29Z

if you have 3.6 installed can you see if the issue still exists there?

I am still seeing the leakage on both my systems running Python 3.6.
For example: http://imgur.com/a/4e75b

After about 10:00 you can see that the number of ipv4 handles advances beyond the number of sessions.
This is similar to the leakage pattern I saw with Python 3.5.2 on the same machine.

kyuupichan · 2017-03-01T02:30:39Z

Do you have the one line from python/asyncio@d84a8cb in sslproto.py? I can't see if that got into 3.6 or not

shsmith · 2017-03-01T04:10:07Z

Yes, I found that line in /usr/lib/python3.6/asyncio/sslproto.py.
While browsing for that bit of code I found that various debugging messages can be enabled by setting the PYTHONASYNCIODEBUG environment. I'm running with that set and will check back later to see if it provided any clues.

kyuupichan · 2017-03-01T04:11:42Z

Shame, I was hoping that was it.

Be prepared for very verbose output!

mmouse- · 2017-03-21T16:02:24Z

I'm hitting the same problem when running an Electrumx 1.0 server. Unfortunately one client seems to use this to intentionally exhaust server resources.
I see the cause seems to be bug python/asyncio#483
But I wonder if there shouldn't be a general limit of accepted connections per peer IP? In a way when there are already three or four connections open, any additional connection attempt would be denied before opening yet another socket.

shsmith · 2017-03-29T15:06:31Z

I applied this patch and leakage is much reduced. Patch works on both Python 3.5.2 and 3.6.
python/cpython#480

shsmith · 2017-03-31T14:41:24Z

After longer run time the leak still exists with python/cpython#480 applied.
Pr 480 may help, but does not solve the majority of leakage.

http://electrum.hsmiths.com:49001/munin/hsmiths.com/electrum.hsmiths.com/electrumx_lsof-week.png

http://vps.hsmiths.com:49001/munin/hsmiths.com/vps.hsmiths.com/electrumx_lsof-week.png

When I applied the patch I added a warning to I could be sure the patch was going into effect.
The warning is frequently logged. Here is an example:

Mar 31 07:37:08 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9844] SSL 38.81.65.42:59465, 569 total
Mar 31 07:37:08 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9844] cannot decode JSON: Expecting value: line 1 column 1 (char 0)
Mar 31 07:37:08 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9845] SSL 69.63.165.252:50987, 570 total
Mar 31 07:37:23 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9846] SSL 92.30.192.55:53631, 564 total
Mar 31 07:37:28 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9847] SSL 209.222.77.220:36122, 563 total
Mar 31 07:37:29 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9848] SSL 157.192.81.193:49231, 562 total
Mar 31 07:37:31 electrumd electrumx_server.py[1186]: WARNING:asyncio:pr480: abort socket via start_shutdown if in_handshake
Mar 31 07:37:33 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9849] SSL 209.222.77.220:44942, 563 total
Mar 31 07:37:33 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9850] SSL 62.250.99.225:51011, 564 total
Mar 31 07:37:37 electrumd electrumx_server.py[1186]: WARNING:asyncio:pr480: abort socket via start_shutdown if in_handshake
Mar 31 07:37:44 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9851] SSL 37.187.100.2:38500, 563 total
Mar 31 07:37:46 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9852] SSL 62.210.246.163:43342, 564 total
Mar 31 07:37:47 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9853] SSL 87.182.127.196:56363, 564 total
Mar 31 07:37:48 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9854] SSL 62.210.246.163:43914, 565 total

mocmocamoc · 2017-12-11T15:56:45Z

Has there been any further progress? At the moment lsof is showing hundreds of connections from one IP that never appears in the log at all. The previous instance exhausted its 10,000 file handles presumably for the same reason. Are regular restarts the only solution?

kyuupichan · 2017-12-11T16:00:32Z

Until it is fixed upstream in Python, yes, the only solution is to restart.

erorus · 2017-12-11T16:09:33Z

To mitigate the issue, you can use iptables to restrict IPs to N connections per IP (I use 8)

iptables -A INPUT -p tcp --syn --dport 50002 -m connlimit --connlimit-above 8 -j REJECT

Also, as a super ugly hack, you can periodically use lsof to compare open and established connections with electrumx_rpc sessions, then use gdb to force those spare connections closed. It's not guaranteed to be side-effect-free, but I've been using it a while and it seems helpful. Here's the PHP script I wrote to to that (I told you it was super ugly..) https://gist.github.com/erorus/57f63ff486413746915127cdfa30dc9e

mocmocamoc · 2017-12-11T20:22:56Z

Thanks @erorus, I have that iptables rule in place now and it's helping keep things under control.

I know it needs to be fixed in Python, but is there a Python bug report active at the moment? I know that python/asyncio#483 was opened by @kyuupichan but that hasn't moved since the proposed fix (python/cpython#480), and from @shsmith's comments above the problem persists after that is applied. In fact the asyncio repo containing that issue has been closed since March, so it seems unlikely to ever be revisited - new asyncio bugs are meant to be raised at https://bugs.python.org.

kyuupichan · 2017-12-12T00:38:18Z

https://bugs.python.org/issue29970

It's not getting much attention

mocmocamoc · 2017-12-12T17:59:43Z

Thanks, I've suggested a fix on that issue.
A monkey patch of that idea in case anyone wants to try it: https://gist.github.com/mocmocamoc/8544c288a486e95514fc982a7a0e8620

shsmith · 2017-12-13T03:44:49Z

@mocmocamoc where does this patch go? in one of the electrumx files? could you post a diff of the patched file. Thanks!

shsmith · 2017-12-13T04:03:35Z

@mocmocamoc I found your notes here: https://bugs.python.org/issue29970 and was able to apply the monkeypatch to /usr/lib/python3.6/asyncio/sslproto.py.
I will report back test results in a few days.

shsmith · 2017-12-14T16:56:54Z

Good news! The leakage is completely stopped with this insertion: https://github.com/python/cpython/pull/4825/files#diff-0827a8b032e7f279fa8f66eee271f6ceR559

Thanks @mocmocamoc! This should be merged upstream.

mpdas108 · 2017-12-17T22:54:09Z

I can also concur that merging pull #4825 fixed the handle leakage. It is an issue independent of electrumx, but I feel there should be a mention of the issue, and the solution, in the docs.

kyuupichan · 2017-12-18T00:48:29Z

Once it is fixed upstream I will make that version of Python the recommended minimum.

mocmocamoc · 2017-12-20T12:12:15Z

python/cpython#4825 has been merged so this should be fixed as of Python 3.7.0a4.

kyuupichan · 2018-02-08T15:18:10Z

As this will be fixed in Python 3.7 I will close this issue.

kyuupichan added the bug label Jan 2, 2017

kyuupichan closed this as completed in 18af570 Jan 7, 2017

kyuupichan reopened this Jan 10, 2017

kyuupichan mentioned this issue Jan 11, 2017

SSL socket exhaustion? python/asyncio#483

Open

kyuupichan added asyncio and removed bug labels Jan 17, 2017

kyuupichan mentioned this issue Nov 16, 2017

ElectrumX stopped accepting SSL connections and failed PeerManager verifications #313

Closed

kyuupichan closed this as completed Feb 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TCP handle leakage for "stale" connections #94

TCP handle leakage for "stale" connections #94

shsmith commented Jan 2, 2017

kyuupichan commented Jan 2, 2017

kyuupichan commented Jan 7, 2017

shsmith commented Jan 10, 2017

shsmith commented Jan 10, 2017

kyuupichan commented Jan 11, 2017

shsmith commented Jan 11, 2017

kyuupichan commented Jan 11, 2017 via email

shsmith commented Jan 11, 2017

shsmith commented Jan 11, 2017

shsmith commented Jan 13, 2017

kyuupichan commented Feb 28, 2017

shsmith commented Mar 1, 2017 •

edited

Loading

kyuupichan commented Mar 1, 2017

shsmith commented Mar 1, 2017

kyuupichan commented Mar 1, 2017

mmouse- commented Mar 21, 2017 •

edited

Loading

shsmith commented Mar 29, 2017

shsmith commented Mar 31, 2017

mocmocamoc commented Dec 11, 2017

kyuupichan commented Dec 11, 2017

erorus commented Dec 11, 2017

mocmocamoc commented Dec 11, 2017

kyuupichan commented Dec 12, 2017

mocmocamoc commented Dec 12, 2017

shsmith commented Dec 13, 2017

shsmith commented Dec 13, 2017

shsmith commented Dec 14, 2017

mpdas108 commented Dec 17, 2017

kyuupichan commented Dec 18, 2017

mocmocamoc commented Dec 20, 2017

kyuupichan commented Feb 8, 2018

TCP handle leakage for "stale" connections #94

TCP handle leakage for "stale" connections #94

Comments

shsmith commented Jan 2, 2017

kyuupichan commented Jan 2, 2017

kyuupichan commented Jan 7, 2017

shsmith commented Jan 10, 2017

shsmith commented Jan 10, 2017

kyuupichan commented Jan 11, 2017

shsmith commented Jan 11, 2017

kyuupichan commented Jan 11, 2017 via email

shsmith commented Jan 11, 2017

shsmith commented Jan 11, 2017

shsmith commented Jan 13, 2017

kyuupichan commented Feb 28, 2017

shsmith commented Mar 1, 2017 • edited Loading

kyuupichan commented Mar 1, 2017

shsmith commented Mar 1, 2017

kyuupichan commented Mar 1, 2017

mmouse- commented Mar 21, 2017 • edited Loading

shsmith commented Mar 29, 2017

shsmith commented Mar 31, 2017

mocmocamoc commented Dec 11, 2017

kyuupichan commented Dec 11, 2017

erorus commented Dec 11, 2017

mocmocamoc commented Dec 11, 2017

kyuupichan commented Dec 12, 2017

mocmocamoc commented Dec 12, 2017

shsmith commented Dec 13, 2017

shsmith commented Dec 13, 2017

shsmith commented Dec 14, 2017

mpdas108 commented Dec 17, 2017

kyuupichan commented Dec 18, 2017

mocmocamoc commented Dec 20, 2017

kyuupichan commented Feb 8, 2018

shsmith commented Mar 1, 2017 •

edited

Loading

mmouse- commented Mar 21, 2017 •

edited

Loading