Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP handle leakage for "stale" connections #94

Closed
shsmith opened this issue Jan 2, 2017 · 31 comments
Closed

TCP handle leakage for "stale" connections #94

shsmith opened this issue Jan 2, 2017 · 31 comments
Labels

Comments

@shsmith
Copy link
Contributor

shsmith commented Jan 2, 2017

I have noticed that over time the number of IP handles reported by lsof continues to grow and far exceeds the number of connections reported by the sessions RPC.

One IP in particular appears over 1000 times, even though there is no active session for the IP.

Here are the logs from 2 days ago when this IP briefly connected:

INFO:ElectrumX:[21536] SSL 41.144.95.105:52782, 666 total
INFO:ServerManager:closing stale connections [19859, 21536, 20904, 21120, 6123, 21278, 21097, 21124, 21481]

This IP does not appear at all after that.

Possibly related: the "subs" value in the getinfo RPC is much higher than the sum of subs for the sessions RPC. I suspect these are related to the stuck TCP handles.

Live Munin charts: http://vps.hsmiths.com:49001/munin/electrumx-day.html

A snapshot from near block 446220:

full lsof output: http://vps.hsmiths.com:49001/lsof.txt
getinfo output: http://vps.hsmiths.com:49001/getinfo
sessions output: http://vps.hsmiths.com:49001/sessions

@kyuupichan
Copy link
Owner

I just noticed this having gotten back from a 10 day break. My subs are over-counted, and my groups list way more sessions than exist.

@kyuupichan
Copy link
Owner

As this was a bug, I've done a backport to 0.9.23. I don't anticipate backporting anything else to 0.9.x.

@shsmith
Copy link
Contributor Author

shsmith commented Jan 10, 2017

I think the leak is still happening.
http://h.hsmiths.com:49001/munin/localdomain/localhost.localdomain/electrumx_lsof-day.png

Using this script I can see that there are currently over 500 IPv4 connections that do not show up in the 'sessions' rpc:

# report lsof for electrumx, omitting known sessions (looking for leakage)

#get electrumx pid
pid=`electrumx_rpc.py getinfo | grep pid | grep -o '[0-9]*'`

#get list of session IP:PORT and convert newline to " -e "
iplist=`electrumx_rpc.py sessions | grep -v Peer | sed -E "s/\s+/ /g" | cut -d" " -f 12 | tr '\n' ' ' | sed 's/ / -e /g'`

#get open IPv handles, omit session addresses and listening ports
lsof -n -p $pid | grep IPv | grep -v -e $iplist LISTEN

This is from the raspberry pi, which I restarted this morning.
http://h.hsmiths.com:49004/munin/localdomain/localhost.localdomain/electrumx_lsof-day.png

You can see that for the first hour or so the open IPv count tracked sessions, but after that IPv got ahead of sessions. Looks like some sessions are ending but not actually closing the socket.

@kyuupichan kyuupichan reopened this Jan 10, 2017
@shsmith
Copy link
Contributor Author

shsmith commented Jan 10, 2017

I searched my logs for the leaked addresses reported by the script and found very few hits.
I suspect these are not electrum connections, but something else scanning the port and getting stuck.

As a test, I did a simple telnet to port 50002 on one server and telnet just sat there with a blank screen. ElectrumX did not log this as a session.
I waited over 10 minutes and the telnet session was still open and waiting for input.

Maybe we need some sort of an initial handshake timeout.
If a connection is accepted but no valid SSL handshake and/or no RPC commands are received in x seconds, the socket should be closed.

@kyuupichan
Copy link
Owner

I cannot see any way asyncio allows one to control this. A TCP connection goes through immediately to connection_made (and therefore is logged) on creation, but because SSL is wrapped, it doesn't call connection_made until the handshake is complete. The Protocol class is constructed when the initial socket connection is made, but there are no details available of that socket to my application until the handshake completes. If it doesn't, it seems to just sit there forever.

@shsmith
Copy link
Contributor Author

shsmith commented Jan 11, 2017

I wonder if something like this would work:
socket.setdefaulttimeout(env.session_timeout)

@kyuupichan
Copy link
Owner

kyuupichan commented Jan 11, 2017 via email

@shsmith
Copy link
Contributor Author

shsmith commented Jan 11, 2017

I thought it was a static method in the socket class and applied to all newly created sockets.

@shsmith
Copy link
Contributor Author

shsmith commented Jan 11, 2017

@shsmith
Copy link
Contributor Author

shsmith commented Jan 13, 2017

shsmith@c84f68c didn't help because asyncio is using nonblocking sockets.

I suspect the stuck sockets are looping here:
https://github.com/python/asyncio/blob/374930d766c347f126c88d61a75c17a7dd3a2450/asyncio/sslproto.py#L201

This code block does not have any kind of timeout, but one could be added to make use of self._handshake_start_time.

@kyuupichan kyuupichan added asyncio and removed bug labels Jan 17, 2017
@kyuupichan
Copy link
Owner

I noticed there was a commit late last year to asyncio that might help this - if you have 3.6 installed can you see if the issue still exists there?

@shsmith
Copy link
Contributor Author

shsmith commented Mar 1, 2017

if you have 3.6 installed can you see if the issue still exists there?

I am still seeing the leakage on both my systems running Python 3.6.
For example: http://imgur.com/a/4e75b

After about 10:00 you can see that the number of ipv4 handles advances beyond the number of sessions.
This is similar to the leakage pattern I saw with Python 3.5.2 on the same machine.

@kyuupichan
Copy link
Owner

Do you have the one line from python/asyncio@d84a8cb in sslproto.py? I can't see if that got into 3.6 or not

@shsmith
Copy link
Contributor Author

shsmith commented Mar 1, 2017

Yes, I found that line in /usr/lib/python3.6/asyncio/sslproto.py.
While browsing for that bit of code I found that various debugging messages can be enabled by setting the PYTHONASYNCIODEBUG environment. I'm running with that set and will check back later to see if it provided any clues.

@kyuupichan
Copy link
Owner

Shame, I was hoping that was it.

Be prepared for very verbose output!

@mmouse-
Copy link
Contributor

mmouse- commented Mar 21, 2017

I'm hitting the same problem when running an Electrumx 1.0 server. Unfortunately one client seems to use this to intentionally exhaust server resources.
I see the cause seems to be bug python/asyncio#483
But I wonder if there shouldn't be a general limit of accepted connections per peer IP? In a way when there are already three or four connections open, any additional connection attempt would be denied before opening yet another socket.

@shsmith
Copy link
Contributor Author

shsmith commented Mar 29, 2017

I applied this patch and leakage is much reduced. Patch works on both Python 3.5.2 and 3.6.
python/cpython#480

@shsmith
Copy link
Contributor Author

shsmith commented Mar 31, 2017

After longer run time the leak still exists with python/cpython#480 applied.
Pr 480 may help, but does not solve the majority of leakage.

http://electrum.hsmiths.com:49001/munin/hsmiths.com/electrum.hsmiths.com/electrumx_lsof-week.png

http://vps.hsmiths.com:49001/munin/hsmiths.com/vps.hsmiths.com/electrumx_lsof-week.png

When I applied the patch I added a warning to I could be sure the patch was going into effect.
The warning is frequently logged. Here is an example:

Mar 31 07:37:08 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9844] SSL 38.81.65.42:59465, 569 total
Mar 31 07:37:08 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9844] cannot decode JSON: Expecting value: line 1 column 1 (char 0)
Mar 31 07:37:08 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9845] SSL 69.63.165.252:50987, 570 total
Mar 31 07:37:23 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9846] SSL 92.30.192.55:53631, 564 total
Mar 31 07:37:28 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9847] SSL 209.222.77.220:36122, 563 total
Mar 31 07:37:29 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9848] SSL 157.192.81.193:49231, 562 total
Mar 31 07:37:31 electrumd electrumx_server.py[1186]: WARNING:asyncio:pr480: abort socket via start_shutdown if in_handshake
Mar 31 07:37:33 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9849] SSL 209.222.77.220:44942, 563 total
Mar 31 07:37:33 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9850] SSL 62.250.99.225:51011, 564 total
Mar 31 07:37:37 electrumd electrumx_server.py[1186]: WARNING:asyncio:pr480: abort socket via start_shutdown if in_handshake
Mar 31 07:37:44 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9851] SSL 37.187.100.2:38500, 563 total
Mar 31 07:37:46 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9852] SSL 62.210.246.163:43342, 564 total
Mar 31 07:37:47 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9853] SSL 87.182.127.196:56363, 564 total
Mar 31 07:37:48 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9854] SSL 62.210.246.163:43914, 565 total

@mocmocamoc
Copy link

Has there been any further progress? At the moment lsof is showing hundreds of connections from one IP that never appears in the log at all. The previous instance exhausted its 10,000 file handles presumably for the same reason. Are regular restarts the only solution?

@kyuupichan
Copy link
Owner

Until it is fixed upstream in Python, yes, the only solution is to restart.

@erorus
Copy link

erorus commented Dec 11, 2017

To mitigate the issue, you can use iptables to restrict IPs to N connections per IP (I use 8)

iptables -A INPUT -p tcp --syn --dport 50002 -m connlimit --connlimit-above 8 -j REJECT

Also, as a super ugly hack, you can periodically use lsof to compare open and established connections with electrumx_rpc sessions, then use gdb to force those spare connections closed. It's not guaranteed to be side-effect-free, but I've been using it a while and it seems helpful. Here's the PHP script I wrote to to that (I told you it was super ugly..) https://gist.github.com/erorus/57f63ff486413746915127cdfa30dc9e

@mocmocamoc
Copy link

Thanks @erorus, I have that iptables rule in place now and it's helping keep things under control.

I know it needs to be fixed in Python, but is there a Python bug report active at the moment? I know that python/asyncio#483 was opened by @kyuupichan but that hasn't moved since the proposed fix (python/cpython#480), and from @shsmith's comments above the problem persists after that is applied. In fact the asyncio repo containing that issue has been closed since March, so it seems unlikely to ever be revisited - new asyncio bugs are meant to be raised at https://bugs.python.org.

@kyuupichan
Copy link
Owner

https://bugs.python.org/issue29970

It's not getting much attention

@mocmocamoc
Copy link

Thanks, I've suggested a fix on that issue.
A monkey patch of that idea in case anyone wants to try it: https://gist.github.com/mocmocamoc/8544c288a486e95514fc982a7a0e8620

@shsmith
Copy link
Contributor Author

shsmith commented Dec 13, 2017

@mocmocamoc where does this patch go? in one of the electrumx files? could you post a diff of the patched file. Thanks!

@shsmith
Copy link
Contributor Author

shsmith commented Dec 13, 2017

@mocmocamoc I found your notes here: https://bugs.python.org/issue29970 and was able to apply the monkeypatch to /usr/lib/python3.6/asyncio/sslproto.py.
I will report back test results in a few days.

@shsmith
Copy link
Contributor Author

shsmith commented Dec 14, 2017

Good news! The leakage is completely stopped with this insertion: https://github.com/python/cpython/pull/4825/files#diff-0827a8b032e7f279fa8f66eee271f6ceR559

Thanks @mocmocamoc! This should be merged upstream.

@mpdas108
Copy link

I can also concur that merging pull #4825 fixed the handle leakage. It is an issue independent of electrumx, but I feel there should be a mention of the issue, and the solution, in the docs.

@kyuupichan
Copy link
Owner

Once it is fixed upstream I will make that version of Python the recommended minimum.

@mocmocamoc
Copy link

python/cpython#4825 has been merged so this should be fixed as of Python 3.7.0a4.

@kyuupichan
Copy link
Owner

As this will be fixed in Python 3.7 I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants