-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP handle leakage for "stale" connections #94
Comments
I just noticed this having gotten back from a 10 day break. My subs are over-counted, and my groups list way more sessions than exist. |
As this was a bug, I've done a backport to 0.9.23. I don't anticipate backporting anything else to 0.9.x. |
I think the leak is still happening. Using this script I can see that there are currently over 500 IPv4 connections that do not show up in the 'sessions' rpc:
This is from the raspberry pi, which I restarted this morning. You can see that for the first hour or so the open IPv count tracked sessions, but after that IPv got ahead of sessions. Looks like some sessions are ending but not actually closing the socket. |
I searched my logs for the leaked addresses reported by the script and found very few hits. As a test, I did a simple telnet to port 50002 on one server and telnet just sat there with a blank screen. ElectrumX did not log this as a session. Maybe we need some sort of an initial handshake timeout. |
I cannot see any way asyncio allows one to control this. A TCP connection goes through immediately to connection_made (and therefore is logged) on creation, but because SSL is wrapped, it doesn't call connection_made until the handshake is complete. The Protocol class is constructed when the initial socket connection is made, but there are no details available of that socket to my application until the handshake completes. If it doesn't, it seems to just sit there forever. |
I wonder if something like this would work: |
The problem I have is I don't have a socket.
…On Thu, 12 Jan 2017, 08:20 shsmith, ***@***.***> wrote:
I wonder if something like this would work:
socket.setdefaulttimeout(env.session_timeout)
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#94 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADKliN2KR3IeLNDDspqeIN9cl5I948-nks5rRWPGgaJpZM4LY2Vm>
.
|
I thought it was a static method in the socket class and applied to all newly created sockets. |
shsmith@c84f68c didn't help because asyncio is using nonblocking sockets. I suspect the stuck sockets are looping here: This code block does not have any kind of timeout, but one could be added to make use of self._handshake_start_time. |
I noticed there was a commit late last year to asyncio that might help this - if you have 3.6 installed can you see if the issue still exists there? |
I am still seeing the leakage on both my systems running Python 3.6. After about 10:00 you can see that the number of ipv4 handles advances beyond the number of sessions. |
Do you have the one line from python/asyncio@d84a8cb in sslproto.py? I can't see if that got into 3.6 or not |
Yes, I found that line in /usr/lib/python3.6/asyncio/sslproto.py. |
Shame, I was hoping that was it. Be prepared for very verbose output! |
I'm hitting the same problem when running an Electrumx 1.0 server. Unfortunately one client seems to use this to intentionally exhaust server resources. |
I applied this patch and leakage is much reduced. Patch works on both Python 3.5.2 and 3.6. |
After longer run time the leak still exists with python/cpython#480 applied. http://electrum.hsmiths.com:49001/munin/hsmiths.com/electrum.hsmiths.com/electrumx_lsof-week.png http://vps.hsmiths.com:49001/munin/hsmiths.com/vps.hsmiths.com/electrumx_lsof-week.png When I applied the patch I added a warning to I could be sure the patch was going into effect.
|
Has there been any further progress? At the moment lsof is showing hundreds of connections from one IP that never appears in the log at all. The previous instance exhausted its 10,000 file handles presumably for the same reason. Are regular restarts the only solution? |
Until it is fixed upstream in Python, yes, the only solution is to restart. |
To mitigate the issue, you can use iptables to restrict IPs to N connections per IP (I use 8)
Also, as a super ugly hack, you can periodically use |
Thanks @erorus, I have that iptables rule in place now and it's helping keep things under control. I know it needs to be fixed in Python, but is there a Python bug report active at the moment? I know that python/asyncio#483 was opened by @kyuupichan but that hasn't moved since the proposed fix (python/cpython#480), and from @shsmith's comments above the problem persists after that is applied. In fact the asyncio repo containing that issue has been closed since March, so it seems unlikely to ever be revisited - new asyncio bugs are meant to be raised at https://bugs.python.org. |
https://bugs.python.org/issue29970 It's not getting much attention |
Thanks, I've suggested a fix on that issue. |
@mocmocamoc where does this patch go? in one of the electrumx files? could you post a diff of the patched file. Thanks! |
@mocmocamoc I found your notes here: https://bugs.python.org/issue29970 and was able to apply the monkeypatch to /usr/lib/python3.6/asyncio/sslproto.py. |
Good news! The leakage is completely stopped with this insertion: https://github.com/python/cpython/pull/4825/files#diff-0827a8b032e7f279fa8f66eee271f6ceR559 Thanks @mocmocamoc! This should be merged upstream. |
I can also concur that merging pull #4825 fixed the handle leakage. It is an issue independent of electrumx, but I feel there should be a mention of the issue, and the solution, in the docs. |
Once it is fixed upstream I will make that version of Python the recommended minimum. |
python/cpython#4825 has been merged so this should be fixed as of Python 3.7.0a4. |
As this will be fixed in Python 3.7 I will close this issue. |
I have noticed that over time the number of IP handles reported by lsof continues to grow and far exceeds the number of connections reported by the sessions RPC.
One IP in particular appears over 1000 times, even though there is no active session for the IP.
Here are the logs from 2 days ago when this IP briefly connected:
This IP does not appear at all after that.
Possibly related: the "subs" value in the getinfo RPC is much higher than the sum of subs for the sessions RPC. I suspect these are related to the stuck TCP handles.
Live Munin charts: http://vps.hsmiths.com:49001/munin/electrumx-day.html
A snapshot from near block 446220:
The text was updated successfully, but these errors were encountered: