Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

socket leak (suspected due to redirection) #771

Closed
eranrund opened this Issue · 47 comments

8 participants

@eranrund

It seems requests forgets to properly close a socket when a redirection occurs.
See the following gist for the code that causes the issue (at least here..):
https://gist.github.com/3313868

@kennethreitz

Have you tried the latest release? v0.13.6 fixed a bunch of socket leaking

@slingamn slingamn reopened this
@slingamn

Session.close() was added in the new release, so it's possible that there's a problem.

However I forked the gist to do a lsof automatically, as in bluehorn's test, and didn't see any problem under the master or develop branches: [https://gist.github.com/3318253]

@eranrund

This is how it looks here, unfortunately :(

$ cat 3318253/gistfile1.txt
import requests
import os
import subprocess
pid = os.getpid()
reqsess = requests.session()
try:
r = reqsess.get(r'http://www.mouser.com/Search/Refine.aspx?Keyword=test')
d = r.text
finally:
reqsess.close()

print 'done, calling lsof'
subprocess.call("lsof -p%d -a -iTCP" % (pid,), shell=True)
print 'done'
$ git clone https://github.com/kennethreitz/requests.git
Cloning into 'requests'...
remote: Counting objects: 8196, done.
remote: Compressing objects: 100% (2795/2795), done.
remote: Total 8196 (delta 5593), reused 7911 (delta 5344)
Receiving objects: 100% (8196/8196), 1.53 MiB | 570 KiB/s, done.
Resolving deltas: 100% (5593/5593), done.
$ PYTHONPATH=requests python ./3318253/gistfile1.txt
done, calling lsof
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python 1686 eran 3u IPv4 1394237 0t0 TCP 192.168.0.107:38028->a92-122-135-23.deploy.akamaitechnologies.com:http (CLOSE_WAIT)
done
$

@eranrund

If you are interested, I will gladly provide access to this machine for experimenting.
Also, if you want me to add some logging in any specific places, let me know.

Thanks!

@Lukasa
Collaborator

Using @slingamn's Gist, I can also reproduce the bug.

Using Python with the following build info:

Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin

And using the develop branch of Requests @ 27b55a7, I get the output:

done, calling lsof
COMMAND   PID   USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
python  12214 lukasa   3u   IPv4 0xffffff8012a634e0      0t0  TCP 192.168.1.4:61042->2.16.231.23:http (CLOSE_WAIT)
done
@slingamn

That's really weird; I can't reproduce this under Python 2.6.7 or 2.7.3.

Anyone know if that open socket is to the first server (the one that sent the redirect) or the last server?

@h4ck3rm1k3

debian stable, python v 2.6.6 it works with no leaking.
On commit 27b55a7
branch develop

According to wireshark:

GET /Search/Refine.aspx?Keyword=test HTTP/1.1
Host: www.mouser.com
Accept-Encoding: identity, deflate, compress, gzip
Accept: */*
User-Agent: python-requests/0.13.6 CPython/2.6.6 Linux/2.6.32-5-amd64

HTTP/1.1 302 Moved Temporarily
Content-Type: text/html; charset=utf-8
Location: http://de.mouser.com/Search/Refine.aspx?Keyword=test
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Tue, 14 Aug 2012 07:32:08 GMT
Connection: close
Vary: Accept-Encoding
Set-Cookie: ASP.NET_SessionId=12vbkqzdf4o4y502nbwd14vf; domain=.mouser.com; path=/; HttpOnly
Set-Cookie: CARTCOOKIEUUID=84b5edd5-864f-4f48-aea5-e267cd86c0df; domain=.mouser.com; expires=Wed, 14-Aug-2013 07:28:31 GMT; path=/
Set-Cookie: preferences=; domain=.mouser.com; path=/
Set-Cookie: TS604485=55d4f36503ed1bff5312fc5334dfee9022925ff451cc8b8c5029fef836af3f5ef6da005fbefdd30bfdbd1f58fdf092e3ffffffff; Path=/
Cache-Control: private

Object moved

Object moved to here.

Then the redirect is followed :

GET /Search/Refine.aspx?Keyword=test HTTP/1.1
Host: de.mouser.com
Cookie: CARTCOOKIEUUID=84b5edd5-864f-4f48-aea5-e267cd86c0df; ASP.NET_SessionId=12vbkqzdf4o4y502nbwd14vf; preferences=
Accept-Encoding: identity, deflate, compress, gzip
Accept: */*
User-Agent: python-requests/0.13.6 CPython/2.6.6 Linux/2.6.32-5-amd64

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Vary: Accept-Encoding
Content-Encoding: gzip
Date: Tue, 14 Aug 2012 07:32:10 GMT
Transfer-Encoding:  chunked
Connection: keep-alive
Connection: Transfer-Encoding
Set-Cookie: preferences=ps=de&pl=de-DE&pc_de=EUR; domain=.mouser.com; path=/
Set-Cookie: TS604485=c6aab6eabd6798c9eb3a18f9f40a1e622b6b6b5dc26917a75029fef9fdf092e3cb3bd022; Path=/
Cache-Control: private

... data follows

@Lukasa
Collaborator

Ugh. I was wondering if this was an OS X versus Linux issue, so I tested it on my Ubuntu 12.04 LTS box.

Python build info:

Python 2.7.3 (default, Aug  1 2012, 05:14:39) 
[GCC 4.6.3] on linux2

Using commit 27b55a7 build of Requests, I get:

done, calling lsof
lsof: WARNING: can't stat() fuse.gvfs-fuse-daemon file system /var/lib/lightdm/.gvfs
      Output information may be incomplete.
COMMAND  PID   USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
python  6820 lukasa    3u  IPv4  17429      0t0  TCP Sweden.local:40976->a23-33-215-23.deploy.akamaitechnologies.com:http (CLOSE_WAIT)
done

Still leaking.

@slingamn

Oh, heh, the reason it doesn't reproduce for me is because that page returns me a 200 OK and sends back a body. It probably depends on your IP address.

http://www.gogle.com sends me a 301 to http://www.google.com/, but when I use that URL in the test, I still can't reproduce the bug. Can anyone reproduce the bug with this URL?

@Lukasa
Collaborator

It must, because from the UK I get a 302 and a redirect to gb.mouser.com.

If I try http://www.gogle.com I get a 301 to http://www.google.com/ followed by a 302 to http://www.google.co.uk. The test script does not indicate a leak.

@slingamn

Maybe someone who can reproduce this could try running the script under strace -e network,file -s 4096?

@eranrund

strace output where the issue reproduces: https://gist.github.com/3368251

@slingamn

Heh, apparently strace doesn't consider close(2) "file-related" (or network-related). That's silly.

My fork has a branch leak_patch [https://github.com/slingamn/requests/tree/leak_patch] which adds some debugging prints for the connection cleanup code. Could you try it and post the output?

@eranrund

Here is is:

adding new connection for www.mouser.com
returning to pool: www.mouser.com adding new connection for il.mouser.com
returning to pool: il.mouser.com 4
closing requests.sessions.Session
Pool is empty: www.mouser.com
Closing FD il.mouser.com 4
Pool is empty: il.mouser.com
done, calling lsof
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python 20466 eran 3u IPv4 2341665 0t0 TCP 192.168.0.107:35936->a92-122-135-> 23.deploy.akamaitechnologies.com:http (CLOSE_WAIT)
done

@kennethreitz

Is this a leak from Requests or Urllib3?

@eranrund
@slingamn

Unfortunately this is quite challenging to debug, because the relevant code extends across Requests, urllib3, and the standard-library httplib. Also, it's hard to reproduce :-/

I think some hints about how to inspect the problem are in the debugging patch mentioned above. Basically, try and identify every time a socket is created, and figure out who is creating it and storing it, and output its fileno whenever possible, and then try and correlate that with output from lsof and/or strace describing the actions taken by the program on its file descriptors. An interactive debugger (pdb, ipdb, pudb) might help.

@keves this is hard, but we'd be really appreciative if you could investigate the issue :-)

One thing I thought was interesting from the debugging output was:

returning to pool: www.mouser.com adding new connection for il.mouser.com

Seemed like there should have been a newline in between the "returning" and "adding" messages. I think I saw a similar missing newline when I was playing with google.com's redirection, but I didn't have a corresponding leak.

@eranrund
@eranrund

Well, this turns out to be very ugly. I also patch the close() method on the socket object, and I see it gets called twice - for both file descriptors.

Here is the code I used and its output: https://gist.github.com/3665659


--> new sock 3

--> close sock 3
RELEASE: False
teh release!!!
--> new sock 4

RELEASE: False
--> close sock 4
done, calling lsof
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
python  11087 eran    3u  IPv4  33696      0t0  TCP 192.168.0.105:44490->a88-221-75-23.deploy.akamaitechnologies.com:http (CLOSE_WAIT)
done
@slingamn

Oh, excellent --- patching socket was a great idea.

You can use import traceback; traceback.print_stack() to find the exact places in requests/urllib3/httplib where the sockets are being created and closed.

@slingamn

Ah, I see you already tried that.

Well, this is suspicious; the output claims that we closed file descriptor 3, but it ends up in the lsof output anyway. Maybe a different FD was opened after the close and given the number 3?

Another thing you can play with is os.close(), which lets you close arbitrary FDs by number --- if you put an os.close(3) before the lsof, do you still see CLOSE_WAIT?

@eranrund

Some updates:
1) Calling os.close() explicitly before calling lsof fixes the issue (as expected)
2) The following shows a snippet of strace running the test Python script in the gist https://gist.github.com/3665659:


socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffe16e66000
write(1, "--> new sock 3\n", 15)        = 15
write(1, "\n", 1)                       = 1
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR)               = 0
connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("88.221.75.23")}, 16) = 0
sendto(3, "GET /Search/Refine.aspx?Keyword="..., 209, 0, NULL, 0) = 209
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR)               = 0
recvfrom(3, "HTTP/1.1 302 Moved Temporarily\r\n"..., 8192, 0, NULL, NULL) = 877
write(1, "--> before close sock 3\n", 24) = 24
write(1, "--> after close sock\n", 21)  = 21

Look closely at the gist script and at the 2 write(1,...) lines at the bottom. The close() syscall should have appeared there!
Looking further down the strace:

write(1, "--> before close sock 4\n", 24) = 24
close(4)                                = 0
write(1, "--> after close sock\n", 21)  = 21

The first socket doesn't get closed, the second one does. Whats wrong with socket's close() method?

3) The combination of the 2 discoveries above made me think that maybe sock.close() isn't calling os.close(), but as further investigation revealed, the actual socket is implemented in C and is closed from there by means of reference counting. The socket is closed when the object is finalized.

I then used objgraph to print the backrefs to the actual _sock object (the C object inside the Pythonic wrapper in socket.py from stdlib) and noticed the difference between the socket that gets properly closed and the one that is left open: The broken one has an httplib.HTTPResponse instance which references it through its fp member.

Digging deeper, I eventually found out that r.history (r is the request returned from reqsess.get() in the test script) references the 302 redirection request. If I clear r.history, the garbage socket gets disposed...
I think now would be a good time for people familiar with requests code to jump in...

TL;DR: r.history keeps a reference to the redirection response, which in turn keeps a reference to the httplib.HTTPResponse, which in turn keeps a reference to the socket - and this is why we're seeing the 'junk' socket. Perhaps someone familiar enough with requests/urllib internals could hint at this?

@slingamn

This is extremely helpful, thanks so much!

It looks like socket.close() is not guaranteed to close the socket, but socket.shutdown() can help with that: [http://docs.python.org/library/socket.html#socket.socket.close] But I'm not clear on the mechanics of why this happens.

In general we are hoping not to rely on the semantics of the garbage collector for socket disposal, since we are also targeting pypy, which doesn't use reference counting and thus has less predictable/eager garbage collection behavior. So ideally the fix would be something other than trying to zero out the refcount of the socket object. However, as per this comment in socket.py:

# Wrapper around platform socket objects. This implements
# a platform-independent dup() functionality. The
# implementation currently relies on reference counting
# to close the underlying socket object.

there may be no other way. This patch should drop the reference to the socket from the history list; could you see if it fixes the problem for you? [https://gist.github.com/5ae6b972b067d7b81e96]

I'd like to think more about this, in particular to understand why the cpython socket module does things this way, and to figure out what the best practices are for writing socket code for both cpython and pypy.

P.S. Taking the cpython docs at their word, here's a patch that calls shutdown on the sockets prior to disposal, but it probably doesn't help: [https://gist.github.com/32f5030728b9d241d128]

@sigmavirus24
Collaborator

@slingamn my first instinct about the docs telling you to use shutdown first is that it changes the behavior of the socket just like shutdown would do with C sockets. I would agree with you that the shutdown call will not likely help, but it is worth trying.

@eranrund

Hello.

The suggested patch (setting raw to None) fixes the issue. The shutdown patch doesn't, as expected.

@slingamn

I was looking into why I couldn't reproduce the bug locally, and it looks like we are supposed to be handling this already. Basically, the reference chain r.history[0].raw._fp.fp is supposed to be broken at the last ., by _fp (an httplib.HTTPResponse) having .close() called on it when the content is consumed. This is supposed to set _fp.fp to None.

See this traceback:

  File "./buggist.py", line 25, in <module>
    r = reqsess.get(r'http://www.google.in/')
  File "/home/shivaram/workspace/requests/requests/sessions.py", line 265, in get
    return self.request('get', url, **kwargs)
  File "/home/shivaram/workspace/requests/requests/sessions.py", line 252, in request
    r.send(prefetch=prefetch)
  File "/home/shivaram/workspace/requests/requests/models.py", line 640, in send
    self._build_response(r)
  File "/home/shivaram/workspace/requests/requests/models.py", line 233, in _build_response
    r.content  # Consume socket so it can be released
  File "/home/shivaram/workspace/requests/requests/models.py", line 793, in content
    self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
  File "/home/shivaram/workspace/requests/requests/models.py", line 739, in generate
    chunk = self.raw.read(chunk_size)
  File "/home/shivaram/workspace/requests/requests/packages/urllib3/response.py", line 165, in read
    return self._fp.read(amt)
  File "/usr/lib64/python2.7/httplib.py", line 565, in read
    self.close()

which I obtained using a monkeypatching technique similar to yours: [https://gist.github.com/955bfeaded8099b1a99f]. The crucial step is that _build_response calls r.content, which reads the entire response, triggering a call to close(). (In my gist I had to copy and paste the original code of HTTPResponse.close into my instrumented subclass --- it's an old-style class and I was having some headaches getting the superclass call to work.)

Do you have any idea why this isn't happening correctly on your system?

@eranrund
@eranrund
@slingamn

I can't reproduce using the Mouser URL, since it returns a 200-OK for me (I think possibly because of my IP address).

If you have time, could you look into the question of why r.content isn't breaking the reference chain, by setting _fp.fp to None? I think that's the core question here, since it is intended to prevent exactly this situation.

@shazow
Collaborator

Should urllib3's _fp be set to None? If so, then that also supports shazow/urllib3#103.

@slingamn

@shazow interesting.

It is sufficient for this ticket for _fp.fp to be set to None, as it should be after the call to r.content. I think we'd only need to have urllib3 set raw._fp to None as well if we were trying to release the socket without reading all the content.

However, I suspect this is not a good use case; if you try to reuse a connection while it still has unread data from the last request, I think it may not work correctly. (Possibly depending on implementation details of the server.) Thoughts?

@shazow
Collaborator

I don't recall what the actual behaviour is if you try to re-use a socket with unread data. Would be nice if someone wrote a test. :)

@slingamn

Quick and dirty test: [https://gist.github.com/53a467d3523fbc21354b]

It looks like httpbin, at least, will get stuck if you try to reuse the connection without reading the full response. This may be cryptically alluded to in the httplib comments as "protocol synchronisation" being lost.

I'm not sure if this is a property of all HTTP servers, or it depends on the implementation. Someone more familiar with the HTTP spec could probably clarify...

@piotr-dobrogost

http://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8

8.1.2.2 Pipelining
(...)
A client wishing to send a non-idempotent request SHOULD wait to send that request until it has received the response status for the previous request.

According to this client may choose not to wait for body yet alone read it.

Side note: Although the above statement is in the Pipelining section it seems to describe more general situation of sending requests using the same connection (which doesn't have to be pipelining). It could be inferred from another statement from the same section - Clients SHOULD NOT pipeline requests using non-idempotent methods (...)

@slingamn

Oh, thanks, that's very relevant.

I think this is not a behavior we can realistically support in Requests, even for idempotent requests. It sounds like under the pipelining semantics, sending a new request before the old response has been read SHOULD cause the server to queue up the new response behind the old response. Thus, either way, the new response will not be available until the old response has been consumed.

@eranrund

So, are we stuck on this? :(

@slingamn

@keves I was unclear, sorry. There's an action item for you in this comment: [#771 (comment)]

@eranrund

Oh, sorry. Missed that.
I'll try looking on it during the weekend

@eranrund

I have a speculation I would like to discuss as I wonder what I am missing.
Inside httplib.HTTPResponse, we have the following call:

File "/usr/lib64/python2.7/httplib.py", line 565, in read
    self.close()

which, to my understanding, breaks the _fp.fp chain. Assuming that is indeed the case, what will happen when the response doesn't have a content-length header?
When that is the case, self.length will be None and the code that calls self.close() in read will not get called:

        s = self.fp.read(amt)
        if self.length is not None:
            self.length -= len(s)            
            if not self.length:                
                self.close()

        return s

I sniffed the traffic and indeed I see no content-length header, and this was further confirmed by printing self.length. When I had explicitly set it to the correct length, behavior was as expected

I hope I am not way off here... :)

CC shazow/urllib3#103 - might be relevant

@sigmavirus24
Collaborator

You don't seem to be off at all. This looks like an incredibly sneaky bug.

@slingamn

Wow, nice work.

For a similar but unrelated issue, see #223 and its upstream counterpart [http://bugs.python.org/issue14721].

@eranrund

Thanks :)
I don't see why the issues you mentioned are similar though, from skimming over them i understand they relate to content-length sent by the client (i.e. requests/httplib/urllib/whatever). Here our problem is that the server isn't replying with a content-length header (unless you want to emphasize that content-length header handling is broken all across the board :))

Who should we address with this? The python bugs database? Is there any test suit for httplib? Perhaps it would be simple to write a test case for this issue and increase the chances of someone looking at it

btw, this could be related: http://bugs.python.org/issue15633

@sigmavirus24
Collaborator

@slingamn thanks for reminding me about 223 ;)

@slingamn

@keves yeah, similar in the sense that it's a subtle bug in httplib related to the content-length header.

It's possible that this is in fact the root cause of [http://bugs.python.org/issue15633]. I think reporting this to Python upstream will clarify what exactly the problem is, but we'll likely have to add a workaround to Requests in any case (because the release cycle for Python upstream is slow, and because the fix will never hit Python 2.6).

It's up to you whether you want to report this on [http://bugs.python.org/issue15633] or on a new ticket.

@eranrund

Opened a new issue [http://bugs.python.org/issue16298] and referenced the one you mentioned.
Lets see what happens :)

@Lukasa
Collaborator

Looks like this is confirmed as being 'Not our fault'. Hurrah!

Thanks all for your work on this gnarly little bug, I know it took at lot of your time. =)

@Lukasa Lukasa closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.