Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection reset by peer when sending POST #4937

Closed
rdgoite opened this issue Jan 14, 2019 · 26 comments
Closed

Connection reset by peer when sending POST #4937

rdgoite opened this issue Jan 14, 2019 · 26 comments

Comments

@rdgoite
Copy link

rdgoite commented Jan 14, 2019

Client code that uses Requests module to send data via HTTP POST encounters a ConnectionResetError. The entire operation (composed of multiple POST requests to a small set of service endpoints) can sometimes succeed, but most of the time, it fails with this error.

Expected Result

Operation succeeds (or fails) without connection issues.

Actual Result

The operation fails with ConnectionResetError.

Additional Information

It's a little difficult to provide basic reproduction for the issue as we're running into this problem with a test payload that's specific to our system. The server (peer) is a Java application that's configured to terminate/reset connection after a given time of no use (idle). The client sends multiple one-off POST requests, but it seems like internally, the connections are being reused similar to issue #4506, and the operation eventually runs into a connection that's already been reset and raises the error. However, unlike #4506, we are not using Sessions.

Here are some tracebacks that could hopefully describe the problem better:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
    response.begin()
  File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.7/http/client.py", line 257, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
    response.begin()
  File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.7/http/client.py", line 257, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

...

Traceback (most recent call last):
  ...
  File "../client-code.py", line 322, in createFile
    r = requests.post(fileSubmissionsUrl, data=json.dumps(fileToCreateObject), headers=self.headers)
  File "/usr/local/lib/python3.7/site-packages/requests/api.py", line 116, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": "2.3.1"
  },
  "idna": {
    "version": "2.6"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.7.1"
  },
  "platform": {
    "release": "4.14.42-61.37.amzn2.x86_64",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "1000211f",
    "version": "18.0.0"
  },
  "requests": {
    "version": "2.20.1"
  },
  "system_ssl": {
    "version": "20000000"
  },
  "urllib3": {
    "version": "1.24.1"
  },
  "using_pyopenssl": true
}
@evahteev
Copy link

+1 same issue here

@OneDrawer
Copy link

test

@lucas03
Copy link

lucas03 commented May 12, 2019

We are also experiencing a lot of errors (14%) on some servers. POST call is not retried by requests as it's not idempotent.

After installing requests[security], error changed to ('Connection aborted.', OSError("(104, 'ECONNRESET')",)).

@lucas03
Copy link

lucas03 commented May 24, 2019

I can reproduce single ECONNRESET error. If there is no connection in 10 minutes, next first connection will raise ECONNRESET.

@lucas03
Copy link

lucas03 commented Jul 3, 2019

I've got more info here.
here is message from our internal gitlab:

I have an idea I wanna put to test:

there are 10 connections in connection pool per server/worker.

We have slack notifier that has like 10 requests per minute. We have 8 workers, which I assume share same session (connection pool).

These connections are randomly assigned to workers. If worker receives task, it will pick connection from pool and use it. If it receives two tasks at the same time, it will pick two connections and use it , returning them to pool. So something like this can happen for single worker:

  1. minute: 2 simultaneous connections used, returned to pool.
  2. minute: max 1 simultaneous connection used, returned to pool.
  3. minute: max 1 simultaneous connection used, returned to pool.
  4. minute: max 1 simultaneous connection used, returned to pool.
  5. minute: max 1 simultaneous connection used, returned to pool.
    server closed second connection, as it was idle for 5 minutes..
  6. minute: 2 simultaneous connection used. When second connection is picked, it's already closed and we receive error.

When I moved initialization of session to method that is doing requests, connection errors disappeared. though that will not use session and will consume more resources.

Shouldn't requests lib be aware of these closed connections and retry them automatically?

@jeicoo
Copy link

jeicoo commented Jul 23, 2019

Been experiencing this issue as well. In my case, session is being used and we are doing the actual request inside a coroutine (loop.run_in_executor)

@shankari
Copy link

shankari commented Aug 21, 2019

I ran into this as well, and I can provide a way to reproduce 😄

To reproduce

Run any of the notebooks in
https://github.com/e-mission/e-mission-eval-public-data
The notebooks can be launched using binder:
Update: The issue does not occur on notebooks launched via binder, only on my laptop.
Update 2: The issue is intermittent even on notebooks launched on my laptop. When it works, it works consistently. When it fails, it fails for hours at a time.
Update 3: The issue does not occur on binder, but occurs 90% of the time on my laptop

Current code

The current code is in https://github.com/e-mission/e-mission-eval-public-data/blob/master/emeval/input/spec_details.py#L24 and uses the quickstart version without any sessions

        response = requests.post(self.DATASTORE_URL+"/datastreams/find_entries/timestamp", headers=close_headers, json=post_msg, stream=False)

Workarounds tried

I have tried multiple workarounds, none of which have worked:

  • setting stream=False (as you can see above) DOES NOT WORK
  • using with requests.post(...) as response DOES NOT WORK
  • manually closing the response by adding response.close DOES NOT WORK
         print("Found %d entries" % len(ret_list))
         response.close()
         return ret_list
    
  • tried using a session with keep_alive = False DOES NOT WORK
         s = requests.Session()
         s.keep_alive = False
         response = s.post(self.DATASTORE_URL+"/datastreams/find_entries/timestamp", headers=close_headers, json=post_msg, stream=False)
    
  • tried using a session with retries, DOES NOT WORK
         s = requests.Session()
         adapter = requests.adapters.HTTPAdapter(max_retries=2)
         s.mount('http://', adapter)
         response = s.post(self.DATASTORE_URL+"/datastreams/find_entries/timestamp", headers=close_headers, json=post_msg, stream=False)
    
  • tried using a session with max connections = 1 in an attempt to disable pooling DOES NOT WORK
         s = requests.Session()
         adapter = requests.adapters.HTTPAdapter(pool_connections=1, pool_maxsize=1,
             max_retries=5, pool_block=True)
         s.mount('http://', adapter)
         response = s.post(self.DATASTORE_URL+"/datastreams/find_entries/timestamp", json=post_msg)
    
  • tried manually closing the session and the adapter DOES NOT WORK
         print("Found %d entries" % len(ret_list))
         adapter.close()
         s.close()
         return ret_list
    

I haven't tried using sessions, but I am not sure why I need to. If I use the quickstart version and ensure that the response is closed, shouldn't the pooling be effectively disabled (as hinted at #4937 (comment))

At this point, I am tempted to give up on requests and drop down to urllib directly

@shankari
Copy link

Also, I control both the client and the server, and I can confirm that the request that fails never makes it to the server.

On the client:

About to retrieve data for ucb-sdb-ios-3 from 1563764533.214622 -> 1563788061.999584
About to retrieve messages using {'user': 'ucb-sdb-ios-3', 'key_list': ['background/motion_activity'], 'start_time': 1563764533.214622, 'end_time': 1563788061.999584}
response = <Response [200]>
Found 11 entries
Retrieved 11 entries with timestamps [1563764540.2866454, 1563764653.7736874, 1563764657.5883818, 1563768002.560504, 1563771602.48612, 1563775203.1804562, 1563778802.582345, 1563782402.60982, 1563786002.906196, 1563788054.78684]...


About to retrieve data for ucb-sdb-ios-3 from 1563788059.5701566 -> 1563788061.999584
About to retrieve messages using {'user': 'ucb-sdb-ios-3', 'key_list': ['background/motion_activity'], 'start_time': 1563788059.5701566, 'end_time': 1563788061.999584}

---------------------------------------------------------------------------
ConnectionResetError                      Traceback (most recent call last)

On the server, last call to /datastreams/find_entries is as below. Note that this is from the 11 entry successful call. The call generating the ConnectionResetError is not even making it to the server.

2019-08-20 19:58:20,794:DEBUG:140171047577344:START POST /datastreams/find_entries/timestamp
2019-08-20 19:58:20,794:DEBUG:140171047577344:methodName = skip, returning <class 'emission.net.auth.skip.SkipMethod'>
2019-08-20 19:58:20,794:DEBUG:140171047577344:Using the skip method to verify id token ucb-sdb-ios-3 of length 13
2019-08-20 19:58:20,795:DEBUG:140171047577344:retUUID = 7ed80490-6853-433d9d20-838fe4d3d71b
2019-08-20 19:58:20,796:DEBUG:140171047577344:curr_query = {'user_id': UUID('7ed80490-6853-433d-9d20-838fe4d3d71b'), '$or': [{'metadata.key': 'background/motion_activity'}], 'metadata.write_ts': {'$lte': 1563788061.999584, '$gte': 1563764533.214622}}, sort_key = metadata.write_ts
2019-08-20 19:58:20,796:DEBUG:140171047577344:orig_ts_db_keys = ['background/motion_activity'], analysis_ts_db_keys = []
2019-08-20 19:58:20,798:DEBUG:140171047577344:finished querying values for ['background/motion_activity'], count = 0
2019-08-20 19:58:20,798:DEBUG:140171047577344:finished querying values for [], count = 0
2019-08-20 19:58:20,800:DEBUG:140171047577344:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2019-08-20 19:58:20,802:DEBUG:140171047577344:Found 11 messages in response to query {'user_id': UUID('7ed80490-6853-433d-9d20-838fe4d3d71b'), '$or': [{'metadata.key': 'background/motion_activity'}], 'metadata.write_ts': {'$lte': 1563788061.999584, '$gte': 1563764533.214622}}
2019-08-20 19:58:20,807:DEBUG:140171047577344:END POST /datastreams/find_entries/timestamp 7ed80490-6853-433d-9d20-838fe4d3d71b 0.01349186897277832

@shankari
Copy link

ok, so after poking around a bit more, I actually see an error on the server

  File ".../lib/python3.6/_pyio.py", line 1001, in read
    return self._read_unlocked(size)
  File ".../lib/python3.6/_pyio.py", line 1041, in _read_unlocked
    chunk = self.raw.read(wanted)
  File ".../lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
ConnectionResetError: [Errno 104] Connection reset by peer

This seems to imply that the client is dropping the connection to the server. But I am no longer sure that dropping directly to urllib is going to solve the problem.

@shankari
Copy link

shankari commented Aug 21, 2019

A workaround that did work was to simply catch the error and retry. I am still not sure whether the client or the server is dropping the connection, and why they are doing so, but hopefully this workaround helps somebody else.

Note that both the request creation (requests.post) and the response read/parse (response.json()) are within the try block.

        try:
            response = requests.post(self.DATASTORE_URL+"/datastreams/find_entries/timestamp", json=post_msg)
            print("response = %s" % response)
            response.raise_for_status()
            ret_list = response.json()["phone_data"]
        except Exception as e:
            # Hacky copy-paste of original code, TODO refactor into separate function
            print("Got %s error %s, retrying" % (type(e).__name__, e))
            time.sleep(10)
            response = requests.post(self.DATASTORE_URL+"/datastreams/find_entries/timestamp", json=post_msg)
            print("response = %s" % response)
            response.raise_for_status()
            ret_list = response.json()["phone_data"]

@ArondeParon
Copy link

Experiencing this same issue as well. It seems to be happening 50% of the time.

@isaiah-lyra
Copy link

isaiah-lyra commented Sep 9, 2019

+1, experiencing ~35% of the time and I am using urllib3 directly.

@loic-bellinger
Copy link

Experiencing this issue as well. Made an exponential backoff to bypass it but, that seems somewhat hacky.

@azin634
Copy link

azin634 commented Nov 16, 2019

I'm having the same issue. What version is it happening on? Im using 2.21.0

@Hackeron
Copy link

@LouisBellinger: Can you eleborate your exponential backoff mechanism to bypass the issue?
Struggling to solve this issue while doing ETL. Loosing 5-10% of the files in movement.

@vlade-rc
Copy link

Hi to everyone, Any update to this issue? I'm having the same problem. Do you known any hotfix?

@eliuha
Copy link

eliuha commented Jun 4, 2020

Is there a workaround?

@shankari
Copy link

shankari commented Jun 5, 2020

for me, a single retry has worked pretty reliably:
https://github.com/MobilityNet/mobilitynet-analysis-scripts/blob/master/emeval/input/spec_details.py#L18

For exponential backoff, you would just catch the exceptions multiple times, sleeping longer every time.

@hedza06
Copy link

hedza06 commented Jan 14, 2021

Any solutions?

@jakermx
Copy link

jakermx commented Feb 4, 2021

Hello... The Persistent Connectionts features are not handling the client side when it requested a Connection Keep Alive, because at lease for Linux based OS the TCP Keep Alive Feature is not enabled, or when Enabled it is set to 2 start working after 2 hours, so I found that the problem iis because the OS is not handling the flo control correctly you can fix it in some ways...

For Python in Linux

import requests
import socket
from urllib3.connection import HTTPConnection
HTTPConnection.default_socket_options = HTTPConnection.default_socket_options + [
(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1) #Enables the feature
,(socket.SOL_TCP, socket.TCP_KEEPIDLE, 45) #Overrides the time when the stack willl start sending KeppAlives after no data received on a Persistent Connection
,(socket.SOL_TCP, socket.TCP_KEEPINTVL, 10) #Defines how often thoe KA will be sent between them
,(socket.SOL_TCP, socket.TCP_KEEPCNT, 6) #How many attemps will your code try if the server goes down before droping the connection.

now you can Instanciate your Sessions or Requests..

Without TCP Keep Alive
image

With TCP Keep Alive

image

If the client reach the unhandled session timeout by the server...now the server will close the connection , releasing the connection back to the pool insteadd of dropoing it....

image

There are some other ways to handle it.. depending on your OS and in your needs...
I hope it solve some of your issues

@jakermx
Copy link

jakermx commented Feb 4, 2021

I found this but I dont how to handle setsockopt and requests... maybe you know. and will fix ti for all OSs

https://gist.github.com/shi-yan/611cc0221eeff1644797

@enote-kane
Copy link

enote-kane commented Mar 2, 2021

I was also having this issue (and not just on POST requests) while running stress tests of services using locust, which uses requests.

The very good investigation and fix provided by @jakermx really did it for me as well:

import socket
from urllib3.connection import HTTPConnection
# ...
HTTPConnection.default_socket_options = (
    HTTPConnection.default_socket_options + [
        (socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
        (socket.SOL_TCP, socket.TCP_KEEPIDLE, 45),
        (socket.SOL_TCP, socket.TCP_KEEPINTVL, 10),
        (socket.SOL_TCP, socket.TCP_KEEPCNT, 6)
    ]
)

Insert that code just somewhere that gets loaded by your application/script and it will change the urllib3 default socket options globally.

This isn't a bug in requests nor urllib3, since both are offering solutions:

Also, urllib3 makes it pretty clear that especially in this case, there is a simple detection issue for automatic handling:

@jakermx
Copy link

jakermx commented Mar 10, 2021

By defautl, OS TCP/IP stack defines that the app will stack sending TCP Keep-Alives packes after 2 hours of idle activity, this will only work on OOTB Web Servers Installations, but, if your Content provider set lower limits, the TCP L5/L4 Connection willbe dropped off before 2 hours....and when your app try to used the "persistent" connection it will fail due Connection Reset by RemotePeer...caus it is not a valid connection, setting the tcp options will avoind someerrorsm but not all....not matter is you set the connection keep alive header at L7-L4....if the OS dont send KA packets at TCP layer....the only header that will avoid this conditions is Connection:Close

this is a app, http friendly API, url m api and IETFissue

@elv1z
Copy link

elv1z commented Mar 11, 2021

@jakermx you are my hero. Thank you for solution.
I killed almost 2 days trying to fix it.

@jakermx
Copy link

jakermx commented Mar 11, 2021

there is no heros pal,,,,just a great community workforce

@nateprewitt
Copy link
Member

There are several issues that have been tacked onto the original problem here because people are conflating multiple exceptions. I believe @jakermx has explained what is happening at the TCP level and the urllib3 issue #944 discussing that we're not able to distinguish between a dropped connection and RST has been linked. I'm going to resolve this as there isn't more information to provide at this point.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests