Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various Intermittent Request Errors #3845

Closed
lukas-gitl opened this issue Feb 2, 2017 · 11 comments
Closed

Various Intermittent Request Errors #3845

lukas-gitl opened this issue Feb 2, 2017 · 11 comments

Comments

@lukas-gitl
Copy link

When using requests.post (have not tried other methods) I get the following errors intermittently.

Errors

  • ('Connection aborted.', BadStatusLine("''",))
  • ('Connection aborted.', error(104, 'Connection reset by peer'))
  • ("bad handshake: SysCallError(104, 'ECONNRESET')",)
  • EOF occurred in violation of protocol (_ssl.c:661)
  • HTTPSConnectionPool(host='...', port=443): Max retries exceeded with url: ... (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x...>: Failed to establish a new connection: [Errno 110] Operation timed out',))
  • Sometimes I just get a 502 or 504 back, but that might be a remote issues

The amount this happens differs, but at least a few times every hour (we are sending a few thousand logs per hour). I wasn't able to reproduce these errors locally or reliably.

I have experienced this will various targets hosts, namely the api provided by: loggly.com, logentries.com, logz.io. Since they all have the issue I don't think it is a server configuration problem.

This issue first appeared about a year ago when we set this up. However it seems to have gotten more problematic since we are sending more requests.

Setup
Running Python on Elastic Beanstalk inside a Single Container Docker Configuration behind an Elastic Load Balancer. The outgoing requests happen in various threads (not just in the main thread).

More information that might be useful:

  • EC2 Instance OS (cat /etc/*-release)
NAME="Amazon Linux AMI"
VERSION="2016.09"
ID="amzn"
ID_LIKE="rhel fedora"
VERSION_ID="2016.09"
PRETTY_NAME="Amazon Linux AMI 2016.09"
ANSI_COLOR="0;33"
CPE_NAME="cpe:/o:amazon:linux:2016.09:ga"
HOME_URL="http://aws.amazon.com/amazon-linux-ami/"
Amazon Linux AMI release 2016.09
  • Docker Version (docker --version)
Docker version 1.12.6, build 7392c3b/1.12.6
  • Docker Container OS (cat /etc/*-release)
3.4.6
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.4.6
PRETTY_NAME="Alpine Linux v3.4"
HOME_URL="http://alpinelinux.org"
BUG_REPORT_URL="http://bugs.alpinelinux.org"
  • Python 2.7.13
  • pip freeze --local results:

alembic==0.8.10
aniso8601==1.2.0
apns==2.0.1
awsebcli==3.9.0
backports.ssl-match-hostname==3.5.0.1
blessed==1.9.5
boto==2.45.0
botocore==1.5.7
cement==2.8.2
cffi==1.9.1
click==6.7
colorama==0.3.7
contextlib2==0.5.4
coverage==4.3.4
cryptography==1.7.2
docker-py==1.7.2
dockerpty==0.4.1
docopt==0.4.0
docutils==0.13.1
enum34==1.1.6
Faker==0.7.7
Flask==0.12
Flask-Compress==1.4.0
Flask-Limiter==0.9.3
Flask-Migrate==2.0.3
Flask-RESTful==0.3.5
-e flask-restful-swagger==0.15.5
Flask-Script==2.0.5
Flask-SQLAlchemy==2.1
Flask-Testing==0.6.1
funcsigs==1.0.2
GeoAlchemy2==0.4.0
idna==2.2
ipaddress==1.0.18
itsdangerous==0.24
Jinja2==2.9.5
jmespath==0.9.1
limits==1.2.1
Mako==1.0.6
mandrill==1.0.57
MarkupSafe==0.23
mock==2.0.0
nose==1.3.7
pathspec==0.5.0
pbr==1.10.0
pep8==1.7.0
pexpect==4.2.1
psycopg2==2.6.2
ptyprocess==0.5.1
pyasn1==0.1.9
pybars3==0.9.2
pycparser==2.17
pycrypto==2.6.1
pymemcache==1.4.0
PyMeta3==0.5.1
pyOpenSSL==16.2.0
python-dateutil==2.6.0
python-editor==1.0.3
python-gcm==0.1.5
pytz==2016.10
PyYAML==3.12
requests==2.13.0
rollbar==0.13.10
semantic-version==2.5.0
six==1.10.0
SQLAlchemy==1.1.5
tabulate==0.7.5
uWSGI==2.0.14
vcrpy==1.10.5
wcwidth==0.1.7
websocket-client==0.40.0
Werkzeug==0.11.15
wrapt==1.10.8

Not working Solutions
I have found similar issues here, but was not able to fix it with any of the described solutions. I have tried a lot of different approaches, non of which seem to work. Would much appreciate advice and what I should try next.

  • Added a small sleep, since I thought this might be related to this GIL issue
  • Tried running requests in separate processes instead of a thread, since GIL issue implied this might help
  • Tried to use sessions per thread as to not open too many TCP connections, but no difference in error rate
  • Tried installing various libraries (pyopenssl, pyasn1, and ndg-httpsclient) as recommended here and here.
  • Tried injecting from requests.packages.urllib3.contrib import pyopenssl; pyopenssl.inject_into_urllib3() like described here.
  • switched from ubuntu to alpine docker container
  • updated everythign to the latest version (including ec2 and docker version)
  • Some other stuff that I can't remember at this point

What else should I try? Any idea what might be causing this? What tools should I use to debug this? I'm kind of at a loss here.

@Lukasa
Copy link
Member

Lukasa commented Feb 2, 2017

So the first four errors all amount to "the connection was closed by the remote peer at a time we didn't expect". The fifth one is a standard connection timeout.

While it's possible that some of these reflect underlying problems (for example, a connection closed during a TLS handshake usually indicates that your TLS configuration is not supported by the server), in general all of these can be lumped together as what I'd call "network noise". Essentially, when using keep-alive connections it is possible for all kinds of wacky stuff to go on at the connection management layer.

The simplest way to handle this, and something I highly recommend, is to turn on retries. This can be done at the transport adapter level like this:

import requests
from requests.adapters import HTTPAdapter

s = requests.Session()
s.mount('http', HTTPAdapter(max_retries=3))
s.mount('https', HTTPAdapter(max_retries=3))

If you want even more control you can pass a urllib3 Retry object as shown in urllib3's documentation instead of the integer value.

However, without more specificity we can't help you with the individual problems. If you can get down to single reproducing cases that'd be fab.

@Lukasa Lukasa closed this as completed Feb 2, 2017
@lukas-gitl
Copy link
Author

lukas-gitl commented Feb 6, 2017

@Lukasa If you want even more control you can pass a urllib3 Retry object as shown in urllib3's documentation instead of the integer value. - This does not seem to work. Is this broken?

The code doesn't seem to handle this correctly. Reference: https://github.com/kennethreitz/requests/blob/master/requests/adapters.py#L110

I'll try to set it on the adapter after initialisation with:

retries = Retry(
    total=gitl.app.config['MAX_REQUEST_SESSION_RETRIES'],
    backoff_factor=0.1,
    status_forcelist=[500, 502, 503, 504],
)
adapter = HTTPAdapter()
adapter.max_retries = retries
self.session.mount('http', adapter)
self.session.mount('https', adapter)

and see if that works...

It seems to be passes through correctly. However I still get the errors. Very odd...

Edit: Turns out POST requests are not automatically retried...

@Lukasa
Copy link
Member

Lukasa commented Feb 7, 2017

Indeed, this is because POST requests are not idempotent so it is not safe to automatically retry all POST requests. If you know POST requests are safe in this case, you can use urllib3's detailed Retry configuration object to allow retries to POST requests.

@lukas-gitl
Copy link
Author

Yes, that makes perfect sense. This now seems to be working fine. I'm just wondering how I would urllib3 retry to also handle ("bad handshake: SysCallError(104, 'ECONNRESET')",). It seems to not currently retry on that error. Any idea?

@Lukasa
Copy link
Member

Lukasa commented Feb 7, 2017

Hrm, do you have the full trace back? I suspect that's a bug.

@lukas-gitl
Copy link
Author

Kk, I'll try to get more information the next time it happens!

@lukas-gitl
Copy link
Author

Here is the full stack trace:

Traceback (most recent call last):
  File "...", line ..., in ...
    r = self.session.post(..., data=...)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 535, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 497, in send
    raise SSLError(e, request=request)
SSLError: ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)

@Lukasa
Copy link
Member

Lukasa commented Feb 8, 2017

This is one of those moments where I'd have liked to see more data out of the stack trace. I swear, Python 3's chained exceptions only show up in my life when they aren't helpful, and never when they would be. 😉

However, I think this is a real bug. Looking at urllib3's code, it simply does not consider SSL errors to be worth retrying on. I think that instinct is wrong: SSL errors are almost certainly worth a retry attempt. I recommend you raise this as a bug over on the urllib3 repository, where we'll deal with it.

For when you raise the bug, please attach the following repro code. Specifically, if the following server is run:

import time
import socket

s = socket.socket()
s.bind(('', 4433))
s.listen(10)

while True:
    news, _ = s.accept()
    time.sleep(0.5)
    news.close()

Then this code will exhibit the issue:

import urllib3
http = urllib3.PoolManager()
http.request('GET', 'https://localhost:4433/', retries=5)

That code should fire off a MaxRetryError, but instead gets an SSLError.

@lukas-gitl
Copy link
Author

For anyone reading this: Forking python processes doesn't seem to play nicely with requests. Ideally you have a single worker thread that is shared among all processes. This seems to have resolved all our errors here.

@JDFagan
Copy link

JDFagan commented Feb 16, 2018

I've noticed a pattern of receiving requests.exceptions.ConnectionError: ('Connection aborted.', OSError("(104, 'ECONNRESET')",)) when just beyond 5 minutes since request initiated and I've verified that the server I'm talking to (IIS 8.5) has server side connection timeout set to 5 minutes. I have not been able to convince admin of this server to increase the IIS server's connection timeout from 5 to 10 or 15 minutes despite one of their REST APIs can return quite a large payload back which sometimes goes beyond the timeout threshold and when it does, I get this error. Here is full stack trace. Any ideas how I can resolve this in my case? Let me know if this seems different enough from this issue's thread and I'll open another issue:

Traceback (most recent call last):
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 383, in _make_request
    httplib_response = conn.getresponse()
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 285, in recv_into
    raise SocketError(str(e))
OSError: (104, 'ECONNRESET')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/urllib3/util/retry.py", line 357, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/urllib3/packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/urllib3/connectionpool.py", line 383, in _make_request
    httplib_response = conn.getresponse()
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 285, in recv_into
    raise SocketError(str(e))
urllib3.exceptions.ProtocolError: ('Connection aborted.', OSError("(104, 'ECONNRESET')",))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "apollo/waypoint/etl/mri/cli.py", line 150, in <module>
    cli()
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "apollo/waypoint/etl/mri/cli.py", line 54, in etl
    etl.etl(route_set=route_set, routes=route, routes_params=route_params)
  File "/home/jdfagan/Repositories/Waypoint/apollo/apollo/waypoint/etl/mri/api_etl.py", line 95, in etl
    self.extract(route_set=route_set, routes=routes, routes_params=routes_params)
  File "/home/jdfagan/Repositories/Waypoint/apollo/apollo/waypoint/etl/mri/api_etl.py", line 171, in extract
    responses[extractor.name] = extractor.extract(params=params)
  File "/home/jdfagan/Repositories/Waypoint/apollo/apollo/waypoint/etl/mri/api_extractor.py", line 113, in extract
    responses = self.route.get_all(params=params)
  File "/home/jdfagan/Repositories/Waypoint/apollo/apollo/accounting/__init__.py", line 15, in wrapper
    result = f(self, *arg, **kw)
  File "/home/jdfagan/Repositories/Waypoint/apollo/apollo/accounting/mri/mri_api.py", line 432, in get_all
    response = self.get(params=params, stream=stream)
  File "/home/jdfagan/Repositories/Waypoint/apollo/apollo/accounting/__init__.py", line 15, in wrapper
    result = f(self, *arg, **kw)
  File "/home/jdfagan/Repositories/Waypoint/apollo/apollo/accounting/mri/mri_api.py", line 480, in get
    response = self.session.get(self.url, params=params, headers=self.headers, stream=stream)
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
    return self.request('GET', url, **kwargs)
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/home/jdfagan/.miniconda/apollo/lib/python3.6/site-packages/requests/adapters.py", line 490, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', OSError("(104, 'ECONNRESET')",))

@sigmavirus24
Copy link
Contributor

@JDFagan

  1. Reviving issues closed for over a year because the exception looks vaguely similar isn't that helpful to anyone subscribed to the issue or the issue tracker

  2. You have a completely different issue and there's literally nothing you can do to avoid ECONNRESET based on what you said. Basically you have to either convince the server to not take 5 minutes or get it to return data that returns in 5 minutes

  3. Finally, you're asking for help on a defect tracker. Requests for help with requests belong on StackOverflow and this is well documented.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants