OpenSSL.SSL.Error: [('SSL routines', 'SSL3_GET_RECORD', 'decryption failed or bad record mac')] #1906

Closed
ssbarnea opened this Issue Feb 7, 2014 · 48 comments

Projects

None yet
@ssbarnea
Contributor
ssbarnea commented Feb 7, 2014

It seems that latest requests (2.2.1) is also affected by bug:
OpenSSL.SSL.Error: [('SSL routines', 'SSL3_GET_RECORD', 'decryption failed or bad record mac')]

It seems to be an workaround here http://stackoverflow.com/questions/21497591/urllib2-reading-https-url-failure but I don't know how to apply it to requests.

@Lukasa
Collaborator
Lukasa commented Feb 7, 2014

Thanks for this!

Yeah, this isn't really a request bug, as the SO question highlights: it's a Debian or OpenSSL bug.

With that said, a possible workaround would be an extension of the transport adapter demonstrated on my blog, here: https://lukasa.co.uk/2013/01/Choosing_SSL_Version_In_Requests/

@ssbarnea
Contributor
ssbarnea commented Feb 7, 2014

It's a real problem and I do not have the confirmation that the workaround works. What is even more awkward is that my automation script that used to query a server every hour for few hundred requests started to fail suddenly, even without me changing anything on the machine. I guess they may have changed the configuration of the web server.

Still, still problems occurs on latest distro of Ubuntu, with all patches and the last version of OpenSSL is one year old. We need to implement an workaround for this.

Also, I tried the workaround specified on OpenSSL forums but it doesn't work, I will try your approach and see.

@Lukasa
Collaborator
Lukasa commented Feb 7, 2014

I agree that it's a real problem. I do not necessarily agree that Requests needs a workaround for every bug in any of our dependencies.

What is not clear to me at this time is how severe this bug is, how widespread it is, and how easy it is to work around. It's also not easy for me to find those things out today: I'm at a conference and won't have time near a laptop. It would be useful if @sigmavirus24 could take a look: otherwise I'll have to dig into it tomorrow.

The key thing to know is that OpenSSL has had other bugs in the past that we haven't worked around in the core Requests code. Most notably, 0.9.8f (I think) that ships on OS X 10.8 has a similar bug in it that we have never worked around.

We are not required to fix every bug in all of our dependencies, especially implicit ones. We may fix this one, but I don't know yet.

Note that another workaround is to use a different version of OpenSSL. Just saying.

@ssbarnea
Contributor
ssbarnea commented Feb 7, 2014

Your workaround didn't work this time, I tried it and same error. And regarding using a new openssl, that's not easy at all openssl is very well tied into the system and I imagine that installing another one will only bring more problems.

If we would have a working-workaround it would be fine, I don't mind applying a patch to this script, but it doesn't work. I tried with ssl_version=ssl.PROTOCOL_TLSv1

... I am trying now to build a minimal test-case.

@ssbarnea
Contributor
ssbarnea commented Feb 7, 2014

Oops, it seems that it was a red herring, it's not related to this. Instead it seems to be related to multiprocessing. I was using multiprocessing in order to perform requests on 10 threads, to speedup the process. This worked well for weeks but suddenly stopped working a couple of days ago... with this error.

@Lukasa
Collaborator
Lukasa commented Feb 8, 2014

What does your multiprocessing code look like? If you're sharing Session objects across processes, bad stuff will happen. =)

@ssbarnea
Contributor
ssbarnea commented Feb 8, 2014

I was not sharing any session, in fact I was calling a function on each thread, passing a parameter and inside this function I has my code which was creating a session, performing 2-3 requests and returning. I know, strange.

@Lukasa
Collaborator
Lukasa commented Feb 8, 2014

That's very odd. Is it easily reproducible?

@sigmavirus24
Collaborator

@ssbarnea can you at least share the URL so that we can attempt to reproduce it? So far I think we have almost sufficient information, we just need the URL or characteristics of the server to reproduce it:

  • Multiprocessing using 10 threads
  • Function that instantiates a Session and makes more than one request (ostensibly to the same server) with it.
@crizCraig

Started getting this as well running https://github.com/edx/dyno-slayer

Problem started when changing:

DEFAULT_MIN_TIMING_WINDOW = 60
DEFAULT_MAX_TIMING_WINDOW = 120

to:

DEFAULT_MIN_TIMING_WINDOW = 20
DEFAULT_MAX_TIMING_WINDOW = 30
@sigmavirus24
Collaborator

@crizCraig I'm not familiar with that project. What is the significance of those values? Are they:

  • Configuration file values?
  • Environment values?
  • Other?
@crizCraig

Sorry, those just define a length of time to sample heroku logs via their API. The entire program is this file:

https://github.com/edx/dyno-slayer/blob/master/scripts/slayer.py

@Lukasa
Collaborator
Lukasa commented Mar 6, 2014

We need substantially better diagnostics than that I'm afraid. =) Can you provide us a traceback, for instance?

@sigmavirus24
Collaborator

I wonder if someone who works @heroku could give us a hand. They might have some insight. @catsby can you give us some info about the SSL configuration on a default Heroku box? Can you point us at someone who can?

@Lukasa
Collaborator
Lukasa commented Mar 23, 2014

@crizCraig Ping. =)

@zvodd
zvodd commented Dec 16, 2014

I'm experiencing this issue in Windows 7 with Python 2.7.9 x32

SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:581)
@Lukasa
Collaborator
Lukasa commented Dec 16, 2014

@zvodd Have you tried forcing SSL negotiation at different versions, as per this article?

@Lukasa
Collaborator
Lukasa commented Jan 19, 2015

Closed for inactivity.

@Lukasa Lukasa closed this Jan 19, 2015
@maxcountryman
Contributor

Ran into this issue using multiprocessing. Have not been able to fix it.

@sigmavirus24
Collaborator

@maxcountryman can you provide any of the details we've been asking for?

@maxcountryman
Contributor

@sigmavirus24 well, I'm not using Heroku. What other details did you want?

@t-8ch
Contributor
t-8ch commented Feb 12, 2015

@maxcountryman

  • Your version of python and openssl
  • You operatingsystem and version
  • A minimal breaking code snippet
  • If possible a public URL which triggers the bug (together with the mentioned code snippet)
@ulandj
ulandj commented Mar 17, 2015

Hi @Lukasa @sigmavirus24 @t-8ch ,
Is your article (https://lukasa.co.uk/2013/01/Choosing_SSL_Version_In_Requests/) compatible with Python 3? Is it right that the current issue will be solved by updating python to version 2.7.9?

@Lukasa
Collaborator
Lukasa commented Mar 17, 2015

@ulandj The article should be Python 3 compatible. Upgrading to Python 2.7.9 will solve a lot of problems.

@ulandj
ulandj commented Mar 17, 2015

@Lukasa i.e. after upgrading to Python 2.7.9 I don't need use your adapter in the article, right? And

import requests
conn = requests.Session()
conn.put(url, data=body, headers=headers)
conn.delete(url, data=body, headers=headers)

will work with any count of multiprocesses?

@Lukasa
Collaborator
Lukasa commented Mar 17, 2015

If you're encountering this problem with multiprocessing, I've never been given a repro scenario for it, so I don't actually know what's happening. If you can demonstrate the problem with a bit of sample code I'd like to see it.

@ulandj
ulandj commented Mar 17, 2015

well, we first upgrade to python 2.7.9 and will try to run with multiprocessing. if this error appears again, I will let you know. Thanks.

@ulandj
ulandj commented Mar 18, 2015

@Lukasa our customer says that they have the following error:

Traceback (most recent call last): 
File "/opt/SketchSync/WorkersApp.py", line 230, in run self.producer.release_job(self.worker_id, payload, done) 
File "/opt/SketchSync/SketchSyncProducer.py", line 88, in release_job self.queue.delete_job(payload) 
File "/opt/SketchSync/SketchSyncProducer.py", line 177, in delete_job self.queue.delete(mid) 
File "/usr/local/lib/python2.7/site-packages/iron_mq.py", line 58, in delete result = self.client.delete(url) 
File "/usr/local/lib/python2.7/site-packages/iron_core.py", line 233, in delete retry=retry, body=body) 
File "/usr/local/lib/python2.7/site-packages/iron_core.py", line 152, in request r = self._doRequest(url, method, body, headers) 
File "/usr/local/lib/python2.7/site-packages/iron_core.py", line 117, in _doRequest r = self.conn.delete(url, data=body, headers=headers) 
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 527, in delete return self.request('DELETE', url, **kwargs) 
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 456, in request resp = self.send(prep, **send_kwargs) 
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 559, in send r = adapter.send(request, **kwargs) 
File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 382, in send raise SSLError(e, request=request) 
SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:1750) 

Does it say about something?

Using:
python (2.7.9)
requests (2.3.0)

@sigmavirus24
Collaborator

@ulandj sounds like you're running into http://stackoverflow.com/a/3724938/1953283

@ulandj
ulandj commented Mar 18, 2015

@sigmavirus24
you think that if I move this line of code (from constructor of class) - https://github.com/iron-io/iron_core_python/blob/master/iron_core.py#L162 here - https://github.com/iron-io/iron_core_python/blob/master/iron_core.py#L188, then it will work without errors?

@maxcountryman
Contributor

@ulandj fwiw the code that breaks in my application involves that same IronIO wrapper.

@maxcountryman maxcountryman added a commit to maxcountryman/iron_core_python that referenced this issue Mar 18, 2015
@maxcountryman maxcountryman move conn to request method
This moves the connection session into the request method in order to
facilitate better support for multiprocessing. In particular, a Requests
Session object used across processes will fail to properly encrypt data
via SSL[0][1]. By encapsulating the session creation in the request
method itself we workaround this issue at the limited expense of
creating the session on demand.

A version bump, from 1.1.7 to 1.1.8, is included as well.

Fixes errors that look like: [Errno 1] _ssl.c:1429

[0] http://stackoverflow.com/a/3724938/1953283
[1] kennethreitz/requests#1906
6543505
@maxcountryman
Contributor

@ulandj I've made the changes you suggested and so far things seem to be working fine on my local development machine. I should be able to roll this out to staging soon where the errors are more prevalent.

@maxcountryman
Contributor

@ulandj rolled the above changes out to staging and it's been smooth sailing ever since. I doubt they'll be merged upstream since you totally lose the benefit of the session but at any rate it does seem to fix it.

@ulandj
ulandj commented Mar 19, 2015

@maxcountryman thanks,
We offered our customer to install iron_core from ulan-multiprocessing-sslerror branche. We will observe it. Hope the move "requests" will help to resolve the ssl error in multiprocessing.

@stickystyle stickystyle pushed a commit to stickystyle/python-couchdblogger that referenced this issue Sep 14, 2015
Ryan Parrish Sessions cannot be shared when doing SSL kennethreitz/requests#1906 a0a99fd
@trehn trehn added a commit to trehn/bundlewrap-teamvault that referenced this issue Mar 19, 2016
@trehn trehn fix session reuse across processes 6fd17db
@dwfreed
dwfreed commented Mar 25, 2016

@Lukasa simple repro description:

  1. Make session in master process
  2. Make some requests to an SSL-enabled site
  3. Start 2 multiprocessing processes
  4. In the first, make some more requests to the same SSL-enabled site (so connection pooling kicks in)
  5. In the second, wait a little bit, then make a request to the same SSL-enabled site

I can't post my script that's hitting this, but I'll see if I can make a simple script that reproduces this with httpbin.

Why this happens: The two children processes share the SSL state initially, and are using the same socket for communication, but once one of them makes a request, the state becomes desynchronized, so when the other tries to use it, the SSL decryption fails.

Some possible fixes:

  • Throw out SSL connections from the connection pool upon unpickling the session and close their associated socket objects
    • This is fine because closing the socket object only closes that process's file descriptor; the underlying connection will not be closed until all file descriptors pointing to it are closed
    • Do it on unpickling instead of preventing the connections from making it into the pickle so that you can clean up the file descriptors in the child processes
  • Throw out the whole connection pool on unpickling
    • Probably the best idea, because things are probably going to break
  • Warn/explode on unpickling a session that has SSL/any connections in its connection pool
@dwfreed
dwfreed commented Mar 25, 2016

Reproduction script:

#!/usr/bin/env python3.4

import multiprocessing
import requests
import signal
import sys

session = None
stop = None

def sub_1():
    signal.signal(signal.SIGINT, signal.SIG_IGN)
    while not stop.wait(2):
        session.get('https://httpbin.org/ip')

def sub_2():
    signal.signal(signal.SIGINT, signal.SIG_IGN)
    while not stop.wait(20):
        session.get('https://httpbin.org/ip')

if __name__ == '__main__':
    session = requests.Session()
    session.get('https://httpbin.org/ip')
    stop = multiprocessing.Event()
    sub_1_process = multiprocessing.Process(target=sub_1)
    sub_1_process.start()
    sub_2_process = multiprocessing.Process(target=sub_2)
    sub_2_process.start()
    while True:
        try:
            sub_1_process.join()
            sub_2_process.join()
            sys.exit(0)
        except (KeyboardInterrupt, SystemExit):
            if not stop.is_set():
                stop.set()
            else:
                raise

Resulting traceback after roughly 20 seconds:

Process Process-2:
Traceback (most recent call last):
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 376, in _make_request
    httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 559, in urlopen
    body=body, headers=headers)
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 378, in _make_request
    httplib_response = conn.getresponse()
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/http/client.py", line 1227, in getresponse
    response.begin()
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/http/client.py", line 386, in begin
    version, status, reason = self._read_status()
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/http/client.py", line 348, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/socket.py", line 378, in readinto
    return self._sock.recv_into(b)
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/ssl.py", line 748, in recv_into
    return self.read(nbytes, buffer)
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/ssl.py", line 620, in read
    v = self._sslobj.read(len, buffer)
ssl.SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:1748)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/site-packages/requests/adapters.py", line 376, in send
    timeout=timeout
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 588, in urlopen
    raise SSLError(e)
requests.packages.urllib3.exceptions.SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:1748)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/multiprocessing/process.py", line 254, in _bootstrap
    self.run()
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "test.py", line 19, in sub_2
    session.get('https://httpbin.org/ip')
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/site-packages/requests/sessions.py", line 480, in get
    return self.request('GET', url, **kwargs)
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/site-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/site-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/Users/dwfreed/.pyenv/versions/3.4.4/lib/python3.4/site-packages/requests/adapters.py", line 447, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:1748)
@dwfreed
dwfreed commented Mar 25, 2016

I did the reproduction script in 3.4, but it's probably the same issue in every version supported by requests and multiprocessing. This script is basically what my code does, except pointed at httpbin instead of what I'm actually hitting.

@Lukasa
Collaborator
Lukasa commented Mar 25, 2016

I'm a bit confused by your diagnosis here: I don't think it's quite right.

The reason is that the connection pool is not carried across when picking/unpickling. We explicitly call init_poolmanager when unpickling to give us a new connection pool. However, that's for providing the repro script: I'll dive in and see if I can find what's going on.

@dwfreed
dwfreed commented Mar 25, 2016

@Lukasa I'm MITMing the connection using a transparent proxy. Only 1 connection is ever made.

@dwfreed
dwfreed commented Mar 25, 2016

Aha, the session isn't getting pickled at all, because the default start method for multiprocessing on Unix systems is fork, which relies on the COW semantics instead. If you switch to the spawn start method (and fix the script to pass the session as a parameter), the session is pickled, which clears the connection pool.

@Lukasa
Collaborator
Lukasa commented Mar 25, 2016

Yup, my testing just discovered this here too. =) So this issue is just a more fundamental problem about sockets.

When you fork, either the parent or the child must close their file descriptors, because using them both from different threads is really bad. This basically means that either you shouldn't share a session across processes at all, or you should ensure that when you pass a session off to another process you lose it in the original process that held it.

I don't believe there's much we can do about this.

@dwfreed
dwfreed commented Mar 25, 2016

Or manually clean up the connection pool in children processes. In order to fix this in requests, you would need an after_fork hook, which I don't believe is exposed by python (some stdlib modules have after-fork code that gets called from the C code after the fork, but they're special). Might be nice to have an after_fork() function on session objects, documenting that it should be called after any call to os.fork(), including when using multiprocessing with fork as the start method.

@Lukasa
Collaborator
Lukasa commented Mar 25, 2016

I explicitly did not mention that approach because I think it's impossible for us to do it sensibly.

For example, one thing you might want to do is configure a Session on the main process and then pass it to the child: in that instance, you want the child instance to keep hold of its connections and the parent to lose them.

I think this is something that a user needs to decide to do: if you're sharing sessions across processes at all (which I'd recommend not doing if at all possible), you need to make a call about what to do when you fork.

@dwfreed
dwfreed commented Mar 25, 2016

Well, you could mention in the documentation the cases where you'd want to call the after_fork(), and when you wouldn't want to. In my case, all I care about is setting some specific headers, but I need some data from the API before I can start the children (the timeouts for the stop.wait() calls in the children are actually derived from the API, for example), and it seems wasteful for me to have to set up a session 3 times with the exact same setup code.

I suppose one option in my case would be shipping PreparedRequests to the main process to send then shipping the Responses back via queues. Or I could just go back to threading, since my actual processing workload is quite light; this one seems easier in my case (but may not be the best solution for others).

@Lukasa
Collaborator
Lukasa commented Mar 25, 2016

@dwfreed Session setup costs are very low: sessions don't have to do a whole lot to get set up. In particular, given that you have to wait for the API to return before setting up the children I think you can probably just wrap the setup as part of the function you invoke in the background process and it becomes essentially unnoticeable.

This comes in to why I'm nervous about writing documentation for this: I don't think we can provide much more than general advice which would boil down to: don't share Sessions across processes, consider carefully whether you want to share Sessions across threads, when writing concurrent code try to adopt good concurrent design patterns.

Anything more than that is just too restrictive as far as advice goes.

@phobologic phobologic referenced this issue in remind101/stacker May 9, 2016
Closed

Running into error when tailing #151

@jbbarth
jbbarth commented Jul 1, 2016

I know this is an old request and won't be solved by the requests project, but when you're stuck googling about this issue with python code you always get back to this very issue.

It happens that in my case it was happening with "boto" (which uses "requests" under the hood). It happens that boto tries to do smart connection pooling which breaks badly when forking. This smart connection pooling is disabled when you run on Google App Engine. Boto detects if you're on App Engine with the presence of the 3 following environment variables. Setting them makes the bug vanish entirely: USER_IS_ADMIN=fake CURRENT_VERSION_ID=fake APPLICATION_ID=fake.

Again, excuse the spam here, I hope it will be useful to somebody :-) And keep up the great work with requests!!!

@fuzzyami

@jbbarth you're my hero. this error just started happening to me on boto as I added multiprocessing. your workaround isn't working for me (yet). Which requests version are you using?

@jbbarth
jbbarth commented Nov 20, 2016

@fuzzyami I think it was 2.10, along with boto 2.38 or 2.39. The heuristics in boto may have changed, I recommend you look into their connection pooling mechanism or grep the environment variables I was referring to. Good luck! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment