Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.3.1] urllib3>=2 causes vcrpy to save/read gzip responses uncompressed #719

Closed
Terseus opened this issue Jun 18, 2023 · 8 comments · Fixed by #723
Closed

[4.3.1] urllib3>=2 causes vcrpy to save/read gzip responses uncompressed #719

Terseus opened this issue Jun 18, 2023 · 8 comments · Fixed by #723

Comments

@Terseus
Copy link
Contributor

Terseus commented Jun 18, 2023

1. Observed behavior

When using urllib3==2.0.3, the responses saved in the cassettes are saved uncompressed.

Also, when reading a cassette, the responses will be interpreted as decompressed, even though the "Content-Encoding" is "gzip".

The unfortunate consequence is that a test suite with cassettes recorded using urllib3 1.x doesn't work when upgrading urllib3 to 2.x, which defeates the purpose of the test suite.

2. How to reproduce

  • In a virtualenv with vcrpy==4.3.1 and urllib3==2.0.3, run the following script:
import vcr
import requests


with vcr.use_cassette("urllib3-v2.yaml"):
    requests.get("https://httpbin.org/gzip")
  • Check that the response body is in string format, not in gzip compressed binary data:
  response:
    body:
      string: "{\n  \"gzipped\": true, \n  \"headers\": {\n    \"Accept\": \"*/*\",
        \n    \"Accept-Encoding\": \"gzip, deflate\", \n    \"Host\": \"httpbin.org\",
        \n    \"User-Agent\": \"python-requests/2.31.0\", \n    \"X-Amzn-Trace-Id\":
        \"Root=1-648f439e-7634146d3ce1ea581683263d\"\n  }, \n  \"method\": \"GET\",
        \n  \"origin\": \"139.47.88.20\"\n}\n"
  • Install urllib3<2, rename the cassette file and run the script again.
  • Check that now the response body in the new cassette is in gzip compressed binary data:
  response:
    body:
      string: !!binary |
        H4sIAKxDj2QC/z2PPW/DIBCGd/8KxBgFgm1a40odPERt1yiRshJ8wUgNUCBDE+W/BxzJ4z3vx93d
        K4SwvhnvYcQfKIUrrFFhE8gRQszsnscMBqXApzzj1WaFZ9NCydYqNxqri1za1miE869MsBi/XZzD
        U0r+ZCx1QS/aIUIggwY7O/x/mpwlAf6uEFPcNLStKVvMRzJcbpbsg1RAfsrReOdc+qzJOxdn3kpF
        eNv1rOmAvTEhBesafpKyFwrnhsfrvQvkJXP4a7t/dWMXjDa2sLrtKe+oELRhuHpUT6SXOIEmAQAA

3. Expected behavior

The cassettes should be written and readed the same with urllib3 2.x and urllib3 1.x.

@hartwork
Copy link
Collaborator

Hi @Terseus, thanks for the report!

I confirm the change in cassette storage when playing with your example locally. I would like to add that gzip response do work for new cassettes on the outside, but it does makes existing cassettes break on upgrade as you reported, true.

@hartwork
Copy link
Collaborator

Hi @Terseus,

I looked into the issue more and it turns out that there are two interwoven issues here: one is that the behavior changed on cassette YAML level and another is that non-default VCR.py mode decode_compressed_response=True will crash trying to to decompress bytes as gzip compressed that have already been uncompressed by urllib3 earlier with urllib3 v2 and gzipped responses.

It will be tricky and potentially not the best move to force urllib3 into the exact old behavior. What I consider feasible and not the worst approach is to:

  • accept that VCR.py persists what it received from urllib3 (and that decode_compressed_response=False just disables additional decompression by VCR.py),
  • accepting that urllib3 was at liberty to change content auto-decompression from v1 to v2 since it's a major version bump with regard to semver,
  • make VCR.py robust towards trying to decompress already decompressed input, by urllib3 or else,to repair decode_compressed_response=True for urllib3 v2,
  • treat the the cassette-incompatibility problem as a one-time task and sole it with automation: a small script to open the YAML file, decompress the response body, put it back in, re-write the YAML file.

I have created pull request #723 now to demo making decompression more robust in case it is considered agreeable.

What do you think?

CC @kevin1024 @jairhenrique

@hartwork hartwork added the bug label Jun 21, 2023
hartwork added a commit that referenced this issue Jun 22, 2023
Make decompression robust towards already decompressed input (arguably fixes #719)
Harmon758 added a commit to Harmon758/tweepy that referenced this issue Jun 23, 2023
This is necessary for tests to pass, until there is a new vcrpy release, with kevin1024/vcrpy#723, to fix kevin1024/vcrpy#719.
@Terseus
Copy link
Contributor Author

Terseus commented Jun 24, 2023

Hi @hartwork,

Looks like I'm late to the party as the PR has already been merged.

If I'm understanding you correctly the bug is still present, since previously working cassettes will break now and need manual intervention to fix them, decompressing the responses and rebuilding the cassettes.

The big problem I see with this approach is that it's impossible to have a test suite that can be tested with both urllib3 v1 and v2, blocking a project to support both use cases, something I think it's important to many libraries right now.

If I'm correct, please consider reopening this issue and lets look for a solution that maintains backward compatibility.

In any case I would like to take a look at this after some days (I'm starting my vacations right now ^_^), can you point me at where may I start looking at any other extra discovery you did about this?

It will be tricky and potentially not the best move to force urllib3 into the exact old behavior.

I'm with you here, let urllib3 do its thing, we must adapt to it.

@hartwork
Copy link
Collaborator

hartwork commented Jun 24, 2023

Hello @Terseus,

Looks like I'm late to the party as the PR has already been merged.

no worries, that pull requests didn't close any doors on this topic technically.

If I'm understanding you correctly the bug is still present, since previously working cassettes will break now and need manual intervention to fix them, decompressing the responses and rebuilding the cassettes.

I think that's arguable and a question of view but I sure understand your view.

The big problem I see with this approach is that it's impossible to have a test suite that can be tested with both urllib3 v1 and v2, blocking a project to support both use cases, something I think it's important to many libraries right now.

That's a good point, at least for library-type Open Source projects.
"impossible" is a strong word and there are probably even multiple ways to making that work, e.g. one way would be to have two cassettes per test — one cassette for urllib3 v1 and one for v2 — and to use the major version of urllib3 in the filename of the cassette. Would be trivial and work already today.

If I'm correct, please consider reopening this issue and lets look for a solution that maintains backward compatibility.

I'm not sure if re-opening would be a good reflection on reality. I have spent multiple hours on playing with this problem and my own time budget for this is used up for now and I am so far not aware of anyone else with plans to work or help out with pull requests on this. If @kevin1024 or @jairhenrique want to re-open the ticket, alright, but before someone wants to realistically work on this more, I'm not in favor of re-opening personally. If you find ideas and time for a clean and tested pull request, that's cool, but it doesn't depend on the state of this ticket.

In any case I would like to take a look at this after some days (I'm starting my vacations right now ^_^), can you point me at where may I start looking at any other extra discovery you did about this?

I believe the relevant starting bits would be:

  • cassette.responses[0]["body"]["string"]
  • the tests touched in Make decompression robust towards already decompressed input (arguably fixes #719) #723
  • inspecting the results for the three ways to get responses (as can be seen in some tests):
    1. without any VCR.py involvement
    2. with use_cassette, no cassette around i.e. recording
    3. with use_cassette, matching cassette around i.e. playback
  • connection._response_options = connection._response_options._replace(decode_content=False) may or may not be handy when playing with getresponse and urllib3 v2
  • be sure to use master not 4.3.1 e.g. since merged Fix for #174 to prevent filters from corrupting request #702 may change the picture
  • note that assert_is_json expects bytes not a string, that's too easy to be mis-leading. While I'm saying that, let me fix that with a pull request now, it's too far of as is. Done: assertions.py: Fix mis-leading assert_is_json #725
  • note that the resulting response body should stay the same, just the in-cassette response body would change to bytes
  • note where header content-encoding is cut down after decoding content and where not
  • keep decode_compressed_response=(True|False) in mind

It will be tricky and potentially not the best move to force urllib3 into the exact old behavior.

I'm with you here, let urllib3 do its thing, we must adapt to it.

"must" is a strong word. I think it's arguable whether VCR.py must.

@Terseus
Copy link
Contributor Author

Terseus commented Jun 24, 2023

Hi @hartwork,

Thanks a lot for all the information, I'll take a look at it whenever I find time for it, if it happens to be so complex I'll just modify the cassettes in my current project.

I'm not sure if re-opening would be a good reflection on reality. I have spent multiple hours on playing with this problem and my own time budget for this is used up for now and I am so far not aware of anyone else with plans to work or help out with pull requests on this.

I completely understant your point of view, there's no way I'm going to ask you to spend more time in this, our (free) time is extremely limited; seriously, thanks a lot for your work, I'll try to help fix my own problems here :)

Cheers, and please excuse me if I did sound harsh, english is not my first language.

@hartwork
Copy link
Collaborator

@Terseus no worries, didn't read you as harsh, we're good, thanks!

PS: Regarding "must" RFC 2119 could be of interest.

@salomvary
Copy link

salomvary commented Aug 18, 2023

I'm also late to the party and only noticed now that existing cassettes are broken in recent versions. Would it make sense to add this as a breaking change notice to the release notes of whichever version introduced this change? Would save folks even more late from the party from a bit of head-scratching... ;)

Mr0grog added a commit to edgi-govdata-archiving/wayback that referenced this issue Sep 25, 2023
Unfortunately, VCR currently records cassettes differently with urllib3 v1 vs v2 (in v2, it records decompressed bodies rather than raw bytes). To handle this, we need to have separate cassettes for the different versions of urllib3. More details in kevin1024/vcrpy#719.
Mr0grog added a commit to edgi-govdata-archiving/wayback that referenced this issue Sep 26, 2023
Unfortunately, VCR currently records cassettes differently with urllib3 v1 vs v2 (in v2, it records decompressed bodies rather than raw bytes). To handle this, we need to have separate cassettes for the different versions of urllib3. More details in kevin1024/vcrpy#719.
Mr0grog added a commit to edgi-govdata-archiving/wayback that referenced this issue Sep 26, 2023
Back in #118 we “fixed” things for urllib3 v2 by marking this package as only compatible with v1, so users wouldn't wind up with bad dependency combinations. This adds real, proper support for urllib3 v2. Fixes #116.

For the most part, urllib3 v2 just works, but there were two significant changes I had to make here:

1. We had a funky hack to deal with Wayback’s broken gzip handling that used the `from_httplib()` static method. That method no longer exists, and the new equivalent combines a bunch of other behavior that isn't really reasonable to override. Instead, we patch the `HTTPHeadersDict` class in v2.

2. VCR.py writes different, incompatible cassette files for v1 and v2 of urllib3 (see kevin1024/vcrpy#719). To address this in a way that doesn't make future contributions too hard, I added a custom serializer to make VCR behave compatibly between the two versions of urllib3, so you can record or read the same cassette files regardless of which version you are working with. It is no longer perfectly accurate to the response you received if using urllib3 v2, but it's generally good enough for our needs.
@ericbuehl
Copy link

  • treat the the cassette-incompatibility problem as a one-time task and sole it with automation: a small script to open the YAML file, decompress the response body, put it back in, re-write the YAML file.

If it is helpful for others, here is a script I used to decompress past responses:

#!/usr/bin/env python3

import sys
import yaml
import zlib
try:
    # use the same dumper config as vcr to minimize entropy in the diffs
    from yaml import CDumper as Dumper
except ImportError:
    from yaml import Dumper

with open(sys.argv[1], 'r') as file:
    data = yaml.safe_load(file)

for interaction in data['interactions']:
    response = interaction.get('response', {})
    headers = response.get('headers', {})
    contentType = headers.get('content-encoding') or headers.get('Content-Encoding')
    compressed_string = response['body']['string']
    if contentType and contentType[0] == 'gzip':
        response['body']['string'] = zlib.decompress(compressed_string, zlib.MAX_WBITS | 16).decode('utf-8')

with open(sys.argv[1], 'w') as file:
    yaml.dump(data, file, Dumper=Dumper)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants