Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError after following a chain of redirects #6026

Open
wodim opened this issue Jan 1, 2022 · 4 comments
Open

UnicodeDecodeError after following a chain of redirects #6026

wodim opened this issue Jan 1, 2022 · 4 comments

Comments

@wodim
Copy link

wodim commented Jan 1, 2022

#6006

Something confuses requests (or urllib3?) along the way

Actual Result
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 92: invalid continuation byte

Reproduction Steps
import requests
requests.get("https://www.lavozdegalicia.es/noticia/deportes/2021/12/13/psg-juve-united-nuevos-rivales-espa%C3%B1oles-champions/00031639396272418389372.htm")

System Information
$ python -m requests.help
{
"chardet": {
"version": "4.0.0"
},
"charset_normalizer": {
"version": "2.0.9"
},
"cryptography": {
"version": "36.0.0"
},
"idna": {
"version": "3.3"
},
"implementation": {
"name": "CPython",
"version": "3.8.10"
},
"platform": {
"release": "4.4.0-17763-Microsoft",
"system": "Linux"
},
"pyOpenSSL": {
"openssl_version": "101010cf",
"version": "21.0.0"
},
"requests": {
"version": "2.26.0"
},
"system_ssl": {
"version": "1010106f"
},
"urllib3": {
"version": "1.26.7"
},
"using_charset_normalizer": false,
"using_pyopenssl": true
}

@sigmavirus24 you have been too harsh on this one.

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Ahmed\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Ahmed\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Ahmed\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Ahmed\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\sessions.py", line 677, in send
history = [resp for resp in gen]
File "C:\Users\Ahmed\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\sessions.py", line 677, in
history = [resp for resp in gen]
File "C:\Users\Ahmed\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\sessions.py", line 150, in resolve_redirects
url = self.get_redirect_target(resp)
File "C:\Users\Ahmed\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\sessions.py", line 116, in get_redirect_target
return to_native_string(location, 'utf8')
File "C:\Users\Ahmed\AppData\Local\Programs\Python\Python37\lib\site-packages\requests_internal_utils.py", line 25, in to_native_string
out = string.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 92: invalid continuation byte

The guilty part there is return to_native_string(location, 'utf8') Trying to decode the URL to native utf8 when it should rather "URL encode" it. I am not an HTTP expert, but this exception should be handled more gracefully anyway.

The location given by the remote for redirection is as follow:

b'http://www.lavozdegalicia.es/noticia/deportes/2021/12/13/psg-juve-united-nuevos-rivales-espa\xf1oles-champions/00031639396272418389372.htm'

Should the \xf1 be decoded as follow %F1 ?

I can see that this behavior is already followed by Chrome.

@nateprewitt
Copy link
Member

Hi @wodim,

To actually address your issue, the URI being returned by this redirect is invalid. All byte sequences that aren't listed as unreserved or sub-delim MUST be percent encoded in the path (ref). The website is doing this correctly for other paths, so it appears this is a defect in this specific resource.

For Requests, we may be able to be more tolerant of this behavior. Any solution we implement is purely guessing and trying to support broken behavior though, which we typically avoid. We could potentially try to percent-encode this when the UnicodeDecodeError is raised. Given this is the first report of the issue in at least the last 5 years, I'm not sure it's a common enough defect to special case.

@rhettlunn
Copy link

I'm encountering this error as well for https://www.liveinternet.ru/tags/%EF%F0%E5%E7%E8%E4%E5%ED%F2%FB%2B%D1%D8%C0/ (and various other pages on that site)

I think a good fix would be to catch the malformed redirect URL (or any similar invalid header) and raise something that inherits from RequestException, rather than raising a UnicodeDecodeError.

@vkruoso
Copy link

vkruoso commented Mar 3, 2023

This is happening to me as well. One source site is now returning those malformed URLs during redirects, and there is now way to bypass it as far as I understand it.

Ousret added a commit to jawah/niquests that referenced this issue Sep 24, 2023
**Added**
- Static type annotations thorough the whole package.
- `cert` argument for client authentication with certificate can now pass the password/passphrase using a 3-values tuple (cert, key, password).
  The three parameters in the tuple must be of type `str`.
- `verify` argument behavior has been extended and now accept your CA bundle as `str` instead of a path. It also accepts your CA bundle as `bytes` directly.
  This help when you do not have access to the fs.
- Operating system truststore will be used instead of `certifi`. Root CAs are automatically grabbed from your computer configuration.
- Oriented-object headers. Access them through the new property `oheaders` in your `Response`.
- Propagated the argument `retries` in `niquests.api` for all functions.
- Added argument `retries` in the `Session` constructor.
- Property `conn_info` to the `PreparedRequest` and `Response` that hold a reference to a `ConnectionInfo`.
  This class exposes the following properties: `certificate_der` _(bytes)_, `certificate_dict` _(dict)_ as provided by the standard
  library (ssl), `destination_address` _(tuple[ipAddress, portNumber])_, `cipher` _(str)_, `tls_version` _(TLSVersion)_, and `http_version`.
- Two hooks, namely `pre_send` and `pre_request`. The `pre_request` event is fired just after the initial construction of
  a `PreparedRequest` instance. Finally, the `pre_send` will be triggered just after picking a (live) connection
  for your request. The two events receive a `PreparedRequest` instance.

**Changed**
- Calling the method `json` from `Response` when no encoding was provided no longer relies on internal encoding inference.
  We fall back on `charset-normalizer` with a limited set of charsets allowed (UTF-8/16/32 or ASCII).
- No longer will the `text` method from `Response` return str if content cannot be decoded. It returns None instead.
- If specified charset in content-type does not exist (LookupError) the `text` method from `Response` will rely on charset detection.
- If specified charset in content-type is not made for text decoding (e.g. base64), the `text` method from `Response` returns None.
- With above four changes, the `json` method will raise `RequestsJSONDecodeError` when the payload (body) cannot be decoded.
- Passing invalid `files` description no longer _just skip_ invalid entries, it raises `ValueError` from now on.
- Non-str HTTP-Verb are refused.
- Passing `files` with minimal description (meaning no tuple but _just_ the fp) no longer guess its name when `fp.name` return bytes.
- No longer will the default timeout be unset, thus making you waiting indefinitely.
  Functions `get`, `head`, and `options` ships with a default of 30 seconds.
  Then `put`, `post`, `patch` and `delete` uses a default of 120 seconds.
  Finally, the `request` function also have 120 seconds.
- Basic authorization username and password are now encoded using utf-8 instead of latin-1 prior to being base64 encoded.


**Removed**
- Property `apparent_encoding` in favor of a discrete internal inference.
- Support for the legacy `chardet` detector in case it was present in environment.
  Extra `chardet_on_py3` is now unavailable.
- **requests.compat** no longer hold reference to _chardet_.
- Deprecated `requests.packages` that was meant to avoid breakage from people importing `urllib3` or `chardet` within this package.
  They were _vendored_ in early versions of Requests. A long time ago.
- Deprecated function `get_encodings_from_content` from utils.
- Deprecated function `get_unicode_from_response` from utils.
- BasicAuth middleware no-longer support anything else than `bytes` or `str` for username and password.
- `requests.compat` is stripped of every reference that no longer vary between supported interpreter version.
- Charset fall back **ISO-8859-1** when content-type is text and no charset was specified.
- Main function `get`, `post`, `put`, `patch`, `delete`, and `head` no longer accept **kwargs**. They have a fixed list of typed argument.
  It is no longer possible to specify non-supported additional keyword argument from a `Session` instance or directly through `requests.api` functions.
  e.g. function `delete` no-longer accept `json`, or `files` arguments. as per RFCs specifications. You can still override this behavior through the `request` function.
- Mixin classes `RequestEncodingMixin`, and `RequestHooksMixin` due to OOP violations. Now deported directly into child classes.
- Function `unicode_is_ascii` as it is part of the stable `str` stdlib on Python 3 or greater.
- Alias function `session` for `Session` context manager that was kept for BC reasons since the v1.
- pyOpenSSL/urllib3 injection in case built-in ssl module does not have SNI support as it is not the case anymore for every supported interpreters.
- Constant `DEFAULT_CA_BUNDLE_PATH`, and submodule `certs` due to dropping `certifi`.
- Function `extract_zipped_paths` because rendered useless as it was made to handle an edge case where `certifi` is "zipped".
- Extra `security` when installing this package. It was previously emptied in the previous major.
- Warning emitted when passing a file opened in text-mode instead of binary. urllib3.future can overrule
  the content-length if it detects an error. You should not encounter broken request being sent.
- Support for `simplejson` if was present in environment.
- Submodule `compat`.

**Fixed**
- An invalid content-type definition would cause the charset being evaluated to `True`, thus making the program crash.
- Given `proxies` could be mutated when environment proxies were evaluated and injected. This package should not modify your inputs.
  For context see psf#6118
- A server could specify a `Location` header that does not comply to HTTP specifications and could lead to an unexpected exception.
  We try to fall back to Unicode decoding if the typical and expected Latin-1 would fail. If that fails too, a proper exception is raised.
  For context see psf#6026
- Top-level init now specify correctly the exposed api. Fixes mypy error `.. does not explicitly export attribute ..`.
@SamStephens
Copy link

I have another instance of this issue, which unfortunately I cannot share as its private to my employer. Firefox and Chrome both handle the URL I've encountered cleanly as per the behavior @wodim describes.

I see that that [the fork niquests) has a bugfix for this issue](jawah#20; it could be worth borrowing their change. Alternatively a more meaningful exception than UnicodeDecodeError would be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants