Do not re-encode a UTF-8 string as ISO-8859-1 silently in HTTP requests #5560

baharev · 2020-08-18T09:12:45Z

This issue is about sending an HTTP request, and not about the response.

    headers = { 'Content-Type': 'application/json; charset=utf-8' }
    response = requests.post(url, headers=headers, data=my_native_python3_utf8_string)

Expected Result

I expected the request to be sent as UTF-8, after all, everything in the Python 3 ecosystem uses UTF-8 by default.

Actual Result

My UTF-8 string is silently re-encoded as ISO-8859-1, causing confusing bugs at the recipient. The code in http.client is doing it:

https://github.com/python/cpython/blob/c3dd7e45cc5d36bbe2295c2840faabb5c75d83e4/Lib/http/client.py#L1312

In my opinion, RFC 2616 Section 3.7.1 does not apply here.
The whole Python 3 ecosystem uses UTF-8 by default, so I find it utterly confusing that a library silently re-encodes a UTF-8 string as ISO-8859-1.

Suggestion

I suggest that if requests takes a native UTF-8 string as data argument, it makes sure that it is handled as a UTF-8 string all the way, and won't be silently re-encoded as ISO-8859-1. It can cause really obscure bugs otherwise.

Reproduction Steps

See above.

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": "2.8"
  },
  "idna": {
    "version": "2.8"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.7.4"
  },
  "platform": {
    "release": "4.15.0-107-generic",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "1010104f",
    "version": "19.1.0"
  },
  "requests": {
    "version": "2.22.0"
  },
  "system_ssl": {
    "version": "1010107f"
  },
  "urllib3": {
    "version": "1.25.8"
  },
  "using_pyopenssl": true
}

The text was updated successfully, but these errors were encountered:

dhdeangelis · 2020-10-29T16:58:03Z

I may have encountered this problem too. I wonder if there is any status change.

python3 -m requests.help
{
"chardet": {
"version": "3.0.4"
},
"cryptography": {
"version": "3.1.1"
},
"idna": {
"version": "2.10"
},
"implementation": {
"name": "CPython",
"version": "3.8.5"
},
"platform": {
"release": "5.9.1-1-default",
"system": "Linux"
},
"pyOpenSSL": {
"openssl_version": "1010108f",
"version": "19.1.0"
},
"requests": {
"version": "2.24.0"
},
"system_ssl": {
"version": "1010108f"
},
"urllib3": {
"version": "1.25.10"
},
"using_pyopenssl": true
}

sigmavirus24 · 2020-10-29T20:13:56Z

Quoting the original issue:

My UTF-8 string is silently re-encoded as ISO-8859-1, causing confusing bugs at the recipient. The code in http.client is doing it:

In other words, not code that Requests controls and not a bug we can fix.

baharev · 2020-10-29T20:45:06Z

In other words, not code that Requests controls and not a bug we can fix.

I disagree. You could fix it if you wanted.

Since you are wrapping that code, you could handle the issue in your wrapper so that at least your clients are not hurt by this issue.

In other words: Requests could offer an API that is consistent with the Python ecosystem, and handle such silly issues internally if needed, before calling http.client.

sethmlarson · 2020-10-29T20:48:25Z

@baharev Requests is under feature freeze with no room for breaking changes so this change is extremely unlikely to land, regardless of its utility. It is a best practice to provide bytes instead of str for the body so that no underlying library needs to guess at what encoding you want.

baharev · 2020-10-29T20:57:37Z

@sethmlarson

It is a best practice to provide bytes instead of str for the body so that no underlying library needs to guess at what encoding you want.

Yes, I have figured that out the hard way.

The problem is that it is incredibly easy to forget an .encode() call (accidentally pass the string unencoded). To add salt to the wound, it will even "work" with ASCII characters and you won't even notice you have a bug (you forgot an .encode()) until you get a more exotic character in the string. Ouch.

In other words, I find the current interface a loaded gun, and you can easily shoot yourself in the foot.

Requests is under feature freeze with no room for breaking changes so this change is extremely unlikely to land, regardless of its utility.

OK, I understand.

baharev · 2020-10-29T21:08:44Z

@sethmlarson +1s are great, I really appreciate it, but what should I do so that this issue is eventually addressed?

If this issue remains closed, it will be forgotten.

A "fix" could be as simple as throwing TypeError if the data argument requires encoding (a string for example). I don't think there is a fix that is not a breaking change.

baharev changed the title ~~Do not re-encode a UTF-8 string as ISO-8859-1 silently~~ Do not re-encode a UTF-8 string as ISO-8859-1 silently in HTTP requests Aug 18, 2020

sigmavirus24 closed this as completed Oct 29, 2020

github-actions bot locked as resolved and limited conversation to collaborators Aug 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not re-encode a UTF-8 string as ISO-8859-1 silently in HTTP requests #5560

Do not re-encode a UTF-8 string as ISO-8859-1 silently in HTTP requests #5560

baharev commented Aug 18, 2020 •

edited

dhdeangelis commented Oct 29, 2020 •

edited

sigmavirus24 commented Oct 29, 2020

baharev commented Oct 29, 2020

sethmlarson commented Oct 29, 2020

baharev commented Oct 29, 2020

baharev commented Oct 29, 2020 •

edited

Do not re-encode a UTF-8 string as ISO-8859-1 silently in HTTP requests #5560

Do not re-encode a UTF-8 string as ISO-8859-1 silently in HTTP requests #5560

Comments

baharev commented Aug 18, 2020 • edited

Expected Result

Actual Result

Suggestion

Related

Reproduction Steps

System Information

dhdeangelis commented Oct 29, 2020 • edited

sigmavirus24 commented Oct 29, 2020

baharev commented Oct 29, 2020

sethmlarson commented Oct 29, 2020

baharev commented Oct 29, 2020

baharev commented Oct 29, 2020 • edited

baharev commented Aug 18, 2020 •

edited

dhdeangelis commented Oct 29, 2020 •

edited

baharev commented Oct 29, 2020 •

edited