Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not re-encode a UTF-8 string as ISO-8859-1 silently in HTTP requests #5560

Closed
baharev opened this issue Aug 18, 2020 · 6 comments
Closed

Comments

@baharev
Copy link

baharev commented Aug 18, 2020

This issue is about sending an HTTP request, and not about the response.

    headers = { 'Content-Type': 'application/json; charset=utf-8' }
    response = requests.post(url, headers=headers, data=my_native_python3_utf8_string)

Expected Result

I expected the request to be sent as UTF-8, after all, everything in the Python 3 ecosystem uses UTF-8 by default.

Actual Result

My UTF-8 string is silently re-encoded as ISO-8859-1, causing confusing bugs at the recipient. The code in http.client is doing it:

https://github.com/python/cpython/blob/c3dd7e45cc5d36bbe2295c2840faabb5c75d83e4/Lib/http/client.py#L1312

  • In my opinion, RFC 2616 Section 3.7.1 does not apply here.
  • The whole Python 3 ecosystem uses UTF-8 by default, so I find it utterly confusing that a library silently re-encodes a UTF-8 string as ISO-8859-1.

Suggestion

I suggest that if requests takes a native UTF-8 string as data argument, it makes sure that it is handled as a UTF-8 string all the way, and won't be silently re-encoded as ISO-8859-1. It can cause really obscure bugs otherwise.

Related

This issue is about sending a request. Related open issues are mostly concerned with the response:

Reproduction Steps

See above.

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": "2.8"
  },
  "idna": {
    "version": "2.8"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.7.4"
  },
  "platform": {
    "release": "4.15.0-107-generic",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "1010104f",
    "version": "19.1.0"
  },
  "requests": {
    "version": "2.22.0"
  },
  "system_ssl": {
    "version": "1010107f"
  },
  "urllib3": {
    "version": "1.25.8"
  },
  "using_pyopenssl": true
}
@baharev baharev changed the title Do not re-encode a UTF-8 string as ISO-8859-1 silently Do not re-encode a UTF-8 string as ISO-8859-1 silently in HTTP requests Aug 18, 2020
@dhdeangelis
Copy link

dhdeangelis commented Oct 29, 2020

I may have encountered this problem too. I wonder if there is any status change.

python3 -m requests.help
{
"chardet": {
"version": "3.0.4"
},
"cryptography": {
"version": "3.1.1"
},
"idna": {
"version": "2.10"
},
"implementation": {
"name": "CPython",
"version": "3.8.5"
},
"platform": {
"release": "5.9.1-1-default",
"system": "Linux"
},
"pyOpenSSL": {
"openssl_version": "1010108f",
"version": "19.1.0"
},
"requests": {
"version": "2.24.0"
},
"system_ssl": {
"version": "1010108f"
},
"urllib3": {
"version": "1.25.10"
},
"using_pyopenssl": true
}

@sigmavirus24
Copy link
Contributor

Quoting the original issue:

My UTF-8 string is silently re-encoded as ISO-8859-1, causing confusing bugs at the recipient. The code in http.client is doing it:

In other words, not code that Requests controls and not a bug we can fix.

@baharev
Copy link
Author

baharev commented Oct 29, 2020

In other words, not code that Requests controls and not a bug we can fix.

I disagree. You could fix it if you wanted.

Since you are wrapping that code, you could handle the issue in your wrapper so that at least your clients are not hurt by this issue.

In other words: Requests could offer an API that is consistent with the Python ecosystem, and handle such silly issues internally if needed, before calling http.client.

@sethmlarson
Copy link
Member

@baharev Requests is under feature freeze with no room for breaking changes so this change is extremely unlikely to land, regardless of its utility. It is a best practice to provide bytes instead of str for the body so that no underlying library needs to guess at what encoding you want.

@baharev
Copy link
Author

baharev commented Oct 29, 2020

@sethmlarson

It is a best practice to provide bytes instead of str for the body so that no underlying library needs to guess at what encoding you want.

Yes, I have figured that out the hard way.

The problem is that it is incredibly easy to forget an .encode() call (accidentally pass the string unencoded). To add salt to the wound, it will even "work" with ASCII characters and you won't even notice you have a bug (you forgot an .encode()) until you get a more exotic character in the string. Ouch.

In other words, I find the current interface a loaded gun, and you can easily shoot yourself in the foot.

Requests is under feature freeze with no room for breaking changes so this change is extremely unlikely to land, regardless of its utility.

OK, I understand.

@baharev
Copy link
Author

baharev commented Oct 29, 2020

@sethmlarson +1s are great, I really appreciate it, but what should I do so that this issue is eventually addressed?

If this issue remains closed, it will be forgotten.

A "fix" could be as simple as throwing TypeError if the data argument requires encoding (a string for example). I don't think there is a fix that is not a breaking change.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants