Use appropriate charset in `body_string()` #108

goto-bus-stop · 2019-11-20T10:47:24Z

This is an initial attempt at addressing #101. Opening for comments—feel free to critique the direction etc and tell me to throw it all out if it's not what you're looking for! Leaving a few checkboxes…

A new default feature "encoding" opts in to using the appropriate character encoding.

In native binaries, encoding_rs is used. If you know you're only talking to an API that always uses utf-8, you can turn off this feature and save the binary size that is taken up by the encoding tables.
In WebAssembly, it calls out to the TextDecoder API. AFAIK This always makes a copy, so you can turn it off if you don't want to copy big strings between JS and wasm.
- We could optimize this for the utf-8 case by using String::from_utf8() directly if the charset is utf-8, and only use TextDecoder for other less common encodings.

Turning off the "encoding" feature reverts to the current behaviour, which is assuming utf-8.

This should likely check that the charset is utf-8, and refuse to decode it if it is anything else, to avoid garbage output.

This patch also assumes utf-8 if no charset is specified in the response headers.

By spec the default encoding should be iso-8859-1, but in practice it's not. Need to figure out the best approach here.

This patch introduces an error type that's quite unlike anything Surf has right now, so I'm not sure if that direction fits in with the design. I also wrapped the new error type in an io::ErrorKind::InvalidData everywhere which is mostly cargo culting, I don't think that's actually necessary? I carried it over from the old body_string() implementation, where it did make sense.

probably get rid of that wrapping

yoshuawuyts

Oh I love this a lot. Thanks so much! 💯

goto-bus-stop · 2019-11-22T12:55:41Z

Found a website that doesn't use UTF-8 in the year of our lord: https://shop-pro.jp/

Unfortunately, it still doesn't work with this PR, because it has a bunch of bytes that probably are UTF-8 or some other encoding…

Another fallback strategy that folks could use would be to use encoding_rs to just decode it with the detected encoding anyways and accept that there will be UTF-8 replacement characters in the output, like Firefox does:

I've had some trouble finding good examples… a lot of sites that use non-UTF-8 encodings just don't add a charset= parameter to the mime type. This one uses the shift-jis encoding and identifies itself as such: https://wowma.jp/ On master, body_string() returns a UTF-8 error, and with this patch it works, so that's good ✨

For comparison, the cURL CLI doesn't handle encoding at all, so both of the above sites "work" but output garbage if you're using a terminal emulator that uses utf8.

goto-bus-stop · 2019-11-22T13:05:10Z

I'm not sure how to get the DecodeError out of the current surf::Exception type, since I think downcast() only exists on Box<dyn Any>? So maybe the DecodeError type is not useful yet, but if surf switches to anyhow I think its downcast() method would work ok.

CryZe · 2019-11-22T13:21:05Z

but if surf switches to anyhow I think its downcast() method would work ok.

anyhow is not the right abstraction for libraries, thiserror / snafu are.

goto-bus-stop · 2019-11-22T13:21:07Z

So thinking about the above I think this is probably a good start as is:

For pages that contain incorrect bytes, like the shop-pro.jp site, failing is likely better than silently producing lots of UTF8 replacement characters. Typically you'll use surf for APIs and not full web pages, which are more likely to contain incorrect bytes. There could be a body_string_lossy() method in the future for those use cases. Users who need to be super robust and deal with sites that use uncommon encodings but do not specify an encoding in Content-Type can manually check the content-type beforehand and decide to do character type detection at that point
The DecodeError problem will likely be addressed by the solution to surf::Exception does not implement std::error::Error #86
Since the vast vast majority of sites uses UTF-8, it is likely a better default than ISO-8859-1 would be

yoshuawuyts · 2019-11-22T13:32:43Z

anyhow is not the right abstraction for libraries

@CryZe it's the right fit here though, as middleware can return arbitrary errors which are up to the user, and we don't want to enforce a generic param to capture any specific error.

@goto-bus-stop the summary you provided makes sense! Thanks for checking this against real-world examples; agree with utf-8 being a sensible default. Thank you!

goto-bus-stop added 5 commits November 20, 2019 11:09

Sketch encoding-aware body_string() implementation

2de188a

Add some tests for the encodings

8a25c0f

Fix ownership of byte array in utf-8-only path

7ef3e00

make wasm32 impl of decode_body compile

7f9679c

Error if "encoding" feature is disabled and body is not utf-8

60f23fc

yoshuawuyts approved these changes Nov 22, 2019

View reviewed changes

custom Debug impl for DecodeError to truncate the data bytes

9bcf573

document encoding behaviour of body_string()

f989dc4

goto-bus-stop marked this pull request as ready for review November 22, 2019 13:14

goto-bus-stop merged commit 24a3c01 into http-rs:master Nov 22, 2019

goto-bus-stop deleted the encoding branch November 22, 2019 13:37

goto-bus-stop mentioned this pull request Nov 22, 2019

Response method body_string may use wrong character encoding #101

Closed

yoshuawuyts mentioned this pull request Dec 12, 2019

content-encoding aware string parsing & encoding http-rs/http-types#34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use appropriate charset in `body_string()` #108

Use appropriate charset in `body_string()` #108

goto-bus-stop commented Nov 20, 2019 •

edited

Loading

yoshuawuyts left a comment

goto-bus-stop commented Nov 22, 2019

goto-bus-stop commented Nov 22, 2019

CryZe commented Nov 22, 2019

goto-bus-stop commented Nov 22, 2019

yoshuawuyts commented Nov 22, 2019

Use appropriate charset in body_string() #108

Use appropriate charset in body_string() #108

Conversation

goto-bus-stop commented Nov 20, 2019 • edited Loading

yoshuawuyts left a comment

Choose a reason for hiding this comment

goto-bus-stop commented Nov 22, 2019

goto-bus-stop commented Nov 22, 2019

CryZe commented Nov 22, 2019

goto-bus-stop commented Nov 22, 2019

yoshuawuyts commented Nov 22, 2019

Use appropriate charset in `body_string()` #108

Use appropriate charset in `body_string()` #108

goto-bus-stop commented Nov 20, 2019 •

edited

Loading