Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use appropriate charset in body_string() #108

Merged
merged 7 commits into from
Nov 22, 2019

Conversation

goto-bus-stop
Copy link
Member

@goto-bus-stop goto-bus-stop commented Nov 20, 2019

This is an initial attempt at addressing #101. Opening for comments—feel free to critique the direction etc and tell me to throw it all out if it's not what you're looking for! Leaving a few checkboxes…

A new default feature "encoding" opts in to using the appropriate character encoding.

  • In native binaries, encoding_rs is used. If you know you're only talking to an API that always uses utf-8, you can turn off this feature and save the binary size that is taken up by the encoding tables.
  • In WebAssembly, it calls out to the TextDecoder API. AFAIK This always makes a copy, so you can turn it off if you don't want to copy big strings between JS and wasm.
    • We could optimize this for the utf-8 case by using String::from_utf8() directly if the charset is utf-8, and only use TextDecoder for other less common encodings.

Turning off the "encoding" feature reverts to the current behaviour, which is assuming utf-8.

  • This should likely check that the charset is utf-8, and refuse to decode it if it is anything else, to avoid garbage output.

This patch also assumes utf-8 if no charset is specified in the response headers.

  • By spec the default encoding should be iso-8859-1, but in practice it's not. Need to figure out the best approach here.

This patch introduces an error type that's quite unlike anything Surf has right now, so I'm not sure if that direction fits in with the design. I also wrapped the new error type in an io::ErrorKind::InvalidData everywhere which is mostly cargo culting, I don't think that's actually necessary? I carried it over from the old body_string() implementation, where it did make sense.

  • probably get rid of that wrapping

Copy link
Member

@yoshuawuyts yoshuawuyts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I love this a lot. Thanks so much! 💯

@goto-bus-stop
Copy link
Member Author

Found a website that doesn't use UTF-8 in the year of our lord: https://shop-pro.jp/

Unfortunately, it still doesn't work with this PR, because it has a bunch of bytes that probably are UTF-8 or some other encoding…

Another fallback strategy that folks could use would be to use encoding_rs to just decode it with the detected encoding anyways and accept that there will be UTF-8 replacement characters in the output, like Firefox does:
image

I've had some trouble finding good examples… a lot of sites that use non-UTF-8 encodings just don't add a charset= parameter to the mime type. This one uses the shift-jis encoding and identifies itself as such: https://wowma.jp/ On master, body_string() returns a UTF-8 error, and with this patch it works, so that's good ✨

For comparison, the cURL CLI doesn't handle encoding at all, so both of the above sites "work" but output garbage if you're using a terminal emulator that uses utf8.

@goto-bus-stop
Copy link
Member Author

I'm not sure how to get the DecodeError out of the current surf::Exception type, since I think downcast() only exists on Box<dyn Any>? So maybe the DecodeError type is not useful yet, but if surf switches to anyhow I think its downcast() method would work ok.

@goto-bus-stop goto-bus-stop marked this pull request as ready for review November 22, 2019 13:14
@CryZe
Copy link

CryZe commented Nov 22, 2019

but if surf switches to anyhow I think its downcast() method would work ok.

anyhow is not the right abstraction for libraries, thiserror / snafu are.

@goto-bus-stop
Copy link
Member Author

So thinking about the above I think this is probably a good start as is:

  • For pages that contain incorrect bytes, like the shop-pro.jp site, failing is likely better than silently producing lots of UTF8 replacement characters. Typically you'll use surf for APIs and not full web pages, which are more likely to contain incorrect bytes. There could be a body_string_lossy() method in the future for those use cases. Users who need to be super robust and deal with sites that use uncommon encodings but do not specify an encoding in Content-Type can manually check the content-type beforehand and decide to do character type detection at that point

  • The DecodeError problem will likely be addressed by the solution to surf::Exception does not implement std::error::Error #86

  • Since the vast vast majority of sites uses UTF-8, it is likely a better default than ISO-8859-1 would be

@yoshuawuyts
Copy link
Member

anyhow is not the right abstraction for libraries

@CryZe it's the right fit here though, as middleware can return arbitrary errors which are up to the user, and we don't want to enforce a generic param to capture any specific error.


@goto-bus-stop the summary you provided makes sense! Thanks for checking this against real-world examples; agree with utf-8 being a sensible default. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants