Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make parseUrlHttp[s] take in a Text value #67

Closed
tysonzero opened this issue Aug 26, 2019 · 3 comments · Fixed by #75
Closed

Make parseUrlHttp[s] take in a Text value #67

tysonzero opened this issue Aug 26, 2019 · 3 comments · Fixed by #75

Comments

@tysonzero
Copy link

ByteString semantically represents a sequence of 8 bit octets, not a series of ASCII characters, and URLs are semantically sequences of characters. You can see this distinction when looking at how Char8 is too big to store only ASCII, and with how often ByteString is used to store non-ASCII data, and with the non-ASCII encoding typically used when outputting to stdout.

Currently if we have a Text value that we want to convert into a URL then we are in an awkward position of deciding how to convert it to a ByteString. We can T.unpack it and any non-ASCII text essentially becomes garbage. We can T.encodeUtf8 it which is probably the most sensible thing to do, as it appears that parseUrlHttps will re-pack that utf-8 back into the appropriate Text value. We could also mistakenly use any other non-ASCII-compatible encoding like utf-16 and break everything.

It seems to make the most sense to just have it be a Text value from the start, particularly given the fact that parseUrlHttp[s] already seems to do proper UTF-8 decoding.

@tysonzero
Copy link
Author

The current implementation seems to already basically do the right thing on UTF-8 encoded ByteString, trying to leave the URL alone except when it's truly invalid (a raw π half-way through) and then URL-encoding those invalid characters.

It should just be more transparent about the fact that it works on unicode/utf-8 by working on Text directly, helping avoid a variety of errors in the process.

@mrkkrp
Copy link
Owner

mrkkrp commented Sep 1, 2019

ByteString semantically represents a sequence of 8 bit octets, not a series of ASCII characters, and URLs are semantically sequences of [ASCII] characters. You can see this distinction when looking at how Char8 is too big to store only ASCII, and with how often ByteString is used to store non-ASCII data, and with the non-ASCII encoding typically used when outputting to stdout.

Even so, Text is composed of elements which are even "bigger" allowing all sorts of Unicode in it. Thus it's hardly a better option?

The type which we take as the argument of parseUrlHttp(s) function is dictated by the http-types package because we build on functions like decodePathSegments and parseQueryText, both of which take ByteStrings and make some Text pieces out of it. If you want to change this, then perhaps the right place to start is http-types issue tracker.

Anyway, AFAIU, both Text and ByteString are not perfect:

  • In case of ByteString it's 1) have elements which are too big 2) encoding issues may introduce bugs.
  • In case of Text it's 1) even bigger with Unicode symbols which cannot appear un-escaped in URLs (correct me if I'm wrong, I may have forgotten the subtleties).

Since both options are not ideal and http-types uses ByteString I think we should stick to it.

@tysonzero
Copy link
Author

The flaws in the types are more or less what you said, but I would adjust the phrasing slightly to show why Text really is much less problematic than ByteString:

The two main issues are:

  1. Encoding issues: It's relatively easy to have bugs that slip through the type checker if you just use the wrong encoding/decoding function. This issue is unique to ByteString.

  2. Too many codepoints: Suffered by both ByteString and Text, but I would actually argue that ByteString is even worse in this regard. At least with Text, these overly large codepoints have a pretty sane default behavior, namely UTF-8 encode + percent encode (e.g. what your browser does when you type arbitrary unicode in the URL).

The http-types aspect is interesting, and my follow up to that is that http-types is also doing the wrong thing, as they are currently worrying about encoding and assuming it to be utf8, when they should really only be dealing with charpoints. Occasionally it makes sense to do both such as in XML when you need to try and guess the encoding based on the encoding parameter and change the parser accordingly, but usually it does not make sense.

@mrkkrp mrkkrp added this to the 3.0.0 milestone Nov 2, 2019
@mrkkrp mrkkrp closed this as completed in #75 Nov 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants