Make `parseUrlHttp[s]` take in a `Text` value #67

tysonzero · 2019-08-26T23:12:26Z

ByteString semantically represents a sequence of 8 bit octets, not a series of ASCII characters, and URLs are semantically sequences of characters. You can see this distinction when looking at how Char8 is too big to store only ASCII, and with how often ByteString is used to store non-ASCII data, and with the non-ASCII encoding typically used when outputting to stdout.

Currently if we have a Text value that we want to convert into a URL then we are in an awkward position of deciding how to convert it to a ByteString. We can T.unpack it and any non-ASCII text essentially becomes garbage. We can T.encodeUtf8 it which is probably the most sensible thing to do, as it appears that parseUrlHttps will re-pack that utf-8 back into the appropriate Text value. We could also mistakenly use any other non-ASCII-compatible encoding like utf-16 and break everything.

It seems to make the most sense to just have it be a Text value from the start, particularly given the fact that parseUrlHttp[s] already seems to do proper UTF-8 decoding.

The text was updated successfully, but these errors were encountered:

tysonzero · 2019-08-26T23:36:46Z

The current implementation seems to already basically do the right thing on UTF-8 encoded ByteString, trying to leave the URL alone except when it's truly invalid (a raw π half-way through) and then URL-encoding those invalid characters.

It should just be more transparent about the fact that it works on unicode/utf-8 by working on Text directly, helping avoid a variety of errors in the process.

mrkkrp · 2019-09-01T13:06:40Z

ByteString semantically represents a sequence of 8 bit octets, not a series of ASCII characters, and URLs are semantically sequences of [ASCII] characters. You can see this distinction when looking at how Char8 is too big to store only ASCII, and with how often ByteString is used to store non-ASCII data, and with the non-ASCII encoding typically used when outputting to stdout.

Even so, Text is composed of elements which are even "bigger" allowing all sorts of Unicode in it. Thus it's hardly a better option?

The type which we take as the argument of parseUrlHttp(s) function is dictated by the http-types package because we build on functions like decodePathSegments and parseQueryText, both of which take ByteStrings and make some Text pieces out of it. If you want to change this, then perhaps the right place to start is http-types issue tracker.

Anyway, AFAIU, both Text and ByteString are not perfect:

In case of ByteString it's 1) have elements which are too big 2) encoding issues may introduce bugs.
In case of Text it's 1) even bigger with Unicode symbols which cannot appear un-escaped in URLs (correct me if I'm wrong, I may have forgotten the subtleties).

Since both options are not ideal and http-types uses ByteString I think we should stick to it.

tysonzero · 2019-09-03T09:01:29Z

The flaws in the types are more or less what you said, but I would adjust the phrasing slightly to show why Text really is much less problematic than ByteString:

The two main issues are:

Encoding issues: It's relatively easy to have bugs that slip through the type checker if you just use the wrong encoding/decoding function. This issue is unique to ByteString.
Too many codepoints: Suffered by both ByteString and Text, but I would actually argue that ByteString is even worse in this regard. At least with Text, these overly large codepoints have a pretty sane default behavior, namely UTF-8 encode + percent encode (e.g. what your browser does when you type arbitrary unicode in the URL).

The http-types aspect is interesting, and my follow up to that is that http-types is also doing the wrong thing, as they are currently worrying about encoding and assuming it to be utf8, when they should really only be dealing with charpoints. Occasionally it makes sense to do both such as in XML when you need to try and guess the encoding based on the encoding parameter and change the parser accordingly, but usually it does not make sense.

mrkkrp added question feature-request labels Sep 1, 2019

mrkkrp added this to the 3.0.0 milestone Nov 2, 2019

mrkkrp mentioned this issue Nov 2, 2019

Use ‘modern-uri’ for URL parsing #75

Merged

2 tasks

mrkkrp closed this as completed in #75 Nov 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `parseUrlHttp[s]` take in a `Text` value #67

Make `parseUrlHttp[s]` take in a `Text` value #67

tysonzero commented Aug 26, 2019

tysonzero commented Aug 26, 2019

mrkkrp commented Sep 1, 2019

tysonzero commented Sep 3, 2019

Make parseUrlHttp[s] take in a Text value #67

Make parseUrlHttp[s] take in a Text value #67

Comments

tysonzero commented Aug 26, 2019

tysonzero commented Aug 26, 2019

mrkkrp commented Sep 1, 2019

tysonzero commented Sep 3, 2019

Make `parseUrlHttp[s]` take in a `Text` value #67

Make `parseUrlHttp[s]` take in a `Text` value #67