New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surprising behavior of ByteString literals via IsString #140
Comments
The behavior you're experiencing is documented at http://hackage.haskell.org/package/bytestring-0.10.8.2/docs/Data-ByteString-Char8.html However, it may make sense to attach a docstring to the IsString instance since the IsString instance transcends module barriers. |
It seems unusual to me that |
I don't understand the rationale of having to export |
Ah, sorry, I didn't notice the |
This issue is still confusing users today: https://stackoverflow.com/questions/58777439/haskell-data-yaml-utf-8-decoding |
This is a landmine to anyone using RawFilePath. Consider: {-# LANGUAGE OverloadedStrings #-} I agree that this seems to assume the source encoding will be used. |
Since the |
@sjakobi Since this instance has been in place for as long as I can remember, it'd be interesting in order to inform a decision to know how many packages would be affected (and also whether those packages were in full compliance with the PVP by having the mandated upper bounds in place) on Hackage if this instance was removed in a future release of |
@hvr I'm sure that a removal would cause tons of upgrade hassle. Maybe it would be better to change the instance to UTF8-encode the input – that should limit any "breakage" to code that intentionally or more likely unintentionally triggered the current truncating behaviour. |
I think the instance should be removed. It isn't well defined. Carrying such pitfalls around for backwards compatibility reasons isn't what the haskell ecosystem should be about in my opinion. It should be about correctness. We could mark it deprecated and remove it in 1 year. |
Yet another alternative to removal: Raise an error if any |
As much as I'd like for I'm in favor of removing the instance over a long-enough timeline. With In stages:
|
@parsonsmatt I really like the idea of offering quasiquoters as an alternative to literals as a first step. The
|
`bs "foo"` is syntactically lighter-weight than quasiquotation, and
trivial to implement:
bs :: Text -> ByteString
bs = Data.Text.Encoding.encodeUtf8
…--
see shy jo
|
@joeyh I guess it would be convenient if |
To clarify: The The documentation could direct users who want UTF-8 encoding to the |
Turns out that there is some UTF-8 functionality in In fact the bytestring/Data/ByteString/Builder.hs Lines 452 to 453 in 95fe6bd
That's a big surprise to me, and I think it changes the picture. I think we should aim for consistency with this (superior) instance then, and change the instances for strict and lazy Thoughts on this? |
My problem with this is that it's not a good API. ByteString has less structure than String and I'd rather have people use those utf8 functions explicitly, so they write less bugs. The semantics between IsString instances are just not clear enough, IMO. If you need additional documentation for an instance, then it's a good sign you shouldn't have that instance in the first place. Maybe someone will expect it to truncate... and someone expect it to be utf8. |
@parsonsmatt @sjakobi Fortunately for you guys I had this epiphany years ago and already implemented the required typed TH in a library (https://hackage.haskell.org/package/validated-literals) that makes it very simple to turn these problems into compile time errors :p In fact, a compile time checked ByteString is literally one of the examples in that library ;) |
One issue with changing the
I really really don't believe that anyone would intentionally write non-Latin-1 literals knowing that they are truncated. I would be very interested in knowing what @hvr and @dcoutts think about the idea of changing the instances to UTF-8. |
A lot of people I know (including me) have shot their feet by this instance. As a user of Japanese writing system, I strongly support changing the instance to use UTF-8 (also to be consistent with the instance for Builder). |
The point is, you are proposing to change behavior without renaming the function and you have no idea about the bugs in other peoples codebases, which may depend on unexpected behavior. PVP doesn't work, people don't read ChangeLogs of 200+ packages to figure out such things (especially when they are not even aware of what they are relying on). Removing the instance is safe. Old code will stop compiling, people will be forced to change it and think about its semantics. |
The PVP only doesn't work if you're dealing with irresponsible programmers/maintainers which can't be bothered to honor it. Quite frankly, I recommend to stay away from those people's packages as they don't instill me with great confidence and obviously it makes little sense for a PVP-compliant package to depend upon packages which don't themselves make any effort to uphold the PVP contract. That being said, luckily almost all packages I rely on are maintained by people who care to write correct software and therefore also appreciate the goals and benefits of the PVP formal framework. |
To summarise the basic options,
The nice property of 1. is that you can roundtrip the ByteString literals serialized by The original complaint about silent failure could be addressed by turning it into a non-silent runtime exception. This probably wouldn't add any significant overhead since currently string literals have a O(n) overhead as they need to be transcoded; we'd just need to signal errors instead of truncating code-points during this transcoding that needs to occur anyway. Variant 2. would make it consistent with The appeal of variant 3. is to turn the silent failures into compile-time failures but also all currently legitimate uses of bytestring literals! thereby throwing out the baby with the bathwater. Note however that TH or QQ is not a proper replacement for the lack of string literals. Not all GHC platforms support TH (and also for those that do, TH adds quite a bit of compile-time overhead); so I'd hate to see packages starting depending on TH support for something common such as bytestring literals which should be IMO provided by GHC w/o the need for the quite heavy TH machinery. And if we go one step further by making the instance illegal (i.e. so you can't define your own local instance), it'd effectively represent a regression to me as I couldn't rely on TH for libraries (which I strive to be portable, which means avoiding TH when possible). Consequently, I'd strongly object with to any TypeError which would primarily recommend the use of TH/QQ as replacement as I'd consider this harmful for the ecosystem. In summary, I think that adding a runtime error to ISO-8859-1 string literals represents the approach which provides the incremental improvement over the status quo with the best power-to-weight ratio; after all, string literals are constant values; if you manage to evaluate them at least once (hint hint, test-coverage via I could also see
could be something that |
Well, I think it's fair to raise this issue here. PVP says:
This doesn't say anything about changing semantics (not types) of existing functions. There is no way for a maintainer to know when this has happened. In fact, even a minor version bump may introduce changed semantics of a function as part of a "bugfix" (since it's even debatable what a bug is). And if that "bug" has been "fixed" after a few years, your chances of knowing the impact are very low. In this instance, PVP doesn't give us any guarantees whatsoever. The only proper way for that is:
This is the only way that makes a significant change in semantics visible to your users. They will see the deprecation warning in their build logs, especially with
I think this is the most dangerous one. An error may crash someones backend. We don't know that. Just imagine |
@hasufell You're are right about the incomplete wording in the PVP! I.e. that the current wording of the PVP could be interpreted that way; even though most people have interpreted the intent behind the PVP to mean that obviously semantically observable changes are considered breaking changes. And we've been discussing already a clarification (see haskell/pvp#30) of the wording to state this detail more explicit so there's really no doubt about what the intent is (the enumerated rules merely encode the general principle from the POV of a consumer; i.e. they follow from that principle; not the other way around -- but that's a different topic). In any case, I'd suggest to take the PVP specific discussion to haskell/pvp#30 where we're trying to address this minor oversight. |
How realistic is this scenario? For one, typically you have proper exception handling in place for critical components which take untrusted input -- we're specifically talking about the PS: There's still the option of leaving the IsString instance as-is and accept the truncating behaviour legitimate if for the sake of having a total function (at the cost of being non-injective). It's not like this is a "bug" of The whole point of the discussion at hand is gauge the cost/benefit ratios of a change from the current status quo to inform whether a change is worth doing, and most importantly, all choices are merely tradeoffs; there doesn't appear to be one variant which is absolutely superior to the others to me. Otoh, I partly blame that GHC or the core Haskell standard didn't anticipate the need to have a richer support for annotating the purpose/subtype of string literals (but then again; the Haskell Report considered a "String" merely an alias for |
I've been bitten by changing instances before. The last major release of Personally, I really want the A runtime exception with an informative error message would be surprising, but at least it should point to exactly where you need to fix your code. I usually expect to find some weird behavior when I do a major version upgrade (typically a new Removing the instance would be costly, and I wouldn't want to do it without a I wish there was a good way of reaching out and polling the community about their preferences on this - it seems pretty important! |
Yes, but the module can now follow a normal deprecation schedule. |
This isn't really about modelling the hierarchy of encodings though. The purpose of the |
I was just bitten by this issue for the first time. I assumed (so deeply that I didn't even notice I was relying on it as an assumption) that this instance used UTF-8 encoding. I was quite surprised to find out that it did not. I have no new proposals, just want to record myself as one more person who wishes this instance used UTF-8 encoding. |
I'll also record that I had relied on (Specifically ones produced by |
Yes quite unfortunately ByteString's |
GHC 9.10 will support warnings on instances (https://github.com/ghc-proposals/ghc-proposals/blob/master/proposals/0575-deprecated-instances.rst) and warnings may now include categories (https://github.com/ghc-proposals/ghc-proposals/blob/master/proposals/0541-warning-pragmas-with-categories.rst). So perhaps in GHC 9.10+ it would be worth adding a custom warning to |
What will the warning suggest to use instead? Imagine a user with a file with a few dozends clearly harmless ASCII ByteString literals (e.g. when implementing a ASCII-based protocol) all over the place. Is the suggestion to wrap them all in Data.ByteString.Char8.pack? Or maybe we need a proper form for byte literals before we should bug the poor user with warnings? |
ByteString literals are kinda odd, especially since they don't account for the platform. We might be able to shift to the quasiquoter provided in os-string: https://hackage.haskell.org/package/os-string-2.0.0/docs/System-OsString.html#v:osstr These modules also provide large sets of functions allowing to convert to and from string (predictably). String and Text are both "platform agnostic" and are converted to the expected format at the outer ffi layer. With ByteString you're guessing, at least if you interact with FFI (which is one of the primary use cases of the type). But if you're dealing with e.g. HTTP data or other shenanigans that may have different encodings, you'll have to decide anyway:
So there's no simple answer, because the situation isn't simple. |
That's the nice thing about warnings with categories - the user can disable the warning at the module level if they are confident their use case is fine as-is. It now also occurs to me that one can use |
I understand, but thats not a particularly great experience, since the warning is not precise and the fix of least resistance doesn't really improve the code. (“The code you wrote may or may not have a problem. Jump through these hoops to disable this annoying message”). I'm wondering if our users are better served with a less principled (because of the weird coupling) but much more useful warning that only warns (or errs!) if the string isn't pure ASCII? Yes, GHC would have to treat types called ByteString specially, and that's ugly. But maybe better? (Imagine ByteString happens to be defined in Base, like fixed width numbers. We wouldn't hesitate to have special warning support when literals don't fit the type? Having a worse DX as a consequence of which types happen to end up in which packages is maybe not a great design guide, and for most users it's all ”part of Haskell”) |
This has been discussed above and IIRC the MR was rejected by GHC developers, because they do not want to wire a third-party type into the compiler. It should not be too difficult to move |
This would be a good step forward in terms of UX indeed! |
FWIW, after much discussion above, my sense is that the problem is not with the
And then gradually discourage all use of |
Is there a workaround for this, in nowadays? |
My take, per the previous comment, is that someone sufficiently motivated should contribute something along the lines of the suggested The One precedent is the |
Quasi quotes are not a solution. Suppose you want to use decode "날씨 좋은데 나갈까?" Later, you realize it's Quasi quotes does not solve this kind of problem. Because it's (kind of) Template Haskell, replacing decode [r|날씨 좋은데 나갈까?|] Looks like These problems looks barely happen if
Why should keep broken behavior when you can just remove it? |
Removal of |
How many Stackage packages will be broken exactly? Can't we make tracking issues for affected packages? |
@xnuk It's difficult because it's not only the code that they use but also the API that they provide. |
|
Simply removing the instances without a deprecation period is very unappealing. And the first release of ghc to allow proper deprecation of instances will be ghc-9.10.1, which does not yet exist. It will be quite some time before most of our users are working with a new-enough compiler to be able to see deprecation warnings at use-sites even if we add them today. |
@clyring I don't think we should excessively worry about this happening, I trust the maintainers to be extra careful with the timeline, should the proposal go forward. |
@xnuk I don't understand your objection to quasiquotes. It might help if you explained what I think everyone here already agrees that The problem with a |
At work, we discovered a somewhat surprising behavior of ByteString's
IsString
instance and interaction withOverloadedStrings
.The following REPL session demonstrates the issue:
The
IsString
instance callspackChars
which callsc2w
, which silently truncates the bytes.I'd be happy to put together a PR to document the behavior of the
IsString
instance.I think I expected it to encode the string using the source encoding. I don't know whether or not that's a feasible or desirable change.
The text was updated successfully, but these errors were encountered: