Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surprising behavior of ByteString literals via IsString #140

Open
parsonsmatt opened this issue Oct 13, 2017 · 126 comments
Open

Surprising behavior of ByteString literals via IsString #140

parsonsmatt opened this issue Oct 13, 2017 · 126 comments
Labels
blocked: ghc This is blocked on a feature or primitive not yet available in a released GHC version documentation pitfall

Comments

@parsonsmatt
Copy link

At work, we discovered a somewhat surprising behavior of ByteString's IsString instance and interaction with OverloadedStrings.

The following REPL session demonstrates the issue:

λ> BS.unpack $ T.encodeUtf8 ("bla語" :: Text)
[98,108,97,232,170,158]
λ> BS.unpack $ ("bla語" :: BS.ByteString)
[98,108,97,158]
λ> T.decodeUtf8 $ ("bla語" :: BS.ByteString)
*** Exception: Cannot decode byte '\x9e': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8

The IsString instance calls packChars which calls c2w, which silently truncates the bytes.

I'd be happy to put together a PR to document the behavior of the IsString instance.

I think I expected it to encode the string using the source encoding. I don't know whether or not that's a feasible or desirable change.

@hvr
Copy link
Member

hvr commented Oct 15, 2017

The behavior you're experiencing is documented at http://hackage.haskell.org/package/bytestring-0.10.8.2/docs/Data-ByteString-Char8.html

However, it may make sense to attach a docstring to the IsString instance since the IsString instance transcends module barriers.

@chris-martin
Copy link
Contributor

chris-martin commented Mar 12, 2018

It seems unusual to me that packChars is the basis for fromString yet is not part of the API. I think packChars ought to be exported, and the IsString instance haddock can just link to the packChars doc, and the packChars doc should explain what it does.

@hvr
Copy link
Member

hvr commented Mar 12, 2018

I don't understand the rationale of having to export packChars if we already have access to it via other means (i.e. fromString & pack) which alias it.

@chris-martin
Copy link
Contributor

Ah, sorry, I didn't notice the pack function in the Char8 module. Yeah, the important thing is for the IsString instance doc to have a link to Char8.pack.

@sjakobi
Copy link
Member

sjakobi commented Nov 9, 2019

This issue is still confusing users today: https://stackoverflow.com/questions/58777439/haskell-data-yaml-utf-8-decoding

@joeyh
Copy link

joeyh commented Dec 13, 2019

This is a landmine to anyone using RawFilePath. Consider:

{-# LANGUAGE OverloadedStrings #-}
import System.Posix.Directory.ByteString
main = removeDirectory "bla語"

I agree that this seems to assume the source encoding will be used.

@sjakobi
Copy link
Member

sjakobi commented Dec 15, 2019

Since the IsString instance is still confusing both beginners and experienced users, how about removing it?

@hvr
Copy link
Member

hvr commented Dec 15, 2019

@sjakobi Since this instance has been in place for as long as I can remember, it'd be interesting in order to inform a decision to know how many packages would be affected (and also whether those packages were in full compliance with the PVP by having the mandated upper bounds in place) on Hackage if this instance was removed in a future release of bytestring.

@sjakobi
Copy link
Member

sjakobi commented Dec 15, 2019

@hvr I'm sure that a removal would cause tons of upgrade hassle.

Maybe it would be better to change the instance to UTF8-encode the input – that should limit any "breakage" to code that intentionally or more likely unintentionally triggered the current truncating behaviour.

@hasufell
Copy link
Member

I think the instance should be removed. It isn't well defined. Carrying such pitfalls around for backwards compatibility reasons isn't what the haskell ecosystem should be about in my opinion. It should be about correctness.

We could mark it deprecated and remove it in 1 year.

@sjakobi
Copy link
Member

sjakobi commented Dec 16, 2019

Yet another alternative to removal: Raise an error if any Char in the input would be truncated. IMO that's better than truncating silently, but runtime errors aren't that great either…

@parsonsmatt
Copy link
Author

Yet another alternative to removal: Raise an error if any Char in the input would be truncated. IMO that's better than truncating silently, but runtime errors aren't that great either…

As much as I'd like for ""bla語" :: ByteString to be a compile-time error, I'd rather have a runtime error telling me why something is wrong rather than having this fact stick around in the back of my head, and I'd certainly rather have received the error instead of investigating and discovering this 😄

I'm in favor of removing the instance over a long-enough timeline. With qq-literals and the statically checked overloaded strings trick, the fix could be as simple as inserting the relevant quasiquoter ([bs| ... |]) or quotation $$(...) (or even $$"..." in 8.12!). This is an annoyance but would dramatically improve safety.

In stages:

  1. Replace the fromString function with one that throws a runtime error in the next major version bump.
  2. Add a warning to the IsString instance in the next major version bump with migration instructions
  3. Add a TypeError to the IsString instance in the next major version bump with instructions on how to migrate away.

@sjakobi
Copy link
Member

sjakobi commented Dec 17, 2019

@parsonsmatt I really like the idea of offering quasiquoters as an alternative to literals as a first step. The $$(...) and $$"..." quotations seem slightly less readable to me.

bs and lbs seem like good names to me. Does everyone agree or have better suggestions?

@joeyh
Copy link

joeyh commented Dec 17, 2019 via email

@sjakobi
Copy link
Member

sjakobi commented Dec 17, 2019

@joeyh text depends on bytestring, so bytestring can't have a dependency on text.

I guess it would be convenient if bytestring could UTF8-encode Strings itself – I would be surprised if that hadn't been discussed before!

@sjakobi
Copy link
Member

sjakobi commented Dec 17, 2019

To clarify: The bs and lbs quasiquoters would only validate the input, rejecting any input containing non-Latin-1 characters, i.e. characters that don't fit into a single byte.

The documentation could direct users who want UTF-8 encoding to the utf8-string package.

@sjakobi
Copy link
Member

sjakobi commented Dec 18, 2019

I guess it would be convenient if bytestring could UTF8-encode Strings itself – I would be surprised if that hadn't been discussed before!

Turns out that there is some UTF-8 functionality in bytestring – for Builder:

In fact the IsString instance for Builder supports UTF-8 via stringUtf8:

instance IsString Builder where
fromString = stringUtf8

That's a big surprise to me, and I think it changes the picture.

I think we should aim for consistency with this (superior) instance then, and change the instances for strict and lazy ByteString to support UTF-8. That should be much more convenient than having to resort to Builder, text or utf8-string to get a UTF-8-encoded ByteString.

Thoughts on this?

@hasufell
Copy link
Member

Thoughts on this?

My problem with this is that it's not a good API. ByteString has less structure than String and fromString is just misleading here. Once you turned it into a bytestring, you lost explicit information (about the encoding) and must be aware that it is utf8 and not something else.

I'd rather have people use those utf8 functions explicitly, so they write less bugs. The semantics between IsString instances are just not clear enough, IMO. If you need additional documentation for an instance, then it's a good sign you shouldn't have that instance in the first place.

Maybe someone will expect it to truncate... and someone expect it to be utf8.

@merijn
Copy link

merijn commented Dec 18, 2019

As much as I'd like for ""bla語" :: ByteString to be a compile-time error, I'd rather have a runtime error telling me why something is wrong rather than having this fact stick around in the back of my head, and I'd certainly rather have received the error instead of investigating and discovering this

@parsonsmatt @sjakobi Fortunately for you guys I had this epiphany years ago and already implemented the required typed TH in a library (https://hackage.haskell.org/package/validated-literals) that makes it very simple to turn these problems into compile time errors :p

In fact, a compile time checked ByteString is literally one of the examples in that library ;)

@sjakobi
Copy link
Member

sjakobi commented Dec 19, 2019

One issue with changing the ByteString IsString instances to UTF-8 is that users would need to be careful to have proper lower bounds on bytestring to avoid truncation with older versions.

@hasufell

Maybe someone will expect it to truncate...

I really really don't believe that anyone would intentionally write non-Latin-1 literals knowing that they are truncated.

I would be very interested in knowing what @hvr and @dcoutts think about the idea of changing the instances to UTF-8.

@fumieval
Copy link
Contributor

A lot of people I know (including me) have shot their feet by this instance. As a user of Japanese writing system, I strongly support changing the instance to use UTF-8 (also to be consistent with the instance for Builder).

@hasufell
Copy link
Member

hasufell commented Dec 31, 2019

I really really don't believe that anyone would intentionally write non-Latin-1 literals knowing that they are truncated.

The point is, you are proposing to change behavior without renaming the function and you have no idea about the bugs in other peoples codebases, which may depend on unexpected behavior. PVP doesn't work, people don't read ChangeLogs of 200+ packages to figure out such things (especially when they are not even aware of what they are relying on).

Removing the instance is safe. Old code will stop compiling, people will be forced to change it and think about its semantics.

@hvr
Copy link
Member

hvr commented Dec 31, 2019

PVP doesn't work, people don't read ChangeLogs of 200+ packages to figure out such things (especially when they are not even aware of what they are relying on).

The PVP only doesn't work if you're dealing with irresponsible programmers/maintainers which can't be bothered to honor it. Quite frankly, I recommend to stay away from those people's packages as they don't instill me with great confidence and obviously it makes little sense for a PVP-compliant package to depend upon packages which don't themselves make any effort to uphold the PVP contract.

That being said, luckily almost all packages I rely on are maintained by people who care to write correct software and therefore also appreciate the goals and benefits of the PVP formal framework.

@hvr
Copy link
Member

hvr commented Dec 31, 2019

To summarise the basic options,instance IsString ByteString could be either

  1. Decoding as ISO-8859-1 (status quo)
  2. Decoding as UTF-8
  3. not supported

The nice property of 1. is that you can roundtrip the ByteString literals serialized by Show; especially those that aren't valid UTF-8 encodings -- after all, ByteString is about binary data, which is a superset of valid UTF-8 encodings.There's also code out there which relies on the IsString instance to efficiently (still O(n) but with a relatively small factor; but way better than using pack on [Word8] -- but see also @phadej and @andrewthad's ghc-proposals/ghc-proposals#292) embed binary blobs into haskell code (NB: to make things more fun to reason about; GHC currently uses a modified UTF8 encoding to represent string literals as CStrings; and there's been a long-time desire to have the data-length be known statically at compile time... but I've seen @andrewthad make some progress in this department recently).

The original complaint about silent failure could be addressed by turning it into a non-silent runtime exception. This probably wouldn't add any significant overhead since currently string literals have a O(n) overhead as they need to be transcoded; we'd just need to signal errors instead of truncating code-points during this transcoding that needs to occur anyway.


Variant 2. would make it consistent with instance IsString Builder which currently uses UTF-8 encoding (however, it could be argued that IsString Builder should be using ISO-8859-1 as well!). But other than that, it's still a weird choice to pick UTF-8 for a binary ByteString type. Also note that you still have to deal transcoding GHC's modified UTF8 serialization into proper UTF-8 as well as dealing with improper UTF-8 strings (either silently or by runtime exceptions). And you lose the ability to represent all ByteString values as string literals.


The appeal of variant 3. is to turn the silent failures into compile-time failures but also all currently legitimate uses of bytestring literals! thereby throwing out the baby with the bathwater.

Note however that TH or QQ is not a proper replacement for the lack of string literals. Not all GHC platforms support TH (and also for those that do, TH adds quite a bit of compile-time overhead); so I'd hate to see packages starting depending on TH support for something common such as bytestring literals which should be IMO provided by GHC w/o the need for the quite heavy TH machinery.

And if we go one step further by making the instance illegal (i.e. so you can't define your own local instance), it'd effectively represent a regression to me as I couldn't rely on TH for libraries (which I strive to be portable, which means avoiding TH when possible). Consequently, I'd strongly object with to any TypeError which would primarily recommend the use of TH/QQ as replacement as I'd consider this harmful for the ecosystem.


In summary, I think that adding a runtime error to ISO-8859-1 string literals represents the approach which provides the incremental improvement over the status quo with the best power-to-weight ratio; after all, string literals are constant values; if you manage to evaluate them at least once (hint hint, test-coverage via hpc) as part of your CI, you'll be able for force those runtime errors. And typically you don't write that many Bytestring literals anyway -- and if you do, I'd expect you to be mindful about why you're using ByteString ltierals in the first place -- and if you're abusing ByteString literals for encoding Unicde text instead of using text/utf8-text/short-text/..., I'd have little sympathy if you're not being careful. ;-)

I could also see hlint be able to detect silent code-point truncation when the type is trivially inferrable -- if it doesn't already support this; i.e. code like

s :: Bytestring
s = "€uro"

could be something that hlint could heuristically detect IMO.

@hasufell
Copy link
Member

hasufell commented Dec 31, 2019

The PVP only doesn't work if you're dealing with irresponsible programmers/maintainers which can't be bothered to honor it.

Well, I think it's fair to raise this issue here. PVP says:

Breaking change. If any entity was removed, or the types of any entities or the definitions of datatypes or classes were changed, or orphan instances were added or any instances were removed, then the new A.B MUST be greater than the previous A.B. Note that modifying imports or depending on a newer version of another package may cause extra orphan instances to be exported and thus force a major version change.

This doesn't say anything about changing semantics (not types) of existing functions. There is no way for a maintainer to know when this has happened. In fact, even a minor version bump may introduce changed semantics of a function as part of a "bugfix" (since it's even debatable what a bug is). And if that "bug" has been "fixed" after a few years, your chances of knowing the impact are very low.

In this instance, PVP doesn't give us any guarantees whatsoever. The only proper way for that is:

  1. deprecation phase
  2. removing the function and providing a new one

This is the only way that makes a significant change in semantics visible to your users. They will see the deprecation warning in their build logs, especially with -Werror and after a year or two they will experience compilation failure and will have to look up the documentation/ChangeLog.


In summary, I think that adding a runtime error to ISO-8859-1 string literals represents the approach which provides the incremental improvement over the status quo with the best power-to-weight ratio

I think this is the most dangerous one. An error may crash someones backend. We don't know that. Just imagine IsString used in the context of foreign input (from a user, another http server). Now you have a DoS? 😨

@hvr
Copy link
Member

hvr commented Dec 31, 2019

@hasufell You're are right about the incomplete wording in the PVP! I.e. that the current wording of the PVP could be interpreted that way; even though most people have interpreted the intent behind the PVP to mean that obviously semantically observable changes are considered breaking changes. And we've been discussing already a clarification (see haskell/pvp#30) of the wording to state this detail more explicit so there's really no doubt about what the intent is (the enumerated rules merely encode the general principle from the POV of a consumer; i.e. they follow from that principle; not the other way around -- but that's a different topic). In any case, I'd suggest to take the PVP specific discussion to haskell/pvp#30 where we're trying to address this minor oversight.

@hvr
Copy link
Member

hvr commented Dec 31, 2019

In summary, I think that adding a runtime error to ISO-8859-1 string literals represents the approach which provides the incremental improvement over the status quo with the best power-to-weight ratio

I think this is the most dangerous one. An error may crash someones backend. We don't know that. Just imagine IsString used in the context of foreign input (from a user, another http server). Now you have a DoS?

How realistic is this scenario? For one, typically you have proper exception handling in place for critical components which take untrusted input -- we're specifically talking about the IsString which is primarily intended for string literals; so you'd have to assume that somebody had been using code that was incorrect to begin with and silently truncated code-points of a string literal. Or you might be considering the unintended use of IsString for non-static input from untrusted input. But then again, you'd have to assume that code would be doing input validation given that IsString has . I feel like you're trying to come up with pathological scenarios which may be technically correct but not very common in practice. At least I can't think of code of mine I've written over the years where my components would crash and burn without deserving it if non-ISO-8859-1 ByteString literals would start triggering runtime exceptions. That being said, I'd find ways to cope if the IsString instance went missing (and without an annoying TypeError-fake-instance in its place), resulting in some busywork finding ways to express the previous concise ByteString in less verbose forms; I just fail to see whether this is worth the cost (especially since to me personally I can't remember the last time I encountered this class of error).

PS: There's still the option of leaving the IsString instance as-is and accept the truncating behaviour legitimate if for the sake of having a total function (at the cost of being non-injective). It's not like this is a "bug" of IsString ByteString; it's a documented behaviour that's been in place for a very long-time; and yes, it may bite people; yes it's not aligned with the Builder instance (but you can also argue that the Builder instance having been added later is at fault here for diverging); but you can just as well justify the current semantics being legitimate for a specific tradeoff.

The whole point of the discussion at hand is gauge the cost/benefit ratios of a change from the current status quo to inform whether a change is worth doing, and most importantly, all choices are merely tradeoffs; there doesn't appear to be one variant which is absolutely superior to the others to me.

Otoh, I partly blame that GHC or the core Haskell standard didn't anticipate the need to have a richer support for annotating the purpose/subtype of string literals (but then again; the Haskell Report considered a "String" merely an alias for [Char]); other languages I've worked with had some kind of charset annotation (and most importantly without requiring the heavy machinery of something like QQ/TH for something so basic); even Python whose typesystem is in a totally different corner of the typesystem space has modifiers to annotate and distinguish regex, bytestrings and text literals...

@parsonsmatt
Copy link
Author

I've been bitten by changing instances before. The last major release of time changed the instance Read UTCTime silently (accidentally?) to require the trailing time zone. This broke a decent amount of code across the internet with *** Prelude.read: no parse exceptions, which are a bunch of fun to debug.

Personally, I really want the instance IsString ByteString to do UTF-8 decoding. That's what I initially expected, and it seems like the expectation of many other people as well. It is possible that people are relying on the current behavior somehow (accidentally?), but I'd suspect that more people have silent/hidden bugs based on this truncation. That belief is based on the fact that the bug that inspired this issue was in production for quite some time before it was eventually discovered.

A runtime exception with an informative error message would be surprising, but at least it should point to exactly where you need to fix your code. I usually expect to find some weird behavior when I do a major version upgrade (typically a new stackage resolver), and a runtime exception like this would be satisfactory. Especially if that runtime exception noted that a future version of the library would change the behavior to use UTF-8 encoding instead of truncation, and provided a pointer to functions that could either a) have the old truncation behavior or b) have the future encoding behavior.

Removing the instance would be costly, and I wouldn't want to do it without a TypeError that gave you a note saying "Please just use this function instead to fix this error." Unfortunately, we can't attach a warning to an instance - this might be a good GHC feature to add.

I wish there was a good way of reaching out and polling the community about their preferences on this - it seems pretty important!

@hasufell
Copy link
Member

Then migration would be possible to very easily automate (adding one import).

And almost everyone will have that instance visible again.

Yes, but the module can now follow a normal deprecation schedule.

@enobayram
Copy link

@szabi

If that is the intended semantics, then IsASCIIString cannot be a constraint for IsNarrowCharString, as there are plenty of 8-bit encodings ("code points are in range 0-255") that are not a superset of ASCII. (e.g. EBCDIC, CCSID 899, ...)

This isn't really about modelling the hierarchy of encodings though. The purpose of the IsString class is to provide a first-class mechanism for customizing the interpretation of the string literals, by extension, the purpose of these IsXXXString classes is to narrow down which literals you are prepared to handle. So, I think it's technically incorrect that "IsASCIIString cannot be a constraint for IsNarrowCharString" because every Haskell literal containing only ASCII characters is also a literal that only contains characters with Unicode code points between 0-255. So, if your type is prepared to consume the latter then it should always be able to consume the former as well. You could argue against this Unicode-centric perspective, but I think that perspective is already deeply rooted for Haskell Strings and Chars anyway. You could also argue that ASCII isn't an important enough subset of the 0-255 range to eternally burn it into the IsString hierarchy and I wouldn't object to that.

@jeremyschlatter
Copy link

I was just bitten by this issue for the first time. I assumed (so deeply that I didn't even notice I was relying on it as an assumption) that this instance used UTF-8 encoding. I was quite surprised to find out that it did not.

I have no new proposals, just want to record myself as one more person who wishes this instance used UTF-8 encoding.

@phadej
Copy link
Contributor

phadej commented Mar 8, 2023

I'll also record that I had relied on IsString ByteString instance being able to produce invalid UTF-8, but perfectly valid byte strings.

(Specifically ones produced by Show ByteString!)

@Kleidukos
Copy link
Member

Yes quite unfortunately ByteString's IsString instance is an extremely bad usage vector for UTF-8 strings. Text should be the prime gateway for these.

@clyring clyring added the blocked: ghc This is blocked on a feature or primitive not yet available in a released GHC version label Jun 13, 2023
@adamgundry
Copy link
Member

GHC 9.10 will support warnings on instances (https://github.com/ghc-proposals/ghc-proposals/blob/master/proposals/0575-deprecated-instances.rst) and warnings may now include categories (https://github.com/ghc-proposals/ghc-proposals/blob/master/proposals/0541-warning-pragmas-with-categories.rst). So perhaps in GHC 9.10+ it would be worth adding a custom warning to instance IsString ByteString that warns about this issue?

@nomeata
Copy link

nomeata commented Dec 5, 2023

What will the warning suggest to use instead? Imagine a user with a file with a few dozends clearly harmless ASCII ByteString literals (e.g. when implementing a ASCII-based protocol) all over the place. Is the suggestion to wrap them all in Data.ByteString.Char8.pack? Or maybe we need a proper form for byte literals before we should bug the poor user with warnings?

@hasufell
Copy link
Member

hasufell commented Dec 5, 2023

What will the warning suggest to use instead?

ByteString literals are kinda odd, especially since they don't account for the platform.

We might be able to shift to the quasiquoter provided in os-string: https://hackage.haskell.org/package/os-string-2.0.0/docs/System-OsString.html#v:osstr

These modules also provide large sets of functions allowing to convert to and from string (predictably).

String and Text are both "platform agnostic" and are converted to the expected format at the outer ffi layer. With ByteString you're guessing, at least if you interact with FFI (which is one of the primary use cases of the type). OsString fills this gap.

But if you're dealing with e.g. HTTP data or other shenanigans that may have different encodings, you'll have to decide anyway:

So there's no simple answer, because the situation isn't simple.

@adamgundry
Copy link
Member

What will the warning suggest to use instead? Imagine a user with a file with a few dozends clearly harmless ASCII ByteString literals (e.g. when implementing a ASCII-based protocol) all over the place.

That's the nice thing about warnings with categories - the user can disable the warning at the module level if they are confident their use case is fine as-is.

It now also occurs to me that one can use OverloadedLabels to have #"foo" :: ByteString give a syntax for literals that reports a type error if any of the characters don't fit in 8 bits (https://gist.github.com/adamgundry/a1d050be7508dd0a9289011099535159). Though I worry the compile-time performance may be poor...

@nomeata
Copy link

nomeata commented Dec 11, 2023

That's the nice thing about warnings with categories - the user can disable the warning at the module level if they are confident their use case is fine as-is.

I understand, but thats not a particularly great experience, since the warning is not precise and the fix of least resistance doesn't really improve the code. (“The code you wrote may or may not have a problem. Jump through these hoops to disable this annoying message”).

I'm wondering if our users are better served with a less principled (because of the weird coupling) but much more useful warning that only warns (or errs!) if the string isn't pure ASCII? Yes, GHC would have to treat types called ByteString specially, and that's ugly. But maybe better?

(Imagine ByteString happens to be defined in Base, like fixed width numbers. We wouldn't hesitate to have special warning support when literals don't fit the type? Having a worse DX as a consequence of which types happen to end up in which packages is maybe not a great design guide, and for most users it's all ”part of Haskell”)

@Bodigrim
Copy link
Contributor

Yes, GHC would have to treat types called ByteString specially, and that's ugly. But maybe better?

(Imagine ByteString happens to be defined in Base, like fixed width numbers.

This has been discussed above and IIRC the MR was rejected by GHC developers, because they do not want to wire a third-party type into the compiler.

It should not be too difficult to move ByteString definition and instances into base (or ghc-internals or whatever), which would allow GHC to implement such warning.

@Kleidukos
Copy link
Member

This would be a good step forward in terms of UX indeed!

@vdukhovni
Copy link
Contributor

FWIW, after much discussion above, my sense is that the problem is not with the bytestring instance of IsString, but rather with over-use of OverloadedStrings in the first place. This extension is being asked to do too much, there are better (more specific/flexible) alternatives...

  • A bytestring is a (pinned for use with FFI) packed byte array. It is NOT a "string".
  • If one wants a byte array that faithfully represents a Unicode text string, well ..., we have Text exactly for that purpose.
    • The Text instance of IsString, will faithfully capture all Unicode code points
    • Text has had, for some time now, a rather cheap conversion to a UTF-8 bytestring requiring only a copy to ForeignPtr to the content of a pinned byte array (because it is already UTF-8 encoded).
  • One way forward is perhaps for bytestring to expose some quasi-quoters that explicitly capture a literal string as either an octet-string or some encoding of a text string.
[utf8bytes|Le cœur du problème|]   -- utf8
[octets|Le c\189ur du probl\232me] -- iso-8859-15

And then gradually discourage all use of OverloadedStrings for bytestring.

@xnuk
Copy link

xnuk commented Mar 23, 2024

Is there a workaround for this, in nowadays? bytestring is still popular package, OverloadedStrings is still widely used that some packages even recommend it, and most people in here agrees ("안둥쉬크롯,술걷덯칲구뵤!" :: ByteString) == "Hello, world!" is just wrong. It's so easy to misuse. Rather than waiting more years for more discussions, I pretty need a workaround for disabling instance IsString ByteString.

@vdukhovni
Copy link
Contributor

My take, per the previous comment, is that someone sufficiently motivated should contribute something along the lines of the suggested octets and utf8bytes quasi-quoters.

The IsString instance cannot possibly meet both the legacy and also desirable use-cases, and lacking a path forward, should just remain unchanged (backwards-compatible). This does mean that with OverloadedStrings code similar to "Жаль" :: ByteString will continue to not produce what the programmer might have intended, and typically :: Text should have been used instead, or some day a new [utf8bytes|...|] quasiquoter.

One precedent is the r quasiquoter for raw strings.

@xnuk
Copy link

xnuk commented Mar 24, 2024

Quasi quotes are not a solution. Suppose you want to use decode function that you thought it's Text -> _:

decode "날씨 좋은데 나갈까?"

Later, you realize it's ByteString -> _, not Text -> _, and that causes a runtime bug. This can be caught in compile time if there's no IsString ByteString instance.

Quasi quotes does not solve this kind of problem. Because it's (kind of) Template Haskell, replacing "..." with [r|...|] fixes nothing - it's just the same as the previous example (if OverloadedStrings is turned on):

decode [r|날씨 좋은데 나갈까?|]

Looks like [utf8bytes|...|] also doesn't solve the problem. Seems it's syntactic sugar of encodeUtf8 [r|...|], which is ByteString. But this usage only happens if the user is already careful about the IsString ByteString behavior, and they already know it's ByteString -> _. If they don't, they'll stick with decode [r|...|] instead.

These problems looks barely happen if IsString ByteString instance just doesn't exist, or everyone stop using OverloadedStrings and love [Char] -> _ conversions.

should just remain unchanged (backwards-compatible)

"Жаль" :: ByteString will continue to not produce what the programmer might have intended

Why should keep broken behavior when you can just remove it?

@Bodigrim
Copy link
Contributor

Why should keep broken behavior when you can just remove it?

Removal of instance IsString ByteString will break countless packages; Stackage will never upgrade to use a new version of bytestring.

@xnuk
Copy link

xnuk commented Mar 24, 2024

Removal of instance IsString ByteString will break countless packages; Stackage will never upgrade to use a new version of bytestring.

How many Stackage packages will be broken exactly? Can't we make tracking issues for affected packages?

@Kleidukos
Copy link
Member

@xnuk It's difficult because it's not only the code that they use but also the API that they provide.

@xnuk
Copy link

xnuk commented Mar 24, 2024

It's difficult because it's not only the code that they use but also the API that they provide.

  1. Packages that relies on IsString-able API will refuse to compile, so we can track?
  2. I think it's just okay if they do not recommend to use IsString ByteString directly (and bump up the version)?

@clyring
Copy link
Member

clyring commented Mar 24, 2024

Simply removing the instances without a deprecation period is very unappealing. And the first release of ghc to allow proper deprecation of instances will be ghc-9.10.1, which does not yet exist.

It will be quite some time before most of our users are working with a new-enough compiler to be able to see deprecation warnings at use-sites even if we add them today.

@Kleidukos
Copy link
Member

@clyring I don't think we should excessively worry about this happening, I trust the maintainers to be extra careful with the timeline, should the proposal go forward.

@Lysxia
Copy link
Contributor

Lysxia commented Mar 27, 2024

Quasi quotes are not a solution. Suppose you want to use decode function that you thought it's Text -> _ (...)

These problems looks barely happen if IsString ByteString instance just doesn't exist, or everyone stop using OverloadedStrings and love [Char] -> _ conversions.

@xnuk I don't understand your objection to quasiquotes. It might help if you explained what decode does in your examples.

I think everyone here already agrees that IsString ByteString behaves poorly and should be avoided. The remaining question is what should the deprecation message tell people to do instead. I think Viktor's answer makes sense in that context.

The problem with a [Char] -> _ function is that it will only be called at run time. A quasiquote can inspect and validate a literal at compile time, without baking the logic into GHC. This would better reflect the different expectations that people may have from their bytestring literals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked: ghc This is blocked on a feature or primitive not yet available in a released GHC version documentation pitfall
Projects
None yet
Development

Successfully merging a pull request may close this issue.