Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add conversion routines for FilePath #403

Merged
merged 2 commits into from
Jul 1, 2021
Merged

Conversation

luke-clifton
Copy link
Contributor

Despite FilePath being a type alias for String, the type is actually
quite distinct. An argument of type FilePath is expected to be encoded
using the file system encoding, which, by default, can be converted to a bytestring and
back exactly.

There already exist many incorrect re-implementations of these functions, usually assuming that the bytestrings should be UTF-8 encoded. [1]

I feel like having something more discoverable here might provide some value. What do you think?

Despite FilePath being a type alias for String, the type is actually
quite distinct. An argument of type FilePath is expected to be encoded
using the file system encoding, and can be converted to a bytestring and
back exactly.
@Bodigrim
Copy link
Contributor

Makes sense to me. @sjakobi, what is your take on this?

Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There already exist many incorrect re-implementations of these functions, usually assuming that the bytestrings should be UTF-8 encoded. [1]

Is this [1] supposed to be a reference to something? I'd like to see it.

The addition makes sense to me. I think the documentation should explicitly mention the filesystem encoding though.

@luke-clifton
Copy link
Contributor Author

Indeed, I was going to post a list, but decided it was too hard. It's made worse by people wanting to avoid String (and I guess FilePath as well) for performance reasons. So I lot of FilePaths end up being converted to Text which will mess up the encoding. I'll list a few that I found. I'm sure there are plenty more.

I'll expand on the documentation.


shelly for example ends up going via Text, which will mangle FilePaths


cabal-install has it's own function for encoding FilePath

https://github.com/haskell/cabal/blob/master/cabal-install/src/Distribution/Client/Utils.hs#L241-L272

which seems to be the replacement for ByteString.pack

haskell/cabal@a60f439

Though, I haven't followed the code all the way through, so maybe it's important still.


tar-conduit assumes utf-8 encoding:

https://github.com/snoyberg/tar-conduit/blob/master/src/Data/Conduit/Tar/Types.hs#L150-L156


(general confusion about bytestring arguments on unix systems) pcapriotti/optparse-applicative#368


I know of a few more for some closed source projects.

Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers!

@Bodigrim
Copy link
Contributor

Looks good, but I cannot make my mind whether it belongs to Data.ByteString or Data.ByteString.Char8 or to both?

@luke-clifton
Copy link
Contributor Author

I chose not to put it in Char8 because the encoding used might not be very Char8ish on some platforms. e.g. on windows it depends on the code page being used, which I believe could result in a UTF-16 encoded blob. Best to treat this as a blob of meaningless bytes.

@Bodigrim Bodigrim added this to the 0.11.2.0 milestone Jul 1, 2021
@Bodigrim Bodigrim merged commit 28b235f into haskell:master Jul 1, 2021
@Bodigrim
Copy link
Contributor

Bodigrim commented Jul 1, 2021

Thanks @luke-clifton!

Bodigrim pushed a commit to Bodigrim/bytestring that referenced this pull request Jul 1, 2021
* Add conversion routines for FilePath

Despite FilePath being a type alias for String, the type is actually
quite distinct. An argument of type FilePath is expected to be encoded
using the file system encoding, and can be converted to a bytestring and
back exactly.

* Expand on to/fromFilePath documentation
@joeyh
Copy link

joeyh commented Aug 10, 2021

I'm late to this but it would be more convenient if these new functions did not need IO. My own versions of these use unsafePerformIO.

When I'm writing code that uses ByteString filepaths, I often have places where conversion is often needed in pure code, or in places where lifting to IO would be awkward.

@Bodigrim
Copy link
Contributor

It's probably not too late to redesign; I assume getFileSystemEncoding does not change during program's lifetime.

@joeyh
Copy link

joeyh commented Aug 11, 2021

Well it's possible that someone uses setFileSystemEncoding. Which would break referential transparency, in theory, I suppose..

@luke-clifton
Copy link
Contributor Author

luke-clifton commented Aug 11, 2021

Yeah, the very existence of setFileSystemEncoding makes this require IO. It's unfortunate, as it's something that is pretty likely to stay constant.

I did consider adding an initFileSystemEncoding to GHC.IO.Encoding (see the similar initLocaleEncoding) and then creating a pure version based on that. But that comes with it's own set of potentially confusing issues.

@Bodigrim
Copy link
Contributor

Let's keep it as is then. Client apps, which are in control of encoding (and presumably set it only once), may wish to wrap it into unsafePerformIO, but for a core library it's safer to leave it explicit.

@luke-clifton
Copy link
Contributor Author

Perhaps we could add to the documentation that it's safe to use unsafePerformIO as long as the encoding doesn't change.

@Bodigrim
Copy link
Contributor

@luke-clifton a PR, improving documentation in this aspect, would be much appreciated.

@luke-clifton
Copy link
Contributor Author

Not sure if GitHub alerts on that reference, so... I raised a PR to mention unsafePerformIO in the docs.

@Bodigrim
Copy link
Contributor

Bodigrim commented Sep 7, 2021

@luke-clifton I've been recently running bytestring test suite on a s390x machine with GHC.IO.Encoding.getFileSystemEncoding set to ASCII, which caused fromFilePath >>= toFilePath property to fail with recoverEncode error about a Unicode symbol. As far as I understand, the reason is that on such machine FilePath should have been ASCII-only, but our QuickCheck generator has no idea about it and provides full-fledged Unicode strings. Is my understanding correct? What is the best way to amend tests?

@luke-clifton
Copy link
Contributor Author

Ugh. FilePath really should have been a system dependent newtype...

getFileSystemEncoding is supposed to be a Unicode encoding according to the documentation.

Given that there are systems that don't follow that rule, perhaps selectively disabling those tests might make sense?

The other option would be to change the QuickCheck tests to just a handful of unit tests covering some fairly basic cases or modifying it to produce ASCII strings only. I don't think we really need to be testing it too much given the simplicity of the functions. We can hopefully rely on GHC to be testing the encoding more thoroughly upstream anyway.

@Bodigrim
Copy link
Contributor

Bodigrim commented Sep 8, 2021

I filed a documentation issue upstream https://gitlab.haskell.org/ghc/ghc/-/issues/20344 and pushed a fix for test suite at #419

noughtmare pushed a commit to noughtmare/bytestring that referenced this pull request Dec 12, 2021
* Add conversion routines for FilePath

Despite FilePath being a type alias for String, the type is actually
quite distinct. An argument of type FilePath is expected to be encoded
using the file system encoding, and can be converted to a bytestring and
back exactly.

* Expand on to/fromFilePath documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants