Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AFP] provide an (AbstractFilePath -> Data.ByteString.Builder.Builder) #157

Closed
hasufell opened this issue Jun 22, 2022 · 7 comments
Closed

Comments

@hasufell
Copy link
Member

In GitLab by @frasertweedale on Jun 22, 2022, 09:54

The main API only gives you a way to convert to String, but doesn't give you any
convenient and efficient way of getting at the bytes (e.g. to print or send to a
handle).

It is unfortunate to have to dig into System.OsString.Internal.Types and get at
the inner values via the constructors just to print/send AbstractFilePath values
without intermediate conversion to String. Furthermore, without Coercions
(or equivalent) it is tricky to write cross-platform code - because you don't know
if you have PosixString or WindowsString "under the hood".

Ideally there should be a function in the main module that gives you a Builder for
the AbstractFilePath. Perhaps also Maybe Coercions to the underlying types (in the
Internal module).

@hasufell
Copy link
Member Author

In GitLab by @maerwald on Jun 22, 2022, 17:05

It is unfortunate to have to dig into System.OsString.Internal.Types and get at
the inner values via the constructors just to print/send AbstractFilePath values
without intermediate conversion to String

What is the use case to converting to bytestring?

The main reason we make this hard is because windows and unix filepaths have different encoding (and underlying syscall types) and converting to bytes has difficult to understand semantics. If you want to print the bytes, that may be fine, but it may lead people into thinking they can use some bytestring APIs safely.

This way you're forced to use internal modules, at least.

On the other hand, I already added https://hackage.haskell.org/package/filepath-2.0.0.3/candidate/docs/System-AbstractFilePath-Internal.html#v:bytesToAFP so a variant that does the opposite could be added too, but definitely only in internal modules to signal this is unsafe.

Furthermore, without Coercions
(or equivalent) it is tricky to write cross-platform code - because you don't know
if you have PosixString or WindowsString "under the hood".

Absolutely not! The entire point of AFPP is to make writing cross platform code safer.

https://hackage.haskell.org/package/filepath-2.0.0.3/candidate/docs/System-OsString-Internal-Types.html

  1. Windows string is a newtype: newtype WindowsString = WS { unWFP :: BS.ShortByteString }
  2. Posix string is a newtype: newtype PosixString = PS { unPFP :: BS.ShortByteString }
  3. Platform string is a platform ifdefed type synonym, which means you can't accidentally use the wrong constructor on the wrong platform!
#if defined(mingw32_HOST_OS) || defined(__MINGW32__)
type PlatformString = WindowsString
#else
type PlatformString = PosixString
#endif

So matching on WS on a PlatformString on unix is a type error!

This translates to AbstractFilePath, because:

newtype OsString = OsString PlatformString

type AbstractFilePath = OsString

OsString/AbstractFilePath hide those implementation details behind a newtype.

So, if you write cross platform code, you have to do it like this:

fromPlatformStringEnc :: TextEncoding
                      -> AbstractFilePath
                      -> Either EncodingException String
#if defined(mingw32_HOST_OS) || defined(__MINGW32__)
fromPlatformStringEnc winEnc (OsString (WS ba)) = unsafePerformIO $ do
  r <- try @SomeException $ BS8.useAsCStringLen ba $ \fp -> GHC.peekCStringLen winEnc fp
  evaluate $ force $ first (flip EncodingError Nothing . displayException) r
#else
fromPlatformStringEnc unixEnc (OsString (PS ba)) = unsafePerformIO $ do
  r <- try @SomeException $ BS.useAsCStringLen ba $ \fp -> GHC.peekCStringLen unixEnc fp
  evaluate $ force $ first (flip EncodingError Nothing . displayException) r
#endif

This allows to safely write API for AbstractFilePath (whose semantics depend on the current platform), as well as modules that are platform specific and work the same across all platforms (see https://hackage.haskell.org/package/filepath-2.0.0.3/candidate/docs/System-AbstractFilePath-Windows.html).

So if you want to deal with tar archives, you simply use PosixFilePath and the appropriate module. And you can be sure that you're never accidentally pattern matching on a windows filepath.

So no, we don't want any coercions.

I will explain this more in depth in a blog post after release.


Also note: using Data.ByteString.Short on windows filepaths will not work, because they are wide characters (Word16). That's why there is https://hackage.haskell.org/package/filepath-2.0.0.3/candidate/docs/System-AbstractFilePath-Data-ByteString-Short-Word16.html

@hasufell
Copy link
Member Author

In GitLab by @frasertweedale on Jun 22, 2022, 21:21

To clarify, the Coercions I want are Maybe (Coercion AbstractFilePath PosixFilePath) and Maybe (Coercion AbstractFilePath WindowsFilePath). This would allow to write cross-platform code that can produce or manipulate AbstractFilePath in platform-specific ways without the CPP.

Alternatively (equivalently?), something like (PosixFilePath -> a) -> (WindowsFilePath -> a) -> AbstractFilePath -> a or some other way to witness the underlying type.

@hasufell
Copy link
Member Author

In GitLab by @frasertweedale on Jun 22, 2022, 21:57

What is the use case to converting to bytestring?

There must be many. Here's one: browse filesystem, choose a file to attach to mail, set Content-Disposition: attachment; filename=I-was-converted-from-an-AbstractFilePath.pdf.

Granted, you do not want to attach the bytes as-is. In general they must be decoded first, then re-encoded to what the application requires. But filepath only gives a way to convert to String. Wouldn't it be a performance issue to convert to String, then again to Text, or re-encode to a ByteString? Using lazy Text/ByteString won't help as most paths would be shorter than the default chunk size. Only Builders might help (e.g. Data.ByteString.Builder.stringUtf8) as the intermediate [Char] can be consumed lazily.

So thinking this through more, I agree that offering a conversion to ByteString is not something this library ought to do, except perhaps the likely very common case of utf8. An efficient conversion to Text would seem useful but I note that filepath does not depend on text and I doubt this is a strong enough reason to add it.

Maybe all of these performance issues are all in my head as paths are typically not that long, and all those fully evaluated Strings resulting from conversions to Text or other types will immediately get eaten by the GC.

@hasufell
Copy link
Member Author

In GitLab by @maerwald on Jun 22, 2022, 22:34

My thinking was:

  1. The general user should not be required to deal with these low-level things
  2. When you have such platform specific code, chances are high you'll need bindings from unix or Win32. You will already need CPP to import these, because they're incompatible with the other platform.

I'm not sure I entirely understand how Coercion API will work here. The alternative sounds possible to implement, but as said, I'm not sure that's really more ergonomic. I wouldn't mind adding that to internal modules though.

@hasufell
Copy link
Member Author

In GitLab by @maerwald on Jun 22, 2022, 22:43

Ok, so from what I understand, you basically want a function like this:

reEncode :: (TextEncoding, TextEncoding)  -- ^ unix    (decoder, encoder)
         -> (TextEncoding, TextEncoding)  -- ^ windows (decoder, encoder)
         -> AbstractFilePath
         -> Either EncodingException ShortByteString

So basically re-encoding. E.g. if you want all filepaths to be converted to UTF-8, regardless of platform. And then you want the intermediate common representation String to fuse, so that you don't pay for it.

ByteString is really boring. It's meant for large FFI blobs. Let's stick to ShortByteString.


I don't think we can fuse it, because we can't use accursedUnutterablePerformIO since we're using bracketed functions for encoding/decoding like withCStringLen/useAsCStringLen, which would leak.

@hasufell
Copy link
Member Author

In GitLab by @frasertweedale on Jun 22, 2022, 22:49

Fair enough. The more I think about it, I think what I'm asking for is premature optimisation and I just ought to use the AFP as it spreads through enough of the core libraries. The pain points and performance issues, if any, will reveal themselves in time.

Thanks for all your feedback; I'll close this ticket now.

@hasufell
Copy link
Member Author

In GitLab by @maerwald on Jun 23, 2022, 05:42

Well, I'm glad for input. Not many users so far have commented on this whole thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant