Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstract filepath proposal (request for help) #35

Closed
hasufell opened this issue Jun 19, 2022 · 37 comments
Closed

Abstract filepath proposal (request for help) #35

hasufell opened this issue Jun 19, 2022 · 37 comments

Comments

@hasufell
Copy link
Contributor

hasufell commented Jun 19, 2022

Motivation

The abstract filepath proposal strives to improve filepath handling in Haskell. The original proposal is written down in the GHC wiki and suggested implementing this in base.

Since this would break a lot of real world code, we (CLC and filepath maintainers) decided to implement this in "user-space", as in: only in filepath (a core library shipped with GHC). That means base will not use these new filepath types for the foreseeable future, but other core-libraries can.

The main benefits compared to type FilePath = String are:

  1. it is more efficient and avoids memory fragmentation (uses unpinned 'ShortByteString' under the hood)
  2. it is more type-safe (newtype over 'ShortByteString')
  3. avoids round-tripping issues by not converting to String (which is not total and loses the encoding)
  4. abstracts over unix and windows while keeping the original bytes

Previous discussions

Previous work

  1. extending bytestring API wrt ShortByteString: Merge shortbytestring package back into bytestring wrt #444 haskell/bytestring#471
  2. AFPP merged into filepath: https://gitlab.haskell.org/haskell/filepath/-/merge_requests/103

Current state

  1. a release candidate is available: https://hackage.haskell.org/package/filepath-2.0.0.3/candidate
  2. a companion library that allows file reading/writing is available here: https://github.com/hasufell/file-io
  3. pending Win32 PR: Add WindwowsString/WindowsFilePath support wrt AFPP haskell/win32#198
  4. pending unix PR: Add PosixFilePath and friends support (for AFPP) haskell/unix#202
  5. GHC issue that needs to be solved: https://gitlab.haskell.org/ghc/ghc/-/issues/21738

The code is ready for release, PRs 3. and 4. can be merged in a timely manner (those are the minimum requirements to be able to use it).

Next steps

The idea is to get this release into GHC 9.4 or 9.6 (@Bodigrim I actually forgot which one). Users on older GHC will however be able to upgrade to the newer filepath anyway, because it's not tied to base.

After that, we need to market this new API and request core libraries maintainers to support AFPP (either as additional API or as the main API).

Some dicussions and PRs:

  1. Add AFPP support haskell/directory#136 (has gone stale)
  2. Support AbstractFilePath haskell/directory#138
  3. Support AbstractFilePath haskell/process#252
  4. AbstractFilePath support kowainik/relude#401
  5. Support AbstractFilePath commercialhaskell/path#189

This isn't a formal tech proposal, since the new API has been merged and will be part of the next filepath release. However, migrating the ecosystem can only be a joint effort. I'd imagine the next packages that need adoption are directory and process at least. I can't do all these migrations on my own, unless I intend to burn out.

I'm not sure what exactly I'm asking, but I talked with @david-christiansen during ZuriHac that this needs wider support. So I'm dropping this thread here.

I intend to write a blog post after the release that summarizes the work done, outlines how to use the new API and suggests migration strategies.

@Bodigrim
Copy link
Collaborator

The idea is to get this release into GHC 9.4 or 9.6 (@Bodigrim I actually forgot which one).

It's GHC 9.6.

@Bodigrim
Copy link
Collaborator

FWIW IMO AFPP is "the next big thing" for Haskell core libraries, huge thanks to @hasufell for leading this project.

@frasertweedale
Copy link
Contributor

Love your work, this looks great. I think there's a need for a AbstractFilePath -> Data.ByteString.Builder.Builder in the main module, and I filed a ticket to track that.

@Ericson2314
Copy link
Contributor

Ericson2314 commented Jun 22, 2022

filesystem has a copy of ShortByteString inlined into it? Err I see it still depends on bytestring, but System.AbstractFilePath.Data.ByteString.Short is no mere reexport. I am confused.

Also I opened up https://gitlab.haskell.org/haskell/filepath/-/issues/117 with some questions that might be good to resolve before the release.

@hasufell
Copy link
Contributor Author

filesystem has a copy of ShortByteString inlined into it? Err I see it still depends on bytestring, but System.AbstractFilePath.Data.ByteString.Short is no mere reexport. I am confused.

It is a re-export. What you're looking at is a Word16 based API variant that was not accepted in bytestring, because it's too specific. This one is needed to deal with wide character bytestrings (UCS-2LE) on windows: https://gitlab.haskell.org/haskell/filepath/-/blob/master/System/AbstractFilePath/Data/ByteString/Short/Word16.hs

@Ericson2314
Copy link
Contributor

Mmm IMO having to conform to bytestring's scope is another reason why it was a mistake for ShortByteString to be merged back into bytestring.

If it was a separate repo, it would be very easy to use some build time tricks to get e.g. short-bytestring and short-bytestring-16 libraries built deduplicated code, and then bytestring can only only depend on short-bytestring not short-bytestring-16.

@mpickering
Copy link

Has it been considered not doing a major version bump to filepath-2.0?

It is clearly an improvement to be able to use this abstract datatype for FilePath manipulation but I am worried that it will cause disruption in the ecosystem for a project which doesn't yet have significant buy-in when the same feature could be introduced with a minor version bump.

The API of FilePath has been largely stable for a number of years so disrupting this I think requires great care.

@Ericson2314
Copy link
Contributor

I agree major bumps without breaking changes to force them are probably not worth the churn. It would be nice if there was a separate mechanism to signal "exciting release!" than that to signal "breaking release!", but that doesn't yet exist.,

@hasufell
Copy link
Contributor Author

Has it been considered not doing a major version bump to filepath-2.0?

It's possible. My reasoning was mainly to get more attention. And a major bump will definitely do that.

PVP doesn't disallow this, at least. @Bodigrim opinions?

@hasufell
Copy link
Contributor Author

hasufell commented Jun 22, 2022

Mmm IMO having to conform to bytestring's scope is another reason why it was a mistake for ShortByteString to be merged back into bytestring.

If it was a separate repo, it would be very easy to use some build time tricks to get e.g. short-bytestring and short-bytestring-16 libraries built deduplicated code, and then bytestring can only only depend on short-bytestring not short-bytestring-16.

Yes, I suggested something along those lines here haskell/bytestring#444

This was rejected and I think we're fine for now.

@Ericson2314
Copy link
Contributor

Yes I said "another" because I had already offered a different reason there :). I would strongly advocate revisiting that decision.

@hasufell
Copy link
Contributor Author

I've added some examples that prove that the current base library is unsound wrt filepaths when interacting with CP932 (which is the most common japanese encoding it seems): https://gist.github.com/hasufell/c600d318bdbe010a7841cc351c835f92

This also demonstrates that abstract filepath fixes these issues.

@Bodigrim
Copy link
Collaborator

It's possible. My reasoning was mainly to get more attention. And a major bump will definitely do that.

PVP doesn't disallow this, at least. @Bodigrim opinions?

Correct, PVP does not prohibit bumping a major version even if there is no material breakage.

Dunno, as long as a downstream package compiles, more often than not maintainers just slap a revision to relax upper bounds and be with it, without exploring new API available.

@tomjaguarpaw
Copy link
Contributor

Yes, if you want to get more attention I recommend just telling people. Trying to communicate through major version numbers seems likely to be ineffective. I can't imagine many users doing more than just bumping the upper bound, as @Bodigrim says.

@hasufell
Copy link
Contributor Author

https://www.reddit.com/r/haskell/comments/vivjdo/abstract_filepath_coming_soon/

@goldfirere
Copy link
Collaborator

The Reddit conversation has a decent amount of discussion about the name of the new type. (Writing this here because I find these comments are more easily found than Reddit ones.) I won't put my thumb on the scales of that discussion -- mostly because I don't trust my own instincts around naming -- but I observe that if we had local modules, this would be much easier. I've long exhausted my time budget in working on that proposal, but I would be thrilled if someone took over and got it over the line. I'm very happy to advise about what the next steps there are (but please post there, not here).

@hasufell
Copy link
Contributor Author

@goldfirere I went with OsPath just FYI: https://gitlab.haskell.org/haskell/filepath/-/issues/120

@emilypi
Copy link
Contributor

emilypi commented Jul 7, 2022

I'd rather have AFPP in with the current incarnation than wait for GHC Proposals to land (a minimum of what.. 2 years ish?), and the current spec doesn't really preclude a migration to any later GHC proposals that may land and make the authorship experience cleaner. The end user is the one that matters. Whatever you need @hasufell

@Rufflewind
Copy link

Is instance IsString OsPath in scope of the proposal? Are there any thoughts on how users would construct OsPath from literals without doing some complicated dance involving encodeUtf/FS?

It may be valuable to write a tutorial for a beginner in Haskell and see if there are any wrinkles there.

@hasufell
Copy link
Contributor Author

Is instance IsString OsPath in scope of the proposal? Are there any thoughts on how users would construct OsPath from literals without doing some complicated dance involving encodeUtf/FS?

There's no way to write a total IsString OsPath instance. You either have to truncate like the bytestring instance, which is heavily debated or you have to accept that fromString may crash via error.

It may be valuable to write a tutorial for a beginner in Haskell and see if there are any wrinkles there.

That's already done: hasufell/hasufell.github.io#4

@tomjaguarpaw
Copy link
Contributor

And to perhaps make the point more starkly, if there were a safe IsString instance then we wouldn't even need OsPath, right?

@hasufell
Copy link
Contributor Author

hasufell commented Jul 10, 2022

And to perhaps make the point more starkly, if there were a safe IsString instance then we wouldn't even need OsPath, right?

The main problem is that there is encoding/decoding at the FFI boundary, converting between CString and String.

This encoding depends on getFileSystemEncoding on unix and can cause horrendous bugs. There is a way to make the decoding CString -> String total (via PEP-383 and enforcing UTF-8), but that causes other issues. Even with this encoding enforced, you can't get a total conversion from String -> CString, because Char includes surrogate chars, which are rejected by strict UTF-8 and only a subset of those are used by the lax PEP-383 UTF-8 escaping trick.

All of this is explained in the linked blog post, which will be posted after the final filepath release.

To put it differently: even if we changed the IsString class to be safe (e.g. by failing at compile time for literals and returning Either at runtime), we would still want OsPath.

@Rufflewind
Copy link

you have to accept that fromString may crash via error.

  • Doesn't that only happen if you try to deliberately put surrogates into a literal string? That seems like an extremely rare edge case that shouldn't happen in practice, so an error seems like a more reasonable tradeoff over sacrificing ergonomics.

    Since the fromString is intended to be called on literals, any unexpected surrogates should be detected when that section of the code is evaluated for the first time, so it seems rather unlikely for such errors to go undetected. Some warnings in the documentation to clarify the limitations of the IsString instance would be valuable of course.

  • Using a quasiquoter seems quite overkill for this. Is TemplateHaskell widely adopted these days or still somewhat niche?

  • If we can't use OverloadedStrings for ostensibly simple cases like this then it seems like the OverloadedStrings feature is a bit lacking. Maybe if OverloadedStrings had the ability to detect invalid conversions at compile time we wouldn't need all this baggage?

@Ericson2314
Copy link
Contributor

@Rufflewind I would say first make it correct, then worry about ergonomics. Ergonomics are important but worrying about everything at once just makes for a confusing conversation. And correctness is the proper foundation.

If we can't use OverloadedStrings for ostensibly simple cases like this then it seems like the OverloadedStrings feature is a bit lacking.

Yes exactly. We need to rethink string and list literals for a variety of reasons and we should do that, but we should do that after this so as to build concensus on what the use-cases that are not handled are.

Maybe if OverloadedStrings had the ability to detect invalid conversions at compile time we wouldn't need all this baggage?

We would still need OsPath for the reasons @hasufell says: there is nothing wrong with OS paths alone, they just are not the same as plain text.

@Rufflewind
Copy link

Ergonomics are important but worrying about everything at once just makes for a confusing conversation.

What is proposed here is an alternative to an existing system. Migration is already hard, but convincing people to migrate to a less ergonomic solution is even harder. More likely, people will inertially cling to the older and broken solution and "correctness" of the ecosystem as a whole will not improve as one would expect.

We would still need OsPath for the reasons @hasufell says: there is nothing wrong with OS paths alone, they just are not the same as plain text.

I never advocated for the removal of OsPath.

@Ericson2314
Copy link
Contributor

Even if no one migrates, this is still better to build consensus about what is needed from better overloaded literals. But, I think people will migrate.

The design here is very much in line with what Python, Rust, etc. have done: we're not breaking new ground. People use the stuff over there without complaint also with less than perfect ergonomics, and so they should with Haskell too.

@Rufflewind
Copy link

The design here is very much in line with what Python, Rust, etc. have done:

  • Python accepts plain string literals just fine: open('/tmp/myfile.txt'). Surrogates will result in UnicodeEncodeError at run time.
  • Rust has a total OsStr::new constructor, by virtue of str / char being free of surrogates. Most Rust APIs accept impl AsRef<OsStr>, so you can just pass in literals without even invoking the constructor directly.

@hasufell
Copy link
Contributor Author

Doesn't that only happen if you try to deliberately put surrogates into a literal string? That seems like an extremely rare edge case that shouldn't happen in practice, so an error seems like a more reasonable tradeoff over sacrificing ergonomics.

Since the library is about improving correctness... adding fromString seems largely backwards. Users of libraries can re-implement this as an orphan instance or a custom function, can decide to truncate non-ascii chars or use replacement chars for surrogates etc.

Until fromString supports expressing failure, I don't want to add another defunct instance that can crash users code.

Since the fromString is intended to be called on literals, any unexpected surrogates should be detected when that section of the code is evaluated for the first time, so it seems rather unlikely for such errors to go undetected. Some warnings in the documentation to clarify the limitations of the IsString instance would be valuable of course.

fromString can be called on dynamic values just fine. It's more general than OverloadedStrings. That's just one use case.

Using a quasiquoter seems quite overkill for this. Is TemplateHaskell widely adopted these days or still somewhat niche?

TH can cause issues with cross-compilation. That's why I avoid them in core libraries. In applications this should be largely fine. They allow us to fail at compile time. We can even provide quasi quoters that only accept ascii literals.

If we can't use OverloadedStrings for ostensibly simple cases like this then it seems like the OverloadedStrings feature is a bit lacking. Maybe if OverloadedStrings had the ability to detect invalid conversions at compile time we wouldn't need all this baggage?

Well, this is not in my scope. I'm not going to work on GHC improving OverloadedStrings to get this project finished.

What is proposed here is an alternative to an existing system. Migration is already hard, but convincing people to migrate to a less ergonomic solution is even harder. More likely, people will inertially cling to the older and broken solution and "correctness" of the ecosystem as a whole will not improve as one would expect.

That's fine. If people don't want to migrate, then they don't. But many users have expressed interest. That's one of the reasons filepath-bytestring exists. Except that's very hard to get right in a cross-platform manner. This proposal fixes that. And provides very convenient and clearly documented functions for conversion.

@hasufell
Copy link
Contributor Author

The design here is very much in line with what Python, Rust, etc. have done:

  • Python accepts plain string literals just fine: open('/tmp/myfile.txt'). Surrogates will result in UnicodeEncodeError at run time.
  • Rust has a total OsStr::new constructor, by virtue of str / char being free of surrogates. Most Rust APIs accept impl AsRef<OsStr>, so you can just pass in literals without even invoking the constructor directly.

Yes, but haskell has type String = [Char] and Char is a unicode codepoint, not a scalar value. We can't fix that either. And I'm not interested in fixing GHC or the Haskell report. That's outside of the scope of this proposal.

@nomeata
Copy link

nomeata commented Jul 11, 2022

More likely, people will inertially cling to the older and broken solution and "correctness" of the ecosystem as a whole will not improve as one would expect.

That's fine. If people don't want to migrate, then they don't. But many users have expressed interest.

Sounds like we are in a “Correctness - Ergonomics - No Fragmentation, choose 2” - situation? Are libraries with APIs dealing with paths now also expected to provide all functions twice, to please either crowd, and if they don't, programmers will have to (perceived sillily) convert? This sounds like a repetition of the String/Text/Lazy Text situation?

It may be distracting, and probably hard, but a API that is easy enough that can and will be used in then common case, while still improving correctness in the corner case, so that we wouldn't need the other one (save for backwards compat) would be the bigger prize.

More concretely about OverloadedStrings: my impression is that partiality of fromString is an accepted fact and that the explicit use of it on dynamic data is not the intended use case. With that I would not be surprised about an instance of it for abstract file paths, and may be surprised that it is not. But I admit that this is contestable.

@hasufell
Copy link
Contributor Author

More likely, people will inertially cling to the older and broken solution and "correctness" of the ecosystem as a whole will not improve as one would expect.

That's fine. If people don't want to migrate, then they don't. But many users have expressed interest.

Sounds like we are in a “Correctness - Ergonomics - No Fragmentation, choose 2” - situation? Are libraries with APIs dealing with paths now also expected to provide all functions twice, to please either crowd, and if they don't, programmers will have to (perceived sillily) convert? This sounds like a repetition of the String/Text/Lazy Text situation?

Library authors can decide to support only OsPath, since users can convert to and from String easily. See the API: https://hackage.haskell.org/package/filepath-1.4.99.5/candidate/docs/System-OsPath.html#g:2

I don't know what you mean with repetition of Text. Text is not the same as String and definitely not the right choice for filepaths. If you want a language with only one String type, Haskell is probably not the right choice. There is more than one String type. Haskell is about types.

It may be distracting, and probably hard, but a API that is easy enough that can and will be used in then common case, while still improving correctness in the corner case, so that we wouldn't need the other one (save for backwards compat) would be the bigger prize.

I'm repeating myself, because this has been discussed over a year ago already. This proposal doesn't deprecate FilePath for backwards compatibility reasons. The original proposal is over a decade old and clearly, such a breaking change is impossible to sell to the community and I have no interest in doing so.

If you think that's a good idea, you're free to create your own proposal.

The work I've put into this approach is already borderline, given that it looks like a small change. I don't want to imagine how much work it is to change base. Feel free to try, though.

More concretely about OverloadedStrings: my impression is that partiality of fromString is an accepted fact and that the explicit use of it on dynamic data is not the intended use case. With that I would not be surprised about an instance of it for abstract file paths, and may be surprised that it is not. But I admit that this is contestable.

The point of OsPath is not only to fix correctness issues, but also educate users. The lack of OverloadedStrings may be beneficial, since it forces users to think about their flepath literals. Unicode filepath literals are already questionable. They'll have to read the API. With quasiquoters we can make things more explicit and users can still define their own IsString orphan instance and choose the behavior they want. And there are multiple possible choices.

@hasufell
Copy link
Contributor Author

Released: https://hackage.haskell.org/package/filepath-1.4.100.0

Blog: https://hasufell.github.io/posts/2022-06-29-fixing-haskell-filepaths.html

@kokobd
Copy link

kokobd commented Jul 18, 2022

Has anyone successfully built filepath-1.4.100.0 when your project depends on ghc or ghc-boot?

Many GHC bundled libraries do not have new versions released to Hackage (for example ghci and hpc, their versions on hackage is very old). Thus, cabal solver fails if I add filepath ^>= 1.4.100.0 to a project that depends on ghc, which then depends on ghci. The installed ghci links against filepath-1.4.2.2 in this case, and cabal can not rebuild a ghci with filepath ^>= 1.4.100.0, as the source code is not found on hackage.

@hasufell
Copy link
Contributor Author

@kokobd ghc-9.6 is supposed to ship with the new filepath

@hasufell
Copy link
Contributor Author

I'm also experimenting with an adjusted GHC-8.10.7 source dist that includes filepath-1.4.100.0. If it builds fine, I'll upload it somewhere.

@Ericson2314
Copy link
Contributor

IMO old GHC + newer libraries should be a thing we do on a regular basis, so long as we are supporting older GHCs for all. There should be regular workflow for that.

@hasufell
Copy link
Contributor Author

This is done.

GHC 9.6 will hopefully ship with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests