Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gather code and design reviews #10

Open
hasufell opened this issue Aug 17, 2021 · 20 comments
Open

Gather code and design reviews #10

hasufell opened this issue Aug 17, 2021 · 20 comments
Labels
discussion help wanted Extra attention is needed task

Comments

@hasufell
Copy link
Owner

hasufell commented Aug 17, 2021

@hasufell hasufell added help wanted Extra attention is needed discussion task labels Aug 17, 2021
@hasufell
Copy link
Owner Author

pinging some folks @snoyberg @emilypi @Kleidukos @sclv @gbaz @phadej @bgamari

@bgamari
Copy link
Contributor

bgamari commented Aug 17, 2021

Ccing @Mistuke

Thanks for putting the time into pushing this forward, @hasufell! Some initial thoughts:

  • Should AbstractFilePath rather be a newtype? Afterall, not all strings are paths. If it were a newtype we would need to also provide oss :: QuasiQuoter and associated combinators
  • Is the AbstractFilePath module expected to be imported qualified? If so, I think that:
    • (un)?packAFP should rather be called (un)?pack (for consistency with bytestring and text)
    • joinPath should just be called join
    • splitPath should just be called split
  • IMHO bsToAFP shouldn't be abbreviated; rather it should be spelled byteStringToAFP (for consistency with, e.g., the Data.ByteString.Builder combinators
  • the semantics of makeValid are very vague; if we want this operation it should be well-documented and stable
  • the name OsWord is a bit confusing since it actually refers to a codepoint (although under POSIX it is a codepoint in the trivial ASCII encoding)
  • Shouldn't fromChar return a Maybe? (e.g. what is fromChar 'ᝆ' under POSIX?)
  • Is / in pathSeparators on Windows? It seems the answer is "yes" but I don't believe that CreateFile would accept such a path.
  • What representation, if any, are named streams given on Windows?
  • Nit: s/windows/Windows

@hasufell
Copy link
Owner Author

hasufell commented Aug 17, 2021

@bgamari

Should AbstractFilePath rather be a newtype? Afterall, not all strings are paths. If it were a newtype we would need to also provide oss :: QuasiQuoter and associated combinators

I had given this some thought, the options were as follows:

  1. A type synonym: type AbstractFilePath = OsString (or vice versa). This is similar to type FilePath = String. The disadvantage is that we lose the type level distinction and it's more like a hint.
  2. Use another newtype. This may be unergonomic, given that they actually maintain the same invariants. AbstractFilePath does not ensure filepath validity anyway (since there is no definite list of what is allowed, only hints. Even the filesystem may have an impact on what is valid. As such we only maintain the encoding). This approach may be more interesting if AbstractFilePath had more invariants, such as relative vs absolute on type level (compare with Path and hpath libraries).
  3. Provide only one type for both use cases and rename to e.g. OsString.
  4. Convert non-filepath things to String or Text. This kinda goes against the original proposal of not losing the encoding information and not roundtripping through types.

So my argument is: AbstractFilePath doesn't maintain further restrictions over OsString and it probably shouldn't (since these are on e.g. windows fuzzy and not strict, depend on filesystem used, possibly windows version etc etc).
This library/proposal is only about encoding correctness and memory efficiency. Having more guarantees about filepaths on type level is out of scope, IMO. A user-space library can provide these things (there are hpath, path and others and they make different trade-offs, the design space isn't small).

Is the AbstractFilePath module expected to be imported qualified?

I have no strong opinion, what's your verdict?

the semantics of makeValid are very vague; if we want this operation it should be well-documented and stable

True. However, most (not all) functions are ported from filepath: https://hackage.haskell.org/package/filepath-1.4.2.1/docs/System-FilePath-Posix.html#v:makeValid

the name OsWord is a bit confusing since it actually refers to a codepoint (although under POSIX it is a codepoint in the trivial ASCII encoding)

What name do you suggest?

Is / in pathSeparators on Windows? It seems the answer is "yes" but I don't believe that CreateFile would accept such a path.

Yes and I believe it does. The filepath package handles it the same way: https://hackage.haskell.org/package/filepath-1.4.2.1/docs/System-FilePath-Windows.html#v:pathSeparators

Also see https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew:

"The name of the file or device to be created or opened. You may use either forward slashes (/) or backslashes (\) in this name."

What representation, if any, are named streams given on Windows?

I'm not too familiar with those? Why are they relevant for filenames? Does the filepath package handle those somehow?

@bgamari
Copy link
Contributor

bgamari commented Aug 17, 2021

So my argument is: AbstractFilePath doesn't maintain further restrictions over OsString and it probably shouldn't (since these are on e.g. windows fuzzy and not strict, depend on filesystem used, possibly windows version etc etc).

Fair.

Is the AbstractFilePath module expected to be imported qualified?

I have no strong opinion, what's your verdict?

I think it should be treated as a qualified import.

True. However, most (not all) functions are ported from filepath: https://hackage.haskell.org/package/filepath-1.4.2.1/docs/System-FilePath-Posix.html#v:makeValid

Fair, but I think we should take this opportunity to clarify the semantics (or at least clearly document that the function will merely make a fuzzy "best-effort" attempt at fixing up the path.

What name do you suggest?

OsChar perhaps? This slightly conflates characters and codepoints, but this is a fairly common conflation and I would argue that it's clearer than OsWord.

Also see https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew:

Ahh, good catch.

What representation, if any, are named streams given on Windows?

I'm not too familiar with those? Why are they relevant for filenames? Does the filepath package handle those somehow?

See https://docs.microsoft.com/en-us/windows/win32/fileio/file-streams. In short, files in Windows are in fact more like directories, containing multiple named "streams" of content (vaguely reminiscent of extended attributes in the POSIX world). The "default" stream is named $DATA. As far as I know this is a quite rarely used feature so we would probably be fine with ignoring it (as, I believe, filepath does).

@Mistuke
Copy link

Mistuke commented Aug 17, 2021

Is / in pathSeparators on Windows? It seems the answer is "yes" but I don't believe that CreateFile would accept such a path.

Yes and I believe it does. The filepath package handles it the same way: https://hackage.haskell.org/package/filepath-1.4.2.1/docs/System-FilePath-Windows.html#v:pathSeparators

Also see https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew:

"The name of the file or device to be created or opened. You may use either forward slashes (/) or backslashes () in this name."

This is only true when using the APIs in "legacy" mode. In "legacy" mode the Win32 API calls will translate paths from their legacy form e.g. C:\Windows\foo into the form the driver actually understands \\?\C:\Windows\foo. One of the main differences is that the path is no longer pre-processed by the user-mode part of the Window API[0].

This results in accesses being faster, but also in an application not being restricted to legacy issues (e.g. MAX_PATH). But this also means various things become invalid, primarily that / in paths are no longer seen as path separators and that relative paths are invalid, or even just using . and .. in paths as the driver is not aware of the program's PWD so it cannot possibly resolve them.

Haskell programs use these filepaths internally, but at the moment we constantly have to translate between the "legacy" ones and the ones we use internally. I would expect a new API to hopefully remove this requirement.

Note that Windows has a number of these namespaces \\?\, \\.\, \\??\, \Device\ etc. Also drive letters are a user-mode concept, so a path such as \\?\Volume{4321eefsf-57fa-44sc-a5b4-d3bf28a54dft}\Windows\foo is perfectly valid.

A modern version of GHC will accept these and do the right thing, but filepath can't be used to maintain them so we do so in very low levels of the runtime.

[0] https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file

What representation, if any, are named streams given on Windows?

I'm not too familiar with those? Why are they relevant for filenames? Does the filepath package handle those somehow?

They're represented by using :<stream-name> after the file name. So foo.c:h1 and foo.c:h2 refer to the same file but different streams. Currently filepath doesn't handle them which causes failures when doing things such as

Prelude System.FilePath> takeExtension "hello.c:foo"
".c:foo"

which is invalid.

@Mistuke
Copy link

Mistuke commented Aug 17, 2021

What representation, if any, are named streams given on Windows?

I'm not too familiar with those? Why are they relevant for filenames? Does the filepath package handle those somehow?

See https://docs.microsoft.com/en-us/windows/win32/fileio/file-streams. In short, files in Windows are in fact more like directories, containing multiple named "streams" of content (vaguely reminiscent of extended attributes in the POSIX world). The "default" stream is named $DATA. As far as I know this is a quite rarely used feature so we would probably be fine with ignoring it (as, I believe, filepath does).

Yes, filepath largely ignores it, but does corrupt things when they're used. e.g takeExtension for instance is broken. They are actually used more often then one would think. For instance have you ever wondered how Windows knows when a file was downloaded from the internet with a Microsoft browser? It's data stored in a stream. Though I would agree that it's a pretty obscure feature :) I'd argue the package doesn't have to be able to manipulate it. just do things correct.

The blame for this feature though lies with Mac. Data streams were added to NTFS to support interop with Macs. Macs also support this concept in HFS. The neutral term for this is called File Forks. See https://en.wikipedia.org/wiki/Fork_(file_system)

@phadej
Copy link

phadej commented Aug 17, 2021

I don't like the design where depending on the build target the API changes. I'd need modules for Windows and POSIX paths (like filepath has). Having a module which is either depending on the target system is fine and e.g. directory primitives would use it. It's common enough that a *nix program has to process windows file paths and vice versa, but doesn't do any IO with them.

In perfect world e.g. tarball/zipfile internal paths could be fitted into the same story, etc. though POSIX paths are close enough, maybe?

@hasufell
Copy link
Owner Author

hasufell commented Aug 17, 2021

@phadej

I'd need modules for Windows and POSIX paths

There is. Half of the library is dealing with exactly this concern:

@hasufell
Copy link
Owner Author

@Mistuke @bgamari

I would expect a new API to hopefully remove this requirement.

Well, I don't have a clear picture what needs to change. So I'd rely on pull requests. The filepath ported functions are all in https://github.com/hasufell/abstract-filepath/blob/master/lib/AFP/AbstractFilePath/Internal/Common.hs

@Mistuke
Copy link

Mistuke commented Aug 18, 2021

I don't like the design where depending on the build target the API changes. I'd need modules for Windows and POSIX paths (like filepath has). Having a module which is either depending on the target system is fine and e.g. directory primitives would use it. It's common enough that a *nix program has to process windows file paths and vice versa, but doesn't do any IO with them.

In perfect world e.g. tarball/zipfile internal paths could be fitted into the same story, etc. though POSIX paths are close enough, maybe?

But the reality is that this doesn't match the real world. At some point you have to have platform abstractions. Not doing so is how Windows builds ended up being so much slower than the rest.

Assuming a POSIX view of the world has caused so much trouble on non-POSIX systems that really I wish we wouldn't. The world is just not POSIX.

To me doing this honestly provides no extra benefits at all over using paths as strings. I have to heavily process them anyway for any use, so I might as well use a representation that's easily read from C then.

@Mistuke
Copy link

Mistuke commented Aug 18, 2021

@Mistuke @bgamari

I would expect a new API to hopefully remove this requirement.

Well, I don't have a clear picture what needs to change. So I'd rely on pull requests. The filepath ported functions are all in https://github.com/hasufell/abstract-filepath/blob/master/lib/AFP/AbstractFilePath/Internal/Common.hs

@hasufell I don't believe anything has to necessarily change from the API point of view (obviously ideally a path doesn't have to keep repeatedly testing what kind of path it is).

But as an example the file you linked makes an assumption that the only valid absolute filepath on Windows is one that starts with a drive letter. Which is not true.

It also makes an assumption that badElements is invalid for all paths on Windows. Which is also not true. e.g. \\.\COM1 is perfectly valid and points to the first COM port.

It makes an attempt to distinguish between the different namespaces. As in, it recognizes \\?\ but then makeValid doesn't seem to.

So it's the inconsistencies that is problematic. What I'm asking for is that any path out of the functions are semantically and syntactically valid..

@Mistuke
Copy link

Mistuke commented Aug 18, 2021

@hasufell so having read the proposal, (I believe that I've discussed this with @hvr before) I really like that the filepaths on Windows are represented as Utf16. This will save an immense amount of work..

The only thing I am wondering about is whether OsString should be a datatype so you can have multiple constructors or a different way to represent what kind of Path something is.

The rational is that today whenever a native Windows path is used I can't assume that each manipulation by filepaths keeps a valid path. Which means at the usage point I have to ensure valid.

But if I can assume that usages of afp keeps them valid that'll save a lot of processing. Of course I don't want that processing to move into afp (so that one has to check what kind of path something is on every command).

If not possible then fine, the Utf16 representation is already a major improvement!

@hasufell
Copy link
Owner Author

hasufell commented Aug 18, 2021

@Mistuke

@hasufell so having read the proposal, (I believe that I've discussed this with @hvr before) I really like that the filepaths on Windows are represented as Utf16. This will save an immense amount of work..

The only thing I am wondering about is whether OsString should be a datatype so you can have multiple constructors or a different way to represent what kind of Path something is.

I don't think so. OsString is an abstraction around the current platforms syscall encoding.

If you want different codepaths, you can use WindowsString and PosixString respectively.

If you write a low-level library based on AFPP, you simply ifdef around the platform and both codepaths use different inner cosntructors WS and PS, see here

toPlatformString :: String -> PLATFORM_STRING
#ifdef WINDOWS
toPlatformString = WS . encodeUtf16LE
#else
toPlatformString = PS . encodeUtf8
#endif

You can't mix them up if you're wrapping them inside OsString, it will be a compile error. That's the point.

The rational is that today whenever a native Windows path is used I can't assume that each manipulation by filepaths keeps a valid path. Which means at the usage point I have to ensure valid.

But if I can assume that usages of afp keeps them valid that'll save a lot of processing. Of course I don't want that processing to move into afp (so that one has to check what kind of path something is on every command).

If not possible then fine, the Utf16 representation is already a major improvement!

On unix, only NUL bytes are invalid. Everything else goes.

From my research, windows doesn't have a definite answer about what is a valid filepath and it may depend on filesystem, windows API or the phase of the moon. As such I don't think this library should try to enforce such properties. You should be free to pass junk to syscalls if you like so.

https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions looks like just conventions, but some invalid paths may still succeed, but then fail in other ways. This is up to user space library to figure out.

@hasufell
Copy link
Owner Author

hasufell commented Aug 18, 2021

@Mistuke

@Mistuke @bgamari

I would expect a new API to hopefully remove this requirement.

Well, I don't have a clear picture what needs to change. So I'd rely on pull requests. The filepath ported functions are all in https://github.com/hasufell/abstract-filepath/blob/master/lib/AFP/AbstractFilePath/Internal/Common.hs

@hasufell I don't believe anything has to necessarily change from the API point of view (obviously ideally a path doesn't have to keep repeatedly testing what kind of path it is).

But as an example the file you linked makes an assumption that the only valid absolute filepath on Windows is one that starts with a drive letter. Which is not true.

It also makes an assumption that badElements is invalid for all paths on Windows. Which is also not true. e.g. \\.\COM1 is perfectly valid and points to the first COM port.

It makes an attempt to distinguish between the different namespaces. As in, it recognizes \\?\ but then makeValid doesn't seem to.

So it's the inconsistencies that is problematic. What I'm asking for is that any path out of the functions are semantically and syntactically valid..

You clearly know more about these things than me, so I'd appreciate a PR.

I took filepath simply as a base. I'm fine to deviate from it.

@hasufell
Copy link
Owner Author

hasufell commented Aug 22, 2021

@hasufell
Copy link
Owner Author

I've finished the Win32 migration (only the subset of the functions that had their API changed are different, the rest are re-exports): 5256773

@hasufell
Copy link
Owner Author

hasufell commented Nov 2, 2021

CCing @Bodigrim

I'll re-iterate and update the state of the efforts briefly.


Context

So far, the CLC has "ack'ed" these efforts, but I have no explicit consent from Win32 or unix package maintainers. These will be the first core packages that need to support AFPP (as additional API). The types need to be put somewhere as well, @snoyberg suggested primitive as a "not-quite-base".

I put temporary haddocks here: http://files.hasufell.de/AFPP/

Status

I believe the core implementation is done.

We now have these utility packages that I already put on hackage:

  • word16: needed for Word16 literals, this is a rather simple package
  • shortbytestring: this is the core of the low-level primitives... since bytestring has a very small API for ShortByteString, this added a lot of new stuff. Additionally, the quirky Data.ByteString.Short.Word16 module was added, which is used by the windows callstack. This needs special attention.

AFPP packages are pretty much done:

  • abstract-filepath-types: contains the core types
  • abstract-filepath: contains construction/deconstruction, but also contains the entirety of the filepath API (may not be 100% the same, but I don't see why it should... @Mistuke also made some comments about changes that could be made for windows, but I'm still waiting for a PR)
  • abstract-filepath-unix: contains the subset of the unix API that needed adjustment for AFPP as new module variants... care was taken to use the same internal types (other than filepath), which required some unsafeCoerce in a few places
  • abstract-filepath-Win32: contains the subset of the Win32 API that needed adjustment for AFPP as new module variants... care was taken to use the same internal types (other than filepath), which required some unsafeCoerce in a few places

abstract-filepath-unix and abstract-filepath-Win32 are complementory packages, they don't replace unix or Win32.

I've started some example migrations of my own packages to AFPP:

  • Use abstract-filepath streamly-posix#2 (easy migration, small package)
  • Abstract filepath hpath#43 (this is effectively an alternative to the directory and path packages and provides filesystem operations as well as strict path types... this will be my testbed for benchmarks. The Win32 portion of the API is still WIP, but should be finished soon)

How to move forward

Ultimately, at least abstract-filepath-unix and abstract-filepath-Win32 need to be moved into the respective packages, because maintaining these as subset forks will be time consuming. As said, here I don't have explicit consent, so I'm not sure how or if this can move forward.

After I finish migrating hpath, I will start conducting benchmarks. I'm not too familiar with that. I've put up a ticket #5. This will likely require looking at https://well-typed.com/blog/2021/01/fragmentation-deeper-look/ and ghc-debug to make useful assessments about improvement in memory fragmentation.

@hasufell
Copy link
Owner Author

hasufell commented Nov 4, 2021

@Mistuke ...I'm currently taking a deeper look into windows paths

But as an example the file you linked makes an assumption that the only valid absolute filepath on Windows is one that starts with a drive letter. Which is not true.

Can you be more specific? Afais there are these types:

  1. C:\
  2. \\?\C:\
  3. \\.\C:\
  4. \\share
  5. \\?\UNC\share
  6. \\.\UNC\share
> fmap isAbsolute ["C:\\","\\\\?\\C:\\","\\\\.\\C:\\","\\\\share","\\\\?\\UNC\\share","\\\\.\\UNC\\share"]
[True,True,True,True,True,True]
> fmap isValid ["C:\\","\\\\?\\C:\\","\\\\.\\C:\\","\\\\share","\\\\?\\UNC\\share","\\\\.\\UNC\\share"]
[True,True,False,True,False,True]

So isValid chokes only on the 5th (and by accident returns True for the 6th and 3rd I think). Is that correct?

It also makes an assumption that badElements is invalid for all paths on Windows. Which is also not true. e.g. \\.\COM1 is perfectly valid and points to the first COM port.
It makes an attempt to distinguish between the different namespaces. As in, it recognizes \\?\ but then makeValid doesn't seem to.

Ok, so I think isValid and makeValid should just be no-ops when the long-path prefix is discovered? Because windows API disables parsing for these and just passes the string unchanged to the syscall. Of course, that one may choke, but I think we can assume that \\?\ denotes "I know what I'm doing".

This also aligns with the discussion at haskell/filepath#56

@Bodigrim
Copy link

I wonder what's the best way forward with shortbytestring. Integrate it into bytestring codebase? Make it an upstream dependency of bytestring? Leave it as a separate "utility" package in between of bytestring and filepath?

I'm exceedingly busy with text-2.0 at the moment, but hopefully will get time to ponder about it in January.

@hasufell
Copy link
Owner Author

I wonder what's the best way forward with shortbytestring. Integrate it into bytestring codebase? Make it an upstream dependency of bytestring? Leave it as a separate "utility" package in between of bytestring and filepath?

I think there's some code sharing in the test suite, where I had to cut out all things that aren't shortbytestring relevant. Not sure that's a good argument for merging them. It might make sense to have a split. I think small scope is good for PVP, so messing with shortbytestring (which is maybe more likely when it's started to be heavily used as a new dependency of filepath) doesn't impact the version of bytestring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion help wanted Extra attention is needed task
Projects
None yet
Development

No branches or pull requests

5 participants