Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: use WTF-8 on Windows #2970

Merged
merged 16 commits into from
May 23, 2023
Merged

fs: use WTF-8 on Windows #2970

merged 16 commits into from
May 23, 2023

Conversation

StefanKarpinski
Copy link
Contributor

I got tired of waiting on #2192, so I just applied the code review suggestions made by @vtjnash. Hopefully I'm doing the return error codes right, but if not, let me know what else to do here.

There were a couple of formatting recommendations for long function signatures that I did not follow since lining wrapped up arguments with the opening parens does not seem to be the existing style in this file. Instead, I indented the wrapped arguments to be indented once after the normal function body level, which seems to be how the rest of this file is formatted. If the other format is preferable, it would be better to make a separate cosmetic PR to fix that.

@StefanKarpinski
Copy link
Contributor Author

Update: this now threads ERROR_INVALID_NAME back to the caller if any path name is invalid WTF-8. I also made the types used a bit more specific: code points are always int32_t, path lengths are always int (plenty for paths on all platforms); the types are signed because -1 is used to signal invalid WTF-8 data. I put some asserts in there that are strictly not necessary since they're after functions that cannot return invalid values (i.e. going from WTF-16 to WTF-8), but it seemed better to put the asserts in anyway with a comment.

src/win/fs.c Outdated Show resolved Hide resolved
src/win/fs.c Outdated Show resolved Hide resolved
src/win/fs.c Outdated Show resolved Hide resolved
src/win/fs.c Outdated Show resolved Hide resolved
src/win/fs.c Outdated Show resolved Hide resolved
src/win/fs.c Outdated Show resolved Hide resolved
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
@StefanKarpinski
Copy link
Contributor Author

Ok, I applied your suggestions but I ended up rewriting fs__decode_wtf8_char entirely to be simpler.

src/win/fs.c Show resolved Hide resolved
src/win/fs.c Outdated Show resolved Hide resolved
@seishun
Copy link
Contributor

seishun commented Aug 27, 2020

The function fs__decode_wtf8_char includes validation (albeit incomplete), which was deliberately removed from the spec (see SimonSapin/wtf-8@c6a271d). This implies one of the following:

  1. The spec's author was wrong in limiting the algorithm to well-formed WTF-8
  2. libuv is a special case

Which one is it?

@StefanKarpinski
Copy link
Contributor Author

The code raises an error if the WTF-8 is not well-formed, which is exactly what the spec suggests.

@seishun
Copy link
Contributor

seishun commented Aug 27, 2020

Where does it suggest that?

@StefanKarpinski
Copy link
Contributor Author

StefanKarpinski commented Aug 27, 2020

The spec doesn't actually address ill-formed WTF-8 at all as far as I can tell, but it does suggest when converting strictly from WTF-8 to UTF-8 that one return an error if the WTF-8 is not valid UTF-8. The other option that it suggests is replacing unparied surrogates with replacement characters. The situation going from arbitrary bytes to WTF-8 is analogous and the viable choices are the same: raise an error or replace the ill-formed sequences with replacement characters. Since the spec explicitly doesn't address anything besides well-formed WTF-8, we're free to do which of these we feel is appropriate. The most conservative thing to do is to raise an error with the caller, which is precisely what this pull-request does.

@seishun
Copy link
Contributor

seishun commented Aug 27, 2020

The spec doesn't actually address ill-formed WTF-8 at all as far as I can tell

Correct, and the note in http://simonsapin.github.io/wtf-8/#decode-from-wtf-8 explains why.

but it does suggest when converting strictly from WTF-8 to UTF-8 that one return an error if the WTF-8 is not valid UTF-8

From well-formed WTF-8. This is a different situation.

Since the spec explicitly doesn't address anything besides well-formed WTF-8, we're free to do which of these we feel is appropriate.

It doesn't address ill-formed WTF-8 because it assumes that this situation will never occur. One could see it as a litmus test for suitability of WTF-8 for a given purpose. The fact that you seem to argue that this situation can occur suggests that WTF-8 is not being used for its intended purpose.

@StefanKarpinski
Copy link
Contributor Author

StefanKarpinski commented Aug 27, 2020

You can view this PR as doing two separate things:

  1. Verify that caller-supplied data is well-formed WTF-8; if it isn't raise an error to the caller.
  2. Given well-formed WTF-8, convert it to a sequence of code points as per the WTF-8 spec.

The spec only applies to the second step, and that step is done precisely according to spec. The first step is, by the spec's own statement, outside of scope of the spec. The spec explicitly says that it has no position about how this step should be handled. That doesn't mean that we can't apply the WTF-8 spec to the second step anymore than the fact that the Unicode spec doesn't discuss gzip means that we can't decompress a gzipped file and then interpret the decompressed data as UTF-8. If the gzip file is malformed, then we throw an error before trying to decode the UTF-8 data. Your argument is like saying that because gzip files can be incorrectly formatted, we can't possibly apply the Unicode standard to interpret the contents of a decompressed gzip file.

Even though the WTF-8 spec doesn't address ill-formed WTF-8 data, we can extrapolate from the options it gives for handling converting WTF-8 to UTF-8. There are two approaches:

  1. The strict, non-lossy approach where you throw an error upon encountering any unpaired surrogate;
  2. The lossy, non-strict approach where you convert unpaired surrogate code points to replacement characters.

The same options exist for handling step one above:

  1. The strict, non-lossy approach where you throw an error upon encountering any ill-formed WTF-8;
  2. The lossy, non-strict approach where you convert ill-formed WTF-8 sequences to replacement characters.

Again: the spec doesn't address this explicitly, but the fact that step one is out of scope for the WTF-8 spec doesn't mean that we can't apply the WTF-8 spec to step two. In this PR, we take the more conservative approach here and raise an error to the caller when they pass ill-formed WTF-8.

@StefanKarpinski
Copy link
Contributor Author

StefanKarpinski commented Aug 27, 2020

It doesn't address ill-formed WTF-8 because it assumes that this situation will never occur.

Btw, this is your interpretation, not something the spec says. I suspect the spec authors are aware that ill-formed WTF-8 can and does occur, they just consider it out of scope of the WTF-8 spec to tell you how to handle it. That doesn't mean that the problem goes away, or that if you have a situation where it can occur you just stick your head in the sand and ignore the possiblity. It just means that it's up to you to decide how to handle it.

@seishun
Copy link
Contributor

seishun commented Aug 27, 2020

I don't know how you ended up with that interpretation if you've read the note I linked. The analogy is flawed because validation is an integral part of any encoding spec where potentially ill-formed input can occur (e.g. https://encoding.spec.whatwg.org/#utf-8-decoder) - and WTF-8 isn't one.

ill-formed WTF-8 can and does occur

How can it occur in a self-contained system?

@vtjnash
Copy link
Member

vtjnash commented May 16, 2023

This is ready to merge

src/win/fs.c Show resolved Hide resolved
src/win/fs.c Outdated Show resolved Hide resolved
src/win/fs.c Show resolved Hide resolved
src/win/fs.c Outdated Show resolved Hide resolved
src/win/fs.c Show resolved Hide resolved
src/win/fs.c Show resolved Hide resolved
test/test-fs.c Outdated Show resolved Hide resolved
@vtjnash vtjnash merged commit 8f32a14 into libuv:v1.x May 23, 2023
31 of 34 checks passed
vtjnash pushed a commit to vtjnash/libuv that referenced this pull request May 23, 2023
This allows working with valid filenames that are not well-formed
UTF-16. This is a superset of UTF-8, which does not error when it
encounters an unpaired surrogate but simply allows it.

Fixes: libuv#2048
Refs: https://simonsapin.github.io/wtf-8/
Replaces: libuv#2192 by Nikolai Vavilov <vvnicholas@gmail.com>
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
(cherry picked from commit 8f32a14)
@StefanKarpinski StefanKarpinski deleted the sk/wtf-8 branch May 23, 2023 18:41
@StefanKarpinski
Copy link
Contributor Author

w00t!

santigimeno added a commit to santigimeno/node that referenced this pull request Jun 30, 2023
Notable changes
- fs: use WTF-8 on Windows: libuv/libuv#2970
- linux: add some more iouring backed fs ops: libuv/libuv#4012

Important bugs fixed
- linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059
- src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048
nodejs-github-bot pushed a commit to nodejs/node that referenced this pull request Jul 2, 2023
Notable changes
- fs: use WTF-8 on Windows: libuv/libuv#2970
- linux: add some more iouring backed fs ops: libuv/libuv#4012

Important bugs fixed
- linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059
- src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048

PR-URL: #48618
Fixes: #48512
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Richard Lau <rlau@redhat.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com>
Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>
RafaelGSS pushed a commit to nodejs/node that referenced this pull request Jul 3, 2023
Notable changes
- fs: use WTF-8 on Windows: libuv/libuv#2970
- linux: add some more iouring backed fs ops: libuv/libuv#4012

Important bugs fixed
- linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059
- src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048

PR-URL: #48618
Fixes: #48512
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Richard Lau <rlau@redhat.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com>
Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>
vtjnash added a commit that referenced this pull request Jul 13, 2023
We forgot to mask off the high bits from the first byte, so we ended up
always failing the subsequent range check.

Refs: #2970
Fixes: nodejs/node#48673
Ceres6 pushed a commit to Ceres6/node that referenced this pull request Aug 14, 2023
Notable changes
- fs: use WTF-8 on Windows: libuv/libuv#2970
- linux: add some more iouring backed fs ops: libuv/libuv#4012

Important bugs fixed
- linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059
- src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048

PR-URL: nodejs#48618
Fixes: nodejs#48512
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Richard Lau <rlau@redhat.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com>
Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>
Ceres6 pushed a commit to Ceres6/node that referenced this pull request Aug 14, 2023
Notable changes
- fs: use WTF-8 on Windows: libuv/libuv#2970
- linux: add some more iouring backed fs ops: libuv/libuv#4012

Important bugs fixed
- linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059
- src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048

PR-URL: nodejs#48618
Fixes: nodejs#48512
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Richard Lau <rlau@redhat.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com>
Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>
arsnyder16 pushed a commit to arsnyder16/node that referenced this pull request Sep 10, 2023
Notable changes
- fs: use WTF-8 on Windows: libuv/libuv#2970
- linux: add some more iouring backed fs ops: libuv/libuv#4012

Important bugs fixed
- linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059
- src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048

PR-URL: nodejs#48618
Fixes: nodejs#48512
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Richard Lau <rlau@redhat.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com>
Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>
ruyadorno pushed a commit to nodejs/node that referenced this pull request Sep 11, 2023
Notable changes
- fs: use WTF-8 on Windows: libuv/libuv#2970
- linux: add some more iouring backed fs ops: libuv/libuv#4012

Important bugs fixed
- linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059
- src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048

PR-URL: #48618
Backport-PR-URL: #49591
Fixes: #48512
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Richard Lau <rlau@redhat.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com>
Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>
PR-URL: #48078
ruyadorno pushed a commit to nodejs/node that referenced this pull request Sep 13, 2023
Notable changes
- fs: use WTF-8 on Windows: libuv/libuv#2970
- linux: add some more iouring backed fs ops: libuv/libuv#4012

Important bugs fixed
- linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059
- src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048

PR-URL: #48618
Backport-PR-URL: #49591
Fixes: #48512
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Richard Lau <rlau@redhat.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com>
Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>
PR-URL: #48078
ruyadorno pushed a commit to nodejs/node that referenced this pull request Sep 17, 2023
Notable changes
- fs: use WTF-8 on Windows: libuv/libuv#2970
- linux: add some more iouring backed fs ops: libuv/libuv#4012

Important bugs fixed
- linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059
- src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048

PR-URL: #48618
Backport-PR-URL: #49591
Fixes: #48512
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Richard Lau <rlau@redhat.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com>
Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>
PR-URL: #48078
vtjnash added a commit that referenced this pull request Oct 29, 2023
As promised in #2970, this attempts to migrate code to a common set of
utilities in a common place in the code and use them everywhere. This
also exports the functionality, since the Windows API with
WideCharToMultiByte is fairly verbose relative to what libuv and
libuv's clients typically need, so it is useful not to require clients
to reimplement this conversion logic unnecessarily (and because Windows
is not 64-bit ready here, but this implementation is.)
richardlau added a commit to richardlau/node-1 that referenced this pull request Mar 4, 2024
Original commit message:

    fs: fix WTF-8 decoding issue (nodejs#4092)

    We forgot to mask off the high bits from the first byte, so we ended up
    always failing the subsequent range check.

    Refs: libuv/libuv#2970
    Fixes: nodejs#48673
richardlau added a commit to richardlau/node-1 that referenced this pull request Mar 5, 2024
Original commit message:

    fs: fix WTF-8 decoding issue (nodejs#4092)

    We forgot to mask off the high bits from the first byte, so we ended up
    always failing the subsequent range check.

    Refs: libuv/libuv#2970
    Fixes: nodejs#48673
richardlau added a commit to nodejs/node that referenced this pull request Mar 15, 2024
Original commit message:

    fs: fix WTF-8 decoding issue (#4092)

    We forgot to mask off the high bits from the first byte, so we ended up
    always failing the subsequent range check.

    Refs: libuv/libuv#2970
    Fixes: #48673

PR-URL: #51976
Refs: #48673
Reviewed-By: Rafael Gonzaga <rafael.nunu@hotmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Marco Ippolito <marcoippolito54@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Ulises Gascón <ulisesgascongonzalez@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants