-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs: use WTF-8 on Windows #2970
fs: use WTF-8 on Windows #2970
Conversation
This allows working with filenames that are not well-formed UTF-16. Fixes: libuv#2048 Refs: https://simonsapin.github.io/wtf-8/
Update: this now threads |
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
49bc411
to
88f5d6b
Compare
Ok, I applied your suggestions but I ended up rewriting |
30bdf09
to
dae1a63
Compare
dae1a63
to
bc51ad1
Compare
52042b1
to
ddb4e33
Compare
The function
Which one is it? |
The code raises an error if the WTF-8 is not well-formed, which is exactly what the spec suggests. |
Where does it suggest that? |
The spec doesn't actually address ill-formed WTF-8 at all as far as I can tell, but it does suggest when converting strictly from WTF-8 to UTF-8 that one return an error if the WTF-8 is not valid UTF-8. The other option that it suggests is replacing unparied surrogates with replacement characters. The situation going from arbitrary bytes to WTF-8 is analogous and the viable choices are the same: raise an error or replace the ill-formed sequences with replacement characters. Since the spec explicitly doesn't address anything besides well-formed WTF-8, we're free to do which of these we feel is appropriate. The most conservative thing to do is to raise an error with the caller, which is precisely what this pull-request does. |
Correct, and the note in http://simonsapin.github.io/wtf-8/#decode-from-wtf-8 explains why.
From well-formed WTF-8. This is a different situation.
It doesn't address ill-formed WTF-8 because it assumes that this situation will never occur. One could see it as a litmus test for suitability of WTF-8 for a given purpose. The fact that you seem to argue that this situation can occur suggests that WTF-8 is not being used for its intended purpose. |
You can view this PR as doing two separate things:
The spec only applies to the second step, and that step is done precisely according to spec. The first step is, by the spec's own statement, outside of scope of the spec. The spec explicitly says that it has no position about how this step should be handled. That doesn't mean that we can't apply the WTF-8 spec to the second step anymore than the fact that the Unicode spec doesn't discuss gzip means that we can't decompress a gzipped file and then interpret the decompressed data as UTF-8. If the gzip file is malformed, then we throw an error before trying to decode the UTF-8 data. Your argument is like saying that because gzip files can be incorrectly formatted, we can't possibly apply the Unicode standard to interpret the contents of a decompressed gzip file. Even though the WTF-8 spec doesn't address ill-formed WTF-8 data, we can extrapolate from the options it gives for handling converting WTF-8 to UTF-8. There are two approaches:
The same options exist for handling step one above:
Again: the spec doesn't address this explicitly, but the fact that step one is out of scope for the WTF-8 spec doesn't mean that we can't apply the WTF-8 spec to step two. In this PR, we take the more conservative approach here and raise an error to the caller when they pass ill-formed WTF-8. |
Btw, this is your interpretation, not something the spec says. I suspect the spec authors are aware that ill-formed WTF-8 can and does occur, they just consider it out of scope of the WTF-8 spec to tell you how to handle it. That doesn't mean that the problem goes away, or that if you have a situation where it can occur you just stick your head in the sand and ignore the possiblity. It just means that it's up to you to decide how to handle it. |
I don't know how you ended up with that interpretation if you've read the note I linked. The analogy is flawed because validation is an integral part of any encoding spec where potentially ill-formed input can occur (e.g. https://encoding.spec.whatwg.org/#utf-8-decoder) - and WTF-8 isn't one.
How can it occur in a self-contained system? |
This is ready to merge |
This allows working with valid filenames that are not well-formed UTF-16. This is a superset of UTF-8, which does not error when it encounters an unpaired surrogate but simply allows it. Fixes: libuv#2048 Refs: https://simonsapin.github.io/wtf-8/ Replaces: libuv#2192 by Nikolai Vavilov <vvnicholas@gmail.com> Co-authored-by: Jameson Nash <vtjnash@gmail.com> (cherry picked from commit 8f32a14)
w00t! |
Notable changes - fs: use WTF-8 on Windows: libuv/libuv#2970 - linux: add some more iouring backed fs ops: libuv/libuv#4012 Important bugs fixed - linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059 - src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048
Notable changes - fs: use WTF-8 on Windows: libuv/libuv#2970 - linux: add some more iouring backed fs ops: libuv/libuv#4012 Important bugs fixed - linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059 - src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048 PR-URL: #48618 Fixes: #48512 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com> Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>
Notable changes - fs: use WTF-8 on Windows: libuv/libuv#2970 - linux: add some more iouring backed fs ops: libuv/libuv#4012 Important bugs fixed - linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059 - src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048 PR-URL: #48618 Fixes: #48512 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com> Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>
We forgot to mask off the high bits from the first byte, so we ended up always failing the subsequent range check. Refs: #2970 Fixes: nodejs/node#48673
Notable changes - fs: use WTF-8 on Windows: libuv/libuv#2970 - linux: add some more iouring backed fs ops: libuv/libuv#4012 Important bugs fixed - linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059 - src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048 PR-URL: nodejs#48618 Fixes: nodejs#48512 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com> Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>
Notable changes - fs: use WTF-8 on Windows: libuv/libuv#2970 - linux: add some more iouring backed fs ops: libuv/libuv#4012 Important bugs fixed - linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059 - src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048 PR-URL: nodejs#48618 Fixes: nodejs#48512 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com> Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>
Notable changes - fs: use WTF-8 on Windows: libuv/libuv#2970 - linux: add some more iouring backed fs ops: libuv/libuv#4012 Important bugs fixed - linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059 - src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048 PR-URL: nodejs#48618 Fixes: nodejs#48512 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com> Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>
Notable changes - fs: use WTF-8 on Windows: libuv/libuv#2970 - linux: add some more iouring backed fs ops: libuv/libuv#4012 Important bugs fixed - linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059 - src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048 PR-URL: #48618 Backport-PR-URL: #49591 Fixes: #48512 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com> Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com> PR-URL: #48078
Notable changes - fs: use WTF-8 on Windows: libuv/libuv#2970 - linux: add some more iouring backed fs ops: libuv/libuv#4012 Important bugs fixed - linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059 - src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048 PR-URL: #48618 Backport-PR-URL: #49591 Fixes: #48512 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com> Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com> PR-URL: #48078
Notable changes - fs: use WTF-8 on Windows: libuv/libuv#2970 - linux: add some more iouring backed fs ops: libuv/libuv#4012 Important bugs fixed - linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059 - src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048 PR-URL: #48618 Backport-PR-URL: #49591 Fixes: #48512 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com> Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com> PR-URL: #48078
As promised in #2970, this attempts to migrate code to a common set of utilities in a common place in the code and use them everywhere. This also exports the functionality, since the Windows API with WideCharToMultiByte is fairly verbose relative to what libuv and libuv's clients typically need, so it is useful not to require clients to reimplement this conversion logic unnecessarily (and because Windows is not 64-bit ready here, but this implementation is.)
Original commit message: fs: fix WTF-8 decoding issue (nodejs#4092) We forgot to mask off the high bits from the first byte, so we ended up always failing the subsequent range check. Refs: libuv/libuv#2970 Fixes: nodejs#48673
Original commit message: fs: fix WTF-8 decoding issue (nodejs#4092) We forgot to mask off the high bits from the first byte, so we ended up always failing the subsequent range check. Refs: libuv/libuv#2970 Fixes: nodejs#48673
Original commit message: fs: fix WTF-8 decoding issue (#4092) We forgot to mask off the high bits from the first byte, so we ended up always failing the subsequent range check. Refs: libuv/libuv#2970 Fixes: #48673 PR-URL: #51976 Refs: #48673 Reviewed-By: Rafael Gonzaga <rafael.nunu@hotmail.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Marco Ippolito <marcoippolito54@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Ulises Gascón <ulisesgascongonzalez@gmail.com>
I got tired of waiting on #2192, so I just applied the code review suggestions made by @vtjnash. Hopefully I'm doing the return error codes right, but if not, let me know what else to do here.
There were a couple of formatting recommendations for long function signatures that I did not follow since lining wrapped up arguments with the opening parens does not seem to be the existing style in this file. Instead, I indented the wrapped arguments to be indented once after the normal function body level, which seems to be how the rest of this file is formatted. If the other format is preferable, it would be better to make a separate cosmetic PR to fix that.