fs: use WTF-8 on Windows #2970

StefanKarpinski · 2020-08-26T14:02:05Z

I got tired of waiting on #2192, so I just applied the code review suggestions made by @vtjnash. Hopefully I'm doing the return error codes right, but if not, let me know what else to do here.

There were a couple of formatting recommendations for long function signatures that I did not follow since lining wrapped up arguments with the opening parens does not seem to be the existing style in this file. Instead, I indented the wrapped arguments to be indented once after the normal function body level, which seems to be how the rest of this file is formatted. If the other format is preferable, it would be better to make a separate cosmetic PR to fix that.

This allows working with filenames that are not well-formed UTF-16. Fixes: libuv#2048 Refs: https://simonsapin.github.io/wtf-8/

StefanKarpinski · 2020-08-26T16:45:04Z

Update: this now threads ERROR_INVALID_NAME back to the caller if any path name is invalid WTF-8. I also made the types used a bit more specific: code points are always int32_t, path lengths are always int (plenty for paths on all platforms); the types are signed because -1 is used to signal invalid WTF-8 data. I put some asserts in there that are strictly not necessary since they're after functions that cannot return invalid values (i.e. going from WTF-16 to WTF-8), but it seemed better to put the asserts in anyway with a comment.

src/win/fs.c

Co-authored-by: Jameson Nash <vtjnash@gmail.com>

StefanKarpinski · 2020-08-26T20:31:54Z

Ok, I applied your suggestions but I ended up rewriting fs__decode_wtf8_char entirely to be simpler.

src/win/fs.c

seishun · 2020-08-27T15:45:55Z

The function fs__decode_wtf8_char includes validation (albeit incomplete), which was deliberately removed from the spec (see SimonSapin/wtf-8@c6a271d). This implies one of the following:

The spec's author was wrong in limiting the algorithm to well-formed WTF-8
libuv is a special case

Which one is it?

StefanKarpinski · 2020-08-27T16:13:11Z

The code raises an error if the WTF-8 is not well-formed, which is exactly what the spec suggests.

seishun · 2020-08-27T16:21:17Z

Where does it suggest that?

StefanKarpinski · 2020-08-27T16:31:46Z

The spec doesn't actually address ill-formed WTF-8 at all as far as I can tell, but it does suggest when converting strictly from WTF-8 to UTF-8 that one return an error if the WTF-8 is not valid UTF-8. The other option that it suggests is replacing unparied surrogates with replacement characters. The situation going from arbitrary bytes to WTF-8 is analogous and the viable choices are the same: raise an error or replace the ill-formed sequences with replacement characters. Since the spec explicitly doesn't address anything besides well-formed WTF-8, we're free to do which of these we feel is appropriate. The most conservative thing to do is to raise an error with the caller, which is precisely what this pull-request does.

seishun · 2020-08-27T17:59:28Z

The spec doesn't actually address ill-formed WTF-8 at all as far as I can tell

Correct, and the note in http://simonsapin.github.io/wtf-8/#decode-from-wtf-8 explains why.

but it does suggest when converting strictly from WTF-8 to UTF-8 that one return an error if the WTF-8 is not valid UTF-8

From well-formed WTF-8. This is a different situation.

Since the spec explicitly doesn't address anything besides well-formed WTF-8, we're free to do which of these we feel is appropriate.

It doesn't address ill-formed WTF-8 because it assumes that this situation will never occur. One could see it as a litmus test for suitability of WTF-8 for a given purpose. The fact that you seem to argue that this situation can occur suggests that WTF-8 is not being used for its intended purpose.

StefanKarpinski · 2020-08-27T18:39:56Z

You can view this PR as doing two separate things:

Verify that caller-supplied data is well-formed WTF-8; if it isn't raise an error to the caller.
Given well-formed WTF-8, convert it to a sequence of code points as per the WTF-8 spec.

The spec only applies to the second step, and that step is done precisely according to spec. The first step is, by the spec's own statement, outside of scope of the spec. The spec explicitly says that it has no position about how this step should be handled. That doesn't mean that we can't apply the WTF-8 spec to the second step anymore than the fact that the Unicode spec doesn't discuss gzip means that we can't decompress a gzipped file and then interpret the decompressed data as UTF-8. If the gzip file is malformed, then we throw an error before trying to decode the UTF-8 data. Your argument is like saying that because gzip files can be incorrectly formatted, we can't possibly apply the Unicode standard to interpret the contents of a decompressed gzip file.

Even though the WTF-8 spec doesn't address ill-formed WTF-8 data, we can extrapolate from the options it gives for handling converting WTF-8 to UTF-8. There are two approaches:

The strict, non-lossy approach where you throw an error upon encountering any unpaired surrogate;
The lossy, non-strict approach where you convert unpaired surrogate code points to replacement characters.

The same options exist for handling step one above:

The strict, non-lossy approach where you throw an error upon encountering any ill-formed WTF-8;
The lossy, non-strict approach where you convert ill-formed WTF-8 sequences to replacement characters.

Again: the spec doesn't address this explicitly, but the fact that step one is out of scope for the WTF-8 spec doesn't mean that we can't apply the WTF-8 spec to step two. In this PR, we take the more conservative approach here and raise an error to the caller when they pass ill-formed WTF-8.

StefanKarpinski · 2020-08-27T19:19:11Z

It doesn't address ill-formed WTF-8 because it assumes that this situation will never occur.

Btw, this is your interpretation, not something the spec says. I suspect the spec authors are aware that ill-formed WTF-8 can and does occur, they just consider it out of scope of the WTF-8 spec to tell you how to handle it. That doesn't mean that the problem goes away, or that if you have a situation where it can occur you just stick your head in the sand and ignore the possiblity. It just means that it's up to you to decide how to handle it.

seishun · 2020-08-27T20:02:04Z

I don't know how you ended up with that interpretation if you've read the note I linked. The analogy is flawed because validation is an integral part of any encoding spec where potentially ill-formed input can occur (e.g. https://encoding.spec.whatwg.org/#utf-8-decoder) - and WTF-8 isn't one.

ill-formed WTF-8 can and does occur

How can it occur in a self-contained system?

vtjnash · 2023-05-16T01:21:11Z

This is ready to merge

src/win/fs.c

test/test-fs.c

This allows working with valid filenames that are not well-formed UTF-16. This is a superset of UTF-8, which does not error when it encounters an unpaired surrogate but simply allows it. Fixes: libuv#2048 Refs: https://simonsapin.github.io/wtf-8/ Replaces: libuv#2192 by Nikolai Vavilov <vvnicholas@gmail.com> Co-authored-by: Jameson Nash <vtjnash@gmail.com> (cherry picked from commit 8f32a14)

StefanKarpinski · 2023-05-23T18:41:20Z

w00t!

Notable changes - fs: use WTF-8 on Windows: libuv/libuv#2970 - linux: add some more iouring backed fs ops: libuv/libuv#4012 Important bugs fixed - linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059 - src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048

Notable changes - fs: use WTF-8 on Windows: libuv/libuv#2970 - linux: add some more iouring backed fs ops: libuv/libuv#4012 Important bugs fixed - linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059 - src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048 PR-URL: #48618 Fixes: #48512 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com> Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>

We forgot to mask off the high bits from the first byte, so we ended up always failing the subsequent range check. Refs: #2970 Fixes: nodejs/node#48673

Notable changes - fs: use WTF-8 on Windows: libuv/libuv#2970 - linux: add some more iouring backed fs ops: libuv/libuv#4012 Important bugs fixed - linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059 - src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048 PR-URL: nodejs#48618 Fixes: nodejs#48512 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com> Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>

Notable changes - fs: use WTF-8 on Windows: libuv/libuv#2970 - linux: add some more iouring backed fs ops: libuv/libuv#4012 Important bugs fixed - linux: work around io_uring IORING_OP_CLOSE bug: libuv/libuv#4059 - src: don't run timers if loop is stopped/unref'd: libuv/libuv#4048 PR-URL: #48618 Backport-PR-URL: #49591 Fixes: #48512 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com> Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com> PR-URL: #48078

As promised in #2970, this attempts to migrate code to a common set of utilities in a common place in the code and use them everywhere. This also exports the functionality, since the Windows API with WideCharToMultiByte is fairly verbose relative to what libuv and libuv's clients typically need, so it is useful not to require clients to reimplement this conversion logic unnecessarily (and because Windows is not 64-bit ready here, but this implementation is.)

Original commit message: fs: fix WTF-8 decoding issue (nodejs#4092) We forgot to mask off the high bits from the first byte, so we ended up always failing the subsequent range check. Refs: libuv/libuv#2970 Fixes: nodejs#48673

Original commit message: fs: fix WTF-8 decoding issue (#4092) We forgot to mask off the high bits from the first byte, so we ended up always failing the subsequent range check. Refs: libuv/libuv#2970 Fixes: #48673 PR-URL: #51976 Refs: #48673 Reviewed-By: Rafael Gonzaga <rafael.nunu@hotmail.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Marco Ippolito <marcoippolito54@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Ulises Gascón <ulisesgascongonzalez@gmail.com>

seishun and others added 3 commits February 12, 2019 20:29

fs: use WTF-8 on Windows

8943842

This allows working with filenames that are not well-formed UTF-16. Fixes: libuv#2048 Refs: https://simonsapin.github.io/wtf-8/

fs: WTF-8 code, apply code review suggestions

d6fe3fa

Merge branch 'v1.x' into sk/wtf-8

7cd60d8

StefanKarpinski mentioned this pull request Aug 26, 2020

fs: use WTF-8 on Windows #2192

Closed

StefanKarpinski added 2 commits August 26, 2020 11:32

fs: return ERROR_INVALID_NAME on invalid WTF-8

a7e7c82

fs: more specific sized types for UTF-8

906bfd8

vtjnash reviewed Aug 26, 2020

View reviewed changes

src/win/fs.c Outdated Show resolved Hide resolved

src/win/fs.c Outdated Show resolved Hide resolved

src/win/fs.c Outdated Show resolved Hide resolved

src/win/fs.c Outdated Show resolved Hide resolved

src/win/fs.c Outdated Show resolved Hide resolved

vtjnash reviewed Aug 26, 2020

View reviewed changes

src/win/fs.c Outdated Show resolved Hide resolved

apply review recommendations

e996002

Co-authored-by: Jameson Nash <vtjnash@gmail.com>

StefanKarpinski force-pushed the sk/wtf-8 branch from 49bc411 to 88f5d6b Compare August 26, 2020 20:31

vtjnash reviewed Aug 27, 2020

View reviewed changes

src/win/fs.c Show resolved Hide resolved

src/win/fs.c Outdated Show resolved Hide resolved

StefanKarpinski added 2 commits August 27, 2020 10:04

use lowercase hex literals

b3411cf

rewrite fs__decode_wtf8_char

0f34185

StefanKarpinski force-pushed the sk/wtf-8 branch 2 times, most recently from 30bdf09 to dae1a63 Compare August 27, 2020 15:22

check for leading continuation byte (invalid)

bc51ad1

StefanKarpinski force-pushed the sk/wtf-8 branch from dae1a63 to bc51ad1 Compare August 27, 2020 15:23

add a bunch of comments

ddb4e33

StefanKarpinski force-pushed the sk/wtf-8 branch from 52042b1 to ddb4e33 Compare August 27, 2020 15:33

vtjnash approved these changes Aug 27, 2020

View reviewed changes

fix comparison values, tests, and syntax

b8d7f2e

vtjnash force-pushed the sk/wtf-8 branch from aa671b4 to b8d7f2e Compare May 16, 2023 00:34

bnoordhuis reviewed May 16, 2023

View reviewed changes

address comments

e427c27

vtjnash merged commit 8f32a14 into libuv:v1.x May 23, 2023
31 of 34 checks passed

StefanKarpinski deleted the sk/wtf-8 branch May 23, 2023 18:41

vtjnash mentioned this pull request May 23, 2023

[RFC] export WTF8 conversion utilities #4021

Merged

santigimeno mentioned this pull request May 25, 2023

fs: cannot interact with invalid UTF-16 filenames on Windows, even with Buffers nodejs/node#23735

Closed

bnoordhuis mentioned this pull request Jun 20, 2023

Release proposal: v1.46.0 #4060

Closed

santigimeno mentioned this pull request Jun 30, 2023

deps: upgrade to libuv 1.46.0 nodejs/node#48618

Merged

justinmk mentioned this pull request Jul 1, 2023

build(deps): bump libuv to v1.46.0 neovim/neovim#24218

Merged

vtjnash mentioned this pull request Jul 12, 2023

fs: fix WTF-8 decoding issue #4092

Merged

vtjnash added a commit that referenced this pull request Jul 13, 2023

fs: fix WTF-8 decoding issue (#4092)

d09441c

We forgot to mask off the high bits from the first byte, so we ended up always failing the subsequent range check. Refs: #2970 Fixes: nodejs/node#48673

kLabz mentioned this pull request Oct 4, 2023

[ci] stay on node 18.17.1 for windows CI HaxeFoundation/haxe#11325

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fs: use WTF-8 on Windows #2970

fs: use WTF-8 on Windows #2970

StefanKarpinski commented Aug 26, 2020

StefanKarpinski commented Aug 26, 2020

StefanKarpinski commented Aug 26, 2020

seishun commented Aug 27, 2020

StefanKarpinski commented Aug 27, 2020

seishun commented Aug 27, 2020

StefanKarpinski commented Aug 27, 2020 •

edited

Loading

seishun commented Aug 27, 2020

StefanKarpinski commented Aug 27, 2020 •

edited

Loading

StefanKarpinski commented Aug 27, 2020 •

edited

Loading

seishun commented Aug 27, 2020

vtjnash commented May 16, 2023

StefanKarpinski commented May 23, 2023

fs: use WTF-8 on Windows #2970

fs: use WTF-8 on Windows #2970

Conversation

StefanKarpinski commented Aug 26, 2020

StefanKarpinski commented Aug 26, 2020

StefanKarpinski commented Aug 26, 2020

seishun commented Aug 27, 2020

StefanKarpinski commented Aug 27, 2020

seishun commented Aug 27, 2020

StefanKarpinski commented Aug 27, 2020 • edited Loading

seishun commented Aug 27, 2020

StefanKarpinski commented Aug 27, 2020 • edited Loading

StefanKarpinski commented Aug 27, 2020 • edited Loading

seishun commented Aug 27, 2020

vtjnash commented May 16, 2023

StefanKarpinski commented May 23, 2023

StefanKarpinski commented Aug 27, 2020 •

edited

Loading

StefanKarpinski commented Aug 27, 2020 •

edited

Loading

StefanKarpinski commented Aug 27, 2020 •

edited

Loading