New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: decode filenames using UTF-8 in fs.watch #3401

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
9 participants
@seishun
Member

seishun commented Oct 16, 2015

Fixes #2088.

cc @bnoordhuis

@mscdex mscdex added the fs label Oct 16, 2015

@bnoordhuis

This comment has been minimized.

Member

bnoordhuis commented Oct 17, 2015

This is the correct fix for Windows but not necessarily for other platforms.

@seishun

This comment has been minimized.

Member

seishun commented Oct 17, 2015

@bnoordhuis UTF-8 is assumed for other platforms in e.g. readdir.

@bnoordhuis

This comment has been minimized.

Member

bnoordhuis commented Oct 17, 2015

Yes, that's wrong.

@seishun

This comment has been minimized.

Member

seishun commented Oct 17, 2015

It's correct for 99% of users. Do we want to fix fs.watch for them now, even if it's technically wrong, or wait until we change literally all of the fs API, which will probably never happen because no one sane would want to use buffers for paths?

@bnoordhuis

This comment has been minimized.

Member

bnoordhuis commented Oct 17, 2015

So what I think we need to do is bite the bullet and move the file name / path name munging into a class that contains platform-specific or file system-specific logic for encoding and decoding them. That's going to be a fair bit of churn but that's better than the halfhearted approach where everything is broken in different ways all the time.

@seishun

This comment has been minimized.

Member

seishun commented Oct 17, 2015

a class that contains platform-specific or file system-specific logic for encoding and decoding them

What logic could there be for Unices other than assuming an encoding? And if it assumes UTF-8, then the logic is exactly the same for all platforms.

where everything is broken in different ways all the time

The only thing objectively broken right now is fs.watch. The theoretical breakage of non-UTF8 filenames on Unices is consistent across all of fs APIs as far as I'm aware.

I don't understand why you insist on making global changes before making fs.watch consistent with the rest of the fs API, whatever they might be.

@bnoordhuis

This comment has been minimized.

Member

bnoordhuis commented Oct 17, 2015

What logic could there be for Unices other than assuming an encoding?

Just an example but if you consider OS X a UNIX, HFS+ uses NFD decomposition with extensions. Assuming UTF-8 produces wrong results.

fs.watch() doesn't produce correct results now either but the one-byte encoding means you can at least patch it up post facto.

I don't understand why you insist on making global changes before making fs.watch consistent with the rest of the fs API, whatever they might be.

Because it's just swapping one bug for another. Maybe it's going to work better now for a majority of users but that's small consolation for the minority whose workflow just broke. I don't feel comfortable signing off on that.

@seishun

This comment has been minimized.

Member

seishun commented Oct 17, 2015

Just an example but if you consider OS X a UNIX, HFS+ uses NFD decomposition with extensions. Assuming UTF-8 produces wrong results.

Shouldn't it be libuv's job to convert that into proper UTF-8?

@bnoordhuis

This comment has been minimized.

Member

bnoordhuis commented Oct 17, 2015

It could be. I'm not completely opposed to that but it's a discussion that gets philosophical quickly and it would be a semver-major change at the very least. I do think that in general it's best to do conversion at the edges, which in this case would be the JS <-> C++ border.

@seishun

This comment has been minimized.

Member

seishun commented Oct 17, 2015

It could be. I'm not completely opposed to that but it's a discussion that gets philosophical quickly and it would be a semver-major change at the very least. I do think that in general it's best to do conversion at the edges, which in this case would be the JS <-> C++ border.

It does convert on Windows. Why should OS X be treated any differently? I thought libuv was supposed to abstract away platform differences.

Anyway, my point still stands. You can argue that using UTF-8 on OS X is wrong, but that's how all of fs works. I don't see why fs.watch should be an exception (and in an extremely unintuitive way, too). If there really is an issue with OS X filenames (and so far I haven't found any source confirming that their filenames aren't valid UTF-8), then it should be fixed globally for all of fs at some point.

but that's small consolation for the minority whose workflow just broke

If their workflow would break from that, then why don't I see their complaints about other fs functions?

@nodejs/collaborators other opinions?

@bnoordhuis

This comment has been minimized.

Member

bnoordhuis commented Oct 17, 2015

It does convert on Windows.

Windows is special. Libuv converts to UTF-16 so it can use the wide-char OS functions. The ANSI functions simply don't cut it in many cases.

@bnoordhuis

This comment has been minimized.

Member

bnoordhuis commented Oct 17, 2015

If there really is an issue with OS X filenames (and so far I haven't found any source confirming that their filenames aren't valid UTF-8),

It's UTF-8 with a twist, that's why iconv for example calls it utf-8-mac. From here:

HFS Plus (Mac OS Extended) uses a variant of Normal Form D in which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed

If you google around for "OS X NFD" or "HFS NFD", you'll find a great many SO questions on the topic.

then it should be fixed globally for all of fs at some point.

That's what I was suggesting!

@silverwind

This comment has been minimized.

Contributor

silverwind commented Oct 17, 2015

We had a lengthy discussion about normalization for OSX in #2165, with the gist being that we don't want to normalize to a specific form because of a possible data loss.

As for this issue, I think a class that abstracts away platform inconsistencies and returns UTF-8 in all cases would be prefered and be best done in libuv.

cc: @jorangreef

@seishun

This comment has been minimized.

Member

seishun commented Oct 18, 2015

@bnoordhuis from what I gather, it's still valid UTF-8. In which case, I don't see any problem with decoding it as UTF-8.

@seishun

This comment has been minimized.

Member

seishun commented Oct 18, 2015

@silverwind

with the gist being that we don't want to normalize to a specific form because of a possible data loss.

Which means we just want to pass filenames as-is, right? Which is exactly what this PR does, just like everywhere else in fs.

@silverwind

This comment has been minimized.

Contributor

silverwind commented Oct 19, 2015

Yes, the closest representation to the original file system would be prefered. What's the issue exactly on Unices if we do it like in this PR?

@seishun

This comment has been minimized.

Member

seishun commented Oct 19, 2015

I assume this question was directed at @bnoordhuis, because I have no idea.

@jorangreef

This comment has been minimized.

Contributor

jorangreef commented Oct 19, 2015

@bnoordhuis is right. Although 95% of filesystems use utf-8 as their encoding, one can't assume that every fileysystem is utf-8 or that just because the platform is Windows or Darwin that the encoding is therefore utf-8. Node currently makes this assumption without providing a way to work with filesystems which use other encodings.

The current binary string is better as Ben mentioned because it leaves the encoding choice up to the user. If fs.watch converted to utf-8 then this would lose data in some rare edge cases when the encoding is not in fact utf-8. If the utf-8 decoding then replaced some unknown sequences with the replacement character then there would be no way for the user to get back to the actual filename as used by the filesystem.

@bnoordhuis, does the current utf-8 decoding also normalize form or does it just bring in whatever utf-8 form is being used in the data (whether mixed, NFC, NFD, HFS+ NFD etc.)? I thought it would do the latter, and if so then it should be safe to use with HFS+ so long as the decoding process does not also normalize form at the same time (this should never be done).

Just my two cents, but I think it would be great to make utf-8 the default encoding for all filenames across all fs methods. And then add a path encoding option to the various fs methods such as fs.watch and fs.readdir in case the user knows that the filesystem is actually using another encoding. It should always be up to the user to figure out what encoding the filesystem is using and Node should never try to take this responsibility on itself (apart from assume utf-8 as the default, but give an option to override encoding). Unicode form however should never be normalized and there should never be any option to do this.

@seishun

This comment has been minimized.

Member

seishun commented Oct 19, 2015

What's the likelihood of someone using fs.watch on an exotic filesystem while disregarding the fact that they are completely screwed by the rest of fs? (And discovering the Buffer hack instead of giving up or submitting an issue when fs.watch gave them garbage)

The current binary string is better as Ben mentioned because it leaves the encoding choice up to the user.

No, it's not better. It "supports" the likely irrelevant edge case in the worst way possible (basically by passing a "binary string", which should have been part of history for about 3 years now), while being inconsistent with the rest of fs. What kind of a first impression will a new user of Node.js get when they are told that fs.readdir passes correct filenames while fs.watch requires an undocumented hack with a deprecated encoding?

@trevnorris

This comment has been minimized.

Contributor

trevnorris commented Oct 19, 2015

"And discovering the Buffer hack"
"requires an undocumented hack"

I'm curious what you're referring to?

(basically by passing a "binary string", which should have been part of history for about 3 years now)

Do you mean removed from fs or removed completely?

with a deprecated encoding

I'd completely forgot about that line in the docs. It's incorrect. Thanks for the reminder (#3441)

@seishun

This comment has been minimized.

Member

seishun commented Oct 20, 2015

I'm curious what you're referring to?

Using new Buffer(str, 'binary') to convert it back to a byte array.

Do you mean removed from fs or removed completely?

Removed from all APIs.

@jorangreef

This comment has been minimized.

Contributor

jorangreef commented Oct 20, 2015

@seishun I think you may have skipped over the last part of my comment:

Just my two cents, but I think it would be great to make utf-8 the default encoding for all filenames across all fs methods. And then add a path encoding option to the various fs methods such as fs.watch and fs.readdir in case the user knows that the filesystem is actually using another encoding. It should always be up to the user to figure out what encoding the filesystem is using and Node should never try to take this responsibility on itself (apart from assume utf-8 as the default, but give an option to override encoding). Unicode form however should never be normalized and there should never be any option to do this.

@seishun

This comment has been minimized.

Member

seishun commented Oct 20, 2015

@jorangreef Yes, it might be a good idea to have in the future, but fs.watch shouldn't remain in its current state until then.

@seishun

This comment has been minimized.

Member

seishun commented Oct 21, 2015

@bnoordhuis What's the status on this?

@seishun seishun added the relnotes label Oct 22, 2015

@srl295

This comment has been minimized.

Member

srl295 commented Oct 23, 2015

I don't think I have enough context to ±1 this specific change here, but UTF-8-only from libuv or the JS⇄C++ boundary would make sense.

It is fact that not all file systems are in UTF-8.

It's correct for 99% of users. Do we want to fix fs.watch for them now, even if it's technically wrong

And thus a breaking change for 1% (the number may actually be higher)

@seishun

This comment has been minimized.

Member

seishun commented Oct 23, 2015

And thus a breaking change for 1% (the number may actually be higher)

The number is irrelevant. Even if such users exist, they won't be able to read or write any files, as all of fs assumes UTF-8. I doubt they'd be able to run Node.js scripts at all. fs.watch is the least of their concern, singling it out doesn't make sense.

@jorangreef

This comment has been minimized.

Contributor

jorangreef commented Oct 24, 2015

Actually, we have a sync app in production that expects a binary string from fs.watch and converts it to utf-8. The proposed change would break our sync and probably quite a few apps and modules using fs.watch.

@seishun your arguments do not follow, and do not add positively to the discussion. It's not as simple as you think.

@seishun

This comment has been minimized.

Member

seishun commented Oct 24, 2015

Actually, we have a sync app in production that expects a binary string from fs.watch and converts it to utf-8. The proposed change would break our sync and probably quite a few apps and modules using fs.watch.

Fine, make it semver-major then. Why is it a problem? You'll just have to replace new Buffer(filename, 'binary').toString() with filename.

@seishun your arguments do not follow, and do not add positively to the discussion. It's not as simple as you think.

Unfounded statements aren't useful to the discussion either. Care elaborating why it's "not as simple as I think"?

@silverwind

This comment has been minimized.

Contributor

silverwind commented Oct 24, 2015

I'm generally in support of this, but don't understand enough about binary vs. unicode encoding to sign it off. I'll add a TSC label here, so it can be discussed next meeting. Also labeled as breaking as mentioned.

@bnoordhuis

This comment has been minimized.

Member

bnoordhuis commented Oct 25, 2015

Why is it a problem?

Conversion to UTF-8 is not lossless. Invalid byte sequences get replaced with U+FFFD.

@jasnell

This comment has been minimized.

Member

jasnell commented Oct 25, 2015

Let's take a step back and look at what is needed to solve the problem more
generally as @bnoordhuis suggests. A breaking change for 1% of the users is
still a breaking change. If we're going to do it, let's do it right.
On Oct 25, 2015 7:04 AM, "Ben Noordhuis" notifications@github.com wrote:

Why is it a problem?

Conversion to UTF-8 is not lossless. Invalid byte sequences get replaced
with U+FFFD.


Reply to this email directly or view it on GitHub
#3401 (comment).

@seishun

This comment has been minimized.

Member

seishun commented Oct 25, 2015

Conversion to UTF-8 is not lossless.

That's obviously not a problem that affects @jorangreef.

Let's take a step back and look at what is needed to solve the problem more generally as @bnoordhuis suggests.

Alright, can you link some of the issues that the users of non-UTF-8 Unices have submitted about fs.open, fs.readFile, fs.stat, fs.link and all the other fs functions that are broken for them? That would be a more appropriate place to discuss global fs changes.

@bnoordhuis

This comment has been minimized.

Member

bnoordhuis commented Oct 25, 2015

Taking that step back, I think we all agree that node is buggy, we just disagree on how to fix it. I don't want to derail this PR further so I filed #3519 with a recap of the current situation and a call for proposals.

@seishun

This comment has been minimized.

Member

seishun commented Oct 27, 2015

One thing is clear - there will be no breaking changes to fs as the result of #3519. So I don't see any reason why a fix to fs.watch must be postponed until that.

@rvagg rvagg removed the tsc-agenda label Nov 4, 2015

@Trott Trott force-pushed the nodejs:master branch to 082cc8d Dec 27, 2015

@MylesBorins MylesBorins referenced this pull request Jan 11, 2016

Merged

V5.4.1 propose #4626

@seishun seishun referenced this pull request Mar 9, 2016

Closed

fs: add Buffer support in fs methods #5616

0 of 4 tasks complete

jasnell added a commit to jasnell/node that referenced this pull request Mar 25, 2016

fs: Buffer and encoding enhancements to fs API
This makes several changes:

1. Allow path/filename to be passed in as a Buffer on fs methods
2. Add `options.encoding` to fs.readdir, fs.readdirSync, fs.readlink,
   fs.readlinkSync and fs.watch.
3. Documentation updates

For 1... it's now possible to do:

```js
fs.open(Buffer('/fs/foo/bar'), 'w+', (err, fd) => { });
```

For 2...
```js
fs.readdir('/fs/foo/bar', {encoding:'hex'}, (err,list) => { });

fs.readdir('/fs/foo/bar', {encoding:'buffer'}, (err, list) => { });
```

encoding can also be passed as a string

```js
fs.readdir('/fs/foo/bar', 'hex', (err,list) => { });
```

The default encoding is set to UTF8 so this addresses the
discrepency that existed previously between fs.readdir and
fs.watch handling filenames differently.

Fixes: nodejs#2088
Refs: nodejs#3519
Alternate: nodejs#3401
@jasnell

This comment has been minimized.

Member

jasnell commented Apr 2, 2016

#5616 landed... which includes decoding filenames in watch as utf8 by default

@jasnell jasnell closed this Apr 2, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment