Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node occasionally gives multiple files/folders the same inode #12115

Closed
alutman opened this issue Mar 29, 2017 · 63 comments

Comments

@alutman
Copy link

commented Mar 29, 2017

Moderator's note (@Fishrock123): Off-topic comments have and will be deleted.


  • Version: v7.2.1
  • Platform: Windows 10 Pro 64bit, Windows 7 Enterprise 64bit (Both using NTFS)
  • Subsystem: File System

Node sometimes reports different files/folders to have identical ino values.
I can't reproduce this consistently and copying/reproducing a folder structure that contains dupes elsewhere doesn't replicate the issue.

I did encounter lots of duplicates under C:\Users\%USER%\AppData but it may be different for other people

Example

Specific example I encountered

# Structure
│   ServerStore.jsx
│
├───constants
│       Api.jsx
│
└───stores
        UserStore.jsx

> fs.lstatSync("stores").ino
5910974511014218
> fs.lstatSync("stores/UserStore.jsx").ino
24206847997202570
> fs.lstatSync("constants").ino //Duplicate
9851624184963316
> fs.lstatSync("constants/Api.jsx").ino //Duplicate
9851624184963316
> fs.lstatSync("ServerStore.jsx").ino
3659174697792238

Test Script

Here's a hacky node script to loop through a directory and look for duplicate inodes.
Running it on most of my other folders didn't yield a result, until I ran it on C:\Users\%USER%\AppData where I encounted loads of duplicates

Usage: node dupe.js [dir]

var fs = require('fs');
var path = require('path');
var process = require('process');

// Recursively walks a directory and looks for duplicate inodes
// loop from http://stackoverflow.com/questions/5827612/node-js-fs-readdir-recursive-directory-search

var dir = process.argv[2];
if (dir == undefined) {
    dir = '.';
}

var walk = function(dir, done) {

  var results = [];
  fs.readdir(dir, function(err, list) {

    if (err) return done(err);

    var pending = list.length;
    if (!pending) return done(null, results);

    list.forEach(function(file) {

      file = path.resolve(dir, file);
      fs.stat(file, function(err, stat) {
        if(stat && stat.ino) {
            results.push({
                file: file,
                ino: stat.ino
            });
        }

        if (stat && stat.isDirectory()) {
          walk(file, function(err, res) {
            if(res) {
               results = results.concat(res);
            }
            if (!--pending) done(null, results);
          });
        } 
        else {
          if (!--pending) done(null, results);
        }
      });
    });
  });
};

walk(dir, function(err, results) {
    var merge = {};
    results.forEach(function(it) {
        if (!merge[it.ino]) {
            merge[it.ino] = [];
        }
        merge[it.ino].push(it.file);
    });
    var dupes = Object.keys(merge).filter(key => merge[key].length > 1);

    dupes.forEach(it => console.log(it, merge[it]));
})

@mscdex mscdex added the fs label Mar 29, 2017

@bnoordhuis

This comment has been minimized.

Copy link
Member

commented Mar 29, 2017

I suppose it could be either a Windows bug or quirk1 but it could also be caused by node.js converting libuv's 64 bits st_ino field to a double, which won't be lossless for large numbers.

I expect it's the latter that is happening here. You should be able to verify that by instrumenting src/node_file.cc if you are so inclined. Grep for 'st_ino', it's used in only one location.

1 The inode number is derived from NtQueryInformationFile(FileAllInformation). It has some documented quirks and probably a few undocumented as well, see e.g. the section 'Remarks' on this page.

@alutman

This comment has been minimized.

Copy link
Author

commented Mar 30, 2017

I had a little play with that Windows API and ran it directly against files that node reported with the same ino (Using C# DLL import)

Example

Node 
> fs.lstatSync('one.gif').ino
9851624185071828
> fs.lstatSync('two.gif').ino
9851624185071828

Windows 7 NTFS via NtQueryInformationFile
(QuadPart is what is used in libuv)

file: one.gif
type: LARGE_INTEGER
QuadPart: 9851624185071827

file: two.gif
type: LARGE_INTEGER
QuadPart: 9851624185071829

Looks like you were right about the problem being in the double conversion. Here's a C++ snippet highlighting it.

#include <iostream>

int main() {
    long long a = 9851624185071827;
    long long b = 9851624185071829;

    std::cout.precision(100);
    std::cout << a << std::endl; //9851624185071827
    double da = a;
    std::cout << da << std::endl; //9851624185071828

    std::cout << b << std::endl; //9851624185071829
    double db = b;
    std::cout << db << std::endl; //9851624185071828

    return 0;
}

Closest possible fix I could think of is to change the fs_stats_field_array type from double to long long but that sounds like it would just make more problems.

@bnoordhuis

This comment has been minimized.

Copy link
Member

commented Mar 30, 2017

Yes, that wouldn't be a solution. JS doesn't have a 64 bits integral type, only a 64 bits floating-point type. Values >= 253 and <= -253 cannot be represented exactly and get rounded up or down.

We could detect such out-of-range values and present them as strings instead of numbers but it's an open question if a type change like that qualifies as a bug fix or a backwards-incompatible semver-major change.

cc @mscdex since you worked on that code recently.

@mscdex

This comment has been minimized.

Copy link
Contributor

commented Mar 30, 2017

There really is no good solution for this until JS gets 64-bit types. IIRC this isn't the only place where node hopes for the best by using a double for 64-bit integer values.

@Fishrock123

This comment has been minimized.

Copy link
Member

commented Mar 30, 2017

Best thing to do might be make a native module that copies the core part that you need but returns a string?

@addaleax

This comment has been minimized.

Copy link
Member

commented Mar 30, 2017

Alternatively we might think about having a second Stats constructor that treats the uint64_t fields returned by libuv faithfully… we’ve also had people asking for nanosecond resolution in the timer fields (that is otherwise lost by new Date()), this could be a good chance to tackle that, too.

@mscdex

This comment has been minimized.

Copy link
Contributor

commented Mar 30, 2017

Alternatively we might think about having a second Stats constructor

How would that work (having two different constructors at the same time)? Does someone opt-in to the new one with a flag passed to the fs.*stat*() methods or something?

that treats the uint64_t fields returned by libuv faithfully

Define 'faithfully.' Is this an array with the upper and lower 32-bits? A string? An 8-byte Buffer? A new BigInt class of our own? Something else?

we’ve also had people asking for nanosecond resolution in the timer fields (that is otherwise lost by new Date())

I am all for getting rid of the Date instances, it only adds more overhead if users just end up extracting the integer timestamp out of it anyway. At least if we just provide the number, end users could just pass it to new Date() if they need that kind of functionality and everyone else would get a speedup.

@addaleax

This comment has been minimized.

Copy link
Member

commented Mar 30, 2017

How would that work (having two different constructors at the same time)?

Yes, a flag (not a fan) or a new set of functions on fs (more of a fan because that won’t slow down any existing code).

Define 'faithfully.' Is this an array with the upper and lower 32-bits?

Something like that. It’s not actually important to me which of your suggestions – they all make sense – but generally something that people could reimplement the current Stats upon in any way they want.

@mscdex

This comment has been minimized.

Copy link
Contributor

commented Mar 30, 2017

something that people could reimplement the current Stats upon in any way they want

I still don't understand this, how they are changing/providing Stats? Are they monkey patching/overwriting fs.Stats with their own implementation? Should we really be encouraging that?

@addaleax

This comment has been minimized.

Copy link
Member

commented Mar 30, 2017

@mscdex Sorry, probably not expressing myself clearly. Basically, what I’d imagine is a 2nd API on Node’s side, not changing the current API’s behaviour. If somebody wanted, they could build the current API on top of that as a userland module – I’m not suggesting that they monkey-patch anything (or “change” fs.Stats at all).

@jorangreef

This comment has been minimized.

Copy link
Contributor

commented Mar 31, 2017

Personally, I would prefer if Node fixed this properly going forward.

I think passing the inode as a String would be appropriate since an inode is in essence an identifier. It happens to look like an integer, but it may as well be hex for all intents and purposes. I'm not sure if passing the inode as a String would have any performance implications? AFAIK it's a minimum of 24 bytes allocation compared to an integer? I wouldn't mind a Buffer either if that's lighter.

Referring to @mscdex's comment, it would be terrific if we could do away with instantiating thousands of Date objects, when users really just need the timestamp.

@kkoopa

This comment has been minimized.

Copy link

commented Apr 2, 2017

Just split the 64-bit word into a high and low 32-bit word and leave arithmetic to the user.

@jorangreef

This comment has been minimized.

Copy link
Contributor

commented Apr 3, 2017

Just another idea, the 64-bit inode integer returned by Windows could be remapped through a hash into the 53-bit address space supported by JS integers. This should give less collisions than the current folding.

@stickywes

This comment has been minimized.

Copy link

commented Apr 17, 2017

nodejs/node-v0.x-archive#2670

Here is a conversation from a few years ago about what to do with Windows inodes. It is interesting to see the same conversation about the issue happening all over again!

@stickywes

This comment has been minimized.

Copy link

commented Apr 17, 2017

It seems like a combination of the string solution @jorangreef advocates (and @piscisaureus advocated nearly 2 years ago) and the 64 bit high/low fields @kkoopa suggests is the way to go, to me. Or, rather, I don't see why we need to choose between the two.

A string will allow Node to provide a consistent cross-platform representation at the cost of performance during comparisons. Switching an integer to a string would be a breaking change, though, so you'd need to account for that.

The 64 high/low fields would allow you to do performant inode comparisons if you need them.

@sebadoom

This comment has been minimized.

Copy link

commented Apr 18, 2017

I still find it odd no big-integer library is part of standard Node.js. It would have solved problems like this until JS got proper support for 64-bit integers.

@Fishrock123

This comment has been minimized.

Copy link
Member

commented Apr 18, 2017

I still find it odd no big-integer library is part of standard Node.js. It would have solved problems like this until JS got proper support for 64-bit integers.

I suppose we could have done that like with Buffer. I'm not sure why we didn't but I am quite sure there would have been some history of that coming up that pre-dates me.

Just split the 64-bit word into a high and low 32-bit word and leave arithmetic to the user.

I agree with this, we could just return a Uint32Array of 2.

The questions then become:

  • Is "BigNum" support far enough out that this is worth it?
  • Can we transition easier to "BigNum" from the existing APIs, or from from a new API, or will we end up with 3?
@Fishrock123

This comment has been minimized.

Copy link
Member

commented Apr 18, 2017

Actually, on second look we'd be adding a property to fs.Stats, which seems generally less troublesome.

@dchest

This comment has been minimized.

Copy link

commented Apr 18, 2017

Why is there even discussion about big or small numbers? inode is a number only by accident (because it was an index into some internal array on early UNIX), in reality and semantically it's just an arbitrary identifier. Arithmetic on it is useless, so make it a string, object, whatever, it doesn't matter, just make sure it can be converted back and forth into the OS representation internally.

@deckar01

This comment has been minimized.

Copy link

commented Apr 19, 2017

FWIW comparing hex encoded strings is actually faster not that much slower than comparing UInt32Arrays in V8. Theoretically string comparison should be ~6.67% slower, but V8 optimizations make it 17% faster.

https://jsperf.com/64bit-comparison
https://jsperf.com/64bit-comparison-es-5

Strings also make better keys than arrays. 😉

@TimothyGu

This comment has been minimized.

Copy link
Member

commented Apr 19, 2017

@deckar01 your benchmark is a bit flawed as it uses the let syntax which probably forces V8 to use TurboFan, which does not optimize TypedArrays as well as Crankshaft currently does. However, even with that, my Chrome 57 still runs faster on the TypedArray case than on strings. And if I eliminate the use of let, TypedArray is quite a bit faster than strings (158 vs. 102), which makes sense.

@dchest

This comment has been minimized.

Copy link

commented Apr 19, 2017

Typed arrays are mutable though, so string is a better choice for immutable identifiers. (This is a kind of a design decision that should be considered and discussed before doing benchmarks.)

Strings are also transparently encoded and decoded as JSON, unlike typed arrays, so programs that store inodes in JSON-encoded structures will continue doing so without issues. Finally, strings can be compared with ===, just like numbers.

@Fishrock123

This comment has been minimized.

Copy link
Member

commented Apr 19, 2017

I, personally, am totally fine with a string if that works well.

@jasnell

This comment has been minimized.

Copy link
Member

commented Apr 19, 2017

given the lack of anything else more suitable, I'm good with string also

@Memnarch

This comment has been minimized.

Copy link

commented Apr 20, 2017

I'd just have gone with low/high 32bit or 8Byte array.

Btw just to mention, since we are at using strings anyway as it seems: ReFS uses 128Bit identifiers. The 64bit part is not unique there:

The ReFS file system, introduced with Windows Server 2012, includes 128-bit file identifiers. To retrieve the 128-bit file identifier use the GetFileInformationByHandleEx function with FileIdInfo to retrieve the FILE_ID_INFO structure. The 64-bit identifier in this structure is not guaranteed to be unique on ReFS.

It seems that NtQueryInformationFile FILE_INTERNAL_INFORMATION structure and GetFileInformationByHandle(Ex) deliver the FileIDs, except the later one supports returning the 128Bit one in FILE_ID_INFO structure

FILE_INTERNAL_INFORMATION
https://msdn.microsoft.com/de-de/library/windows/hardware/ff540318(v=vs.85).aspx
FILE_ID_INFO
https://msdn.microsoft.com/en-us/library/hh802691(v=vs.85).aspx

@decal

This comment has been minimized.

Copy link

commented Apr 20, 2017

I did encounter lots of duplicates under C:\Users%USER%\AppData but it may be different for other people

I would venture a guess that this is because the AppData folder contains many files that are <= the block size; causing new file control block creation in large numbers, and therefore inodes as well.. This almost sounds like a race condition in the underlying file system driver? shrug

@ghost

This comment has been minimized.

Copy link

commented Apr 20, 2017

@decal it's probably more because of the large number of config files, caches, and whatnot stored in there for the whole zoo of Windows applications. Create a lot of files in the same directory and some of their inodes are bound to be close.

@bnoordhuis

This comment has been minimized.

Copy link
Member

commented Apr 20, 2017

Off-topic comment deleted. Next offender gets a ban.

(Added a note to the OP -@Fishrock123)

shisama added a commit to shisama/node that referenced this issue Jun 12, 2018

doc: document BigInt support in fs.Stats
PR-URL: nodejs#20220
Fixes: nodejs#12115
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>

targos added a commit that referenced this issue Jun 13, 2018

fs: support BigInt in fs.*stat and fs.watchFile
Add the `bigint: true` option to all the `fs.*stat` methods and
`fs.watchFile`.

PR-URL: #20220
Fixes: #12115
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>

targos added a commit that referenced this issue Jun 13, 2018

doc: document BigInt support in fs.Stats
PR-URL: #20220
Fixes: #12115
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>

shisama added a commit to shisama/node that referenced this issue Jun 17, 2018

fs: support BigInt in fs.*stat and fs.watchFile
Add the `bigint: true` option to all the `fs.*stat` methods and
`fs.watchFile`.

PR-URL: nodejs#20220
Fixes: nodejs#12115
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>

shisama added a commit to shisama/node that referenced this issue Jun 17, 2018

doc: document BigInt support in fs.Stats
PR-URL: nodejs#20220
Fixes: nodejs#12115
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
@verdy-p

This comment has been minimized.

Copy link

commented Nov 14, 2018

I suppose it could be either a Windows bug or quirk1 but it could also be caused by node.js converting libuv's 64 bits st_ino field to a double, which won't be lossless for large numbers.

conversions of 64-bit integers to a double (Number in Javascript) is necessarily lossy. Of course a double is represented by a 64-bit field, but it is then normalized for some values of the exponent part:

  • the special exponent field value representing infinite and NaN values where all the mantissa bits are cleared except the 2 highest bits used to distinguish infinites from NaN, and to distinguish signaling and non-signaling NaN which is set to non-signaling according to the native FPU
  • the special exponent field value representing zeroes and denormals: Javascript normalize zeroes to discard its sign, and normalize denormals to zeroes.

So basic unsigned 64-bit integer values represented in hex by

  • 0x8n...nn to 0xFn..nnn are normalized to 0x8n..nn or 0xAn..nn (as a negative, non-NaN, non-infinite, non-denormal double)
  • 0x0n...nn or 0x7n..nnn are normalized to 0x0n..nn or 0x3n..nn (as a negative, non-NaN, non-infinite, non-denormal double)

The best you can get is not preserving the full 64 bits, but only the sign bit, then force the next bit to 0 (avoids zeros/denormals/Nans,infinites where this bit is necessarily 1) and the 62 lowest bits (note that this transform will not preserve the apparent numeric value of this coerced "double" which will be very different from the value of the 63-bit integer (note also that the 63-bit is also not fully preserved the most negative value is also forbidden): you can only represent a 62-bit signed or unsigned integer this way (to convert that "strange" double to a computable and comparable value, you still need to use a BigInt): the unsigned 64-bit 0xFFFF_FFFF_FFFF_FFFF (-1 if signed) is not stored exactly, you loose 2 bits, you can only store 0x9FFF_FFFF_FFFF_FFFF (which would also be a negative double). With such 62-bit field, you can't perform any arithmetic (all you can do is to compare them, but even a double addition or substraction or multiplication or division, or just a negation, will not give you the correct result in this representation).

So your inodes are necessarily limited to 62 bit: if the two highest bits of the 64-bit integer inode are significant, they will be lost. So you get the bug where two distinct files on the same volume (not hardlinked to the same store, so with possibly different contents and different security attributes) will apparently have the same inode in Javascript (the bug can be reproduced only on very large volumes which can hold many inodes but does not affect mosts users using harddisks or small RAID arrays: it occurs on very large cloud storages where there's a single virtual volume; it does not affect clients of these clouds that have partitioned their space in multiple virtual devices, but affects the very large volumes needed for very database storage of several petabytes). The only safe work around for now is to use a second Number to represent a 64-bit inode... or a BigInt.

If inodes are represented as Javascript numbers with identical values, then max inodes cannot be more than MAX_SAFE_INTEGER=(2**53-1) per volume, i.e. you cannot have more than ~9 billions millions inodes per volume (i.e. per mountable filesystem device), and Javascript should be safe: this is already a very considerable space that only a few very large cloud providers like Google can offer in their big farms to their business customers (but something that will be useful for example to create a filesystem-based hashstore for RDF, with lot of extremely small files, typically about 1KB each, i.e. a filesystem total size slightly above 10^16 KB when counting the extra filesytem overhead for managing its own filename indexes, transaction logs and security attributes; with typical harddisks of 1TB each, such giant volume would require more than 100,000 harddisks, not counting redundant disks for RAID 5/6, but for Google that manages millions of customers and allows allocating several Gigabytes for each, such quantity is not demesurated, but Google would normally partition this space with one virtual volume per customer)! Even for Youtube, all videos don't need to be in the same volume and have to be in the storage space of one client

So I have serious doubt that the 53 bit limit of inodes seen in Javascript is a real problem, except if you use javascript as a virtual connector to millions remote volumes, seen in Javascript as if it was only one: in that case, it's jsut best to split this huge space by creating multiple virtual volumes.

The bug may be visible however if inodes returned to applications are not linear but are actually weak hashes (such as MD5) of effective remote inodes.

Or if inodes are used not just to locate a file in the volume, but also to track other information such as a transaction id or version number, like in ReFS (not all nodes in a volume may be accessible simultaneously or with the same security context), in which case you need more bits to store that info (which may be encrypted along with the location in the inode exposed to clients, that must then use inode values exactly, without any loss).

An inode number may even be an opaque 128-bit UUID (created randomly and associated to real inodes on servers, or with a strong hash like SHA), where most inode numbers on the same volume are invalid (inaccessible to the client) and the rest is distributed completely randomly over the 128-bit space: truncating this to 53 bit would necessarily reduce the usable space on the volume to just a tiny fraction of the volume before collisions in 53-bit values start occuring for very distinct 128-bit values.

My opinion is that even Unix/Linux/POSIX will reform the datatype used for "inode numbers" to allow much longer pseudo-integer types (inode_t), and that it may as well be a long structure containing multiple words whose actual storage length/precision will depend on each device/volume, even if 64-bit will remain largely enough for any physical storage volumes like harddisks or RAID arrays or most clouds. So the BigInt representation in Javascript will still be a solution, even if Javascript is extended with native 64-bit integer support.

@refack

This comment has been minimized.

Copy link
Member

commented Nov 14, 2018

@verdy-p fs.stats can now return a struct of BigInt - https://nodejs.org/api/fs.html#fs_fs_stat_path_options_callback

Ref: #23821

@verdy-p

This comment has been minimized.

Copy link

commented Nov 14, 2018

Just another idea, the 64-bit inode integer returned by Windows could be remapped through a hash into the 53-bit address space supported by JS integers. This should give less collisions than the current folding.

This is identical to the solution of converting inode numbers to strings: strings are already hashed; but strings at least can detect collisions and ensure uniqueness.

@verdy-p

This comment has been minimized.

Copy link

commented Nov 14, 2018

@verdy-p fs.stats can now return a struct of BigInt - https://nodejs.org/api/fs.html#fs_fs_stat_path_options_callback

Ref: #23821

Unfortunately, BigInts are easily mutable, and not suitable as atomic identifiers. And they suffer from serilization locking problem (performance cost) when all we need is an atomic store, which is what strings offer natively (Javascript strings do not have to be displayable or even valid UTF-16, they are just immutable vectors of unsigned 16-bit code units; if we want to display these identifiers, we can still reencode them easily into other valid UTF-16 strings). Strings are also faster to compare (at least for equality: this is almost immediate by comparing only the references, if they are stored in a hashtable). But you may argue that we'll suffer from the storage cost of the hastable to make them atomic if we want to manage large collections of inodes: they would slow down all the rest of the API using strings. In terms of memory management and garbage collection, their cost will be equivalent to BigInts which also need to be allocated (the hashing of inode strings is not absolutely necessary as these strings are short enough to be compared directly, without comparing first their hash then looking up their actual location in the hashtable).
I'm not sure that BigInts are solutions, it is largely possible and very likely that file identifiers will be strings with variable length, just like URIs: consider WebDav filesystems...

@ljharb

This comment has been minimized.

Copy link
Member

commented Nov 14, 2018

BigInts are primitives; how are they mutable?

@refack

This comment has been minimized.

Copy link
Member

commented Nov 14, 2018

Adding a string inode field will require some non trivial changes to our marshaling logic. ATM we pass the data as using a globaly mapped array of either double or BigInt:

node/src/node_file.h

Lines 193 to 219 in 9827858

template <typename NativeT, typename V8T>
constexpr void FillStatsArray(AliasedBuffer<NativeT, V8T>* fields,
const uv_stat_t* s, const size_t offset = 0) {
fields->SetValue(offset + 0, s->st_dev);
fields->SetValue(offset + 1, s->st_mode);
fields->SetValue(offset + 2, s->st_nlink);
fields->SetValue(offset + 3, s->st_uid);
fields->SetValue(offset + 4, s->st_gid);
fields->SetValue(offset + 5, s->st_rdev);
#if defined(__POSIX__)
fields->SetValue(offset + 6, s->st_blksize);
#else
fields->SetValue(offset + 6, 0);
#endif
fields->SetValue(offset + 7, s->st_ino);
fields->SetValue(offset + 8, s->st_size);
#if defined(__POSIX__)
fields->SetValue(offset + 9, s->st_blocks);
#else
fields->SetValue(offset + 9, 0);
#endif
// Dates.
fields->SetValue(offset + 10, ToNative<NativeT>(s->st_atim));
fields->SetValue(offset + 11, ToNative<NativeT>(s->st_mtim));
fields->SetValue(offset + 12, ToNative<NativeT>(s->st_ctim));
fields->SetValue(offset + 13, ToNative<NativeT>(s->st_birthtim));
}

Adding another string value will also have a performance penalty.

We could add make this an opt-in new option, something like fs.stat({withStringINode: true})

@refack refack added this to backlog/orphan in feature requests via automation Nov 14, 2018

@verdy-p

This comment has been minimized.

Copy link

commented Nov 14, 2018

fs.stat({withStringINode: true})
I see that more like:

  • fs.stat({as: {ino: 'string'}}), or
  • fs.stat({as: {ino: 'BigInt'}}), the default option being
  • fs.stat({as: {ino: 'Number'}})
    notably if other fields will need to change size in the future (e.g. device numbers, mode, nlink, uid/gid, size/blksize/blocks, dates, ...)

As well some of the returned properties returned by fs.stat() may not be necessary, but costly to get from the underlying OS, and a script may just require a specific set of values, or just one, leaving others possibly unset in the returned Javascript object (also if these properties make no sense for the remote filesystem):

  • fs.stat({req: {ino: true}, as: {ino: 'Number'}})
    If the filesystem returns signatures/hashes or version/view/transaction identifiers or other properties, we could as well request them:
  • fs.stat({req: {hash: 'SHA1', version: true}, as: {hash: 'string', version: 'Number'}})
    And the set of properties that can be returned could be queried as well and returned as an array of strings (or as a single space-separated string):
  • fs.stat({req: {propset: true}, as: {propset: 'Array'}})
    Another solution would be to pass an object prefilled with properties of the expected type:
  • props = {ino = newBigInt()}; fs.stat({into: props}); then use props.ino (if the query cannot be satisfied, the property would be unset).
    Some properties may require passing an authorization token (to allow elevation of privilege), e.g. to query concurrent locks on a file, and some info about lockers. Though I still don't know how to get this token with fs.stat() itself, I can imagine that the "securityTokens" given below may work like a modifiable jar with getters/setters (allowing to first store before in it calling fs.stat, a token that will prompt the user securely like for UAC on windows, or similar prompts by webbrowsers, store the reply as another token in the jar to avoid asking the user repeatedly, and then use that accept/deny token to process further queries with the same tokens jar). Or specify a maximum timeout (if the costly query may be long to process):
  • props = {ino = newBigInt()}; fs.stat({into: props, with:{security: securityTokens, timeout: 1000}});
@refack

This comment has been minimized.

Copy link
Member

commented Nov 14, 2018

So I'm thinking iteratively.
ATM we have:

  • fs.stat() - regular (i.e. values as Number)
  • fs.stat({bigint}) - values as BigInt

What I'm suggesting is adding an independent { inodeString } option that will be orthogonal to bigint true or false, i.e.

  • fs.stat({bigint:true/false, inodeString}) - that returns the same old stat but with a new stat.inodeString.

This is a nice server-minor change. It's opt-in so those who don't want it don't pay for it.

As for unneeded fields, on POSIX we use stat syscall, so we get everything at the same price.
If we're considering backward compatibility, it's safer to keep that as it is.

@verdy-p

This comment has been minimized.

Copy link

commented Nov 14, 2018

Note that you speak about POSIX, not actual systems. Even if most OSes implement more or less the POSIX API, filling all the fields may be costly, and not enough to get file info for all filesystems. In the web context we can speak about virtual filesystems with web APIs that need security enforcement, and where getting all the infos at once may be costly or simply denied: under POSIX rules they would be set to pseudovalues like you do in your code by setting arbitrarily some fields arbitrarily to 0.
A file may itself have also several contents: a main content with size (all other fields are not relevant), and possibly an additional directory content which is enumeratable (in NTFS this could be a set of streams, on the web it would be roughly the same as adding a "/" to an URL) and you could query that content with fs.stat({props: "dir"}): the property "dir" returned would be a directory object, not a simple string or number. There can be potentially a lot of properties for files, like versions, views, streams; users/groups are not necessarily reduced to just a single integer; modes are not necesssarily limited to just a few bits and we may need to query modes for specific groups/users, or according to the securityTokens provided. Filesize and dates may eventually be not defined (for continuous live streams). Another info is if the file can be seeked to random positions or not, or if it can be seeked before its initial start position (a live stream may have a record of the last hour before the first position you get when opening it; file positions you can seek to are not necesssarily byte offsets but may be timestamps and two successive timestamps may be separated by unbreakable records with variable byte sizes.
Conceptually the POSIX rules just tries to mimic what was in the minimum common set of features implemented in the first filesystems for Unix; not taking into account new filesystems/data services that have appeared over time or those that existed since long in other systems like former mainframes, then VMS, and then NT, and the web (many of these are now integrable in Unix/Linux as well).

@dchest

This comment has been minimized.

Copy link

commented Nov 14, 2018

Why is there's a theoretical discussion about future and past file systems, and request for string option? The original issue was solved by introducing BigInt for inode numbers, which is good enough for existing file systems on all platforms supported by Node, in line with the purpose of fs module: "The fs module provides an API for interacting with the file system in a manner closely modeled around standard POSIX functions", can represent arbitrary data in an immutable value, and can be compared with ==. (The only downside compared to strings is that it can't currently be directly encoded with JSON.stringify, but that's a problem for those who encode them to solve.) I only proposed to use string or any other immutable object because we didn't have BigInt, but now that we have them, it's as good as anything else: a value referencing an inode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.