Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minimize fs calls #2

Closed
es128 opened this issue Dec 10, 2015 · 3 comments
Closed

minimize fs calls #2

es128 opened this issue Dec 10, 2015 · 3 comments

Comments

@es128
Copy link
Collaborator

es128 commented Dec 10, 2015

For performance reasons, the less calls to fs methods, the better.

Currently, path entries go through a chain of stat -> realPath -> readdir (the last one only for confirmed dirs). But I think it would be better to refactor to something more like:

fs.readdir(path, function(err, entries) {
  if (err.Error === 'ENOTDIR') {
    // path exists, and it is a file or other non-dir fs entry
  } else if (err) {
    return callback(err);
  } else {
    // this is a dir, recurse to the entries
  }
});

Then after that do stat and/or realPath if requested based on opt-in parameters (or keep it simple and just leave it up to the consumer to do on their own).

It could even be possible to allow a really aggressive strategy of assuming that any entry that is named like a file (/\.\w{1,4}$/) is a file and skipping the fs.readdir.

I discussed the same sort of idea in paulmillr/readdirp#24, but when I really looked at it it turned out to seem to be too complex a change to slide into readdirp. I had planned to work on it as a new module soon after knocking down some of the bugs in chokidar that had been reported during my less OSS-active period this year (one of those bugs btw was directly caused by readdirp always using realPath, which is another motivation for not doing it here or making it optional). And then you came along and made this.

It would represent a really significant change to the strategy employed here so far (almost a complete rewrite possibly), but if you're cool with the concept I'd be happy to collaborate on moving in this direction here rather than work on it in a new project.

@bpasero
Copy link

bpasero commented Mar 11, 2016

@es128 @kmalakoff I have had bad experiences doing eager fs.readdir() calls when using more recent versions of node.js. It seems to me that readdir is doing a whole bunch of things you do not want for files. We have similar code to walk a file system and found the current approach to work quite efficient (feedback welcome):

  • always begin with a lstat
  • if file is a link, stat it (you need this because lstat will not tell you if the file is a directory or not unless you stat it)
  • if file is a link, realpath it to get the real target and prevent endless loops if you have links pointing to each other (you will need to add this to walk-filtered to fix this issue)
  • continue with readdir for folders

@kmalakoff
Copy link
Owner

kmalakoff commented Jan 3, 2018

I tried both of your advice, but I couldn't get a faster running verion.

  1. original
Default options x 6.03 ops/sec ±3.70% (33 runs sampled)
Serial (fs) x 3.31 ops/sec ±1.34% (21 runs sampled)
Parallel (fs) x 5.67 ops/sec ±1.81% (31 runs sampled)
Parallel (gfs) x 5.35 ops/sec ±3.06% (30 runs sampled)
Parallel limit (fs, 10) x 6.07 ops/sec ±2.14% (33 runs sampled)
Parallel limit (fs, 50) x 5.99 ops/sec ±2.24% (33 runs sampled)
Parallel limit (fs, 100) x 5.48 ops/sec ±7.40% (31 runs sampled)
Parallel limit (gfs, 10) x 5.93 ops/sec ±2.04% (33 runs sampled)
Parallel limit (gfs, 50) x 5.68 ops/sec ±2.91% (31 runs sampled)
Parallel limit (gfs, 100) x 5.74 ops/sec ±2.23% (32 runs sampled)
  1. No lstat using failing readdir with a check for error.code === 'ENOTDIR'
    https://github.com/kmalakoff/walk-filtered/tree/no-stats
Default options x 1.32 ops/sec ±7.84% (11 runs sampled)
Serial (fs) x 0.62 ops/sec ±19.78% (8 runs sampled)
Parallel (fs) x 1.15 ops/sec ±2.16% (10 runs sampled)
Parallel (gfs) x 1.12 ops/sec ±2.33% (10 runs sampled)
Parallel limit (fs, 10) x 1.21 ops/sec ±2.61% (11 runs sampled)
Parallel limit (fs, 50) x 1.22 ops/sec ±2.08% (10 runs sampled)
Parallel limit (fs, 100) x 1.19 ops/sec ±2.35% (10 runs sampled)
Parallel limit (gfs, 10) x 1.24 ops/sec ±1.50% (11 runs sampled)
Parallel limit (gfs, 50) x 1.22 ops/sec ±2.19% (10 runs sampled)
Parallel limit (gfs, 100) x 1.20 ops/sec ±2.98% (10 runs sampled)
  1. always begin with a lstat, etc from above
    https://github.com/kmalakoff/walk-filtered/tree/optimizations
Default options x 5.98 ops/sec ±3.07% (33 runs sampled)
Serial (fs) x 3.31 ops/sec ±1.53% (21 runs sampled)
Parallel (fs) x 5.67 ops/sec ±2.65% (32 runs sampled)
Parallel (gfs) x 5.55 ops/sec ±3.07% (31 runs sampled)
Parallel limit (fs, 10) x 5.95 ops/sec ±2.84% (33 runs sampled)
Parallel limit (fs, 50) x 5.79 ops/sec ±3.65% (32 runs sampled)
Parallel limit (fs, 100) x 5.87 ops/sec ±2.68% (32 runs sampled)
Parallel limit (gfs, 10) x 5.88 ops/sec ±2.52% (33 runs sampled)
Parallel limit (gfs, 50) x 5.83 ops/sec ±3.25% (32 runs sampled)
Parallel limit (gfs, 100) x 5.71 ops/sec ±2.92% (32 runs sampled)

The original still seems to perform the best. Any ideas? Feel free to restructure and submit a pull request to try to beat it.

@kmalakoff
Copy link
Owner

I've published a refactored library using an asyncIterator API. See fs-iterator

Also, I ran performance tests against readdirp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants