Skip to content
This repository has been archived by the owner on May 4, 2018. It is now read-only.

uv_fs_readdir and uv_fs_readdir_next should use readdir #1430

Closed
tjfontaine opened this issue Aug 19, 2014 · 16 comments
Closed

uv_fs_readdir and uv_fs_readdir_next should use readdir #1430

tjfontaine opened this issue Aug 19, 2014 · 16 comments
Labels

Comments

@tjfontaine
Copy link
Contributor

The design intent of readdir is to handle streaming iteration of a large directory, the current implementation of uv_fs_readdir and uv_fs_readdir_next (while allocating less) do not actually fill this design need.

Seems like uv_fs_readdir is the only function that's needed here, and can be invoked multiple times, both on and off the threadpool (like with the current other implementations). And then you should be able to call uv_close on the readdir resource. We should treat like it were just another stream.

@tjfontaine tjfontaine added the API label Aug 19, 2014
@saghul
Copy link
Contributor

saghul commented Aug 19, 2014

You mean like making uv_fs_readdir an actual stream? The idea with uv_fs_readdir_next was that thet abstraction can be built outside of libuv. Using readdir underneath will probably give us the memory consuption goals we have.

@txdv
Copy link
Contributor

txdv commented Aug 20, 2014

scandir uses readir.

@saghul
Copy link
Contributor

saghul commented Sep 19, 2014

I don't enjoy the idea of making the readdir request a handle-like thing at all.We do have uv_fs_req_cleanup, which would cleanup whatever necessary. I'm not sure about how the API would look like (yet), but I don't see this happening before 1.0.0. See my comment on #1431 regarding the change of name to uv_fs_scandir.

@tjfontaine
Copy link
Contributor Author

I had @jclulow looking at this earlier, hopefully he will be able to show an example of what I'm thinking of

@jclulow
Copy link

jclulow commented Sep 19, 2014

We should be exposing the POSIX opendir(), readdir() and closedir() functions as uv_fs_ wrappers, probably with a readdir-specific handle type (analogous to DIR *). This set of interfaces may also be emulated on Windows platforms with FindFirstFile, FindNextFile and FindClose, as we do in libuv today.

This pattern is the only way to get a scalable, streams-like interface that allows for reading very large directories without blocking threads for an unreasonable period of time, and/or blowing out the available heap memory in the process. I'm working on a patch to move to this interface, and to preserve the existing (essentially broken) interfaces by renaming them uv_fs_scandir, et al, as per #1431.

We need to get this done before committing to an interface. Releasing a major version with the status quo means libuv is simply unfit for reading large directories.

@saghul
Copy link
Contributor

saghul commented Sep 19, 2014

We should be exposing the POSIX opendir(), readdir() and closedir() functions as uv_fs_ wrappers, probably with a readdir-specific handle type (analogous to DIR *). This set of interfaces may also be emulated on Windows platforms with FindFirstFile, FindNextFile and FindClose, as we do in libuv today.

We could instead have uv_fs_readdir and uv_fs_readdir_next. The first one would do opendir + readdir X entries, and the latter would readdir X more entries. When end it reached it would closedir. If the end is never called because the user doesn't finish reading we would do it in uv_fs_req_cleanup.

This pattern is the only way to get a scalable, streams-like interface that allows for reading very large directories without blocking threads for an unreasonable period of time, and/or blowing out the available heap memory in the process. I'm working on a patch to move to this interface, and to preserve the existing (essentially broken) interfaces by renaming them uv_fs_scandir, et al, as per #1431.

Please do so as separate patches / PRs. Also I'd like for us to discuss the API further before you jump into writing code, as I mentioned above I don't like the idea of exposing the 3 APIs much.

We need to get this done before committing to an interface. Releasing a major version with the status quo means libuv is simply unfit for reading large directories.

It means that version is not ideal. There are many things we can improve, but then we'd wait for ever. This was never communicated as a priority, and I really don't want to rush it in the last minute, sorry.

@jclulow
Copy link

jclulow commented Sep 19, 2014

I don't think it makes sense to conflate the opendir() and the readdir() steps behind one magical function call. They are two orthogonal operations, with different sets of expected error conditions. I do agree that we should provide support for batching N different readdir() calls into one uv_fs_readdir() call -- batch size ought to be an option to uv_fs_opendir(), so that we may pre-allocate the pool of memory for directory entry structures.

@saghul
Copy link
Contributor

saghul commented Sep 19, 2014

There is no use for exposing opendir() other than calling readdir, so what's the point of it? My suggestion makes the API simpler to use IMHO. Also, I assume your proposal would reuse the same req object across calls to keep state, which not consistent with the current fs API.

I'd like to hear some more opinions here: /cc @indutny @piscisaureus @bnoordhuis

@jclulow
Copy link

jclulow commented Sep 19, 2014

Opening a directory is an operation that has particular failure modes; e.g. the directory does not exist, or you do not have permission to read entries from it. Reading entries from that directory has different failure modes, many of which are less clear; e.g. a particular directory entry could not be read by a process with the current data model (32-bit vs 64-bit), or the directory is partially corrupt, after a certain number of directory entries have been read.

Also, it emphatically does not make sense to produce an iterator (which this is) function where you receive a set of results from two different entry points. You should begin the iteration with some begin function (uv_fs_opendir()) and you should receive records through only one function (uv_fs_readdir()).

@saghul
Copy link
Contributor

saghul commented Sep 19, 2014

Opening a directory is an operation that has particular failure modes; e.g. the directory does not exist, or you do not have permission to read entries from it. Reading entries from that directory has different failure modes, many of which are less clear; e.g. a particular directory entry could not be read by a process with the current data model (32-bit vs 64-bit), or the directory is partially corrupt, after a certain number of directory entries have been read.

So? We'd return the error in the callback as per usual.

Also, it emphatically does not make sense to produce an iterator (which this is) function where you receive a set of results from two different entry points. You should begin the iteration with some begin function (uv_fs_opendir()) and you should receive records through only one function (uv_fs_readdir()).

My proposal already addresses that: uv_fs_readdir would do the opendir and possibly readdir a few entries which would be cached, then uv_fs_readdir_next would be called in the callback called after uv_fs_readdir to fetch results. uv_fs_readdir_next would also need a callback because it needs to do a roundtrip to the threadpool to fetch more results. Once there are no more results, the callback would be called with UV_EOF and closedir called internally.

@tjfontaine
Copy link
Contributor Author

The spirit of the APIs here is that you have some form of an initialize and then the subsequent operations that happen on that resource, consider uv_tcp_init or uv_fs_open. They are separate things from read or write, and have different concerns.

A contrived example to demonstrate the ideal API:

int readdir_sync() {
  uv_fs_t req;
  uv_dir_t dirh;
  uv_dirent_t entry;

  assert(uv_fs_opendir(uv_default_loop(),
                                    &req,
                                    &dirh,
                                    "/tmp",
                                    UV_DIR_FLAGS_NONE,
                                    NULL) == UV_OK);

  while((uv_fs_readir(uv_default_loop(), &req, &dirh, &entry, NULL) != UV_EOF)) {
   assert(req.result == UV_OK);
    //do something with entry
  }

  uv_close(dirh);
}
uv_fs_t req;
uv_dir_t dirh;
uv_dirent_t entry;

void readdir_cb(uv_fs_t* req) {
  assert(req->ptr == &entry);
  assert(req->handle == &dirh);
  if (req->result == UV_EOF) {
   uv_close(&dirh);
  } else {
  // do something with entry
  // have the option to do multiple *sync* readdir operations
  // or defer back to the threadpool
     uv_fs_readdir(uv_default_loop(),
                         &req,
                         &dirh,
                         &entry,
                         readdir_cb);
  }
}

void opendir_cb(uv_fs_t* req) {
  assert(req->ptr == &dirh);
  uv_fs_readdir(uv_default_loop(),
                         &req,
                         &dirh,
                         &entry,
                         readdir_cb);
}

int readdir_async() {
  uv_fs_opendir(uv_default_loop(),
                          &req,
                          &dirh,
                          "/tmp",
                          UV_DIR_FLAGS_NONE,
                          opendir_cb);
}

Having opendir allows us in the future to pass flags that could be useful for implementing filtering or some other platform specific extensions to these methods.

@bnoordhuis
Copy link
Contributor

Opening a directory is an operation that has particular failure modes; e.g. the directory does not exist, or you do not have permission to read entries from it. Reading entries from that directory has different failure modes, many of which are less clear; e.g. a particular directory entry could not be read by a process with the current data model (32-bit vs 64-bit), or the directory is partially corrupt, after a certain number of directory entries have been read.

I agree with you on the first part but readdir() and readdir_r() cannot really fail absent libuv bugs.

@saghul
Copy link
Contributor

saghul commented Sep 22, 2014

I gave this more thought during the weekend and I think I was wrong, exposing the 3 APIs seems to be the sanest thing to do.

Here are some things I mentally noted, some obvious:

  • We need some platform dependent uv_dir_t, a typedef should do
  • uv_fs_t will probably need at least a new field for storing/returning a uv_dir_t
  • We probably want to preallocate an array of uv_dirent_t and pass it to uv_fs_readdir for them to be filled in
  • @tjfontaine you mentioned something about flags passed to opendir, can you elaborate on that? I don't see it getting flags (on Linux, that is)

API:

  • int uv_fs_opendir(uv_loop_t* loop, uv_fs_t* req, const char* path, uv_fs_cb cb)
  • int uv_fs_readdir(uv_loop_t* loop, uv_fs_t* req, const uv_dir_t* dir, uv_dirent_t[] entries, unsigned int nentries, uv_fs_cb cb)
  • int uv_fs_closedir(uv_loop_t* loop, uv_fs_t* req, const uv_dir_t* dir, uv_fs_cb cb)

The result field in uv_fs_t would contain the number of entries which were filled. Or < 0 in case of error.
We can use the ptr field to hold the pointer to the array of dirents.

NOTE: Each API call requires a fresh request.

Open questions:

  • Regardless of the API accepting a single entry or an array of them, space for the name needs to be allocated, maybe have an extra function to help with that?
  • What happens if we create 2 readdir requests for the same directory? On Unix we should be fine because we'd use readdir_r, but how about windows?

misterdjules pushed a commit to misterdjules/libuv that referenced this issue Oct 9, 2014
Tested on Linux, MacOS X and SmartOS.

Fixes joyent#1430
@misterdjules
Copy link

@tjfontaine @saghul I've submitted an initial PR that implements @tjfontaine's proposal. Further details are in the PR's comments.

misterdjules pushed a commit to misterdjules/libuv that referenced this issue Oct 10, 2014
Tested on Linux, MacOS X and SmartOS.

Fixes joyent#1430
misterdjules pushed a commit to misterdjules/libuv that referenced this issue Oct 10, 2014
Tested on Linux, MacOS X and SmartOS.

Fixes joyent#1430
misterdjules pushed a commit to misterdjules/libuv that referenced this issue Oct 14, 2014
Tested on Linux, MacOS X and SmartOS.

Fixes joyent#1430
misterdjules pushed a commit to misterdjules/libuv that referenced this issue Oct 14, 2014
Tested on Linux, MacOS X and SmartOS.

Fixes joyent#1430
misterdjules pushed a commit to misterdjules/libuv that referenced this issue Oct 14, 2014
Tested on Linux, MacOS X and SmartOS.

Fixes joyent#1430
misterdjules pushed a commit to misterdjules/libuv that referenced this issue Oct 14, 2014
Tested on Linux, MacOS X, SmartOS and Windows.

Fixes joyent#1430
misterdjules pushed a commit to misterdjules/libuv that referenced this issue Nov 11, 2014
Tested on Linux, MacOS X, SmartOS and Windows.

Fixes joyent#1430.
misterdjules pushed a commit to misterdjules/libuv that referenced this issue Nov 12, 2014
Tested on Linux, MacOS X, SmartOS and Windows.

Fixes joyent#1430.
@misterdjules
Copy link

@saghul @tjfontaine I submitted #1574, a new PR that fixes this issue and implements @saghul's latest design proposal.

@saghul
Copy link
Contributor

saghul commented Nov 26, 2014

Closing, work as begun at #1574

@saghul saghul closed this as completed Nov 26, 2014
misterdjules pushed a commit to misterdjules/libuv that referenced this issue Dec 3, 2014
Tested on Linux, MacOS X, SmartOS and Windows.

Fixes joyent#1430.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants