-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add recursive option to POSIX filesystem handler #57
Conversation
This commit adds recursive mode to list in posix
b05a0e4
to
baa77c1
Compare
4561607
to
fd8e12f
Compare
chainerio/filesystems/posix.py
Outdated
# use len instead of len + 1 as root in os.walk does not end | ||
# with "/" | ||
prefix_end_index = len(path_or_prefix) | ||
for root, dirs, files in os.walk(path_or_prefix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single directory listing can be very slow (e.g. overloaded NFS server) and thus I'd prefer recursively calling os.scandir()
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the os.walk()
is implemented with os.scandir()
recursively.
https://github.com/python/cpython/blob/b9877cd2cc47b6f3512c171814c4f630286279b9/Lib/os.py#L350
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the impl, but it's width-first and not depth-first, which would make us feel very slow when number of entries per single directory is huge, while our desired behaviour is depth-first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand why depth-first is fast. My idea of implementing the depth-first would need a list of directories inside current directory to iterate into. While getting the list of directories needs to call os.scandir
against current directory, which is the same as width-first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a depth-first listing code (and virtually done in hdfs filesystem list):
def rec_list(path):
for c in os.scandir(path):
path1 = os.path.join(path, c.name)
yield path1
if c.isdir():
yield from rec_list(path1)
The reason why I prefer depth-first listing is for a case with a directory that includes, say, million children and file system is overloaded by too much meta data, which causes any metadata reference slow. Width-first listing would need waiting for all directory entries fetched from disk across so many inodes, and then listing each directories. While our way of yielding files are depth-first as in both hdfs and zip, and so does |
I see your point, but I wonder if we have a deep hierarchy, then the depth-first would give a worse performance. |
chainerio/filesystems/posix.py
Outdated
path_or_prefix.rstrip("/") | ||
# use len instead of len + 1 as root in os.walk does not end | ||
# with "/" | ||
prefix_end_index = len(path_or_prefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs +1 here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake, sorry. The above line should be path_or_prefix = path_or_prefix.rstrip("/")
# | - nested_dir1 | ||
# | | - nested_dir3 | ||
# | _ nested_dir2 | ||
test_dir_name = "testlsdir/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A directory name without trailing slash would have found the bug above...
self.assertIn(self.tmpfile_name, file_list) | ||
self.assertNotIn(nested_dir_path2_relative, file_list) | ||
|
||
file_list = list(chainerio.list(self.dir_name, recursive=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto on trailing slash
Possible performance degrade in depth-first recursion could be stack overflow or cost of keeping directory file descriptors. As I state below maximum order of recursion is limited, both max stack size and max number of file descriptors can be extended, although those numbers won't affect system performance that much. On max depth of depth-first recursion, these urls ( https://eklitzke.org/path-max-is-tricky , https://stackoverflow.com/q/7140575 ) are very interesting. Practically on Linux and Mac we can assume some rather small max length of the of recursion like 255~4k . On Linux I saw
|
|
This PR adds recursive mode to list in posix. This resolve the third todo in #46.