Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add recursive option to POSIX filesystem handler #57

Merged
merged 8 commits into from
Sep 20, 2019

Conversation

belldandyxtq
Copy link
Member

This PR adds recursive mode to list in posix. This resolve the third todo in #46.

@belldandyxtq belldandyxtq added cat:enhancement Implementation that does not break interfaces. cat:feature Implementation that introduces new interfaces. and removed cat:enhancement Implementation that does not break interfaces. labels Sep 17, 2019
@belldandyxtq belldandyxtq force-pushed the add_posix_recursive branch 2 times, most recently from 4561607 to fd8e12f Compare September 17, 2019 12:23
# use len instead of len + 1 as root in os.walk does not end
# with "/"
prefix_end_index = len(path_or_prefix)
for root, dirs, files in os.walk(path_or_prefix):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single directory listing can be very slow (e.g. overloaded NFS server) and thus I'd prefer recursively calling os.scandir() here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the os.walk() is implemented with os.scandir() recursively.
https://github.com/python/cpython/blob/b9877cd2cc47b6f3512c171814c4f630286279b9/Lib/os.py#L350

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the impl, but it's width-first and not depth-first, which would make us feel very slow when number of entries per single directory is huge, while our desired behaviour is depth-first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand why depth-first is fast. My idea of implementing the depth-first would need a list of directories inside current directory to iterate into. While getting the list of directories needs to call os.scandir against current directory, which is the same as width-first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a depth-first listing code (and virtually done in hdfs filesystem list):

def rec_list(path):
    for c in os.scandir(path):
        path1 = os.path.join(path, c.name)
        yield path1

        if c.isdir():
            yield from rec_list(path1)

@kuenishi kuenishi changed the title Add posix recursive Add recursive option to POSIX filesystem handler Sep 18, 2019
@belldandyxtq belldandyxtq mentioned this pull request Sep 18, 2019
3 tasks
@kuenishi
Copy link
Member

The reason why I prefer depth-first listing is for a case with a directory that includes, say, million children and file system is overloaded by too much meta data, which causes any metadata reference slow. Width-first listing would need waiting for all directory entries fetched from disk across so many inodes, and then listing each directories. While our way of yielding files are depth-first as in both hdfs and zip, and so does ls(1). I believe depth-first printing is more natural and intuitive.

@belldandyxtq
Copy link
Member Author

belldandyxtq commented Sep 18, 2019

I see your point, but I wonder if we have a deep hierarchy, then the depth-first would give a worse performance.

path_or_prefix.rstrip("/")
# use len instead of len + 1 as root in os.walk does not end
# with "/"
prefix_end_index = len(path_or_prefix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs +1 here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake, sorry. The above line should be path_or_prefix = path_or_prefix.rstrip("/")

# | - nested_dir1
# | | - nested_dir3
# | _ nested_dir2
test_dir_name = "testlsdir/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A directory name without trailing slash would have found the bug above...

self.assertIn(self.tmpfile_name, file_list)
self.assertNotIn(nested_dir_path2_relative, file_list)

file_list = list(chainerio.list(self.dir_name, recursive=True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto on trailing slash

@kuenishi
Copy link
Member

Possible performance degrade in depth-first recursion could be stack overflow or cost of keeping directory file descriptors. As I state below maximum order of recursion is limited, both max stack size and max number of file descriptors can be extended, although those numbers won't affect system performance that much.

On max depth of depth-first recursion, these urls ( https://eklitzke.org/path-max-is-tricky , https://stackoverflow.com/q/7140575 ) are very interesting. Practically on Linux and Mac we can assume some rather small max length of the of recursion like 255~4k . On Linux I saw

$ find /usr/include/linux -type f| xargs grep PATH_MAX | grep '#define' 
/usr/include/linux/nfs3.h:#define NFS3_MAXPATHLEN               PATH_MAX
/usr/include/linux/un.h:#define UNIX_PATH_MAX   108
/usr/include/linux/btrfs.h:#define BTRFS_INO_LOOKUP_PATH_MAX 4080
/usr/include/linux/btrfs.h:#define BTRFS_INO_LOOKUP_USER_PATH_MAX (4080 - BTRFS_VOL_NAME_MAX - 1)
/usr/include/linux/limits.h:#define PATH_MAX        4096        /* # chars in a path name including nul */
/usr/include/linux/nfs4.h:#define NFS4_MAXPATHLEN               PATH_MAX
/usr/include/linux/netfilter/xt_bpf.h:#define XT_BPF_PATH_MAX           (XT_BPF_MAX_NUM_INSTR * sizeof(struct sock_filter))
/usr/include/linux/netfilter/xt_cgroup.h:#define XT_CGROUP_PATH_MAX     512

@pfn-ci-bot
Copy link

  [RESOURCE_EXHAUSTED] up to 5 commands can be accepted
  2019-09-19 10:14:47.940864 github_issue_comment.go:84] up to 5 commands can be accepted
  
  Stack trace:
    github.com/pfnet/imosci/util/frontend/handler/apihandler.(*githubWebhookIssueCommentFlow).Do (github_issue_comment.go:84)
    github.com/pfnet/imosci/util/frontend/handler/apihandler.githubIssueCommentHandler (github_issue_comment.go:45)
    runtime.call64 (asm_amd64.s:523)
    reflect.Value.call (value.go:447)
    reflect.Value.Call (value.go:308)
    github.com/pfnet/imosci/util/frontend/core.RegisterAPIHandlerInternal.func1 (handler.go:417)
    github.com/pfnet/imosci/util/frontend/core.RegisterHandler.func1 (handler.go:173)
    github.com/pfnet/imosci/util/frontend/core.RegisterHandler.func2.1 (handler.go:275)
    github.com/pfnet/imosci/util/frontend/core.RegisterHandler.func2 (handler.go:280)
    net/http.HandlerFunc.ServeHTTP (server.go:1964)
    net/http.(*ServeMux).ServeHTTP (server.go:2361)
    github.com/pfnet/imosci/util/api.callInternal.func2 (call.go:196)
    github.com/pfnet/imosci/util/api.callInternal (call.go:204)
    github.com/pfnet/imosci/util/api.Call (call.go:120)
    github.com/pfnet/imosci/util/api.GithubIssueComment (call.go:492)
    github.com/pfnet/imosci/util/frontend/handler/xternalhandler.githubWebhookHandler (github_webhook.go:118)
    github.com/pfnet/imosci/util/frontend/core.RegisterHandler.func1 (handler.go:173)
    github.com/pfnet/imosci/util/frontend/core.RegisterHandler.func2.1 (handler.go:275)
    github.com/pfnet/imosci/util/frontend/core.RegisterHandler.func2 (handler.go:280)
    net/http.HandlerFunc.ServeHTTP (server.go:1964)
    net/http.(*ServeMux).ServeHTTP (server.go:2361)
    google.golang.org/appengine/internal.executeRequestSafely (api.go:162)
    google.golang.org/appengine/internal.handleHTTP (api.go:121)
    net/http.HandlerFunc.ServeHTTP (server.go:1964)
    net/http.serverHandler.ServeHTTP (server.go:2741)
    net/http.(*conn).serve (server.go:1847)
    runtime.goexit (asm_amd64.s:1333)

@kuenishi kuenishi merged commit 353ff24 into master Sep 20, 2019
@kuenishi kuenishi deleted the add_posix_recursive branch September 20, 2019 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:feature Implementation that introduces new interfaces.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants