Add recursive option to POSIX filesystem handler #57

belldandyxtq · 2019-09-17T08:39:02Z

This PR adds recursive mode to list in posix. This resolve the third todo in #46.

This commit adds recursive mode to list in posix

kuenishi · 2019-09-18T00:35:12Z

chainerio/filesystems/posix.py

+            # use len instead of len + 1 as root in os.walk does not end
+            # with "/"
+            prefix_end_index = len(path_or_prefix)
+            for root, dirs, files in os.walk(path_or_prefix):


Single directory listing can be very slow (e.g. overloaded NFS server) and thus I'd prefer recursively calling os.scandir() here.

I think the os.walk() is implemented with os.scandir() recursively.
https://github.com/python/cpython/blob/b9877cd2cc47b6f3512c171814c4f630286279b9/Lib/os.py#L350

I checked the impl, but it's width-first and not depth-first, which would make us feel very slow when number of entries per single directory is huge, while our desired behaviour is depth-first.

I am not sure I understand why depth-first is fast. My idea of implementing the depth-first would need a list of directories inside current directory to iterate into. While getting the list of directories needs to call os.scandir against current directory, which is the same as width-first.

This would be a depth-first listing code (and virtually done in hdfs filesystem list):

def rec_list(path): for c in os.scandir(path): path1 = os.path.join(path, c.name) yield path1 if c.isdir(): yield from rec_list(path1)

kuenishi · 2019-09-18T07:56:17Z

The reason why I prefer depth-first listing is for a case with a directory that includes, say, million children and file system is overloaded by too much meta data, which causes any metadata reference slow. Width-first listing would need waiting for all directory entries fetched from disk across so many inodes, and then listing each directories. While our way of yielding files are depth-first as in both hdfs and zip, and so does ls(1). I believe depth-first printing is more natural and intuitive.

belldandyxtq · 2019-09-18T09:19:04Z

I see your point, but I wonder if we have a deep hierarchy, then the depth-first would give a worse performance.

kuenishi · 2019-09-19T00:50:17Z

chainerio/filesystems/posix.py

+            path_or_prefix.rstrip("/")
+            # use len instead of len + 1 as root in os.walk does not end
+            # with "/"
+            prefix_end_index = len(path_or_prefix)


Needs +1 here.

My mistake, sorry. The above line should be path_or_prefix = path_or_prefix.rstrip("/")

kuenishi · 2019-09-19T00:55:25Z

tests/filesystem_tests/test_posix_handler.py

+        # | - nested_dir1
+        # |   | - nested_dir3
+        # | _ nested_dir2
+        test_dir_name = "testlsdir/"


A directory name without trailing slash would have found the bug above...

kuenishi · 2019-09-19T00:56:36Z

tests/test_context.py

+        self.assertIn(self.tmpfile_name, file_list)
+        self.assertNotIn(nested_dir_path2_relative, file_list)
+
+        file_list = list(chainerio.list(self.dir_name, recursive=True))


ditto on trailing slash

kuenishi · 2019-09-19T01:14:47Z

Possible performance degrade in depth-first recursion could be stack overflow or cost of keeping directory file descriptors. As I state below maximum order of recursion is limited, both max stack size and max number of file descriptors can be extended, although those numbers won't affect system performance that much.

On max depth of depth-first recursion, these urls ( https://eklitzke.org/path-max-is-tricky , https://stackoverflow.com/q/7140575 ) are very interesting. Practically on Linux and Mac we can assume some rather small max length of the of recursion like 255~4k . On Linux I saw

$ find /usr/include/linux -type f| xargs grep PATH_MAX | grep '#define' 
/usr/include/linux/nfs3.h:#define NFS3_MAXPATHLEN               PATH_MAX
/usr/include/linux/un.h:#define UNIX_PATH_MAX   108
/usr/include/linux/btrfs.h:#define BTRFS_INO_LOOKUP_PATH_MAX 4080
/usr/include/linux/btrfs.h:#define BTRFS_INO_LOOKUP_USER_PATH_MAX (4080 - BTRFS_VOL_NAME_MAX - 1)
/usr/include/linux/limits.h:#define PATH_MAX        4096        /* # chars in a path name including nul */
/usr/include/linux/nfs4.h:#define NFS4_MAXPATHLEN               PATH_MAX
/usr/include/linux/netfilter/xt_bpf.h:#define XT_BPF_PATH_MAX           (XT_BPF_MAX_NUM_INSTR * sizeof(struct sock_filter))
/usr/include/linux/netfilter/xt_cgroup.h:#define XT_CGROUP_PATH_MAX     512

pfn-ci-bot · 2019-09-19T01:14:48Z

  [RESOURCE_EXHAUSTED] up to 5 commands can be accepted
  2019-09-19 10:14:47.940864 github_issue_comment.go:84] up to 5 commands can be accepted
  
  Stack trace:
    github.com/pfnet/imosci/util/frontend/handler/apihandler.(*githubWebhookIssueCommentFlow).Do (github_issue_comment.go:84)
    github.com/pfnet/imosci/util/frontend/handler/apihandler.githubIssueCommentHandler (github_issue_comment.go:45)
    runtime.call64 (asm_amd64.s:523)
    reflect.Value.call (value.go:447)
    reflect.Value.Call (value.go:308)
    github.com/pfnet/imosci/util/frontend/core.RegisterAPIHandlerInternal.func1 (handler.go:417)
    github.com/pfnet/imosci/util/frontend/core.RegisterHandler.func1 (handler.go:173)
    github.com/pfnet/imosci/util/frontend/core.RegisterHandler.func2.1 (handler.go:275)
    github.com/pfnet/imosci/util/frontend/core.RegisterHandler.func2 (handler.go:280)
    net/http.HandlerFunc.ServeHTTP (server.go:1964)
    net/http.(*ServeMux).ServeHTTP (server.go:2361)
    github.com/pfnet/imosci/util/api.callInternal.func2 (call.go:196)
    github.com/pfnet/imosci/util/api.callInternal (call.go:204)
    github.com/pfnet/imosci/util/api.Call (call.go:120)
    github.com/pfnet/imosci/util/api.GithubIssueComment (call.go:492)
    github.com/pfnet/imosci/util/frontend/handler/xternalhandler.githubWebhookHandler (github_webhook.go:118)
    github.com/pfnet/imosci/util/frontend/core.RegisterHandler.func1 (handler.go:173)
    github.com/pfnet/imosci/util/frontend/core.RegisterHandler.func2.1 (handler.go:275)
    github.com/pfnet/imosci/util/frontend/core.RegisterHandler.func2 (handler.go:280)
    net/http.HandlerFunc.ServeHTTP (server.go:1964)
    net/http.(*ServeMux).ServeHTTP (server.go:2361)
    google.golang.org/appengine/internal.executeRequestSafely (api.go:162)
    google.golang.org/appengine/internal.handleHTTP (api.go:121)
    net/http.HandlerFunc.ServeHTTP (server.go:1964)
    net/http.serverHandler.ServeHTTP (server.go:2741)
    net/http.(*conn).serve (server.go:1847)
    runtime.goexit (asm_amd64.s:1333)

belldandyxtq added cat:enhancement Implementation that does not break interfaces. cat:feature Implementation that introduces new interfaces. and removed cat:enhancement Implementation that does not break interfaces. labels Sep 17, 2019

belldandyxtq added 2 commits September 17, 2019 18:05

Add recursive to list in posix

4cd297f

This commit adds recursive mode to list in posix

add tests for recursive list

baa77c1

belldandyxtq force-pushed the add_posix_recursive branch from b05a0e4 to baa77c1 Compare September 17, 2019 09:05

belldandyxtq added 2 commits September 17, 2019 19:05

add recursive to list in context

3046cd2

add recursive test to list in context

fd8e12f

belldandyxtq force-pushed the add_posix_recursive branch 2 times, most recently from 4561607 to fd8e12f Compare September 17, 2019 12:23

kuenishi requested changes Sep 18, 2019

View reviewed changes

kuenishi changed the title ~~Add posix recursive~~ Add recursive option to POSIX filesystem handler Sep 18, 2019

belldandyxtq mentioned this pull request Sep 18, 2019

Add recursive to list #46

Closed

3 tasks

belldandyxtq added 2 commits September 18, 2019 19:14

change to use dfs

03f11ca

replace 'with' with 'for'

8f544bf

kuenishi requested changes Sep 19, 2019

View reviewed changes

belldandyxtq added 2 commits September 19, 2019 20:29

fix bug with trailing slash

ce13d82

add slash directory check

b85732f

kuenishi approved these changes Sep 20, 2019

View reviewed changes

kuenishi merged commit 353ff24 into master Sep 20, 2019

kuenishi deleted the add_posix_recursive branch September 20, 2019 03:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add recursive option to POSIX filesystem handler #57

Add recursive option to POSIX filesystem handler #57

belldandyxtq commented Sep 17, 2019

kuenishi Sep 18, 2019

belldandyxtq Sep 18, 2019

kuenishi Sep 18, 2019

belldandyxtq Sep 18, 2019

kuenishi Sep 18, 2019

kuenishi commented Sep 18, 2019

belldandyxtq commented Sep 18, 2019 •

edited

kuenishi Sep 19, 2019

belldandyxtq Sep 19, 2019

kuenishi Sep 19, 2019

kuenishi Sep 19, 2019

kuenishi commented Sep 19, 2019

pfn-ci-bot commented Sep 19, 2019

Add recursive option to POSIX filesystem handler #57

Add recursive option to POSIX filesystem handler #57

Conversation

belldandyxtq commented Sep 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kuenishi commented Sep 18, 2019

belldandyxtq commented Sep 18, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kuenishi commented Sep 19, 2019

pfn-ci-bot commented Sep 19, 2019

belldandyxtq commented Sep 18, 2019 •

edited