Add recursive support to list in hdfs #49

belldandyxtq · 2019-08-21T05:56:53Z

This PR adds recursive support to list in HDFS to align with the API.
This PR solves the 2nd todo in #46

kuenishi

It works, but there's an explicit room to improve the performance IMHO.

chainerio/filesystems/hdfs.py

kuenishi · 2019-09-12T02:21:13Z

chainerio/filesystems/hdfs.py

+
+    def _recursive_list(self, prefix, path):
+        for _file in self.connection.ls(path):
+            yield _file[_file.find(prefix):]


What is this code for?

>>> s = 'list' >>> s[s.find('li'):] 'list'

As explained in ad279e8, this is to convert the full URI to relative path from prefix.
"hdfs://nameservice/prefix_dir/testfile" => "prefix_dir/testfile"

It is not good to depend on string matching because it may include potential bugs (or even a source of vulnerability) in a case like this, but it'd better obtain the "hdfs://nameservice" from somewhere else. For example, given that there's a directory hdfs://nameservice/user/kota/nameservice and what happens withchainerio.list("nameservice", recursive=True) ?

kuenishi · 2019-09-17T01:21:20Z

chainerio/filesystems/hdfs.py

-            yield os.path.basename(_dir)
+        target_dir = self.connection.info(path_or_prefix)
+        if "directory" != target_dir['kind']:
+            return None


Why don't you align the behaviour with POSIX filesystem os.scandir() ?

kuenishi · 2019-09-17T01:26:02Z

chainerio/filesystems/hdfs.py

+
+        target_path = target_dir['path'] + "/"
+        if not path_or_prefix.endswith("/"):
+            path_or_prefix = path_or_prefix + "/"


Stripping trailing / at the beginning of this method is a bit smarter than appending itself IMO, as it would be clearer.

kuenishi · 2019-09-17T01:30:39Z

chainerio/filesystems/hdfs.py

+        if not path_or_prefix.endswith("/"):
+            path_or_prefix = path_or_prefix + "/"
+
+        prefix_index = len(target_path[:-len(path_or_prefix)])


Both the name prefix and index is confusing here, as it is rather "length" of a prefix, prefix of full path like hdfs://... and that is different from path_or_prefix. Should be named differently like print_index, prefix_end_index, or full_path_prefix_len .

This commit adds recursive support to HDFS.

belldandyxtq added the cat:feature Implementation that introduces new interfaces. label Aug 21, 2019

belldandyxtq force-pushed the recursive_hdfs_list branch from 4c2c83d to 06ee401 Compare August 21, 2019 05:58

kuenishi requested changes Sep 12, 2019

View reviewed changes

kuenishi requested changes Sep 17, 2019

View reviewed changes

belldandyxtq added 7 commits September 17, 2019 18:17

Add recursive support to hdfs

59a9426

This commit adds recursive support to HDFS.

add tests for recursive list on hdfs

d50ae26

improve performance with detail

3181732

add comments

79956eb

get the nameservice from info

df0499b

strip trailing /

4194400

raise NotADirectoryError when the path is not a directory

6da7b9e

belldandyxtq force-pushed the recursive_hdfs_list branch from 77ffc64 to 6da7b9e Compare September 17, 2019 09:24

kuenishi approved these changes Sep 18, 2019

View reviewed changes

kuenishi merged commit 234ed34 into master Sep 18, 2019

kuenishi deleted the recursive_hdfs_list branch September 18, 2019 00:41

belldandyxtq mentioned this pull request Sep 18, 2019

Add recursive to list #46

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add recursive support to list in hdfs #49

Add recursive support to list in hdfs #49

belldandyxtq commented Aug 21, 2019

kuenishi left a comment

kuenishi Sep 12, 2019

belldandyxtq Sep 12, 2019

kuenishi Sep 13, 2019

kuenishi Sep 17, 2019

kuenishi Sep 17, 2019

kuenishi Sep 17, 2019

Add recursive support to list in hdfs #49

Add recursive support to list in hdfs #49

Conversation

belldandyxtq commented Aug 21, 2019

kuenishi left a comment

Choose a reason for hiding this comment

kuenishi Sep 12, 2019

Choose a reason for hiding this comment

belldandyxtq Sep 12, 2019

Choose a reason for hiding this comment

kuenishi Sep 13, 2019

Choose a reason for hiding this comment

kuenishi Sep 17, 2019

Choose a reason for hiding this comment

kuenishi Sep 17, 2019

Choose a reason for hiding this comment

kuenishi Sep 17, 2019

Choose a reason for hiding this comment