Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add recursive support to list in hdfs #49

Merged
merged 7 commits into from
Sep 18, 2019
Merged

Conversation

belldandyxtq
Copy link
Member

This PR adds recursive support to list in HDFS to align with the API.
This PR solves the 2nd todo in #46

@belldandyxtq belldandyxtq added the cat:feature Implementation that introduces new interfaces. label Aug 21, 2019
Copy link
Member

@kuenishi kuenishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works, but there's an explicit room to improve the performance IMHO.

chainerio/filesystems/hdfs.py Outdated Show resolved Hide resolved

def _recursive_list(self, prefix, path):
for _file in self.connection.ls(path):
yield _file[_file.find(prefix):]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this code for?

>>> s = 'list'
>>> s[s.find('li'):]
'list'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained in ad279e8, this is to convert the full URI to relative path from prefix.
"hdfs://nameservice/prefix_dir/testfile" => "prefix_dir/testfile"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not good to depend on string matching because it may include potential bugs (or even a source of vulnerability) in a case like this, but it'd better obtain the "hdfs://nameservice" from somewhere else. For example, given that there's a directory hdfs://nameservice/user/kota/nameservice and what happens withchainerio.list("nameservice", recursive=True) ?

yield os.path.basename(_dir)
target_dir = self.connection.info(path_or_prefix)
if "directory" != target_dir['kind']:
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you align the behaviour with POSIX filesystem os.scandir() ?


target_path = target_dir['path'] + "/"
if not path_or_prefix.endswith("/"):
path_or_prefix = path_or_prefix + "/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stripping trailing / at the beginning of this method is a bit smarter than appending itself IMO, as it would be clearer.

if not path_or_prefix.endswith("/"):
path_or_prefix = path_or_prefix + "/"

prefix_index = len(target_path[:-len(path_or_prefix)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the name prefix and index is confusing here, as it is rather "length" of a prefix, prefix of full path like hdfs://... and that is different from path_or_prefix. Should be named differently like print_index, prefix_end_index, or full_path_prefix_len .

@kuenishi kuenishi merged commit 234ed34 into master Sep 18, 2019
@kuenishi kuenishi deleted the recursive_hdfs_list branch September 18, 2019 00:41
@belldandyxtq belldandyxtq mentioned this pull request Sep 18, 2019
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:feature Implementation that introduces new interfaces.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants