-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add recursive support to list in hdfs #49
Conversation
4c2c83d
to
06ee401
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works, but there's an explicit room to improve the performance IMHO.
chainerio/filesystems/hdfs.py
Outdated
|
||
def _recursive_list(self, prefix, path): | ||
for _file in self.connection.ls(path): | ||
yield _file[_file.find(prefix):] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this code for?
>>> s = 'list'
>>> s[s.find('li'):]
'list'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As explained in ad279e8, this is to convert the full URI to relative path from prefix.
"hdfs://nameservice/prefix_dir/testfile" => "prefix_dir/testfile"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not good to depend on string matching because it may include potential bugs (or even a source of vulnerability) in a case like this, but it'd better obtain the "hdfs://nameservice" from somewhere else. For example, given that there's a directory hdfs://nameservice/user/kota/nameservice
and what happens withchainerio.list("nameservice", recursive=True)
?
chainerio/filesystems/hdfs.py
Outdated
yield os.path.basename(_dir) | ||
target_dir = self.connection.info(path_or_prefix) | ||
if "directory" != target_dir['kind']: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you align the behaviour with POSIX filesystem os.scandir()
?
chainerio/filesystems/hdfs.py
Outdated
|
||
target_path = target_dir['path'] + "/" | ||
if not path_or_prefix.endswith("/"): | ||
path_or_prefix = path_or_prefix + "/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stripping trailing /
at the beginning of this method is a bit smarter than appending itself IMO, as it would be clearer.
chainerio/filesystems/hdfs.py
Outdated
if not path_or_prefix.endswith("/"): | ||
path_or_prefix = path_or_prefix + "/" | ||
|
||
prefix_index = len(target_path[:-len(path_or_prefix)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the name prefix and index is confusing here, as it is rather "length" of a prefix, prefix of full path like hdfs://...
and that is different from path_or_prefix
. Should be named differently like print_index
, prefix_end_index
, or full_path_prefix_len
.
This commit adds recursive support to HDFS.
77ffc64
to
6da7b9e
Compare
This PR adds recursive support to list in HDFS to align with the API.
This PR solves the 2nd todo in #46