-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3,gs: don't append "/" to prefixed path when listing objects #4149
Conversation
dvc/remote/gs.py
Outdated
def walk_files(self, path_info, **kwargs): | ||
for fname in self._list_paths(path_info / "", **kwargs): | ||
for fname in self._list_paths(path_info, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For S3-like remotes that aren't actually filesystems with directory trees, the presence or absence of the trailing slash here makes a difference, since sometimes we want to walk a string prefix (without slash) and sometimes we want to walk a "directory" (with slash). So I think it's better for us to rely on the caller to explicitly append the trailing slash as needed when using remote tree walk_files
.
This change doesn't affect anything besides unit test usage right now, since we only walk DVC remote/cache trees, but it will potentially impact use cases like external workspaces: for a tree containing
foo.txt
foo/bar.txt
s3_tree.walk_files("foo")
would return both foo.txt
and foo/bar.txt
instead of only the contents of foo/
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmrowla I'm worried about that inconsistency in behaviour. If we think about it, we could totally just check if self.isdir(path_info)
here and if it is a dir - append "/", otherwise - raise or even return empty list like os.walk() does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are on our way to unifying tree API and this inconsistency will be very problematic, as the user shouldn't be aware that he needs to add /
himself just because of specific tree implementation details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even simple assert self.isdir(path_info)
here would do the trick. Remote size estimation is remote-specific, right? If so, that is where we should take care about the specific remote's implementation details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@efiop I'm fine with checking isdir
here and appending the slash for directories, but for remote trees I think it still makes sense to allow non-directories (rather than raise or return empty) and support walking prefixes for remotes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the size estimation logic is not remote specific, remote trees can specify how many characters in the hash prefix we should use when estimating, but the actual estimation is all done in the base tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise we will go back to the old behavior where we had a list_cache_paths
method that was just duplicated walk_files
but with support for a prefix rather than a directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmrowla I feel like prefix
-based behavior is not really intuitive (at least for me) to be the default behavior for walk_files
. Imagine yourself using tree somewhere and trying to list files in a data
dir, and then it suddenly returns you also all the files that start with data
suffix in addition to normal dir-like results.
How about we introduce an explicit flag for walk_files
that would tell it to allow prefix-like behavior? E.g. walk_files(path_info, prefix=True)
for s3(and other clouds that support prefix-like behaviour) would treat path_info
as prefix, but other remotes that don't support prefix behaviour would error out if path_info
is not a directory?
* if prefix is True, path will be walked as a prefix * if prefix is False, path will be treated as a directory (trailing slash will be appended for S3-like remotes)
@@ -187,7 +187,9 @@ def _list_paths(self, path_info, max_items=None): | |||
) | |||
|
|||
def walk_files(self, path_info, **kwargs): | |||
for fname in self._list_paths(path_info / "", **kwargs): | |||
if not kwargs.pop("prefix", False): | |||
path_info = path_info / "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@efiop after this change we'll treat paths as directories and append the trailing slash for S3-like remote walk_files
unless prefix
kwarg is enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @pmrowla ! π
Will fix #4141.
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
β I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)
Thank you for the contribution - we'll try to review it as soon as possible. π