Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3,gs: don't append "/" to prefixed path when listing objects #4149

Merged
merged 3 commits into from
Jul 2, 2020

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Jul 2, 2020

Will fix #4141.

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

@pmrowla pmrowla added the bug Did we break something? label Jul 2, 2020
@pmrowla pmrowla self-assigned this Jul 2, 2020
dvc/remote/gs.py Outdated
Comment on lines 146 to 147
def walk_files(self, path_info, **kwargs):
for fname in self._list_paths(path_info / "", **kwargs):
for fname in self._list_paths(path_info, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For S3-like remotes that aren't actually filesystems with directory trees, the presence or absence of the trailing slash here makes a difference, since sometimes we want to walk a string prefix (without slash) and sometimes we want to walk a "directory" (with slash). So I think it's better for us to rely on the caller to explicitly append the trailing slash as needed when using remote tree walk_files.

This change doesn't affect anything besides unit test usage right now, since we only walk DVC remote/cache trees, but it will potentially impact use cases like external workspaces: for a tree containing

foo.txt
foo/bar.txt

s3_tree.walk_files("foo") would return both foo.txt and foo/bar.txt instead of only the contents of foo/.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmrowla I'm worried about that inconsistency in behaviour. If we think about it, we could totally just check if self.isdir(path_info) here and if it is a dir - append "/", otherwise - raise or even return empty list like os.walk() does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are on our way to unifying tree API and this inconsistency will be very problematic, as the user shouldn't be aware that he needs to add / himself just because of specific tree implementation details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even simple assert self.isdir(path_info) here would do the trick. Remote size estimation is remote-specific, right? If so, that is where we should take care about the specific remote's implementation details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop I'm fine with checking isdir here and appending the slash for directories, but for remote trees I think it still makes sense to allow non-directories (rather than raise or return empty) and support walking prefixes for remotes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the size estimation logic is not remote specific, remote trees can specify how many characters in the hash prefix we should use when estimating, but the actual estimation is all done in the base tree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise we will go back to the old behavior where we had a list_cache_paths method that was just duplicated walk_files but with support for a prefix rather than a directory

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmrowla I feel like prefix-based behavior is not really intuitive (at least for me) to be the default behavior for walk_files. Imagine yourself using tree somewhere and trying to list files in a data dir, and then it suddenly returns you also all the files that start with data suffix in addition to normal dir-like results.

How about we introduce an explicit flag for walk_files that would tell it to allow prefix-like behavior? E.g. walk_files(path_info, prefix=True) for s3(and other clouds that support prefix-like behaviour) would treat path_info as prefix, but other remotes that don't support prefix behaviour would error out if path_info is not a directory?

@pmrowla pmrowla requested review from efiop, pared and skshetry July 2, 2020 06:09
* if prefix is True, path will be walked as a prefix
* if prefix is False, path will be treated as a directory (trailing
  slash will be appended for S3-like remotes)
@@ -187,7 +187,9 @@ def _list_paths(self, path_info, max_items=None):
)

def walk_files(self, path_info, **kwargs):
for fname in self._list_paths(path_info / "", **kwargs):
if not kwargs.pop("prefix", False):
path_info = path_info / ""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop after this change we'll treat paths as directories and append the trailing slash for S3-like remote walk_files unless prefix kwarg is enabled

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @pmrowla ! πŸ™

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dvc pull does not seem to work for files whose md5 starts with 00
3 participants