Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external outputs: broken if pipeline output doesn't exist during stage initialization #8757

Closed
dberenbaum opened this issue Jan 4, 2023 · 5 comments · Fixed by iterative/dvc-objects#176 or #9574
Assignees
Labels
p1-important Important, aka current backlog of things to do regression Ohh, we broke something :-(

Comments

@dberenbaum
Copy link
Contributor

Bug Report

Description

S3 external outputs are broken for pipelines since 7211bd0 because of a bug in s3fs (and probably in other filesystems). They will only break if running a stage for which an output doesn't already exist. When initializing the stage, DVC will try to remove the nonexistent output and raise a FileNotFound error.

Reproduce

dvc repro will break if there is an external output and that output does not exist yet.

In a new repo, using some <s3_path> that doesn't exist yet, do this:

$ echo 'foo' > foo
$ dvc stage add --external -n foo -d foo -O <s3_path> 'aws s3 cp params.yaml <s3_path>'
$ dvc repro -v

Expected

dvc repro shouldn't fail while removing outputs. In this case, it fails because of what seems like a bug or at least inconsistent behavior in fsspec. Like mentioned in #5961 (comment), output.remove for s3fs and other async filesystems calls _expand_path. When the path doesn't exist and recursive=True, _expand_path raises FileNotFoundError. When recursive=False, it returns the path. It also returns the path for the LocalFileSystem regardless of whether recursive=True, so not sure if it was intended to raise an error only for this specific scenario.

@dberenbaum dberenbaum added the regression Ohh, we broke something :-( label Jan 4, 2023
@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do and removed p1-important Important, aka current backlog of things to do labels Jun 9, 2023
@dberenbaum dberenbaum reopened this Jun 9, 2023
@dberenbaum
Copy link
Contributor Author

Not sure if this should be a new issue, but I'm seeing the same problem now because we have recursive=True in

self.fs.remove(self.fs_path, recursive=True)
.

@dberenbaum dberenbaum added the p1-important Important, aka current backlog of things to do label Jun 9, 2023
@dberenbaum
Copy link
Contributor Author

@daavoo Do you recall from #8825 if we need recursive=True?

@daavoo
Copy link
Contributor

daavoo commented Jun 9, 2023

@daavoo Do you recall from #8825 if we need recursive=True?

I didn't test its removal but it was supposed to be a legacy option no longer needed iterative/dvc-objects#176 (comment)

@daavoo
Copy link
Contributor

daavoo commented Jun 9, 2023

So @dberenbaum to see if I got it right, is this because fsspec raises FileNotFound error?
I don't know if this is a bug in fsspec or expected behavior

@dberenbaum
Copy link
Contributor Author

Correct. It also wasn't clear to me whether it's intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1-important Important, aka current backlog of things to do regression Ohh, we broke something :-(
Projects
No open projects
Archived in project
2 participants