Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task: make workdir.output more intuitive #414

Closed
Tracked by #391
casperdcl opened this issue Mar 1, 2022 · 25 comments · Fixed by #435
Closed
Tracked by #391

task: make workdir.output more intuitive #414

casperdcl opened this issue Mar 1, 2022 · 25 comments · Fixed by #435
Assignees
Labels
discussion Waiting for team decision p1-important High priority resource-task iterative_task TF resource

Comments

@casperdcl
Copy link
Contributor

casperdcl commented Mar 1, 2022

right now task destroy does cp remote://input local://output but I think we need cp remote://output local://output

  • more intuitive
  • consistent with task apply (cp local://input remote://input)

detailed before-and-after spec

@casperdcl casperdcl added p1-important High priority discussion Waiting for team decision resource-task iterative_task TF resource labels Mar 1, 2022
@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Mar 7, 2022

(Seeing that nobody jumps in)

I like this idea and believe that it's a good compromise between the two opposed approaches we've been discussing all this time:

  1. Don't impose any specific project structure and overwrite the whole working directory.
  2. Require users to specify a separate path for outputs.

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Mar 7, 2022

Examples of increasing complexity following this proposal:

resource "iterative_task" "example" {
  ···
  storage {
    workspace = "." # input
    artifacts = "./output" # output
  }
}
resource "iterative_task" "example" {
  ···
  storage {
    workspace = "."
    artifacts = ["./output", "./models/**.hdf5"]
  }
}
resource "iterative_task" "example" {
  ···
  storage {
    workspace   = "."
    artifacts   = ["./output", "./models/**.hdf5"]
    output = "./experiments/1"
  }
}

Don't get distracted by naming, it's subject to change.

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Mar 7, 2022

workspace — what to upload
artifacts — what to download (also known as output in the original post)
output (optional) — where to download artifacts; defaults to workspace; relative to workspace

Uploads of workspace exclude local artifacts and output when it's contained on workspace

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Mar 7, 2022

All of this functionality can be trivially implemented with the rclone Go module. If we don't have to support parallel downloads, the first example would work like a charm.

@DavidGOrtega DavidGOrtega removed their assignment Mar 7, 2022
@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Mar 7, 2022

All of this functionality can be trivially implemented with the rclone Go module.

That's exciting. Are we talking about also of running within a go executor? 😃

@0x2b3bfa0
Copy link
Member

No, we are talking about using rclone as we already do, but taking advantage of its include/exclude functionality.

@0x2b3bfa0
Copy link
Member

By the way, @DavidGOrtega, you were assigned to this issue because of the discussion. What do you think about these proposals? Not about naming, but functionality.

@casperdcl
Copy link
Contributor Author

strongly against art[ie]facts. Far too complicated and a major anti-pattern IMO (user code writes data to path X, but iff run via TPI we'll move it from X to Y). That would be for discussion in a separate issue anyway.

I'd rather just change the current behaviour of workdir.output for now.

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Mar 7, 2022

Sounds fair. @casperdcl, [how] would we solve #306?

@0x2b3bfa0

This comment was marked as off-topic.

@casperdcl
Copy link
Contributor Author

I... don't know what #306 is. At some point everyone agreed it was a duplicate of #307 (which is fully supported by #414 here).

However @DavidGOrtega changed his mind with a comment I couldn't follow.

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Mar 8, 2022

#306/#307 is (or should be) an error that occurs when several rclone copy processes try to write the same file at the same time. For example, when users try to download results from several tasks at once, saving them to the same directory,

@0x2b3bfa0
Copy link
Member

How does #414 (self) address this case? Can you share an example HCL block and the desired behavior?

@casperdcl
Copy link
Contributor Author

casperdcl commented Mar 8, 2022

before #414

resource "iterative_task" "example1" {
  script = "./run.py --epochs=1 --output=."
  storage {
    input  = "."
    output = "one"
  }
}

resource "iterative_task" "example2" {
  script = "./run.py --epochs=2 --output=."
  storage {
    input  = "."
    output = "two"
  }
}

results in:

  • remote1://model.pkl
  • remote2://model.pkl
  • local://one/model.pkl
  • local://two/model.pkl

while running example1.script && example2.script locally would give a conflict:

  • local://model.pkl (--epochs=1)
  • local://model.pkl (--epochs=2)

after #414

resource "iterative_task" "example1" {
  script = "./run.py --epochs=1 --output=./one"
  storage {
    input  = "."
    output = "one"
  }
}

resource "iterative_task" "example2" {
  script = "./run.py --epochs=2 --output=./two"
  storage {
    input  = "."
    output = "two"
  }
}

results in:

  • remote1://one/model.pkl
  • remote2://two/model.pkl
  • local://one/model.pkl
  • local://two/model.pkl

and running locally would give a consistent:

  • local://one/model.pkl
  • local://two/model.pkl

@casperdcl
Copy link
Contributor Author

Users would be responsible of storing results under {workdir}/{output} for every task.

Yes, exactly as they would intuitively do when running jobs in parallel locally (before clouds existed).

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Mar 8, 2022

Crystal clear. Thank you very much! ☁️

@casperdcl casperdcl mentioned this issue Mar 8, 2022
8 tasks
casperdcl added a commit that referenced this issue Mar 8, 2022
- depends on #414
@DavidGOrtega
Copy link
Contributor

@0x2b3bfa0 any stopper that you might discuss?

@0x2b3bfa0
Copy link
Member

Fortunately, I haven't found any major stoppers. Thanks anyway! 🙏🏼

@0x2b3bfa0
Copy link
Member

Still I like the idea of having several output globs, but it's more of a future–facing API choice than a real need for basic use cases.

resource "iterative_task" "example1" {
  script = "./run.py --metrics=metrics.json --output=./output"
  storage {
    input  = "."
    outputs = ["output", "metrics.json"]
  }
}

Also, a bikesheddish remark: should we rename input to workspace or workdir?

@casperdcl
Copy link
Contributor Author

casperdcl commented Mar 10, 2022

I think workdir/workspace is a good idea for future compatibility with Git/DVC-workspace-awareness.

In terms of outputs being a list, does the schema support entries that are Union[str, List[str]], or must we chose between list & str?

@0x2b3bfa0
Copy link
Member

No, it would have to be a list, even if users only want to specify a single directory.

resource "iterative_task" "example1" {
  script = "./run.py --metrics=metrics.json --output=./output"
  storage {
    workspace  = "."
    outputs = ["output"]
  }
}

@casperdcl
Copy link
Contributor Author

In that case I'd prefer UNIX & Windows-style path separators (:), i.e. "output:metrics.json:another_dir" in some future follow-up PR.

@0x2b3bfa0

This comment was marked as resolved.

@0x2b3bfa0
Copy link
Member

What if we ship output: string with this pull request and add outputs: list in the future? Functionally, output would only be an alias for outputs = [output]

storage {
  workspace  = "."
  output = "output"
}
storage {
  workspace  = "."
  outputs = ["one", "two", "three/**.json"]
}

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Mar 10, 2022

This kind of behavior has several precedents in official providers. For example:

  • scope - (Required) The scope at which the Role Definition applies to. It is recommended to use the first entry of the assignable_scopes.
  • assignable_scopes - (Optional) One or more assignable scopes for this Role Definition [...] The value for scope is automatically included in this list if no other values supplied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Waiting for team decision p1-important High priority resource-task iterative_task TF resource
Projects
None yet
3 participants