Advice on adding parallelism #1710

gregfriedland · 2019-03-11T16:56:28Z

Hello,
First off, thanks for your work on this project! It seems very close to what I need for a project I'm working on at my company to build an automated, reproducible, versioned pipeline for training ML models to be used in our production workflows.

The only thing missing is parallelism, which we need because our model training takes up to 36h on an instance, and we have many of these training jobs we'd like to do simultaneously. My thought is that the pipeline jobs could be launched remotely (e.g. using kubernetes) and we'd wait for completion before continuing onto the next pipeline step. Thus, no local resources would be used and many pipelines (without interdependencies of course) could be run in parallel from a single node.

I saw some discussion of this in other issues, but it doesn't seem like there's a remote branch that yet tackles this yet so I wanted to inquire whether A) this work had already been started, and B) whether you have guidance for how to put together a PR that you would accept?

Best,
Greg

ghost · 2019-03-11T21:27:07Z

Hello, @gregfriedland !

A workflow that is currently supported for remote computing is to dvc run with --no-exec (to only create a stage file without actually running any command), then ssh into your remote beefy instance with 128G of RAM, and run dvc repro to run the command. It is quite cumbersome, tho.

I saw some discussion of this in other issues, but it doesn't seem like there's a remote branch that yet tackles this yet so I wanted to inquire whether A) this work had already been started, and B) whether you have guidance for how to put together a PR that you would accept?

There's no current effort to tackle this, as far as I know.
For guidance, there's a contribution guide that can get you started with the project, feel free to ask anything related to the code on the Discord's #need-help channel!

By the way, @gregfriedland , thank you for your interest in contributing, is really appreciated :) !

dmpetrov · 2019-03-11T23:27:59Z

Hi @gregfriedland

Thank you for the feature request! Yeah, a few people have already asked for parallel execution (like #755). It is a quite heavy feature (from the first glance) but we would be really happy to help with the implementation if you are interested in - to localize the scope, decompose this feature request to small ones and implementation. Please let us know if you are ready to commit to this.

Another thing that I'd like to clarify...

Thus, no local resources would be used and many pipelines (without interdependencies of course) could be run in parallel from a single node.

It seams like the method that @MrOutis proposed might actually work quite well for you - create a "run-branch" with no execution, execute it in a remote machine and then marge in a local one. It would be great to hear what is your opinion about this approach. Is there any issues that you see?

gregfriedland · 2019-03-14T17:58:06Z

Thanks for your responses.

The main issue I see with the suggested approach is that then I need to
write my own orchestrator to run all these pipelines and it gets tricky if a pipeline step depends on the results of n other pipelines.

Thanks for your offer to help break down the tasks. After thinking about it more, I think dvc's pipelines may not be the best choice for my application since the orchestration & parallelism is so important for me. I will try to see if I can get the orchestration/pipelining going in Airflow or Argo and use plain S3 or dvc for the data management.

ghost added enhancement Enhances DVC feature request Requesting a new feature labels Mar 11, 2019

gregfriedland closed this as completed Mar 14, 2019

dmpetrov mentioned this issue Dec 3, 2019

dvc: release locks when running a command #2584

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice on adding parallelism #1710

Advice on adding parallelism #1710

gregfriedland commented Mar 11, 2019

ghost commented Mar 11, 2019

dmpetrov commented Mar 11, 2019

gregfriedland commented Mar 14, 2019

Advice on adding parallelism #1710

Advice on adding parallelism #1710

Comments

gregfriedland commented Mar 11, 2019

ghost commented Mar 11, 2019

dmpetrov commented Mar 11, 2019

gregfriedland commented Mar 14, 2019