Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advice on adding parallelism #1710

Closed
gregfriedland opened this issue Mar 11, 2019 · 3 comments
Closed

Advice on adding parallelism #1710

gregfriedland opened this issue Mar 11, 2019 · 3 comments
Labels
enhancement Enhances DVC feature request Requesting a new feature

Comments

@gregfriedland
Copy link

Hello,
First off, thanks for your work on this project! It seems very close to what I need for a project I'm working on at my company to build an automated, reproducible, versioned pipeline for training ML models to be used in our production workflows.

The only thing missing is parallelism, which we need because our model training takes up to 36h on an instance, and we have many of these training jobs we'd like to do simultaneously. My thought is that the pipeline jobs could be launched remotely (e.g. using kubernetes) and we'd wait for completion before continuing onto the next pipeline step. Thus, no local resources would be used and many pipelines (without interdependencies of course) could be run in parallel from a single node.

I saw some discussion of this in other issues, but it doesn't seem like there's a remote branch that yet tackles this yet so I wanted to inquire whether A) this work had already been started, and B) whether you have guidance for how to put together a PR that you would accept?

Best,
Greg

@ghost ghost added enhancement Enhances DVC feature request Requesting a new feature labels Mar 11, 2019
@ghost
Copy link

ghost commented Mar 11, 2019

Hello, @gregfriedland !

A workflow that is currently supported for remote computing is to dvc run with --no-exec (to only create a stage file without actually running any command), then ssh into your remote beefy instance with 128G of RAM, and run dvc repro to run the command. It is quite cumbersome, tho.

I saw some discussion of this in other issues, but it doesn't seem like there's a remote branch that yet tackles this yet so I wanted to inquire whether A) this work had already been started, and B) whether you have guidance for how to put together a PR that you would accept?

There's no current effort to tackle this, as far as I know.
For guidance, there's a contribution guide that can get you started with the project, feel free to ask anything related to the code on the Discord's #need-help channel!

By the way, @gregfriedland , thank you for your interest in contributing, is really appreciated :) !

@dmpetrov
Copy link
Member

Hi @gregfriedland

Thank you for the feature request! Yeah, a few people have already asked for parallel execution (like #755). It is a quite heavy feature (from the first glance) but we would be really happy to help with the implementation if you are interested in - to localize the scope, decompose this feature request to small ones and implementation. Please let us know if you are ready to commit to this.

Another thing that I'd like to clarify...

Thus, no local resources would be used and many pipelines (without interdependencies of course) could be run in parallel from a single node.

It seams like the method that @MrOutis proposed might actually work quite well for you - create a "run-branch" with no execution, execute it in a remote machine and then marge in a local one. It would be great to hear what is your opinion about this approach. Is there any issues that you see?

@gregfriedland
Copy link
Author

Thanks for your responses.

The main issue I see with the suggested approach is that then I need to
write my own orchestrator to run all these pipelines and it gets tricky if a pipeline step depends on the results of n other pipelines.

Thanks for your offer to help break down the tasks. After thinking about it more, I think dvc's pipelines may not be the best choice for my application since the orchestration & parallelism is so important for me. I will try to see if I can get the orchestration/pipelining going in Airflow or Argo and use plain S3 or dvc for the data management.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

2 participants