Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc: consider introducing build matrix #1018

Open
efiop opened this issue Aug 14, 2018 · 9 comments

Comments

@efiop
Copy link
Member

commented Aug 14, 2018

#973 (comment)

I.e. something like:

matrix:
  include:
    - workdir: runs/gs1
    - workdir: runs/gs2
cmd: process.py input output
deps:
  - path: input
outs:
  - path: output
     cache: True
@efiop

This comment has been minimized.

Copy link
Member Author

commented Nov 15, 2018

Also maybe something like:

cmd: mycmd input output $PARAMS
matrix:
   - name: experiment1
     params: --option 1
   - name: experiment2
     params: --option 2

it will produce output.experiment1, output.experiment2 and so on for the stages down the pipeline.
so basically output files down the pipeline will have suffixes corresponding to the experiment that they Or maybe instead of suffixes, there would be automatically created directories that would store those outputs for each experiment.

@prihoda

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2018

If I understand it correctly, this can already be handled by outputting a directory, using a command that contains a for cycle, right?

Something like this:

mkdir output; for i in {0..100}; do mycmd input/gs${i}/options.json output/gs${i}; done

This approach also makes it possible to run all tasks in parallel, if you are able to submit asynchronously and wait for all tasks to finish:

dvc run -d input -o output 'mkdir output; for i in {1..100}; do mycmd input/gs${i}/options.json output/gs${i} &; done; wait_for_results gs{1..100}'

# Formatted script:
mkdir output; 
for i in {1..100}; do 
    mycmd input/gs${i}/options.json output/gs${i} &; 
done; 
wait_for_results gs{1..100}

The problem with outputting a directory is that when you want to run an additional experiment, or if some of your experiments fail, you have to rerun all of the other ones as well. Therefore I think it's better to think in terms of one experiment = one DVC file. Making it possible to run these tasks in parallel #755 would make that usable.

For example:

mkdir output; 
# Move to output directory to create DVC files there
cd output;
for i in {1..100}; do 
    # Would have to execute in parallel
    dvc run -d ../input/gs${i}/options.json -o gs${i} mycmd ../input/gs${i}/options.json gs${i}; 
done; 
@efiop

This comment has been minimized.

Copy link
Member Author

commented Nov 23, 2018

@prihoda Great point! This #1214 should be useful for such scenarios as well, since you will be able to tell dvc to not remove output before reproduction.

@dmpetrov

This comment has been minimized.

Copy link
Member

commented Dec 31, 2018

I'm still trying to understand the build matrix stuff. And I think we cannot solve this problem without intoroducing a concept of reconfigurable stages. Let me explain this.

Parallelism

First, it looks like build matrix can be a part of the parallel execution #755 problem when parallel steps are specified in a single stage as a build matrix with a certain level of parallelism.

However, an ideal parallelization solution should be able to run commands even from different stages. So, I'd discuss the parallel execution problem and build-matrix problem separately.

Reconfiguration

Second, there are many issues that are pointing to build matrix. Most of them are related to reconfiguration of a step or a pipeline:

  1. In #973 @Hong-Xiang was asking about reusing (reconfigurable) pipelines.
  2. #1416 parametrize pipeline \ step - not config file, just parameters.
  3. #1119 repetitive commands. I see a similarity with parametrizable commands where only a single output is in use and without creating a separate directory for each experiment (./output.p instead of gs1/output.p).

To make a stage reconfigurable many questions has to be answered (how to pass configs and params, how to specify inputs and outputs) and some assumptions should be made. Reconfigurable steps is the problem that we need to solve first before introducing a build matrix and before trying to implement something like this:

cmd: mycmd input output $PARAMS
matrix:
   - name: experiment1
     params: --option 1
   - name: experiment2
     params: --option 2

Only after that, we will be able to introduce build matrix or decide to use just cycles as @prihoda said.

I've created a new issue #1462 for reconfigurable stages.

@hhoeflin

This comment has been minimized.

Copy link

commented Aug 8, 2019

Have you looked into getting this behaviour using e.g. snakemake and building this into a dvc run step?

@efiop

This comment has been minimized.

Copy link
Member Author

commented Aug 8, 2019

@hhoeflin We didn't. Could you elaborate, please? How would that look? 🙂

@hhoeflin

This comment has been minimized.

Copy link

commented Aug 9, 2019

@efiop My own experience with snakemake is limited - so take with a grain of salt. But the command is basically just a rule. Outputs are the targets, where the experiment name would be coded in the directory_name of the output or the target suffix (as you have suggested before). Snakemake allows you to easily parse these target names into its subcomponents. For the parameters, you could use a dictionary that injects different parameters into the rule depending on the experiment name.

Hope this helps. The documentation of snakemake is really good. Have a look there.

@efiop

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2019

@hhoeflin Thanks for elaborating! 🙂 I don't think we will be able to natively integrate snakemake into dvcfiles, since we use pure yaml, but we could definitely check it out to see if we could make some conclusions from it and use them while implementing our own feature.

@hhoeflin

This comment has been minimized.

Copy link

commented Aug 14, 2019

@efiop
One other interesting project to look into would be makepp (http://makepp.sourceforge.net/)

It is a make program that tracks inputs and outputs using md5 checksums that are stored inside a project in .makepp directories. It is a "drop-in" replacement for make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.