-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline variables from params file #3633
Comments
In my case I stored these global variables (vocabulary file, cleaned data file, label name file) in a configuration file as default. And the processing cli can accept all these variables to override the defualt one. But my configuration file parsing and merging is coded inside the program, and had to be re-coded in every program. |
Before I knew about multiple-stage pipelines I used the data:
url: s3://mydata.npz
out_path: data/mydata.npz
data/-o: # Each element in this list will be
# added as a -o <arg>
# For access strings within the current stage
# you can also use '.out_path'
- data.out_path
data/script:
- python src/cli/data_download.py
- --data-size
- .url # equivalent to data.url
- --out-path
- .out_path # equivalent to data.out_path
train:
lr: 0.001
ckpts: arteftacts/ckpts
train/-d:
- data.out_path
train/-o:
- artefacts/tflogs
train/script:
- python train.py
- --data
- data.out_path
- --ckpts
- .ckpts
- --lr
- .lr I'm not suggesting the multi-stage pipeline changes. I guess all I'm saying is:
After Thought After sleeping on it I thought, 'I can use jinja2 for this': stages:
data:
cmd: python src/cli/data_download.py --data-size {{ args.data.size }} --out-path {{ args.data.out_path }}
wdir: ..
outs:
- {{ args.data.out_path }}
train:
cmd: python src/cli/train.py -d {{ args.data.out_path }} --params {{ args.train.params }} --log-dir {{ args.train.lodir }} --metrics-path {{ args.train.metrics }} --ckpts {{ args.train.ckpt_path }}
wdir: ..
deps:
- {{ args.data.out_path }}
- {{ args.train.params }}
metrics:
- {{ args.train.metrics }}
outs:
- {{ args.train.logdir }}
- {{ args.train.ckpt_dir }} I then load my |
Unfortunately, YAML anchors can only solve part of this: You can use them to fill in repeating dependencies like this:
However, there doesn't seem to be an easy way to also "paste" them into a full command, unless we write some custom YAML parsing. |
@dsuess can jinja2-like approach from @tall-josh potentially solve this issue? |
@dmpetrov I like the idea, but I see two drawbacks:
To alleviate the first, maybe we can add a special section
and then only template the For the second one, we could either look into not allowing certain functionality (like conditionals or loops) in jinja2. Alternatively, we need to implement our own simple templating engine that only supports a very limited sets of features. So far, I only see variable substitution to be necessary. Alternatively, we could implement our own PyYaml tag to support pasting like this |
@dsuess, good idea, but there is a question |
There are definitely use cases for both in-file variables and external variable files. I agree with @dsuess in that full templating is probably not the way to go as a proper solution. vars:
IMPORT:
somevars: some/vars.yaml
morevars: more/vars.yaml
data:
size: 10
out_path: ./data/
...
stages:
data:
cmd: python src/cli/data_download.py --data-size {{ vars.data.size }} --out-path {{ vars.data.out_path }}
wdir: ..
outs:
- {{ args.data.out_path }}
train:
...
outs:
- {{ somevars.something.x }}
- {{ morevars.someotherthing.y }} |
@dsuess the variables section makes total sense. I see that as a simple form of variables with no param file associated. We definitely need it and this part has a higher priority. It looks like all of us on the same page regarding Jinja - full support is too much. Also, I like @tall-josh's idea of importing which unifies the explicit vars definition:
It looks like the high-level design is ready :) and we need to agree on the syntax and then a templating engine. On the syntax side, I see these options:
What are your thoughts on this? (3) seems like a more formalized way of doing (1) PS: |
Hi @dmpetrov |
@karajan1001 Agreed. I think this change make most sense with #3584 @dmpetrov If we want to share variables between files, I think @tall-josh solution is better than just implicitly passing variables through from the dependencies. That would also cover the use case of having a single variables-file that's shared throughout the repo. However, I also think that not passing them between files wouldn't be a problem, if the multi-stage files from #3584 are the new standard. This way, files would scope variables nicely. Regarding the syntax: I don't have any preference as long as the files are still valid YAML. The first one will interfere with standard shell variable substitution in the command, which might be a good thing as using non-tracked variables in the command will break reproducibility. It might also be a bad thing if there's a valid use case for having shell-variables in the command. |
@dsuess I'm not sure multi-stage is quite at the point of being the new standard, though I think it will get there soon. Someone correct me if I'm wrong, but as of a few days ago multi-stage ( #3584 ) did not support |
@dsuess agree, defining variables from command line is not the high priority feature. We should have a great solution in the pipeline file. Also, a good point about env variables. @tall-josh it would be great to hear your params ideas! You can post it on the multi-stage issue or just here. |
@dmpetrov. No probs, I've raised the issue ( #3584 ), but it's not really an idea. It's more along the lines of keeping the multi-stage consistent with the regular single-stage dvc run --params myparams.yaml:train "cat myparams.yaml" Produces a dvc file: md5: d978fc54ad4ee1ad65e241f0bd122482
cmd: cat myparams.yaml
deps:
- path: myparams.yaml
params:
train:
epochs: 10
lr: 0.001 Whereas using multi-stage dvc run -n stage1 --params myparams.yaml:train "cat myparams.yaml" Results in a DvcFile that looks like this. Plus a lock file. Neither of which refer to the stages:
stage1:
cmd: cat myparams.yaml
deps:
- myparams.yaml I've directed @skshetry to this thread too as I think the two issues would benefit from one and other. |
Hi @tall-josh. We now have params support for the pipeline file (as we are calling it at the moment). You have already seen the file format, so, I'll just enter to the discussions of variables. I'd propose And, we can introduce a So, it should look something like: variables:
data:
size: 10
out_path: ./data/
stages:
data:
cmd: python src/cli/data_download.py --data-size {{ data.size }} --out-path {{ data.out_path }}
wdir: ..
outs:
- {{ data.out_path }} Note that, I'm not focusing on templating formats (as it's likely going to be tied with the implementation), but we should try to make it logic-less. Providing logic will only make it complicated, and ugly. |
Seems that most of you agree that reusable yaml-variables should be defined directly in the pipeline file (I do too). So maybe you should rename the issue @dmpetrov ? |
@skshetry Amazing news! I can't wait to give it a crack!
|
@skshetry EDIT: Ahhh I misunderstood your comment. If I now understand correctly; There is now |
@tall-josh yes, the templating is still in process. Now we are busy by releasing DVC 1.0 https://dvc.org/blog/dvc-3-years-and-1-0-release The parameter is a good feature for the next step. DVC 1.1 I think. @elgehelge I think there is no need for renaming this issue. It was created specifically to find a solution for using But I agree that the "static" params is a super useful tool and should be the first priority (I had no doubt with that). I can create a separate issue for the "static" parameters case. I like how simple variables can be implemented: variables:
data:
size: 10
out_path: ./data/ Why don't we use the same approach for importing all the params: params_variables: [params.yaml]
stages:
mytrain:
cmd: python train.py --batch-size {{ train.batch_size }} --out-path {{ train.out_path }}
outs:
- {{ train.out_path }} As an option, we can import by param name or param section name: params_variables:
- params.yaml: [process.threshold, train] In this case, params:
train.batch_size: 2048
train.out_path: "out/model.p" What do you think folks? |
@dmpetrov Thanks for the clarification. In the meantime, I had a crack at implementing a https://github.com/tall-josh/dvc/tree/master/TEMPLATING_DEMO The vars:
data: data.txt
base_dir: base_dir
stage1_out: stage1_output.txt
stage2_out: stage2_output.txt
splits:
train: 0.89
eval: 0.3
stages:
st1:
deps:
- {{ vars.data }}
outs:
- {{ vars.base_dir }}/{{ vars.stage1_out }}
cmd: >-
mkdir {{ vars.base_dir }};
cat {{ vars.data }} > {{ vars.base_dir }}/{{ vars.stage1_out }}
st2:
deps:
- {{ vars.base_dir }}/{{ vars.stage1_out }}
outs:
- {{ vars.base_dir }}/{{ vars.stage2_out }}
cmd: >-
echo "train: {{ vars.splits['train'] }}" > {{ vars.base_dir }}/{{ vars.stage2_out }};
cat {{ vars.base_dir }}/{{ vars.stage1_out }} >>
{{ vars.base_dir }}/{{ vars.stage2_out }} |
Hello everyone! I'm a bit late to the party, but here are some ideas about using variables for looping (original discussion started in https://discuss.dvc.org/t/using-dvc-to-keep-track-of-multiple-model-variants/471). In a nutshell, I want to use the same pipeline to train models using different groups or sets of data. The steps are the same, but the data is slightly different and the outputs have different paths. Having From the original post: params:
group1:
name: "group 1"
something_else: "foo"
group2:
name: "group 2"
something_else: "meh"
group3:
name: "group 3"
something_else: "gwah"
stages:
get_some_data:
cmd: ./get_data.sh {{params.name}}
deps:
- get_data.sh
- query.sql
outs:
- data.csv
train_model:
cmd: Rscript train.R {{params.name}} {{params.something_else}}
deps:
- data.csv
outs:
- models/{{params.name}}/model.xgb
metrics:
- metrics/{{params.name}}/model_eval.json No ideas at the moment regarding how to declare that some parameters should be looped on. I imagine |
I've been mulling this over too. Would be super handy!! |
The biggest improvement of the parametrization feature I've experienced is that I can now scale the workflow by only modifying This defines the minimal repro workflow (For simplicity I've removed a lot of params here), I use this for my gitlab CI setting: # params.yaml
models:
gaussian_15:
name: gaussian_15
data:
noise_level: 15
train:
epochs: 2
datasets:
train:
name: train
src: data/DnCNN.zip
path: data/train/rawdata
Set12:
name: Set12
src: data/DnCNN.zip
path: data/test/Set12
testcases:
set12_gaussian_15:
model_name: gaussian_15
noise_level: 15
dataset: Set12 In this setting, I trained one model, and test the model on one dataset (Set12). the corresponding dag is I could easily repeat some of the contents in # params.yaml
models:
gaussian_15:
name: gaussian_15
data:
noise_level: 15
train:
epochs: 50
gaussian_25:
name: gaussian_25
data:
noise_level: 25
train:
epochs: 50
gaussian_50:
name: gaussian_50
data:
noise_level: 50
train:
epochs: 50
datasets:
train:
name: train
src: data/DnCNN.zip
path: data/train/rawdata
Set12:
name: Set12
src: data/DnCNN.zip
path: data/test/Set12
Set68:
name: Set68
src: data/DnCNN.zip
path: data/test/Set68
testcases:
set12_gaussian_15:
model_name: gaussian_15
noise_level: 15
dataset: Set12
set12_gaussian_25:
model_name: gaussian_25
noise_level: 25
dataset: Set12
set12_gaussian_50:
model_name: gaussian_50
noise_level: 50
dataset: Set12
set68_gaussian_15:
model_name: gaussian_15
noise_level: 15
dataset: Set68
set68_gaussian_25:
model_name: gaussian_25
noise_level: 25
dataset: Set68
set68_gaussian_50:
model_name: gaussian_50
noise_level: 50
dataset: Set68 🎉 the workflow gets scaled automatically without touching any of the other files: It's still unclear to me what's the best practice to parameterized the workflow. But I'm pretty satisfied with how it makes the code clean. I'm still playing with dvc and the parametrization. When I feel ready, I might summary them up as a blog. dvc.yamlstages:
extract_data:
foreach: ${datasets}
do:
cmd: mkdir -p ${item.path} && unzip -q ${item.src} "${item.name}/*" -d tmp && mv tmp/${item.name}/* ${item.path}
deps:
- ${item.src}
outs:
- ${item.path}
prepare:
foreach: ${models}
do:
cmd: >-
julia --color=yes --startup=no --project=prepare -e "using Pkg; Pkg.instantiate()" &&
julia --color=yes --startup=no --project=prepare \
prepare/main.jl data/train/rawdata/ data/train/prepared \
--noise-level ${item.data.noise_level} \
--patch-stride ${item.data.patch_stride} \
--patch-size ${item.data.patch_size} \
--max-num-patches ${item.data.max_num_patches}
deps:
- data/train/rawdata
- prepare
outs:
- data/train/prepared/${item.name}.h5
train:
foreach: ${models}
do:
cmd: >-
julia --color=yes --startup=no --project=train -e "using Pkg; Pkg.instantiate()" &&
julia --color=yes --startup=no --project=train \
train/main.jl data/train/prepared/${item.name}.h5 models/${item.name} \
--epochs ${item.train.epochs} \
--batch-size ${item.train.batch_size} \
--initial-lr ${item.train.initial_learning_rate} \
--lr-decay-ratio ${item.train.learning_rate_decay_ratio} \
--lr-decay-freq ${item.train.learning_rate_decay_frequency} \
--model-width ${item.train.model_width} \
--model-depth ${item.train.model_depth} \
--padding-size ${item.train.padding_size} \
--kernel-size ${item.train.kernel_size} \
--batch-size ${item.train.batch_size} \
--log-freq ${item.train.log_frequency} \
--use-gpu ${item.train.use_gpu}
deps:
- data/train/prepared/${item.name}.h5
- train
outs:
- models/${item.name}
evaluate:
foreach: ${testcases}
do:
cmd: >-
julia --color=yes --startup=no --project=evaluate -e "using Pkg; Pkg.instantiate()" &&
julia --color=yes --startup=no --project=evaluate evaluate/main.jl \
models/${item.model_name} \
data/test/${item.dataset} \
results/${item.model_name}/${item.dataset} \
${item.noise_level}
deps:
- data/test/${item.dataset}
- models/${item.model_name}
- evaluate
outs:
- results/${item.model_name}/${item.dataset}/processed_images
plots:
- results/${item.model_name}/${item.dataset}/checkpoints.csv
- results/${item.model_name}/${item.dataset}/metrics.csv
summary:
cmd: >-
julia --color=yes --startup=no --project=evaluate -e "using Pkg; Pkg.instantiate()" &&
julia --color=yes --startup=no --project=evaluate evaluate/summary.jl results metrics.json
deps:
- results
- evaluate
metrics:
- metrics.json Even simpler
|
Hi, everyone. The parametrization feature is complete, at least the things that we planned for the next release. We are hoping to release a pre-release version this week. So, it'd be really helpful if you guys could provide feedback. |
Are there docs or examples for this? Does it follow @johnnychen94 's latest comment? If there are no docs but I can get a nudge in the right direction I could put together a Notebook or something. |
Ahhh think I found what I'm looking for in |
@tall-josh, it's in the wiki as mentioned here: https://github.com/iterative/dvc/wiki/Parametrization @johnnychen94's example also works.
What are you looking for exactly? I don't think it has anything to do with parameterization. |
Thanks @skshetry that wiki perfect. Yeah, the |
With the introduction of the new multiple-stage pipeline, we will need to find a way of defining variables in the pipeline. For example, the intermediate file name
cleansed.csv
is used from two stages in the following pipeline and it needs to be defined into a variable:We need to solve two problems here:
train
stage), not from the command line (like in theprocess
stage).We can solve both of the problems using a single abstraction -
parameters file variable
:This feature is useful in the current DVC design as well. It is convenient to read file names from params file and still define dependency properly like
dvc run -d params.yaml:input_file -o params.yaml:model.pkl
The text was updated successfully, but these errors were encountered: