Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline variables from params file #3633

Closed
dmpetrov opened this issue Apr 14, 2020 · 34 comments
Closed

Pipeline variables from params file #3633

dmpetrov opened this issue Apr 14, 2020 · 34 comments
Assignees
Labels
A: templating Related to the templating feature feature request Requesting a new feature

Comments

@dmpetrov
Copy link
Member

dmpetrov commented Apr 14, 2020

With the introduction of the new multiple-stage pipeline, we will need to find a way of defining variables in the pipeline. For example, the intermediate file name cleansed.csv is used from two stages in the following pipeline and it needs to be defined into a variable:

stages:
    process:
        cmd: "./process.bin --input data --output cleansed.csv"
        deps:
             - path: data/
        outs:
             - path: cleansed.csv

    train:
        cmd: "python train.py"
        deps:
             - path: cleansed.csv
             - path: train.py
             - path: params.yaml
                params:
                     lr: 0.042
                     layers: 8
                     classes: 4
        outs:
             - path: model.pkl
             - path: log.csv
                cache: true
             - path: summary.json

We need to solve two problems here:

  1. Define a variable in one place and reuse it from multiple places/stages.
  2. Often users prefer to read file names from config files (like in the train stage), not from the command line (like in the process stage).

We can solve both of the problems using a single abstraction - parameters file variable:

stages:
    process:
        cmd: ./process.bin
        outs:
             - path: "params.yaml:cleansed_file_name"
        ....
    train:
        cmd: "python train.py"
        deps:
             - path: "params.yaml:cleansed_file_name"

This feature is useful in the current DVC design as well. It is convenient to read file names from params file and still define dependency properly like dvc run -d params.yaml:input_file -o params.yaml:model.pkl

@dmpetrov dmpetrov added the feature request Requesting a new feature label Apr 14, 2020
@karajan1001
Copy link
Contributor

In my case I stored these global variables (vocabulary file, cleaned data file, label name file) in a configuration file as default. And the processing cli can accept all these variables to override the defualt one. But my configuration file parsing and merging is coded inside the program, and had to be re-coded in every program.

@tall-josh
Copy link

tall-josh commented Apr 18, 2020

Before I knew about multiple-stage pipelines I used the params functionality and hacked together a script to build a richer dvc run command. I used the same dot.notation you can use in the params command line args ie: dvc run -p params.yaml:data.output,train ... In the end my params.yaml looks like:

data:
    url: s3://mydata.npz
    out_path: data/mydata.npz
data/-o:  # Each element in this list will be
          # added as a  -o <arg>
    # For access strings within the current stage
    # you can also use '.out_path'
    - data.out_path
data/script:
    - python src/cli/data_download.py
    - --data-size
    - .url  # equivalent to data.url
    - --out-path
    - .out_path # equivalent to data.out_path

train:
  lr: 0.001
  ckpts: arteftacts/ckpts
train/-d:
  - data.out_path
train/-o:
  - artefacts/tflogs
train/script:
  - python train.py
  - --data
  - data.out_path
  - --ckpts
  - .ckpts
  - --lr
  - .lr

I'm not suggesting the multi-stage pipeline changes. I guess all I'm saying is:

  • I like that dot.notation and,
  • I like the idea of self-referencing inside the same yaml file.

After Thought

After sleeping on it I thought, 'I can use jinja2 for this':

stages:
  data:
    cmd: python src/cli/data_download.py --data-size {{ args.data.size }} --out-path {{ args.data.out_path }}
    wdir: ..
    outs:
    - {{ args.data.out_path }}
  train:
    cmd: python src/cli/train.py -d {{ args.data.out_path }} --params {{ args.train.params }} --log-dir {{ args.train.lodir }} --metrics-path {{ args.train.metrics }} --ckpts {{ args.train.ckpt_path }}
    wdir: ..
    deps:
    - {{ args.data.out_path }}
    - {{ args.train.params }}
    metrics:
    - {{ args.train.metrics }}
    outs:
    - {{ args.train.logdir }}
    - {{ args.train.ckpt_dir }}

I then load my args.yaml file and convert it into a namedtuple (so I can use the dot.notation). Bingo bango, we have our filled in .dvc file

@dsuess
Copy link

dsuess commented Apr 22, 2020

Unfortunately, YAML anchors can only solve part of this: You can use them to fill in repeating dependencies like this:

stages:
    process:
        outs:
             - path: &cleansed cleansed.csv
       ...

    train:
        deps:
             - path: *cleansed
        ...

However, there doesn't seem to be an easy way to also "paste" them into a full command, unless we write some custom YAML parsing.

@dmpetrov
Copy link
Member Author

@dsuess can jinja2-like approach from @tall-josh potentially solve this issue?

@dsuess
Copy link

dsuess commented Apr 22, 2020

@dmpetrov I like the idea, but I see two drawbacks:

  1. the necessary information is split across two files
  2. allowing full jinja-templating on the configuration file might be too much

To alleviate the first, maybe we can add a special section variables to the Dvcfile like this:

variables:
  data:
    size: 10
    out_path: ./data/
  ...
  
stages:
  data:
    cmd: python src/cli/data_download.py --data-size {{ args.data.size }} --out-path {{ args.data.out_path }}
    wdir: ..
    outs:
    - {{ args.data.out_path }}
  train:
     ...

and then only template the stages section with those variables. No templating would be allowed for that first section.

For the second one, we could either look into not allowing certain functionality (like conditionals or loops) in jinja2. Alternatively, we need to implement our own simple templating engine that only supports a very limited sets of features. So far, I only see variable substitution to be necessary.

Alternatively, we could implement our own PyYaml tag to support pasting like this

@karajan1001
Copy link
Contributor

karajan1001 commented Apr 22, 2020

@dsuess, good idea, but there is a questionvariables need to be shared among DVCfiles, many of them ( for me embedding size, vocabulary file, sequence length and etc.) must stay the same in all stages. Maybe we should make the DVCfile inherit parameters from its ancestors.

@tall-josh
Copy link

tall-josh commented Apr 22, 2020

There are definitely use cases for both in-file variables and external variable files. I agree with @dsuess in that full templating is probably not the way to go as a proper solution.
Maybe it's just a terminology thing but @karajan1001, if your thinking full blown inheritance then I think things could get messy. A simple way of importing from 1 or more files cloud be nice.

vars:
  IMPORT:
    somevars: some/vars.yaml
    morevars: more/vars.yaml

  data:
    size: 10
    out_path: ./data/
  ...
  
stages:
  data:
    cmd: python src/cli/data_download.py --data-size {{ vars.data.size }} --out-path {{ vars.data.out_path }}
    wdir: ..
    outs:
    - {{ args.data.out_path }}
  train:
     ...
    outs:
    - {{ somevars.something.x }}
    - {{ morevars.someotherthing.y }}

@dmpetrov
Copy link
Member Author

@dsuess the variables section makes total sense. I see that as a simple form of variables with no param file associated. We definitely need it and this part has a higher priority.

It looks like all of us on the same page regarding Jinja - full support is too much.

Also, I like @tall-josh's idea of importing which unifies the explicit vars definition:

vars:
  IMPORT:
    somevars: some/vars.yaml
    morevars: more/vars.yaml

It looks like the high-level design is ready :) and we need to agree on the syntax and then a templating engine.

On the syntax side, I see these options:

  1. Special patter in strings like "$vars.somevar"
  2. jinja2-style variables: {{ vars.somevar }}
  3. PyYaml tag: !var [*vars.somevar]
  4. other options

What are your thoughts on this?

(3) seems like a more formalized way of doing (1)

PS: dvc run can use a bit different syntax or we can decide not to support variables on the command level - let's keep this discussion separately.

@tall-josh
Copy link

Hi @dmpetrov
Yep, we're on the same page. I like the first syntax simply because it's concise.

@dsuess
Copy link

dsuess commented Apr 22, 2020

@karajan1001 Agreed. I think this change make most sense with #3584

@dmpetrov If we want to share variables between files, I think @tall-josh solution is better than just implicitly passing variables through from the dependencies. That would also cover the use case of having a single variables-file that's shared throughout the repo. However, I also think that not passing them between files wouldn't be a problem, if the multi-stage files from #3584 are the new standard. This way, files would scope variables nicely.

Regarding the syntax: I don't have any preference as long as the files are still valid YAML. The first one will interfere with standard shell variable substitution in the command, which might be a good thing as using non-tracked variables in the command will break reproducibility. It might also be a bad thing if there's a valid use case for having shell-variables in the command.

@tall-josh
Copy link

@dsuess I'm not sure multi-stage is quite at the point of being the new standard, though I think it will get there soon. Someone correct me if I'm wrong, but as of a few days ago multi-stage ( #3584 ) did not support --params :-( . Something I'm been meaning to bring up their thread. I'll do that now I guess.

@dmpetrov
Copy link
Member Author

@dsuess agree, defining variables from command line is not the high priority feature. We should have a great solution in the pipeline file. Also, a good point about env variables.

@tall-josh it would be great to hear your params ideas! You can post it on the multi-stage issue or just here.

@tall-josh
Copy link

@dmpetrov. No probs, I've raised the issue ( #3584 ), but it's not really an idea. It's more along the lines of keeping the multi-stage consistent with the regular single-stage .dvc files. ie:

dvc run --params myparams.yaml:train "cat myparams.yaml"

Produces a dvc file:

md5: d978fc54ad4ee1ad65e241f0bd122482
cmd: cat myparams.yaml
deps:
- path: myparams.yaml
  params:
    train:
      epochs: 10
      lr: 0.001

Whereas using multi-stage

dvc run -n stage1 --params myparams.yaml:train "cat myparams.yaml"

Results in a DvcFile that looks like this. Plus a lock file. Neither of which refer to the train key in myparams.yaml

stages:
  stage1:
    cmd: cat myparams.yaml
    deps:
    - myparams.yaml

I've directed @skshetry to this thread too as I think the two issues would benefit from one and other.

@skshetry
Copy link
Member

skshetry commented May 4, 2020

Hi @tall-josh. We now have params support for the pipeline file (as we are calling it at the moment).
We are planning to release it soon (with a public beta in this week).

You have already seen the file format, so, I'll just enter to the discussions of variables. I'd propose
not having an indirection of reading variables from the params file, but from the pipeline file itself.
The pipeline file is very readable and clean, and we'll strive to make it so in the future.

And, we can introduce a variables section in the pipeline file, which can be used for reading params by the dvc.
This way, there can be clear separation between params required by your scripts vs required by
dvc so as not to duplicate information (but, again for read-only purpose, you can just add to
the pipeline file, no problem on that).

So, it should look something like:

variables:
  data:
    size: 10
    out_path: ./data/
stages:
  data:
    cmd: python src/cli/data_download.py --data-size {{ data.size }} --out-path {{ data.out_path }}
    wdir: ..
    outs:
    - {{ data.out_path }}

Note that, I'm not focusing on templating formats (as it's likely going to be tied with the implementation), but we should try to make it logic-less. Providing logic will only make it complicated, and ugly.

cc @karajan1001 @dsuess

@elgehelge
Copy link
Contributor

elgehelge commented May 7, 2020

Seems that most of you agree that reusable yaml-variables should be defined directly in the pipeline file (I do too). So maybe you should rename the issue @dmpetrov ?

@tall-josh
Copy link

@skshetry Amazing news! I can't wait to give it a crack!

  • I agree logic-less is a good call
  • I'm in two minds about whether to allow loading vars from separate files. Leaning more towards as you've described above, but I could be convinced.

@tall-josh
Copy link

tall-josh commented May 29, 2020

@skshetry
I have finally got some overhead to give the params support for the pipeline file. Do you have an example somewhere?

EDIT:

Ahhh I misunderstood your comment. If I now understand correctly; There is now --params support for multistage dvc files, but you're still working on templating variables.

@dmpetrov
Copy link
Member Author

@tall-josh yes, the templating is still in process. Now we are busy by releasing DVC 1.0 https://dvc.org/blog/dvc-3-years-and-1-0-release The parameter is a good feature for the next step. DVC 1.1 I think.

@elgehelge I think there is no need for renaming this issue. It was created specifically to find a solution for using params.yaml in the pipeline. It is a useful approach when you need to localize all the configuration in a single file - a very common pattern. Also, it might be useful when some external tools can properly read and modify simple parameters files params.yaml but do not understand the more complicated format of dvc.yaml. And we already good a lot of good ideas on how to implement it.

But I agree that the "static" params is a super useful tool and should be the first priority (I had no doubt with that). I can create a separate issue for the "static" parameters case.

I like how simple variables can be implemented:

variables:
  data:
    size: 10
    out_path: ./data/

Why don't we use the same approach for importing all the params:

params_variables: [params.yaml]

stages:
  mytrain:
    cmd: python train.py --batch-size {{ train.batch_size }} --out-path {{ train.out_path }}
    outs:
        - {{ train.out_path }}

As an option, we can import by param name or param section name:

params_variables:
    - params.yaml: [process.threshold, train]

In this case, dvc.lock should include all the params that were used in this stage like:

    params:
         train.batch_size: 2048
         train.out_path: "out/model.p"

What do you think folks?

@tall-josh
Copy link

@dmpetrov Thanks for the clarification. In the meantime, I had a crack at implementing a vars section at the top of the multistage dvc.yaml file using Jinja2. It's a bit janky, but it appears to have the desired effect.

https://github.com/tall-josh/dvc/tree/master/TEMPLATING_DEMO

The dvc.yaml file looks like:

vars:
    data: data.txt
    base_dir: base_dir
    stage1_out: stage1_output.txt
    stage2_out: stage2_output.txt
    splits:
        train: 0.89
        eval: 0.3

stages:
  st1:
    deps:
    - {{ vars.data }}
    outs:
    - {{ vars.base_dir }}/{{ vars.stage1_out }}
    cmd: >-
        mkdir {{ vars.base_dir }};
        cat {{ vars.data }} > {{ vars.base_dir }}/{{ vars.stage1_out }}
  st2:
    deps:
    - {{ vars.base_dir }}/{{ vars.stage1_out }}
    outs:
    - {{ vars.base_dir }}/{{ vars.stage2_out }}
    cmd: >-
        echo "train: {{ vars.splits['train'] }}" > {{ vars.base_dir }}/{{ vars.stage2_out }};
        cat {{ vars.base_dir }}/{{ vars.stage1_out }} >>
        {{ vars.base_dir }}/{{ vars.stage2_out }}

@jcpsantiago
Copy link

Hello everyone! I'm a bit late to the party, but here are some ideas about using variables for looping (original discussion started in https://discuss.dvc.org/t/using-dvc-to-keep-track-of-multiple-model-variants/471).

In a nutshell, I want to use the same pipeline to train models using different groups or sets of data. The steps are the same, but the data is slightly different and the outputs have different paths.

Having vars or a params.yaml file would be great! I like the idea of declaring how the DAG should look like and what parameters it needs to run

From the original post:

params:
  group1:
    name: "group 1"
    something_else: "foo"
  group2:
    name: "group 2"
    something_else: "meh"
  group3:
    name: "group 3"
    something_else: "gwah"

stages:
  get_some_data:
    cmd: ./get_data.sh {{params.name}}
    deps:
      - get_data.sh
      - query.sql
    outs:
      - data.csv
  train_model:
    cmd: Rscript train.R {{params.name}} {{params.something_else}}
    deps:
      - data.csv
    outs:
      - models/{{params.name}}/model.xgb
    metrics:
      - metrics/{{params.name}}/model_eval.json

No ideas at the moment regarding how to declare that some parameters should be looped on.

I imagine dvc repro would then run group-by-group the whole DAG from start to finish (as opposed to stage-by-stage).

@tall-josh
Copy link

tall-josh commented Aug 24, 2020

I've been mulling this over too. Would be super handy!!

@johnnychen94
Copy link
Contributor

johnnychen94 commented Dec 1, 2020

The biggest improvement of the parametrization feature I've experienced is that I can now scale the workflow by only modifying params.yaml:

This defines the minimal repro workflow (For simplicity I've removed a lot of params here), I use this for my gitlab CI setting:

# params.yaml

models:
  gaussian_15:
    name: gaussian_15
    data:
      noise_level: 15
    train:
      epochs: 2

datasets:
  train:
    name: train
    src: data/DnCNN.zip
    path: data/train/rawdata
  Set12:
    name: Set12
    src: data/DnCNN.zip
    path: data/test/Set12


testcases:
  set12_gaussian_15:
    model_name: gaussian_15
    noise_level: 15
    dataset: Set12

In this setting, I trained one model, and test the model on one dataset (Set12).

the corresponding dag is

image

I could easily repeat some of the contents in params.yml. In my case, I now trained three models, and test each model on two datasets(Set12 and Set68).

# params.yaml
models:
  gaussian_15:
    name: gaussian_15
    data:
      noise_level: 15
    train:
      epochs: 50

  gaussian_25:
    name: gaussian_25
    data:
      noise_level: 25
    train:
      epochs: 50

  gaussian_50:
    name: gaussian_50
    data:
      noise_level: 50
    train:
      epochs: 50

datasets:
  train:
    name: train
    src: data/DnCNN.zip
    path: data/train/rawdata
  Set12:
    name: Set12
    src: data/DnCNN.zip
    path: data/test/Set12
  Set68:
    name: Set68
    src: data/DnCNN.zip
    path: data/test/Set68


testcases:
  set12_gaussian_15:
    model_name: gaussian_15
    noise_level: 15
    dataset: Set12
  set12_gaussian_25:
    model_name: gaussian_25
    noise_level: 25
    dataset: Set12
  set12_gaussian_50:
    model_name: gaussian_50
    noise_level: 50
    dataset: Set12

  set68_gaussian_15:
    model_name: gaussian_15
    noise_level: 15
    dataset: Set68
  set68_gaussian_25:
    model_name: gaussian_25
    noise_level: 25
    dataset: Set68
  set68_gaussian_50:
    model_name: gaussian_50
    noise_level: 50
    dataset: Set68

🎉 the workflow gets scaled automatically without touching any of the other files:

image

It's still unclear to me what's the best practice to parameterized the workflow. But I'm pretty satisfied with how it makes the code clean.

I'm still playing with dvc and the parametrization. When I feel ready, I might summary them up as a blog.

dvc.yaml
stages:
  extract_data:
    foreach: ${datasets}
    do:
      cmd: mkdir -p ${item.path} && unzip -q ${item.src} "${item.name}/*" -d tmp && mv tmp/${item.name}/* ${item.path}
      deps:
      - ${item.src}
      outs:
      - ${item.path}
  prepare:
    foreach: ${models}
    do:
      cmd: >-
        julia --color=yes --startup=no --project=prepare -e "using Pkg; Pkg.instantiate()" &&
        julia --color=yes --startup=no --project=prepare \
          prepare/main.jl data/train/rawdata/ data/train/prepared \
          --noise-level ${item.data.noise_level} \
          --patch-stride ${item.data.patch_stride} \
          --patch-size ${item.data.patch_size} \
          --max-num-patches ${item.data.max_num_patches}
      deps:
      - data/train/rawdata
      - prepare
      outs:
      - data/train/prepared/${item.name}.h5
  train:
    foreach: ${models}
    do:
      cmd: >-
        julia --color=yes --startup=no --project=train -e "using Pkg; Pkg.instantiate()" &&
        julia --color=yes --startup=no --project=train \
          train/main.jl data/train/prepared/${item.name}.h5 models/${item.name} \
          --epochs ${item.train.epochs} \
          --batch-size ${item.train.batch_size} \
          --initial-lr ${item.train.initial_learning_rate} \
          --lr-decay-ratio ${item.train.learning_rate_decay_ratio} \
          --lr-decay-freq ${item.train.learning_rate_decay_frequency} \
          --model-width ${item.train.model_width} \
          --model-depth ${item.train.model_depth} \
          --padding-size ${item.train.padding_size} \
          --kernel-size ${item.train.kernel_size} \
          --batch-size ${item.train.batch_size} \
          --log-freq ${item.train.log_frequency} \
          --use-gpu ${item.train.use_gpu}
      deps:
      - data/train/prepared/${item.name}.h5
      - train
      outs:
      - models/${item.name}
  evaluate:
    foreach: ${testcases}
    do:
      cmd: >-
        julia --color=yes --startup=no --project=evaluate -e "using Pkg; Pkg.instantiate()" &&
        julia --color=yes --startup=no --project=evaluate evaluate/main.jl \
          models/${item.model_name} \
          data/test/${item.dataset} \
          results/${item.model_name}/${item.dataset} \
          ${item.noise_level}
      deps:
      - data/test/${item.dataset}
      - models/${item.model_name}
      - evaluate
      outs:
      - results/${item.model_name}/${item.dataset}/processed_images
      plots:
      - results/${item.model_name}/${item.dataset}/checkpoints.csv
      - results/${item.model_name}/${item.dataset}/metrics.csv
  summary:
    cmd: >-
      julia --color=yes --startup=no --project=evaluate -e "using Pkg; Pkg.instantiate()" &&
      julia --color=yes --startup=no --project=evaluate evaluate/summary.jl results metrics.json
    deps:
    - results
    - evaluate
    metrics:
    - metrics.json

Even simpler params.yaml

The params.yaml can be further simplified if dvc borrows travis build matrix semantics, e.g.:

stages:
  build:
    foreach-matrix:
          - ${model}
          - ${dataset}
    do:
      ...

or ansible filter semantics, e.g.,:

stages:
  build:
    # item[0] := ${model}
    # item[1] := ${dataset}
    foreach: ${model} | zip(${dataset})
    do:
      ...

I don't have a preference here, I like both specifications.

  • Travis build matrix is very easy to use
  • ansible filter is more flexible, it also allows custom filters

For normal data science projects, it might be better to adopt the Travis way because data scientists can adapt it more easily; they normally are not good programmers 😨 I'm not sure of it, though.


running concurrent jobs

As the workflow can now scale up very easily, it becomes a pain to queue all jobs into one dvc repro sequentially. I still don't have a good sense of how this can be done programmatically. But having a loose read lock(#4979) sounds like a good start.

I haven't followed all the discussions in #755 yet; it can be quite challenging to let DVC figure out what is the best parallelization strategy. So my intuition is that, with the current dvc repro -s functionality, it might be easier to put into another YAML files:

Edit: This seems to be option 1 that @dmpetrov mentioned in #755 (comment)

# jobs.yaml
jobs:
  - name: extract_data@*
    limit: 1 # 1 means no concurrency, this can be set as default value
  - name: prepare@*
    limit: 4 # at most 4 concurrent jobs
    env:
      JULIA_NUM_THREADS: 8
  - train@*
    limit: 4
    env:
      # if it's a scalar, apply to all jobs
      JULIA_NUM_THREADS: 8
      # if it's a list, iterate one by one
      CUDA_VISIBLE_DEVICES:
        - 0
        - 1
        - 2
        - 3
  - name: evaluate@*
    limit: 3
    env:
      JULIA_NUM_THREADS: 8
      CUDA_VISIBLE_DEVICES:
        - 0
        - 1
        - 2
  - name: summary

extract_data@* follows the same glob syntax in #4976

P.S. I don't have access to slurm cluster so my ideas on parallelization are basically on a local machine. It's not very clear to me whether distributed computation should be handled by dvc or by concrete languages/toolboxes that dvc calls.

@skshetry
Copy link
Member

skshetry commented Dec 2, 2020

Hi, everyone. The parametrization feature is complete, at least the things that we planned for the next release. We are hoping to release a pre-release version this week. So, it'd be really helpful if you guys could provide feedback.

@skshetry skshetry unpinned this issue Dec 5, 2020
@skshetry skshetry closed this as completed Dec 8, 2020
@tall-josh
Copy link

Are there docs or examples for this? Does it follow @johnnychen94 's latest comment? If there are no docs but I can get a nudge in the right direction I could put together a Notebook or something.

@tall-josh
Copy link

Ahhh think I found what I'm looking for in tests/func/test_run_multistage.py:test_run_params_default()

@skshetry
Copy link
Member

skshetry commented Dec 8, 2020

Are there docs or examples for this?

@tall-josh, it's in the wiki as mentioned here: https://github.com/iterative/dvc/wiki/Parametrization

@johnnychen94's example also works.

Ahhh think I found what I'm looking for in tests/func/test_run_multistage.py:test_run_params_default()

What are you looking for exactly? I don't think it has anything to do with parameterization.

@tall-josh
Copy link

Thanks @skshetry that wiki perfect. Yeah, the test_run_params_default() did not give me what I want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: templating Related to the templating feature feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

9 participants