DVC feature requests #560

casperdcl · 2021-05-25T18:05:54Z

Collection of DVC issues which CML functionality needs

Potentially needs

pulling & pushing cache (For syncing experiments. Any other reasons?)
- CML <-> DVC cache smooth integration dvc#4268
- retrieve plots from run-cache Retrive plots from run cache dvc#4096
dvc verify New command: dvc verify - check that the pipeline is up to date without having to pull or run it dvc#5369
dulwich auto-auth using CI config
CI runner timeout
- DVC handling SIGINT, SIGTERM, or SIGKILL mid-exp run and mid-checkpoint
  - often (always?) dvc.lock won't be generated checkpoints: write dvc.lock on every checkpoint dvc#6180
- run-cache storage (e.g. Azure run-cache storage at Azure dvc#5899)
DVC needs to be aware of total number of checkpoints expected per experiment https://github.com/iterative/dvc/issues/6183
- dvc exp run && dvc exp run should only execute once
- interrupting dvc exp run followed by calling dvc exp run again should resume (rather than start from checkpoint 0)
pulling & pushing cache (For syncing experiments. Any other reasons?)
- fetch experiment cache data Enable dvc fetching of experiments cache data dvc#4649, CML <-> DVC cache smooth integration dvc#4268
- dvc exp push for >50MB commits (e.g. somehow push to DVC remote rather than Git remote?) exp push: fails for >50MB commits dvc#6181

Needs

CI runner timeout
- CML re-provisioning runner Workflow timeout a better scenario #208, Spot instances- Runner must be able to restart workflow #174
- attached storage Spot instances a better scenario #161
- dvc exp push upon each checkpoint (e.g. via user callback? Or builtin option checkpoints: flag to exp push && push dvc#6182?)
~~DVC needs to be aware of total number of checkpoints expected per experiment https://github.com/iterative/dvc/issues/6183~~
- or just insist that the user code must work out the current checkpoint number by looking at the workspace state

The text was updated successfully, but these errors were encountered:

0x2b3bfa0 · 2021-06-01T16:11:12Z

Naïve requirements (first edition)

Dulwich authentication

Commands like dvc exp pull and dvc exp push rely on authentication for interacting with the Git remote. In a headless setup like a continous integration pipeline, these operations should not require any kind of user interaction, but dulwich — the library that provides DVC with Git capabilities — does not support many of the authentication hacks used by many continuous integration tools.

GitHub Actions, as per the actions/checkout@v2 action, relies on a custom authorization header set through the local repository configuration:

[http "https://github.com/"]
        extraheader = AUTHORIZATION: basic ···

GitLab and others may be using different mechanisms, like SSH keys or credential helpers, so this would require further investigation. See jelmer/dulwich#873 and jelmer/dulwich#882 for a similar request.

Possible fixes

Improve dulwich to support common authentication methods 😌
Outsource push and pull operations to the git command-line tool 🙊

Automatic `dvc exp push` on checkpoints

In the spot instance and limited execution time scenarios, we need to provide users with a way of saving their checkpoints to their DVC remote and the experiment references to the Git remote each time a given number of checkpoints is captured.

Possible fixes

Extend DVC to provide dvc exp run --push-checkpoints <remote> --push-checkpoints-each=<count> with a callback 😌
~~Watch the $DVC_ROOT/.dvc/tmp/DVC_CHECKPOINT file and trigger a push once it gets deleted~~ not even possible 🙊

Lazy experiment pull and apply

In order to keep CML workflows as simple as possible, we should probably abstract the differences between newly created tasks and resumed tasks. It would be interesting to have some DVC flags to avoid returning a non-zero exit code when the referenced experiment doesn't exist yet.

Possible fixes

Extend DVC to provide dvc exp pull --lazy <remote> <experiment> and dvc exp apply --lazy <experiment> 🤔
Use dvc exp list <remote> to determine if the experiment exists and only pull it in that case 🙊

All the suggestions above assume that we're going to use DVC experiments to track CML runs, and those experiments will be deterministically named after the commit & branch that triggered the run.

casperdcl · 2021-06-01T16:47:18Z

Lazy experiment pull and apply

don't see much of a problem with e.g. dvc exp pull <remote> <experiment> && dvc exp apply <experiment> || dvc exp run

0x2b3bfa0 · 2021-06-01T16:53:13Z

don't see much of a problem with e.g. dvc exp pull <remote> <experiment> && dvc exp apply <experiment> || dvc exp run

Me neither, that's why I used the 🤔 emoji above. Nevertheless, we would be silently masking any error — even network ones — under the experiment not found case, and this might not be good on a continuous integration scenario where failing early is better than blindly using expensive resources.

DavidGOrtega · 2021-06-01T17:29:37Z

CML workflow stoppage or the endless training problem

If the workflow stops, CML runner should be able to restart the workflow and continue the training from the last checkpoint.

We conduct a series of experiments assuming that the training generates incremental checkpoints like tensorflow.
We could have saved the state as many other framework, however the chosen method is simple in implementation and easier to grasp whats going on.

Tensorflow example checkpoints

saver.save(sess, 'my_model',global_step=1000)

my_model-1000.index
my_model-1000.meta
my_model-1000.data-00000-of-00001
checkpoint

When

timeout (Github 3h/72 with self hosted runners)
Spot instance of Cloud runner termination

To simulate this stoppage we setup a workflow timeout if 1min

Expected

The workflow should be able to be restarted and continue training from the last checkpoint until completed.

Problem

With DVC as storage all the experiments needs at some point to handle dvc.lock before die.
In some cases like repro and exp run dvc.lock is not accesible until the very end.

Trials

DVC repro

Alters dvc.lock after repro, never before.

Implementation:
Our training process tries to push the models folder into DVC with different strategies:

dvc push

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull || echo 'Forgive this :pray:'
          dvc repro

          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        cache: true
        persist: true

#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc push
        fi
done

dvc commit & push

dvc commit is suggested by DVC itself. We knew in advance that this was not working but it might be misleadig for the user.

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull || echo 'Forgive this :pray:'
          dvc repro
          
          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        cache: true
        persist: true

#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc add models
            dvc commit
            dvc push
        fi
done

dvc push --run-cache

.github/workflows/cml.yaml

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull --run-cache || echo 'Forgive this :pray:'
          dvc repro --pull
          
          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push --run-cache

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        cache: true
        persist: true

train.sh

#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc push --run-cache
        fi
done

dvc commit & push --run-cache

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull --run-cache || echo 'Forgive this :pray:'
          dvc repro --pull
          
          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push --run-cache

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        cache: true
        persist: true

#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc add models
            dvc commit
            dvc push --run-cache
        fi
done

Problems:
We just only have one dvc.lock file that will occur after repro.
Hence if the trainig is stopped before dvc.lock is commited DVC can not recover the state in the next run and restarts from zero.
cml-pr is useless here.

DVC run exp

Alters dvc.lock after repro, never before.

Implementation:
This is just a variation of dvc repro.

Problems
Exactly as repro

dvc push

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull || echo 'Forgive this :pray:'
          dvc run exp

          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        cache: true
        persist: true

#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc push
        fi
done

DVC run exp checkpoints:

ephemereal commit
updates dvc.lock
resumes always from the last checkpoint

dvc push

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull || echo 'Forgive this :pray:'
          dvc exp run
          
          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        checkpoint: true

#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc push

            echo 'cml-pr'
            cml-pr '.gitignore' 'dvc.lock'
        fi
done

Problems

Checkpoints resumes training from the last checkpoint. This will endup in endless training.
Using cml-pr generates many PR

A plausible solution would be merge the last PR enforcing the CI to restart and continue from there.
While our simple script that is checking the existance of several files would success a real scenario would end up
in an endless training if that check is not done also in the training.

courentin · 2021-06-04T13:50:31Z

My 2 cents: I feel this issue iterative/dvc#5369 is related to cml, I'd love to have the ability to check if I need to repro the pipeline without having to spin up a self-hosted runner and pull the data.

DavidGOrtega · 2021-06-11T14:15:13Z

SIGINT is not very effective when running dvc exp run several times has to be triggered

casperdcl · 2021-06-15T17:03:00Z

Lazy experiment pull and apply

don't see much of a problem with e.g. dvc exp pull <remote> <experiment> && dvc exp apply <experiment> || dvc exp run

btw @0x2b3bfa0 would dvc exp run --pull satisfy you here? Or is there a different use case for dvc exp {pull,apply} --lazy?

0x2b3bfa0 · 2021-06-15T17:14:33Z

btw @0x2b3bfa0 would dvc exp run --pull satisfy you here? Or is there a different use case for dvc exp {pull,apply} --lazy?

If we use the run cache to save checkpoints, that would be much more elegant that my earlier suggestion.

casperdcl · 2021-06-16T12:41:00Z

potential workflow:

dvc exp run --name JOB_ID_UNCHANGED_UPON_KILL_AND_RESTART --pull
# ... (auto)kill via SIGINT after ~72h ... # CML does this 5 min early
dvc exp push # CML does this
dvc push # CML does this
# ... CML restarts the workflow

better alternative:

dvc exp run --name CI_JOB_ID_UNCHANGED_UPON_KILL_AND_RESTART --pull --push-every-checkpoint
# ... (auto)kill via SIGINT after ~72h ...
# ... CML restarts the workflow

Note that using COMMIT_SHA instead of CI_JOB_ID might not work in cases where the exp params are not stored in the commit (i.e. 2 job ids with different params but yet same commit sha).

dacbd · 2023-02-17T15:21:18Z

to be revisited

casperdcl added question User requesting support p1-important High priority discussion Waiting for team decision epic Collection of sub-issues labels May 25, 2021

DavidGOrtega assigned DavidGOrtega and casperdcl May 25, 2021

DavidGOrtega added p0-critical Max priority (ASAP) and removed p1-important High priority labels Jun 8, 2021

This was referenced Jun 11, 2021

Support for credential helpers jelmer/dulwich#873

Open

Support for extra header authentication jelmer/dulwich#882

Open

This was referenced Jun 15, 2021

checkpoints: write dvc.lock on every checkpoint iterative/dvc#6180

Closed

checkpoints: flag to exp push && push iterative/dvc#6182

Closed

exp push: fails for >50MB commits iterative/dvc#6181

Closed

casperdcl added p1-important High priority and removed p0-critical Max priority (ASAP) labels Jun 22, 2021

0x2b3bfa0 mentioned this issue Jul 12, 2021

CML <-> DVC push #656

Closed

casperdcl mentioned this issue Jul 16, 2021

checkpoints: num(ber)/epoch awareness iterative/dvclive#113

Closed

casperdcl added the dvc related to DVC label Jul 28, 2021

dacbd added the icebox label Feb 17, 2023

dacbd closed this as not planned Won't fix, can't repro, duplicate, stale Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DVC feature requests #560

DVC feature requests #560

casperdcl commented May 25, 2021 •

edited

Loading

0x2b3bfa0 commented Jun 1, 2021 •

edited

Loading

casperdcl commented Jun 1, 2021

0x2b3bfa0 commented Jun 1, 2021

DavidGOrtega commented Jun 1, 2021 •

edited by 0x2b3bfa0

Loading

courentin commented Jun 4, 2021

DavidGOrtega commented Jun 11, 2021

casperdcl commented Jun 15, 2021

0x2b3bfa0 commented Jun 15, 2021

casperdcl commented Jun 16, 2021 •

edited

Loading

dacbd commented Feb 17, 2023

DVC feature requests #560

DVC feature requests #560

Comments

casperdcl commented May 25, 2021 • edited Loading

Potentially needs

Needs

0x2b3bfa0 commented Jun 1, 2021 • edited Loading

Naïve requirements (first edition)

Dulwich authentication

Possible fixes

Automatic dvc exp push on checkpoints

Possible fixes

Lazy experiment pull and apply

Possible fixes

casperdcl commented Jun 1, 2021

0x2b3bfa0 commented Jun 1, 2021

DavidGOrtega commented Jun 1, 2021 • edited by 0x2b3bfa0 Loading

CML workflow stoppage or the endless training problem

When

Expected

Problem

Trials

DVC repro

DVC run exp

DVC run exp checkpoints:

courentin commented Jun 4, 2021

DavidGOrtega commented Jun 11, 2021

casperdcl commented Jun 15, 2021

0x2b3bfa0 commented Jun 15, 2021

casperdcl commented Jun 16, 2021 • edited Loading

dacbd commented Feb 17, 2023

casperdcl commented May 25, 2021 •

edited

Loading

0x2b3bfa0 commented Jun 1, 2021 •

edited

Loading

Automatic `dvc exp push` on checkpoints

DavidGOrtega commented Jun 1, 2021 •

edited by 0x2b3bfa0

Loading

casperdcl commented Jun 16, 2021 •

edited

Loading