Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DVC feature requests #560

Closed
2 of 4 tasks
casperdcl opened this issue May 25, 2021 · 10 comments
Closed
2 of 4 tasks

DVC feature requests #560

casperdcl opened this issue May 25, 2021 · 10 comments
Assignees
Labels
discussion Waiting for team decision dvc related to DVC epic Collection of sub-issues icebox p1-important High priority question User requesting support

Comments

@casperdcl
Copy link
Contributor

casperdcl commented May 25, 2021

Collection of DVC issues which CML functionality needs

Potentially needs

Needs

@casperdcl casperdcl added question User requesting support p1-important High priority discussion Waiting for team decision epic Collection of sub-issues labels May 25, 2021
@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Jun 1, 2021

Naïve requirements (first edition)

Dulwich authentication

Commands like dvc exp pull and dvc exp push rely on authentication for interacting with the Git remote. In a headless setup like a continous integration pipeline, these operations should not require any kind of user interaction, but dulwich — the library that provides DVC with Git capabilities — does not support many of the authentication hacks used by many continuous integration tools.

GitHub Actions, as per the actions/checkout@v2 action, relies on a custom authorization header set through the local repository configuration:

[http "https://github.com/"]
        extraheader = AUTHORIZATION: basic ···

GitLab and others may be using different mechanisms, like SSH keys or credential helpers, so this would require further investigation. See jelmer/dulwich#873 and jelmer/dulwich#882 for a similar request.

Possible fixes

  • Improve dulwich to support common authentication methods 😌
  • Outsource push and pull operations to the git command-line tool 🙊

Automatic dvc exp push on checkpoints

In the spot instance and limited execution time scenarios, we need to provide users with a way of saving their checkpoints to their DVC remote and the experiment references to the Git remote each time a given number of checkpoints is captured.

Possible fixes

Lazy experiment pull and apply

In order to keep CML workflows as simple as possible, we should probably abstract the differences between newly created tasks and resumed tasks. It would be interesting to have some DVC flags to avoid returning a non-zero exit code when the referenced experiment doesn't exist yet.

Possible fixes

  • Extend DVC to provide dvc exp pull --lazy <remote> <experiment> and dvc exp apply --lazy <experiment> 🤔
  • Use dvc exp list <remote> to determine if the experiment exists and only pull it in that case 🙊

All the suggestions above assume that we're going to use DVC experiments to track CML runs, and those experiments will be deterministically named after the commit & branch that triggered the run.

@casperdcl
Copy link
Contributor Author

Lazy experiment pull and apply

don't see much of a problem with e.g. dvc exp pull <remote> <experiment> && dvc exp apply <experiment> || dvc exp run

@0x2b3bfa0
Copy link
Member

don't see much of a problem with e.g. dvc exp pull <remote> <experiment> && dvc exp apply <experiment> || dvc exp run

Me neither, that's why I used the 🤔 emoji above. Nevertheless, we would be silently masking any error — even network ones — under the experiment not found case, and this might not be good on a continuous integration scenario where failing early is better than blindly using expensive resources.

@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Jun 1, 2021

CML workflow stoppage or the endless training problem

If the workflow stops, CML runner should be able to restart the workflow and continue the training from the last checkpoint.

We conduct a series of experiments assuming that the training generates incremental checkpoints like tensorflow.
We could have saved the state as many other framework, however the chosen method is simple in implementation and easier to grasp whats going on.

Tensorflow example checkpoints

saver.save(sess, 'my_model',global_step=1000)

my_model-1000.index
my_model-1000.meta
my_model-1000.data-00000-of-00001
checkpoint

When

  • timeout (Github 3h/72 with self hosted runners)
  • Spot instance of Cloud runner termination

To simulate this stoppage we setup a workflow timeout if 1min

Expected

The workflow should be able to be restarted and continue training from the last checkpoint until completed.

Problem

With DVC as storage all the experiments needs at some point to handle dvc.lock before die.
In some cases like repro and exp run dvc.lock is not accesible until the very end.

Trials

DVC repro

  • Alters dvc.lock after repro, never before.

Implementation:
Our training process tries to push the models folder into DVC with different strategies:

dvc push

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull || echo 'Forgive this :pray:'
          dvc repro

          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        cache: true
        persist: true 
#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc push
        fi
done

dvc commit & push

dvc commit is suggested by DVC itself. We knew in advance that this was not working but it might be misleadig for the user.

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull || echo 'Forgive this :pray:'
          dvc repro
          
          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        cache: true
        persist: true 
#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc add models
            dvc commit
            dvc push
        fi
done

dvc push --run-cache

.github/workflows/cml.yaml

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull --run-cache || echo 'Forgive this :pray:'
          dvc repro --pull
          
          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push --run-cache

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        cache: true
        persist: true 

train.sh

#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc push --run-cache
        fi
done

dvc commit & push --run-cache

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull --run-cache || echo 'Forgive this :pray:'
          dvc repro --pull
          
          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push --run-cache

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        cache: true
        persist: true 
#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc add models
            dvc commit
            dvc push --run-cache
        fi
done

Problems:
We just only have one dvc.lock file that will occur after repro.
Hence if the trainig is stopped before dvc.lock is commited DVC can not recover the state in the next run and restarts from zero.
cml-pr is useless here.

DVC run exp

  • Alters dvc.lock after repro, never before.

Implementation:
This is just a variation of dvc repro.

Problems
Exactly as repro

dvc push

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull || echo 'Forgive this :pray:'
          dvc run exp

          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        cache: true
        persist: true 
#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc push
        fi
done

DVC run exp checkpoints:

  • ephemereal commit
  • updates dvc.lock
  • resumes always from the last checkpoint
dvc push

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull || echo 'Forgive this :pray:'
          dvc exp run
          
          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        checkpoint: true
#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc push

            echo 'cml-pr'
            cml-pr '.gitignore' 'dvc.lock'
        fi
done

Problems

  • Checkpoints resumes training from the last checkpoint. This will endup in endless training.
  • Using cml-pr generates many PR

A plausible solution would be merge the last PR enforcing the CI to restart and continue from there.
While our simple script that is checking the existance of several files would success a real scenario would end up
in an endless training if that check is not done also in the training.

@courentin
Copy link
Contributor

My 2 cents: I feel this issue iterative/dvc#5369 is related to cml, I'd love to have the ability to check if I need to repro the pipeline without having to spin up a self-hosted runner and pull the data.

@DavidGOrtega DavidGOrtega added p0-critical Max priority (ASAP) and removed p1-important High priority labels Jun 8, 2021
@DavidGOrtega
Copy link
Contributor

SIGINT is not very effective when running dvc exp run several times has to be triggered

@casperdcl
Copy link
Contributor Author

Lazy experiment pull and apply

don't see much of a problem with e.g. dvc exp pull <remote> <experiment> && dvc exp apply <experiment> || dvc exp run

btw @0x2b3bfa0 would dvc exp run --pull satisfy you here? Or is there a different use case for dvc exp {pull,apply} --lazy?

@0x2b3bfa0
Copy link
Member

btw @0x2b3bfa0 would dvc exp run --pull satisfy you here? Or is there a different use case for dvc exp {pull,apply} --lazy?

If we use the run cache to save checkpoints, that would be much more elegant that my earlier suggestion.

@casperdcl
Copy link
Contributor Author

casperdcl commented Jun 16, 2021

potential workflow:

dvc exp run --name JOB_ID_UNCHANGED_UPON_KILL_AND_RESTART --pull
# ... (auto)kill via SIGINT after ~72h ... # CML does this 5 min early
dvc exp push # CML does this
dvc push # CML does this
# ... CML restarts the workflow

better alternative:

dvc exp run --name CI_JOB_ID_UNCHANGED_UPON_KILL_AND_RESTART --pull --push-every-checkpoint
# ... (auto)kill via SIGINT after ~72h ...
# ... CML restarts the workflow

Note that using COMMIT_SHA instead of CI_JOB_ID might not work in cases where the exp params are not stored in the commit (i.e. 2 job ids with different params but yet same commit sha).

@casperdcl casperdcl added p1-important High priority and removed p0-critical Max priority (ASAP) labels Jun 22, 2021
@casperdcl casperdcl added the dvc related to DVC label Jul 28, 2021
@dacbd dacbd added the icebox label Feb 17, 2023
@dacbd
Copy link
Contributor

dacbd commented Feb 17, 2023

to be revisited

@dacbd dacbd closed this as not planned Won't fix, can't repro, duplicate, stale Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Waiting for team decision dvc related to DVC epic Collection of sub-issues icebox p1-important High priority question User requesting support
Projects
None yet
Development

No branches or pull requests

5 participants