Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeat an experiment #8867

Open
pablo-campillo opened this issue Jan 23, 2023 · 3 comments
Open

Repeat an experiment #8867

pablo-campillo opened this issue Jan 23, 2023 · 3 comments
Labels
A: experiments Related to dvc exp feature request Requesting a new feature p2-medium Medium priority, should be done, but less important

Comments

@pablo-campillo
Copy link

pablo-campillo commented Jan 23, 2023

Sometimes, experiments may fail for some reasons related with external factors and these experiment could have been launched from code using random search grid so that repeating them is not easy. So, it would be handy two functionalities:

  1. Repeating failed experiments overwriting the appropiate parameters (Data, metrics, etc.)
dvc exp run --failed
  1. Repeting a experiment overwriting its apropiate parameters giver its name:
dvc exp run -n petete --repeat
@dberenbaum dberenbaum added A: experiments Related to dvc exp feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint labels Jan 23, 2023
@dberenbaum
Copy link
Contributor

Putting this as p3 for now due to too many competing priorities, but seems pretty important for working with queued experiments and sweeps.

@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important and removed p3-nice-to-have It should be done this or next sprint labels Feb 17, 2023
@lefos99
Copy link

lefos99 commented Dec 13, 2023

I vote for this as well. Sometimes the experiment doesn't finish due to hardware issues and thus I would like to repeat/resume those conveniently!

@henrypickler
Copy link

henrypickler commented Dec 22, 2023

I will add to this and try to explain why it is a very important feature for me at least. I tend to use spot machines to train models since they are cheaper. However, they may be terminated at any time and so multiple experiments fail since I run them in parallel. I do use grid search, but more often than not I will do multiple different sweeps at a time and that makes it difficult to repeat the experiment.

To expand on this, it would be awesome if DVC also differentiated between a failed experiment due to an internal error or a failed experiment due to something external, although I'm not sure how feasible this is.

Anyway, in hopes of helping anyone with this issue, I have a little script for helping me retry failed experiments. Be warned though, I haven't tested it much:

#!/bin/bash

dvc queue status | grep "Failed" | awk '{print $2}' | while read -r name; do
    dvc exp apply $name
    dvc exp run --queue
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp feature request Requesting a new feature p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

4 participants