Repeat an experiment #8867

pablo-campillo · 2023-01-23T14:49:21Z

Sometimes, experiments may fail for some reasons related with external factors and these experiment could have been launched from code using random search grid so that repeating them is not easy. So, it would be handy two functionalities:

Repeating failed experiments overwriting the appropiate parameters (Data, metrics, etc.)

dvc exp run --failed

Repeting a experiment overwriting its apropiate parameters giver its name:

dvc exp run -n petete --repeat

The text was updated successfully, but these errors were encountered:

dberenbaum · 2023-01-23T20:12:02Z

Putting this as p3 for now due to too many competing priorities, but seems pretty important for working with queued experiments and sweeps.

lefos99 · 2023-12-13T17:16:10Z

I vote for this as well. Sometimes the experiment doesn't finish due to hardware issues and thus I would like to repeat/resume those conveniently!

henrypickler · 2023-12-22T13:31:26Z

I will add to this and try to explain why it is a very important feature for me at least. I tend to use spot machines to train models since they are cheaper. However, they may be terminated at any time and so multiple experiments fail since I run them in parallel. I do use grid search, but more often than not I will do multiple different sweeps at a time and that makes it difficult to repeat the experiment.

To expand on this, it would be awesome if DVC also differentiated between a failed experiment due to an internal error or a failed experiment due to something external, although I'm not sure how feasible this is.

Anyway, in hopes of helping anyone with this issue, I have a little script for helping me retry failed experiments. Be warned though, I haven't tested it much:

#!/bin/bash

dvc queue status | grep "Failed" | awk '{print $2}' | while read -r name; do
    dvc exp apply $name
    dvc exp run --queue
done

dberenbaum added A: experiments Related to dvc exp feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint labels Jan 23, 2023

dberenbaum mentioned this issue Feb 14, 2023

Queue: wrap up e2e workflow iterative/vscode-dvc#3178

Closed

dberenbaum added p2-medium Medium priority, should be done, but less important and removed p3-nice-to-have It should be done this or next sprint labels Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeat an experiment #8867

Repeat an experiment #8867

pablo-campillo commented Jan 23, 2023 •

edited

dberenbaum commented Jan 23, 2023

lefos99 commented Dec 13, 2023

henrypickler commented Dec 22, 2023 •

edited

Repeat an experiment #8867

Repeat an experiment #8867

Comments

pablo-campillo commented Jan 23, 2023 • edited

dberenbaum commented Jan 23, 2023

lefos99 commented Dec 13, 2023

henrypickler commented Dec 22, 2023 • edited

pablo-campillo commented Jan 23, 2023 •

edited

henrypickler commented Dec 22, 2023 •

edited