Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML experiments and hyperparameters tuning #2799

Open
dmpetrov opened this issue Nov 16, 2019 · 38 comments
Open

ML experiments and hyperparameters tuning #2799

dmpetrov opened this issue Nov 16, 2019 · 38 comments

Comments

@dmpetrov
Copy link
Member

@dmpetrov dmpetrov commented Nov 16, 2019

Problem

There are a lot of discussions on how to manage ML experiments with DVC. Today's DVC design allows ML experiments through Git-based primitives such as commits and branches. This works nicely for large ML experiments when code writing and testing required. However, this model is too heavy for the hyperparameters tuning stage when the user makes dozens of small, one-line changes in config or code. Users don't want to have dozens of Git-commits or branches.

Requirements

A lightweight abstraction needs to be created in DVC to support hyperparameters-like tiny experiments without Git-commits. Hyperparameters tunning stage can be considered as a separate user activity outside of Git workflow. But the result of this activity still needs to be managed by Git preferably by a single commit.

High-level requirements to the hyperparameters tunning stage:

  1. Run. Run dozens of experiments without committing any results into Git while keeping track of all the experiments. Each of the experiments includes a small config change or code change (usually, 1-2 lines).
  2. Compare. A user should be able to compare two experiments: see diffs for code (and probably metrics)
  3. Visualize. A user should be able to see all the experiments results: metrics that were generated. It might be some table with metrics or a graph. CSV table needs to be supported for custom visualization.
  4. Propagate. Choose "the best" experiment (not necessarily the highest metrics) and propagate it to the workspace (bring all the config and code changes. Important: without retraining). Then it can be committed to Git. This is the final result of the current hyperparameter tunning stage. After that, the user can continue to work with a project in a regular Git workflow.
  5. Store. Some (or all) of the experiments might be still useful (in additional to "the best" one). A user should be able to commit them to the Git as well. Preferably in a single commit to keep the Git history clean.
  6. Clean. Not useful experiments should be removed with all the code and data artifacts that were created. A special subcommand of dvc gc might be needed.
  7. [*] Parallel. In some cases, the experiments can be run in parallel which aligns with DVC parallel execution plans: #2212, #755. This might not be implemented now (in the 1st version of this feature) but it is important to support parallel execution by this new lightweight abstraction.
  8. Group. Iterations of hyperparameters tuning might be not related to each other and need to be managed and visualized separately. Experiments need to be grouped somehow.

What should NOT be covered by this feature?

This feature is NOT about the hyperparameter grid-search. In most cases, hyperparameters tuning is done by users manually using "smart" assumptions and hypotheses about hyperparameter space. Grid-search can be implemented on top of this feature/command using bash for example.

  1. The ability to run the experiments from bash might be also a requirement for this feature request.

Possible implementations

This is an open question but many data scientists create directories for each of the experiments. In some cases, people create directories for a group of experiments and then experiments inside. We can use some of these ideas/practices to better align with users' experience and intuition.

Actions

This is a high-level feature request (epic). The requirements and an initial design need to be discussed and more feature requests need to be created. @iterative/engineering please share your feedback. Is something missing here?

EDITED:

Related issues

#2379
#2532
#1018 can be relevant (?)
Discussion

@casperdcl

This comment has been minimized.

Copy link
Member

@casperdcl casperdcl commented Nov 16, 2019

I think I almost-but-not-quite understand the aim here. I feel like I'm missing some key concept.

Run dozens of experiments without committing any results into Git while keeping track of all the experiments.

This seems to be almost a contradiction - the most robust way to "keep track" is to commit separately.

Each of the experiments includes a small config change or code change (usually, 1-2 lines).

This could be satisfied by for example by a bash script looping through param choices with @nteract/papermill for notebook users. I think it would be quite hard to try to write a tool to do this in a language/platform agnostic way. It's hard enough with papermill which is pretty niche.

To be all-encompassing we'd have to wind up supporting multiple ways of passing in params: env vars, cli args, sed -r 's/<search>/<repl>/g', and (nightmare) language-specific ways.

see diffs for code (and probably metrics)

Again a papermill-like approach (bash script spawning multiple notebooks and kernels, each with different params, each outputting a dvc metrics-like file) could do this

some table with metrics or a graph. CSV table needs to be supported for custom visualization.

Would need to create a formal metrics specification, or at least be very intelligent about automatically interpreting and visualising whatever the end-users throw at us.

Choose "the best" experiment (not necessarily the highest metrics) and propagate it to the workspace

Not sure how "best" can be automated with "not necessarily the highest metrics"

  1. Store./6. Clean./7. [*] Parallel.

All could be handled by the bash script.

Experiments need to be grouped somehow.

Probably part of any potential formal metrics spec.

This feature is NOT about the hyperparameter grid-search

and

create directories for each of the experiments [...] directories for a group of experiments

Really seems like end-users writing bash/batch scripts would solve this.


Overall I feel like this has two requirements:

  1. implement (or create) a formal metrics spec (which we can then use for visualisations etc)
  2. document/add a tutorial for writing scripts to manage multiple experiments

I'd be against designing (1) from scratch owing to:

Also vaguely related maybe worth considering org-wide project boards (https://github.com/orgs/iterative/projects) for managing epics as well as cross-repo issues (e.g. iterative/dvc.org#765 and iterative/example-versioning#5)

@dmpetrov

This comment has been minimized.

Copy link
Member Author

@dmpetrov dmpetrov commented Nov 17, 2019

@casperdcl good questions but let's start with the major one:

Run dozens of experiments without committing any results into Git while keeping track of all the experiments.

This seems to be almost a contradiction - the most robust way to "keep track" is to commit separately.

Let's imagine you are jumping to a hyperparameter tunning stage. You need to run a few experiments. You don't know in advance how many experiments are needed. Usually, it takes 10-20 but it might easily take 50-100.

Questions:

  1. What abstraction would you choose? Commits to master? A new branch and commits in the branch? Is it okay for you to have 50 commits in a row?
  2. You end up having 50 commits. How to get all the results and compare them to find the best result?
  3. If Git abstractions work and new standards are not needed why a big portion of data scientists (including ex-developers) do not use this and prefer to create 50 dirs instead of 50 commits?
@casperdcl

This comment has been minimized.

Copy link
Member

@casperdcl casperdcl commented Nov 17, 2019

Run dozens of experiments without committing any results into Git while keeping track of all the experiments.

This seems to be almost a contradiction - the most robust way to "keep track" is to commit separately.

Ah I think we were both not using accurate language :) You do indeed want to commit results in some form (metrics for each experiment/summary of metrics/metadata to allow easy reproduction of experiments - which could just be the looping script). You don't necessarily want to commit runs (saved models, generated tweaked source code).

And when I said commit separately I should've just said commit. (Separately implies multiple commits, which isn't necessary unless you want to save each model and its outputs... which may actually still be useful. 1. Run multiple experiments 2. Save each in a separate branch commit 3. Collate metrics and use them to delete most branches. No clear advantage of this over multiple dirs. Maybe if you want to save the 2 best models on two different branches which will then fork?)

I think the rest of my comment dealt with the multi-dir, single-commit approach anyway (which as I understand is what you also intended).

@dmpetrov

This comment has been minimized.

Copy link
Member Author

@dmpetrov dmpetrov commented Nov 17, 2019

Yeah :) Sorry, I put the description in a very abstract form to not to push to any solutions. This abstract form gives a lot of opportunities for different interpretations which is probably the root cause of the misunderstanding. To be clear, I don't see any other solutions besides dirs yet but it would be great if we can consider other options.

I definitely want to give an ability to commit the results (both metrics as well as runs) but not necessarily all the results (ups to a user).

I think the rest of my comment dealt with the multi-dir, single-commit approach anyway (which as I understand is what you also intended).

👍

@Suor

This comment has been minimized.

Copy link
Member

@Suor Suor commented Nov 18, 2019

Preferably in a single commit to keep the Git history clean.

Doesn't sound like clean to me. It would be a very messy commit and if that experiments involve code changes it would be way easier to have a commit for each, this way you can git checkout it.

Additionally, if we have a git commit for each experiment we want to save then it would be very easy to save associated artifacts too.

Not useful experiments should be removed with all the code and data artifacts that were created

We might simply dvc run --no-commit, and no need to gc anything in the end.

Parallel. ... which aligns with DVC parallel execution plans:

Not necessarily, if we make a dir copy for each experiment, than that would be a different dvc repo, and we won't need any parallel processing for single repo.

@pared

This comment has been minimized.

Copy link
Member

@pared pared commented Nov 18, 2019

What worries me the most is the weight of the project. If we decide to go with dir approach, we need to either make a repo copy for each experiment or somehow link/use dvc controlled artifacts from original repo. I think that copy is fine for first version, but later we need to come up with something that would not duplicate our artifacts. That would probably align with parallel execution plans too.

@casperdcl

This comment has been minimized.

Copy link
Member

@casperdcl casperdcl commented Nov 18, 2019

About the whole dir copy thing... The bash loop + papermill workflow I gave as an example (granted only works for python notebooks) would create one dir per test, and said dir would only contain a notebook with one different parameter cell, as well as potentially some outputs. All notebooks would use (i.e. import) the same code and data from the root directory. And all you'd need to commit is the bash loop script & the metrics files from the output directories in order to reproduce/track what happened. May need random.seed(1337) or similar to reproduce identically but you get the idea.

My main concern is this all seems very language-, code layout-, and OS-specific and best left to the user to figure out. I think it would be helpful if we gave a concrete example of how dvc could assist in a workflow (e.g. this dummy C++ program training on this MNIST data on linux with a bash script subbing in (or passing in via CLI params) these 10 different params for 10 output dirs, running 2 jobs at a time, outputting metrics.csv, etc...)

I feel like trying to create an app to automate this process in generic scenarios is a bit like trying to create an app to help people use a computer. Sounds more like a manual/course than a product.

@dmpetrov

This comment has been minimized.

Copy link
Member Author

@dmpetrov dmpetrov commented Nov 19, 2019

Preferably in a single commit to keep the Git history clean.

Doesn't sound like clean to me. It would be a very messy commit and if that experiments involve code changes it would be way easier to have a commit for each, this way you can git checkout it.

Additionally, if we have a git commit for each experiment we want to save then it would be very easy to save associated artifacts too.

@Suor you are right, but ideally, it should be a user's choice - some folks are very against 50 commits and it would be great to provide some options to avoid this (if we can :) ).

In the dir-per-experiment paradigm, all the experiments might be easily saved in a single commit with all the artifacts (changed files and outputs) since they are separated. What do you think about this approach?

ADDED:

We might simply dvc run --no-commit, and no need to gc anything in the end.

Yeap. An additional, experiment specific option might be helpful like dvc repro --exp tune_lr

Parallel. ... which aligns with DVC parallel execution plans:

Not necessarily, if we make a dir copy for each experiment, than that would be a different dvc repo, and we won't need any parallel processing for single repo.

First, it looks like we have a bit different opinions regarding implementation. I assume that we copy all the artifacts in an experiment dir which gives us an ability to commit experiments (one by one or in a bulk). While you assume that we clone a repo to a dir. We can discuss the pros and cons of these methods. I won't be surprised if we find more options.

Thus, it depends on implementation. If it is a separate repo as a dir then we cannot commit it in the main repo. In this case, you are right in the above - separate commits above will be required.

If we run in a separate dir with no cloning (just copying and instantiating data artifacts) then we parallel running support might require.

@dmpetrov

This comment has been minimized.

Copy link
Member Author

@dmpetrov dmpetrov commented Nov 19, 2019

we need to either make a repo copy for each experiment or somehow link/use dvc controlled artifacts

@pared you are right. I don't think we can afford to make a copy of data artifacts. So, there is only one option - the most complicated one, unfortunately.

@dmpetrov

This comment has been minimized.

Copy link
Member Author

@dmpetrov dmpetrov commented Nov 19, 2019

My main concern is this all seems very language-, code layout-, and OS-specific and best left to the user to figure out.

Exactly. Notebook is kind of a specific language. I'd suggest building a language-agnostic version first based on config files or code file changes - copy all code in a dir, instantiate all the data files and run an experiment. Later we can introduce something more language/Notebook specific.

I think it would be helpful if we gave a concrete example of how dvc could assist in a workflow

Totally! We definitely need an example. This issue was created to initiate the discussion and collect the initial set of requirements. But the development process of MVP should be example-driven.

I feel like trying to create an app to automate this process in generic scenarios is a bit like trying to create an app to help people use a computer. Sounds more like a manual/course than a product.

I see this as an attempt to help users use one of the "best practices" - save all the experiments (in dirs :) ) and compare the results.

@jorgeorpinel

This comment has been minimized.

Copy link
Member

@jorgeorpinel jorgeorpinel commented Nov 19, 2019

What about trying to automatically generate a Git submodule for experiments? 1. Somehow mark code or data files as "under experimentation". 2. Watch those files and make a commit every time they're written (similar to IPython Notebook checkpoints) 3. Tell DVC to stop watching this experiment.

And do we have any ideas on what the interface would look like? Another command, a separate tool, a UI?

a big portion of data scientists (including ex-developers)... prefer to create 50 dirs instead of 50 commits

If this is the case. Perhaps a file linking system or UI that shows the user a growing set of virtual dirs simultaneously, one per experiment. Either based in the single-commit, multiple dir strategy, the git submodule, or something else.

@pared

This comment has been minimized.

Copy link
Member

@pared pared commented Nov 19, 2019

@dmpetrov Do you think that we could restrict (at least in the beginning) experiments feature to systems where linking is possible? That would eliminate the risk of experiment throttling disk space. Also in such case implementation does not seem too hard. We would just need to create a repo with default *-link cache type and point cache to the master project cache.

@dmpetrov

This comment has been minimized.

Copy link
Member Author

@dmpetrov dmpetrov commented Nov 20, 2019

What about trying to automatically generate a Git submodule for experiments?

@jorgeorpinel Might be I didn't get the idea but Git-submodule means a separate repo. So, we end up having 50 Git repositories instead of 50 commits. It looks like an even more havier approach that we currently have.

And do we have any ideas on what the interface would look like? Another command, a separate tool, a UI?

Initially, the command line one. I see that as part or repro. Line vi config.yaml && dvc repro --exp tune_lr - will create a dir with changed files and new outputs.

@jorgeorpinel

This comment has been minimized.

Copy link
Member

@jorgeorpinel jorgeorpinel commented Nov 20, 2019

No, just one submodule with a single copy of the source code, and 50 commits in it. Although now that I think about it, it's similar to just making a branch, and the latter is probably easier...

@dmpetrov

This comment has been minimized.

Copy link
Member Author

@dmpetrov dmpetrov commented Nov 20, 2019

@pared No restriction is needed. We should use link-type that was specified by the user. My point is - we cannot create a copy if the user prefers reflinks. Also, I don't think we need to create any repo. Experiments should work in an existing repo.

@pared

This comment has been minimized.

Copy link
Member

@pared pared commented Nov 20, 2019

@dmpetrov I agree that experiments should work in existing repo. What I have in mind by "creating the new repo" was that I imagined, that we will store each experiment as a current repo "copy" in some special directory, like .dvc/exp/tune_lr_v1 and so on. Are we on the same page here? Or do you imagine it differently?

@dmpetrov

This comment has been minimized.

Copy link
Member Author

@dmpetrov dmpetrov commented Nov 21, 2019

@pared yes, it is very likely we will need to store a copy of the current repo. It might be directly in project root dir tune_lr_v1/.

@alexvoronov

This comment has been minimized.

Copy link

@alexvoronov alexvoronov commented Nov 21, 2019

By the way, I have not seen anyone mentioning MLflow. I haven't tried it myself yet, but the description promises to manage the ML lifecycle, including experimentation and reproducibility. How did they solved this issue? Any chance to just integrate/build on top of that or some other similar tool? Or an API for integrating third-party ML lifecycle tools?

@Suor

This comment has been minimized.

Copy link
Member

@Suor Suor commented Nov 21, 2019

In the dir-per-experiment paradigm, all the experiments might be easily saved in a single commit with all the artifacts (changed files and outputs) since they are separated. What do you think about this approach?

I thought of those dirs as copies of a git/dvc repo. So if you commit it's state, probably to a branch, to a separate commit you might access all the artifacts easily. It will work with gc seamlessly and so on. A copy dvc repo also retains all the functionality, you may cd into it and explore it. You can diff it to original with any dir diff tool, like meld. This are supposed to share cache and use some light-weight links.

Do you suggest a copy of everything in a subdir, but still being the same git/dvc repo? And then committing the whole structure. Not sure how this will work, but I haven't thought about that much.

And, yes if that is the same repo you most probably need parallelized runs.

The thing is with subdirs in a single repo, we can't refer to different versions of an artifact by changing rev, we will also need to change path. And those paths won't be consistent between revs. This might be an issue or not.

Also, how do you mainline some experiment then? Do we need som specific dvc command for that?

If we run in a separate dir with no cloning (just copying and instantiating data artifacts) then we parallel running support might require.

Checking out artifacts is an issue both implementations have. We can simply checkout artifacts for a new copy if we use fast links. But if we use copy we might want to make some lightweight links to already checked out copies in the original dir. This could be ignored or at least wait for a while though.

We have @slow_link_guard to at least keep people informed about that.

What about trying to automatically generate a Git submodule for experiments?

I don't see any advantage of a git submodule over a simple clone. Why should we complicate this?

Initially, the command line one. I see that as part or repro. Line vi config.yaml && dvc repro --exp tune_lr - will create a dir with changed files and new outputs.

I see that a basic building block is creating a dir copy (a clone or just a copy) and checking out artifacts there. Maybe cd there. Then a user may do whatever he/she wants inside:

dvc exp tune_some_thing
dvc repro some_stage.dvc

cd ../..
cd exp/tune_some_thing

# later
vim ...
dvc repro some_stage.dvc

Or maybe it's ok to bundle it from the start, like @dmitry envisions. Not sure --exp under repro or a separate command is better:

dvc experiment <experiment-name> some_stage.dvc
# or
dvc exp <experiment-name> some_stage.dvc
# or even
dvc try <experiment-name> some_stage.dvc  

We will need commands to manage all these, probably. If these are just dirs then we can commit everything as is, which is a plus. But we will still need something to diff, compare metrics, mainline an experiment.

Since these are just dirs (and clones are mostly dirs too) we get some of these for free, which I like a lot:

meld . exp/tune_some_thing  # compare dirs
rm -rf exp/tune_some_thing  # discard experiment
cp -r exp/tune_some_thing . # mainline, not sure this one is correct
@casperdcl

This comment has been minimized.

Copy link
Member

@casperdcl casperdcl commented Nov 21, 2019

Regarding the MLFlow, idea - it looks like an augmented conda env.yml file which supports tracking input CLI params and you need to use their python API for logging outputs/results/metrics.

They do have a nice web UI for visualising said logs, though.

@jorgeorpinel

This comment has been minimized.

Copy link
Member

@jorgeorpinel jorgeorpinel commented Nov 21, 2019

I don't see any advantage of a git submodule over a simple clone.

The thing is that if you clone a Git repo inside a Git repo and add it, Git just ignores the inner repo's contents. I think it stages a dummy empty file with the name of the embedded repos' dir. So we may be forced to use submodules, depending on the specific needs. Here's Git's output when you clone a repo inside a repo and stage it:

warning: adding embedded git repository: {INNER_REPO}
hint: You've added another git repository inside your current repository.
hint: Clones of the outer repository will not contain the contents of
hint: the embedded repository and will not know how to obtain it.
hint: If you meant to add a submodule, use:
hint: 
hint: 	git submodule add <url> {INNER_REPO}
hint: 
hint: If you added this path by mistake, you can remove it from the
hint: index with:
hint: 
hint: 	git rm --cached {INNER_REPO}
hint: 
hint: See "git help submodule" for more information.
@dmpetrov

This comment has been minimized.

Copy link
Member Author

@dmpetrov dmpetrov commented Nov 22, 2019

I thought of those dirs as copies of a git/dvc repo.

@Suor it looks like we are on the same page with that.

Do you suggest a copy of everything in a subdir, but still being the same git/dvc repo? And then committing the whole structure. Not sure how this will work, but I haven't thought about that much.

And, yes if that is the same repo you most probably need parallelized runs.

Right. Yes, I think we should consider this subdir-in-the-same-repo option. This allows a user to commit many subdirs in a single commit or just remove subdirs using a regular rm -rf tune_lr_v1/.

The thing is with subdirs in a single repo, we can't refer to different versions of an artifact by changing rev, we will also need to change path. And those paths won't be consistent between revs. This might be an issue or not.

If you copy a whole structure and change paths in dvc-files it should not be an issue except the cases when a whole path was used like /Users/dmitry/src/myproj/file.txt. I don't think we should care about this case.

Also, how do you mainline some experiment then? Do we need som specific dvc command for that?

🤷‍♂️ The option I like the most so far: dvc repro --exp tune_lr_v1 . A separate command is fine: dvc exp tune_lr_v1

We can simply checkout artifacts for a new copy if we use fast links. But if we use copy we might want to make some lightweight links to already checked out copies in the original dir.

I don't think we need to invent something new here. We should use the same data file liking strategy as specified in a repo. From the file management point of view, the experiment subdirs play the same role as branches and commits and should use the same strategy.

Or maybe it's ok to bundle it from the start, like @dmitry envisions. Not sure --exp under repro or a separate command is better:

Yeah. I'd prefer to create and execute an experiment as a single, simple command. No matter if it is repro or a dedicated one.

Since these are just dirs (and clones are mostly dirs too) we get some of these for free, which I like a lot:

meld . exp/tune_some_thing  # compare dirs
rm -rf exp/tune_some_thing  # discard experiment
cp -r exp/tune_some_thing . # mainline, not sure this one is correct

Exactly! We will get a lot of stuff for free. More than that - it should align well with data scientists' intuition of creating dirs for experiments.

The last command (cp -r ext/...) won't work, unfortunately - we might need a new command dvc exp propagate exp/tune_some_thing (to the current dir by default)

@dmpetrov

This comment has been minimized.

Copy link
Member Author

@dmpetrov dmpetrov commented Nov 22, 2019

Any chance to just integrate/build on top of that or some other similar tool? Or an API for integrating third-party ML lifecycle tools?

@alexvoronov the integration itself is a good idea. Unfortunately, DVC experiments cannot be built on top of MlFlow because MlFlow has a different purpose and focuses on metrics visualization. But the visualization part can be nicely implemented on top of existing solutions. There are a few more MlFlow analogs: Weights & Biases, comet ml and others. It would be great to create a unified integration with these tools.

@casperdcl brought a good point about conda env.yml. It might be another integration.

We should definitely keep the UI and visualization in mind but I would not start with that.

@pared

This comment has been minimized.

Copy link
Member

@pared pared commented Nov 28, 2019

Yeah. I'd prefer to create and execute an experiment as a single, simple command. No matter if it is repro or a dedicated one.

@dmpetrov how would this work? What I have in mind is:

  • doing some changes in my repo, not necessarily committing them
  • running dvc repro --exp tune_lr train_model.dvc
  • dvc takes care of creating an experiment directory and moving all the stuff there, also running

What I don't like about incorporating experiments into repro is that it assumes that we want to run the experiment. Will it always be the case?

What if I want to prepare a few experiment "drafts" by editing my repo, and then, at the end of the day, just dvc experiment run tune_lr tune_lr_v1 tune_lr_v2, go home and get back to finished tasks?
I think the experiment should be a separate command that has three main steps:

  • create a directory with an experiment
  • run the experiment(s)
  • choose experiment(s) which you would like to preserve

First two could be joined into one with some flag like dvc experiment --run tune_lr


I want to get back to creating the experiment directory:

@pared yes, it is very likely we will need to store a copy of the current repo. It might be directly in project root dir tune_lr_v1/.

I think it should not be in the project root dir.

  • In the case of several dozens of experiments, root dir will look terribly
  • removing unwanted experiments will be hell if someone uses "creative" naming

If experiments will be in dedicated directory (.experiments, .dvc/exp or whatever):

  • easy to .git and .dvc ignore
  • finished with experimenting? no problem just rm -rf .experiments
@dashohoxha

This comment has been minimized.

Copy link
Contributor

@dashohoxha dashohoxha commented Nov 29, 2019

@pared first of all let me make the disclaimer that I have not followed this discussion very carefully and I am not sure that I understand all the ideas presented here. So, it is quite possible that I don't know what I am talking about.

What if I want to prepare a few experiment "drafts" by editing my repo, and then, at the end of the day, just dvc experiment run tune_lr tune_lr_v1 tune_lr_v2, go home and get back to finished tasks?
I think the experiment should be a separate command that has three main steps:

  • create a directory with an experiment
  • run the experiment(s)
  • choose experiment(s) which you would like to preserve

Using a command like dvc experiment ... seems interesting to me.
@pared is it possible to show with a simple bash script or with a simple example what the command dvc experiment create ... is supposed to do? Or is it possible to explain how we could do this manually without using dvc experiment?

If experiments will be in dedicated directory (.experiments, .dvc/exp or whatever)

If these experiments are going to be managed transparently (meaning that the users only use dvc experiment ... to manage them, don't touch them manually), then it seems a good idea to use something like .dvc/experiments/.

@pared

This comment has been minimized.

Copy link
Member

@pared pared commented Nov 29, 2019

@dashohoxha
I will try to explain first, if it will not be enough, I can try to prepare some draft:

  1. The user makes some changes inside repo to adjust repo state to his experiment (eg change fully connected model code to CNN in some image recognition project, change number of layers, change learning rate adjustment algorithm), they will probably not be committed to the current branch, but it's up to further discussion.

  2. The user runs dvc experiment create {ename}, dvc copies current repo state to .dvc/experiment/{ename}, and links artifacts properly

  3. The user can run the experiment with another command.

  4. There is a set of commands allowing to manage experiments (choose "the winner" and move it to the current repo, choose few "winners" and [for example] make branch from each one).

So in few words experiment create would be advanced cp . .dvc/experiment/{ename}.
What do you think about that?

@dashohoxha

This comment has been minimized.

Copy link
Contributor

@dashohoxha dashohoxha commented Nov 29, 2019

So in few words experiment create would be advanced cp . .dvc/experiment/{ename}.

So, basically you want to clone all the data and DVC-files to an experiment directory, which can use the same cache (.dvc/cache/) as the main project. With a deduplicating/reflink filesystem this should work.

It is not clear whether you modify the pipeline (or the parameters) of an experiment before you create it or after you create it, and how you are going to do it.

[By the way, rsync might be a better option than cp in this case, but this is not relevant to the discussion.]

@pared

This comment has been minimized.

Copy link
Member

@pared pared commented Nov 29, 2019

It is not clear whether you modify the pipeline (or the parameters) of an experiment before you create it or after you create it, and how you are going to do it.

I believe that user should be able to fix something after it is created, but I think that main use case should be focused on modifying before experiment creation. In other case, we would be just copying and making user to manually enter particular dir, which does not sound too welcoming.
I think that flow like edit -> create experiment -> git reset -> edit .... would be better, especially in use cases when one develops in IDE or notebook

@dashohoxha

This comment has been minimized.

Copy link
Contributor

@dashohoxha dashohoxha commented Nov 29, 2019

I think that main use case should be focused on modifying before experiment creation

But how would you track these changes (so that you can reproduce the experiment, if needed). By committing them to Git?

@pared

This comment has been minimized.

Copy link
Member

@pared pared commented Nov 30, 2019

I think that we don't want to commit them in any way until the user decides so. They are stored in their own experiment repos, and when the user is satisfied with their performance, just dvc experiment pick {ename}. Only then it would be committed to the main repo.

@dashohoxha

This comment has been minimized.

Copy link
Contributor

@dashohoxha dashohoxha commented Dec 1, 2019

I think that we don't want to commit them in any way until the user decides so. They are stored in their own experiment repos, and when the user is satisfied with their performance, just dvc experiment pick {ename}. Only then it would be committed to the main repo.

Seems reasonable to me.

However, instead of dvc experiment create {ename} I would prefer dvc clone {src_exp} {dst_exp} where {src_exp} is the directory of the source experiment and {dst_exp} is the directory of the destination. This would also cover the case when we want to base an experiment on another one (modify an existing experiment to create another one).

It would basically be like rsync, but some paths maybe need to be fixed on the DVC-files, configuration files, etc.

Deleting an experiment may be as easy as rm -rf {exp_dir}.

@dmpetrov

This comment has been minimized.

Copy link
Member Author

@dmpetrov dmpetrov commented Dec 2, 2019

  • doing some changes in my repo, not necessarily committing them
  • running dvc repro --exp tune_lr train_model.dvc
  • dvc takes care of creating an experiment directory and moving all the stuff there, also running

Yes. That's what I have in my mind.

What I don't like about incorporating experiments into repro is that it assumes that we want to run the experiment. Will it always be the case?

Sure. It might be a different command if we have a good reason.

What if I want to prepare a few experiment "drafts" by editing my repo, and then, at the end of the day, just dvc experiment run tune_lr tune_lr_v1 tune_lr_v2, go home and get back to finished tasks?

It seems like a different scenario - auto-hyper param tunning or a grid-search with a custom grid. We should keep these scenarios in mind but the primary use case for this issue is:

  1. make a small change
  2. push a button
  3. get a result in a dir
  4. propagate the result to master if needed.

And you are absolutely right! It might be beneficial to give an option of separating an experiment creation and experiment run. It might be dvc exp --dry-run or something similar. This should help us in future scenarios.

@pared yes, it is very likely we will need to store a copy of the current repo. It might be directly in project root dir tune_lr_v1/.

I think it should not be in the project root dir.

  • In the case of several dozens of experiments, root dir will look terribly
  • removing unwanted experiments will be hell if someone uses "creative" naming

Right. Users should be able to "hide" them into some dir. Like exp/tune_lr_v1.
I probably didn't communicate that properly. What I meant is - we should not limit users from custom directories (even a project root) and shouldn't limit by a single dir like .dvc/exp/ without a good reason.

If experiments will be in dedicated directory (.experiments, .dvc/exp or whatever):

  • easy to .git and .dvc ignore

Should be the experiment ignored? I'd expect to have at least some of them in my Git history.

  • finished with experimenting? no problem just rm -rf .experiments

The same if you store experiments in a project root.

@dmpetrov dmpetrov mentioned this issue Dec 3, 2019
6 of 6 tasks complete
@Suor

This comment has been minimized.

Copy link
Member

@Suor Suor commented Dec 12, 2019

If we don't say where people should put their experiments then experiments need to be referenced by their path for all uses:

  • listing experiments with some metric
  • reintegrating an experiment

If we don't put them to standard location like <repo-root>/experiments/<exp-name> then people might:

cd some-topic
dvc repro thing.dvc --exp try_this
# ... hack ...
dvc repro thing.dvc --exp exp/t1
# ... hack ...
dvc repro thing.dvc --exp exp/t2
# ... hack hack ...
dvc repro thing.dvc --exp exp/new-approach

# list experiments
ls exp try_this # kind of manual
# list metrics, some repetition
dvc metric show exp/*/some-topic/metric.json try_this/some-topic/metric.json
# list diffs
... ???
# reintegrate
dvc integrate exp/new-approach  # better command name?

Note that we need to use full path to metric.json, which might be surprising to the user. We can't copy only the part of the repo because something outside some-topic might be referenced.

Overall exp as a dir referenced as a dir works ok though.

@elgehelge

This comment has been minimized.

Copy link
Contributor

@elgehelge elgehelge commented Feb 21, 2020

Interesting. I kind of wanna help out, but the current way DVC works is just too far from my own mental model, so my ideas would probably require very serious changes that most contributors are probably not willing to buy into. I will try to describe it anyways, and then you can take it or leave it :)

As I see it there is two underlaying questions: One is about parameterised pipelines, and the other is about doing fast iterations locally. Lets take them in that order:

Parameterised pipelines: At the very core, I think DVC fails at distinguishing between "raw data" and "produced data". A raw dataset is a constant immutable thing. You might want to change it at some point, and this is why you need data versioning in the first place. This aligns the "Data is immutable" here https://drivendata.github.io/cookiecutter-data-science/#data-is-immutable and the definition os constants here https://www.brandonsmith.ninja/blog/three-types-of-data). And this is where DVC shines! However, a dataset or a model or any other artifact the was produced by raw data, code, and parameters are just deterministic products (in a broad understanding of "deterministic"). Produced data should not by manually committed since it is a direct consequence of other stuff that is already committed. Instead, it should just be computed and cached using hashes. If you can switch to this way of viewing an experiment, then the solution to the problem is straight forward: Just add parameters to the dependency graph.

Local development: Everything that dmpetrov describes is what I would call "local development", so this should also not be tracked. We should still be able to benefit from the cached upstream "produced data" though. At some point when the developer is ready for an actual code change (to be committed, shared and reviewed by peers), he/she commits these changes. To a branch. Branches are the perfect fit for experiments in my point of view. If the experiment is successful it should be merged back into master at some point.

@casperdcl

This comment has been minimized.

Copy link
Member

@casperdcl casperdcl commented Feb 21, 2020

@elgehelge though you should bear in mind that you may indeed want to commit produced data if you view it as useful artefacts that are time-consuming to "deterministically" re-generate (e.g. a trained machine learning model, a binary package, a picture to display on a webpage). I think DVC does indeed distinguish between "produced data" (artefacts, outputs) and "raw data" (immutable, inputs) if you look at pipelines (or e.g. the arguments of dvc run which distinguish between outs and deps).

@skylogic004

This comment has been minimized.

Copy link

@skylogic004 skylogic004 commented Feb 21, 2020

I'm looking for ways to tune hyperparameters too. I have a pipeline that first does some pre-processing then trains a model at the end. One of the pre-processing steps requires a threshold to be set, and so I want to be able to try many different values to see how the final trained model is affected. There are thousands of thresholds I may want to try.

A parameterized pipeline (@elgehelge) sounds like exactly what I'm looking for. I envision something like this: dvc repro train_model.dvc -my_threshold 123. But I'm not sure what I'd expect DVC to do with the file paths, since they are not dynamic.

One hack I'm considering, in the meantime, is to put my parameter values (thresholds) into a text file and add it as a dependency (to the stage that uses it). Then I would do a repro, change the value in the text file, and repeat. As for the output files, I would copy the files I want to keep into a unique directory (not under DVC control) before running repro again. The procedure would look something like this (and of course I'd write a script to automate it):

  • set parameter value in my_threshold.txt (e.g. threshold = 10)
  • run dvc repro train_model.dvc
  • (dvc would detect that one of the pre-processing stage's dependency has changed and would rerun it along with the stages that follow; it would know not to rerun any stages prior that don't depend on the parameter)
  • run mv output/trained_model_and_metrics_etc.zip tuning_experiment/0001/
  • set new parameter value in my_threshold.txt (e.g. threshold = 20)
  • run dvc repro train_model.dvc
  • run mv output/trained_model_and_metrics_etc.zip tuning_experiment/0002/
  • and repeat...

Once done, I'd write a script to grab all the files in tuning_experiment and plot the thresholds vs performance metrics.

While I think this solution will work for me, it feels wrong because I'm throwing away the ability for DVC to track the models and metrics. I'm essentially using DVC as a caching mechanism and nothing more. Lastly, at the end of the day, the final output I'm most interested in is the plot of thresholds vs performance metrics, not any of the pipeline's outputs themselves.

(p.s. I just started using DVC recently, and I really like it so far. I hope my use case will help with the discussion and move this feature forward!)

@dmpetrov

This comment has been minimized.

Copy link
Member Author

@dmpetrov dmpetrov commented Feb 22, 2020

@elgehelge thank you for sharing your mental model! I really appreciate it. Your thoughts correlate with our much more than you might think. You just need to accept that not everything is implemented yet and in some directions we still have open questions.

Parameterised pipelines: We understand the pipeline parametrization problem #1462. We even had discussions about extracting results of runs in a separate run/build cache #1234 (note, the description contains "deterministic" as you suggested). Unfortunately, we didn't have time to move to this direction yet. In the last year, we were busy making DVC stable, by optimization of the existed "dataset management" scenarios and another "dataset" scenarios like dvc import/get. Today pipeline parametrization could be done through some workarounds: config file as a dependency (see @skylogic004) or making a pipeline more lightweight by extracting dataset (and then dvc import) with modification of the lightweight pipeline. As far as I understand you (and @skylogic004 from GH), you suggest a simplified version of #1462 - that looks like a very good option - I'll create an issue.

Local development: You are absolutely correct that this issue is mostly about local experimentation experience. It aims to simplify experimentation experience and solve "1000-branches" problem in the hyperparameters tuning case. However, we de-prioritized this issue in favor to CI/CD direction which we are working on right now (this project is outside of core-dvc and not opened yet but you might find some related issues in core-dvc #2995, #2998, #2994). In CI/CD these experimentation problems (including 1000-branches problem) could be solved in a higher abstraction level without introducing additional concepts to DVC - pretty much as you described.

To me, it looks like we don't have any disagreements in the DVC roadmap. We just need to implement all the stuff we have in our plans - "The film is ready. It remains only to shoot it" (Rene Clair) 😄

@dmpetrov

This comment has been minimized.

Copy link
Member Author

@dmpetrov dmpetrov commented Feb 22, 2020

@skylogic004 thank you for your replay. I responded to @elgehelge above re most of your questions but I'd like to mention that...

While I think this solution will work for me, it feels wrong because I'm throwing away the ability for DVC to track the models and metrics. I'm essentially using DVC as a caching mechanism and nothing more.

... you are right - DVC support might be very helpful in that stage. And we should come up with a solution.

Lastly, at the end of the day, the final output I'm most interested in is the plot of thresholds vs performance metrics, not any of the pipeline's outputs themselves.

This is very good inside. Thank you for pointing to this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants
You can’t perform that action at this time.