Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User Guide / Manage Experiments #828

Closed
wants to merge 11 commits into from
Closed
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions src/Documentation/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,29 @@
"label": "Managing External Data",
"slug": "managing-external-data"
},
{
"label": "Managing Experiments",
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
"slug": "experiments",
"source": "experiments/index.md",
"children": [
{
"label": "Tags",
"slug": "tags"
},
{
"label": "Branches",
"slug": "branches"
},
{
"label": "Directories",
"slug": "dirs"
},
{
"label": "Mixed",
"slug": "mixed"
}
]
},
{
"label": "Contributing",
"slug": "contributing",
Expand Down
119 changes: 119 additions & 0 deletions static/docs/user-guide/experiments/branches.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# How to Manage Experiments by Branches

You can use a different Git branch for each experiment.

<p align="center">
<img src="/static/img/user-guide/experiments/branches.png" />
</p>

This is usually more flexible than managing experiments by tags, since you can
easily base a new experiment on any of the previous experiments.

## Examples

An example of managing experiments by branches can be seen on the
[Deep Dive Tutorial](https://dvc.org/doc/tutorials/deep/reproducibility).

These interactive tutorials also manage experiments by branches:

- [Pipelines](https://katacoda.com/dvc/courses/tutorials/pipelines) - Using DVC
commands to build a simple ML pipeline.
- [MNIST](https://katacoda.com/dvc/courses/tutorials/mnist) - Classify images of
hand-written digits using the MNIST dataset.

## How it works

### Commit and branch

Let's say that we are working on the branch `master` and at the end of the
experiment we want to save it on a branch named `unigrams`. We can do it like
this:

```dvc
$ git commit -am 'Evaluate'
$ dvc commit # just to make sure all the data is committed
$ git checkout -b unigrams
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not git branch unigrams?

Copy link
Contributor

@jorgeorpinel jorgeorpinel Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, git branch is pretty new and we don't use it in the docs. But we do use git checkout a lot. I would open a separate terminology-focused issue to decide whether to change all instances or not.

This comment was marked as resolved.

$ git checkout master
```

Now we can continue working on `master` for another experiment. When we are done
we can create another branch for it same as above.

### New experiment based on another one

Suppose that we want to start a new experiment based on another one, instead of
starting from `master`. We can switch first to that branch and then start a new
experiment on top of it:

```dvc
$ git checkout unigrams
$ dvc checkout
Copy link

@ghost ghost Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not git checkout -b bigrams unigrams (go to experiment of brigams based on the unigrams)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would depend on your previous suggestion, #828 (comment) ? Perhaps you can replace these 2 comments for a single one for the full thing? 🙂

$ git checkout -b bigrams
```

Now we can continue to make the necessary changes for the bigrams experiment.

### Compare the metrics

To find out which experiment has the best performance (the best metrics) we use
the command `dvc metrics show` with the option `-a, --all-branches`:

```dvc
$ dvc metrics show -a

bigrams:
data/eval.txt: AUC: 0.624727

unigrams:
data/eval.txt: AUC: 0.624652
```

### Check out an experiment

Let's list first all the branches:

```dvc
$ git branch -a
bigrams
unigrams
...
```

To switch to the experiment `unigrams` we can do:

```dvc
$ git checkout unigrams
$ dvc checkout
```

Switching back to `master`:

```dvc
$ git checkout master
$ dvc checkout
```

In any case the command `dvc repro` should not have to re-run anything and
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
should finish quickly, if all the data of the experiments have been committed
properly.

### Move the best experiment to master

> Usually it is not necessary to move the best experiment to master, since we
> can easily switch to any of the branches.

What we usually want is to completely replace the master branch with the
experiment branch. Using `git merge` is not the best option in such a situation
since it will usually result into a mixture between the two branches (the master
branch and the experiment branch). Instead we should copy the branches, like
this:

```dvc
$ git checkout bigrams
$ git branch -c master old-master
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add branch, diff and other commands that are no highlighted to the list here:

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the -c and -C for?

I couldn't follow up with a lot of branching going on 😅 would be great if you could add some comments

$ git branch -C bigrams master
$ git push -f origin master
$ git branch -D old-master
$ git checkout master
$ git diff bigrams
```
97 changes: 97 additions & 0 deletions static/docs/user-guide/experiments/dirs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# How to Manage Experiments by Directories

Using a separate directory for each experiment is the most intuitive solution
for managing experiments and is the first thing that comes to mind. Most of
DS/ML practitioners are already familiar with this approach.

<p align="center">
<img src="/static/img/user-guide/experiments/dirs.png" />
</p>

This approach is most suitable when the different experiments that are being
managed do not have significant differences in their code or the pipeline, but
maybe change on the input datasets, processing parameters, configuration
settings, etc.

Often it is possible to generate these experiment directories automatically (or
almost automatically) from the code of the main project (using the parameters or
configuration settings), so keeping them in Git is not interesting or useful.
What we would like to track instead are just the parameters that were used to
generate the experiment directory and the results of the evaluation (metrics),
so that we can figure out which parameters give the best results.

## Examples

There is a very basic example of using directories for each experiment at the
end of
[this interactive tutorial](https://katacoda.com/dvc/courses/basics/pipelines).

## How it works

If we have a directory named `experiment1/` which contains the pipeline of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except the pipeline - what else does this directory contain? what about code? do we copy the whole project?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too project specific and it is impossible to explain such details on a section like this.
The best that can be done is to have a concrete example somewhere else and to link it from the section of the examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a few recommendations we can make - clean data before copying? use a simple JSON conf? use DVC-files that are setup properly to be relative.

Those ^^ are general problems. Without giving some hints at least it's not very actionable. People who can come with a solution to those problems don't even probably need a section like this (and I can send you a few names).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a few recommendations we can make - clean data before copying? use a simple JSON conf? use DVC-files that are setup properly to be relative.

All these seem common sense to me. Anyway I don't see how they can be explained without having an example as a reference.
Maybe someone can give it a try and let's see whether they make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no understanding from our end on how to give a guidance let's remove this page then for now and wait until the ticker on DVC core is resolved and wait for someone to take the #159.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these seem common sense to me.

If they are common sense why would it be so hard to explain/mention them?

first experiment, and we want to create another experiment on `experiment2/`,
which is based on the first one, often it is as easy as:

```dvc
$ cp --reflink -R experiment1/ experiment2/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reflinks are not supported on majority of systems. So, it's not that easy to copy it w/o hitting some performance problems by copying data. We should think a bit more on what should we do here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reflinks are not supported on majority of systems

"reflink" is a filesystem feature, and all the systems support filesystems with reflink. For example in Linux there are XFS, Btrfs, ZFS etc. The option "--reflink" here is a hint or a reminder that they need to use a filesystem that supports reflinks (like XFS), which they should be doing anyway since the start of the project.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my bigger comment in the review. We can't rely on XFS, etc - for now 80-90% of systems are not configured to use them by default. And it won't change anytime soon. So, if we want to provide a meaningful solution to experiments management we should at least mention something about data, code and other things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

80-90% of systems are not configured to use them by default

That's right, but they are not configured to use DVC by default either. Installing and using XFS is actually much easier than installing and using DVC.
What I am trying to say is that if someone is working on a data project, he can and should install and use XFS (or some other deduplicating filesystem). This is a basic requirement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very very far from the reality. There are a lot (majority?) of cases when you don't have a choice. Your argument applies only to freelancers and students.

```

Then we can continue with modifying `experiment2/`, and finally we can produce
its results with:

```dvc
$ dvc repro -R experiment2/
```

The most important DVC commands, like `dvc commit`, `dvc checkout`, `dvc repro`,
`dvc pull`, `dvc push`, etc. can take the option `-R, --recursive` which is very
convenient for experiment directories.

The command `dvc metrics show` as well can take this option:

```dvc
$ dvc metrics show -R experiment2/
```

However, if we use just `dvc merics show`, without any options or targets, it
will show the metrics of all the experiments, so that we can compare them.

Deleting an experiment is as easy as:

```dvc
$ rm -rf experiment2/
```

However we should make sure to save first the parameters that we used for this
experiment and its metrics (results).

<details>

### 💡 Use a script to create experiments

When we build a pipeline we have to use some long `dvc run` commands, with lots
of options, to define stages. Doing all this manually is long and tedious and
error-prone. The recommended Linux practice in such cases is to record all the
commands in a bash script, which can then be used to build the whole pipeline at
once.

Some of the benefits of this approach are these:

- Typing mistakes while building the pipeline are avoided.
- Modification of the pipeline becomes easier and consistent (for example using
find/replace).
- Building pipelines becomes flexible (for example bash variables can be used).
- Pipelines become reusable (other projects can copy/paste and customize them)

Using a script to create a pipeline is also very convenient when we want to
manage experiments with directories, because it allows us to customize the
experiment based on some options and parameters that we pass to the script.

This can further automate the process of creating a new experiment, producing
its results, saving them, and finally deleting the experiment directory. This
way we can automatically iterate for example over a large number of
hyper-parameters and save the corresponding results.

The implementation details actually depend on the specifics of each project.

</details>
22 changes: 22 additions & 0 deletions static/docs/user-guide/experiments/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Managing Experiments

Data science process is inherently iterative and R&D like. Data scientist may
try many different approaches, different hyper-parameter values, and "fail" many
times before the required level of a metric is achieved. Even failed experiments
can be a useful source of information in ML.

DVC makes it easy to iterate on your project, providing ways to try different
ideas, keep track of them, switch back and forth, compare their performance
through metrics, and find the best experiment. It stores all the context
necessary to reproduce easily and efficiently an experiment: data, pipeline
stages, parameters, models, etc. That way, someone else (or you yourself 3
months from now) can check out and inspect all the details of an experiment.

You can use several ways to manage experiments, which are described on this
section. Which one is more suitable for you depends on your preferences and also
on the kind and complexity of your project.

- [How to Manage Experiments by Tags](/doc/user-guide/experiments/tags)
- [How to Manage Experiments by Branches](/doc/user-guide/experiments/branches)
- [How to Manage Experiments by Directories](/doc/user-guide/experiments/dirs)
- [How to Manage Experiments by Several Methods](/doc/user-guide/experiments/mixed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by a Mix of Methods? Several methods sounds strange a bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually mind, choose the wording that seems right to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgeorpinel what's your take on this and on the one #828 (comment) here? what the best way to word it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ** How to Manage Experiments with a Mix of Methods**
or use the word "Combination", as in that doc's intro.

26 changes: 26 additions & 0 deletions static/docs/user-guide/experiments/mixed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# How to Manage Experiments by Several Methods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mix Methods to Manage Experiments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ** How to Manage Experiments with a Mix of Methods**
or use the word "Combination", as in the intro paragraph.


On complex projects you can use a combination of the methods that we have seen
so far, in order to manage experiments.

<p align="center">
<img src="/static/img/user-guide/experiments/mixed.png" />
</p>

If you want to change different aspects of your ML pipeline, like input
datasets, featurization, learning algorithm, hyper-parameters, etc. you can
manage these changes with different methods. For example let's say that you
create a different branch for each learning algorithm, and a tag for each input
dataset or featurization. Then you can create different experiment directories
for different hyper-parameters.

There is no standard solution that fits all the cases. The way that you might
combine the different experiment management methods depends on the concrete
problem that you are trying to solve and the details of the project.

In order to compare all the experiments, you can use the options
`-a, --all-branches` and `-T, --all-tags`, like this:

```dvc
$ dvc metrics show -aT
```
Loading