Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User Guide / Manage Experiments #828

Closed
wants to merge 11 commits into from
Closed

Conversation

dashohoxha
Copy link
Contributor

Close #816, Close #159

@shcheklein shcheklein temporarily deployed to dvc-org-pr-828 November 29, 2019 13:37 Inactive
@ghost
Copy link

ghost commented Dec 2, 2019

@dashohoxha, the "manage experiments by directories" image could improve a lot by using a more significant naming.

What I've seen is that users create directories to try different hypothesis, not exactly different versions 🤔

@shcheklein shcheklein temporarily deployed to dvc-org-pr-828 December 3, 2019 08:45 Inactive
@dashohoxha
Copy link
Contributor Author

the "manage experiments by directories" image could improve a lot by using a more significant naming

What I've seen is that users create directories to try different hypothesis, not exactly different versions

@MrOutis I have used dummy names just to give an idea of how it should look like (and also to show that sub-experiments can be used as well, that is experiments that are based on other experiments).
I have done the same thing (using dummy names) for tags and branches as well, just to give an idea.

However I completely agree with you. If you (or someone else) could provide some more realistic example names for directories and subdirectories, I can fix that image.

@shcheklein shcheklein temporarily deployed to dvc-org-pr-828 December 3, 2019 11:49 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-pr-828 December 3, 2019 21:44 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-pr-828 December 4, 2019 10:21 Inactive
@dashohoxha dashohoxha changed the title [WIP] User Guide / Manage Experiments User Guide / Manage Experiments Dec 4, 2019
@dashohoxha
Copy link
Contributor Author

I am aware that it is not complete or perfect, especially the pages of directory management and combined management. However this is the best that I could do. If someone has any ideas for improving it further this would be great.

@shcheklein
Copy link
Member

It looks great, @dashohoxha !

A few comments from the first pass I've done:

  • can we make an image with directories the same style as tags and commits? they look better, may be since they have at least some color to them?
  • Manage -> Managing - otherwise we have to rename all other UG sections: Updating ...., Managing ..., Versioning, etc
  • I would say commits - is another way to do experiments. May be we can mention this somewhere?

I'll read it more carefully today.

@dashohoxha
Copy link
Contributor Author

  • can we make an image with directories the same style as tags and commits? they look better, may be since they have at least some color to them?

Unfortunately this is not easy with the tool that I am using (umlet).

  • Manage -> Managing - otherwise we have to rename all other UG sections: Updating ...., Managing ..., Versioning, etc

I used "Manage Experiments" as a shorter version of "How To Manage Experiments". It seems smother to read than "Managing Experiments". However I agree with your consistency argument.

  • I would say commits - is another way to do experiments. May be we can mention this somewhere?

This seems like the case of tags. If you do a single commit for each experiment, then why not put a tag for each commit? It is true that you don't need the tags to switch back to a previous commit (you can see the logs and find out the commit ID) but it is more easy with tags. Besides, the command dvc metrics show works with tags and branches, but does not work with commits, does it?

@shcheklein shcheklein temporarily deployed to dvc-org-pr-828 December 5, 2019 05:54 Inactive
Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice read, also I like a lot last points for Branches and Tags. Moving the best to master is definitely not beginner's use case.

@shcheklein shcheklein temporarily deployed to dvc-org-pr-828 December 5, 2019 18:19 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-pr-828 December 5, 2019 18:25 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-pr-828 December 5, 2019 18:27 Inactive

```dvc
$ git checkout bigrams
$ git branch -c master old-master
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add branch, diff and other commands that are no highlighted to the list here:

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the -c and -C for?

I couldn't follow up with a lot of branching going on 😅 would be great if you could add some comments

$ git add .
$ git commit -m 'Baseline experiment'
$ dvc commit
$ git tag baseline-experiment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a lightweight tag. Why do we use it here as opposed to the regular one below? Are you sure that metrics show works well with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a "normal" tag, the other one is an annotated tag. It is just an example.
As far as I know metrics show works with all the tags.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not even about what can be called normal. It's more about - why do we use a lightweight here and an annotated below? If there are some reasons then explanation is required when should I use what type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any reason. They are both tags and both of them can be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, why use both? let's simplify. Though, there are some guidelines behind those. I think people tend to use lightweight tags for things that they won't be sharing/saving. May be we can use that difference as a guideline? Not sure that we need put so many details - I would probably just keep simple version in both cases if it works fine.

```dvc
$ git checkout baseline-experiment
$ dvc checkout
$ dvc repro evaluate.dvc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to run repro here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to make sure that dvc checkout has retrieved all the correct data? I don't know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dvc status is better then and makes more sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right dvc status would make more sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤝

$ git diff --cached
$ git revert --continue

# delete the old tag and add it to the current version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please split it into two blocks

move comments out and put them as some human readable explanation to the blocks


### Move the best experiment on top

Let's say that `baseline-experiment` has the best performance and we want to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are destructive, right? Let's put a note with ! emoji.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "destructive"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, right. I probably wanted to put this comment to the branches way of doing experiments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, how about branches part? In that place where we do force push and stuff - let's put a comment there?


```dvc
$ git tag
$ git log --oneline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show output here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will help to improve the explanation in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would improve for me as I read it for the first time. Reading it for the second time - I don't even understand why do we need git log --oneline everywhere. You are not saying a word. I meant btw, an output of the git tag:

$ git tag
baseline
bnigrams

it would be easier to read and it is what the previous paragraph is about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both git tag and git log --oneline are ways to check what tags are available. Different from git tag, git log --oneline also shows the position of the tags in the history of commits and their order, which one is first and which one is last.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explanation makes it 10x better! You see, you can do it :) And I'm not kidding - I didn't know the difference between those related to the tags.

I still have some questions though - would log include all commits? If you have a lot of commits (very regular thing to have in real project) tags be lost.

Is there an option for the git tag command to sort output?

I would still prefer to simplify it tbh. And put some output - it will make way easier to understand and read. At least you have one reader who is saying that it would be easier to read. Don't rely on your opinion here.

which is based on the first one, often it is as easy as:

```dvc
$ cp --reflink -R experiment1/ experiment2/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reflinks are not supported on majority of systems. So, it's not that easy to copy it w/o hitting some performance problems by copying data. We should think a bit more on what should we do here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reflinks are not supported on majority of systems

"reflink" is a filesystem feature, and all the systems support filesystems with reflink. For example in Linux there are XFS, Btrfs, ZFS etc. The option "--reflink" here is a hint or a reminder that they need to use a filesystem that supports reflinks (like XFS), which they should be doing anyway since the start of the project.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my bigger comment in the review. We can't rely on XFS, etc - for now 80-90% of systems are not configured to use them by default. And it won't change anytime soon. So, if we want to provide a meaningful solution to experiments management we should at least mention something about data, code and other things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

80-90% of systems are not configured to use them by default

That's right, but they are not configured to use DVC by default either. Installing and using XFS is actually much easier than installing and using DVC.
What I am trying to say is that if someone is working on a data project, he can and should install and use XFS (or some other deduplicating filesystem). This is a basic requirement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very very far from the reality. There are a lot (majority?) of cases when you don't have a choice. Your argument applies only to freelancers and students.

- [How to Manage Experiments by Tags](/doc/user-guide/experiments/tags)
- [How to Manage Experiments by Branches](/doc/user-guide/experiments/branches)
- [How to Manage Experiments by Directories](/doc/user-guide/experiments/dirs)
- [How to Manage Experiments by Several Methods](/doc/user-guide/experiments/mixed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by a Mix of Methods? Several methods sounds strange a bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually mind, choose the wording that seems right to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgeorpinel what's your take on this and on the one #828 (comment) here? what the best way to word it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ** How to Manage Experiments with a Mix of Methods**
or use the word "Combination", as in that doc's intro.

@@ -0,0 +1,26 @@
# How to Manage Experiments by Several Methods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mix Methods to Manage Experiments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ** How to Manage Experiments with a Mix of Methods**
or use the word "Combination", as in the intro paragraph.


## How it works

If we have a directory named `experiment1/` which contains the pipeline of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except the pipeline - what else does this directory contain? what about code? do we copy the whole project?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too project specific and it is impossible to explain such details on a section like this.
The best that can be done is to have a concrete example somewhere else and to link it from the section of the examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a few recommendations we can make - clean data before copying? use a simple JSON conf? use DVC-files that are setup properly to be relative.

Those ^^ are general problems. Without giving some hints at least it's not very actionable. People who can come with a solution to those problems don't even probably need a section like this (and I can send you a few names).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a few recommendations we can make - clean data before copying? use a simple JSON conf? use DVC-files that are setup properly to be relative.

All these seem common sense to me. Anyway I don't see how they can be explained without having an example as a reference.
Maybe someone can give it a try and let's see whether they make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no understanding from our end on how to give a guidance let's remove this page then for now and wait until the ticker on DVC core is resolved and wait for someone to take the #159.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these seem common sense to me.

If they are common sense why would it be so hard to explain/mention them?

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. DIrectory image is not a blocker - disregard the comment. Commits - agreed, we can update it later when we have some interface ready.

Did a full pass review. Did some minor modifications. Mostly minor comments that should be easy to address. The biggest one major concern is the logic behind dirs experiments - it makes sense on the high level but can we tricky to achieve in reality, or at least some recommendations should be put in that section:

  • reflinks are not supported on a majority of systems. We should do something with data on those systems. Clean before copying the dir?
  • code - how do we copy and/or do modification to it? or do we copy only some JSON config? (which is a good solution that should be mentioned!)
  • pipeline - what are considerations/requirements to make it copiable?

@dashohoxha
Copy link
Contributor Author

Regarding the "dirs" page, I agree that in practice there are certain "tricks" that are needed to make it work, but I don't see how they can be explained in a meaningful way without a concrete example.

@dashohoxha
Copy link
Contributor Author

By the way, the new command dvc experiment that is being discussed might be a good way to encapsulate or hide most of the tricks and details that are needed for the case of experiment-by-directory to work. This way the user will not have to worry about them and we will not have to explain them :)

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partially reviewed it, left some comments.

```dvc
$ git commit -am 'Evaluate'
$ dvc commit # just to make sure all the data is committed
$ git checkout -b unigrams
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not git branch unigrams?

Copy link
Contributor

@jorgeorpinel jorgeorpinel Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, git branch is pretty new and we don't use it in the docs. But we do use git checkout a lot. I would open a separate terminology-focused issue to decide whether to change all instances or not.

This comment was marked as resolved.


```dvc
$ git checkout unigrams
$ dvc checkout
Copy link

@ghost ghost Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not git checkout -b bigrams unigrams (go to experiment of brigams based on the unigrams)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would depend on your previous suggestion, #828 (comment) ? Perhaps you can replace these 2 comments for a single one for the full thing? 🙂

static/docs/user-guide/experiments/branches.md Outdated Show resolved Hide resolved

```dvc
$ git checkout bigrams
$ git branch -c master old-master
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the -c and -C for?

I couldn't follow up with a lot of branching going on 😅 would be great if you could add some comments

per #828 (comment)

Co-Authored-By: Mr. Outis <mroutis@protonmail.com>
@shcheklein
Copy link
Member

closing this as stale

@shcheklein shcheklein closed this Mar 14, 2020
@efiop efiop deleted the user-guide/experiments branch March 15, 2020 20:18
@efiop efiop restored the user-guide/experiments branch March 15, 2020 20:18
@jorgeorpinel jorgeorpinel deleted the user-guide/experiments branch May 5, 2020 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

guide: Managing Experiments section(s) user-guide: add "folders" way of experimentation
4 participants