Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example-dvc-experiments: Include CML configuration #83

Closed
Tracked by #85
iesahin opened this issue Sep 3, 2021 · 10 comments
Closed
Tracked by #85

example-dvc-experiments: Include CML configuration #83

iesahin opened this issue Sep 3, 2021 · 10 comments
Labels
A: example-get-started-experiments DVC Experiment, DVCLive examples enhancement New feature or request priority-p1 Immediate pool of tickets to take and work as part of the next sprint

Comments

@iesahin
Copy link
Contributor

iesahin commented Sep 3, 2021

@casperdcl what do we need from to make this repo useful for CML happy-path?

@iesahin @casperdcl as we discussed can we make it example-experiments that would cover basic scenarios with predefined language (python), predefined framework (let's say tensorflow for now). It would have CML action from the first (?) commit that could be run if it's needed (and may be even runs automatically).

Then it'll be a good repo that we can even meaningfully present in Studio?

What do we need to make it substantially useful for CML?

Originally posted by @shcheklein in #79 (comment)

@iesahin iesahin added enhancement New feature or request priority-p1 Immediate pool of tickets to take and work as part of the next sprint labels Sep 3, 2021
@casperdcl
Copy link
Contributor

casperdcl commented Sep 3, 2021

@shcheklein:

  • the CML use case (specifically auto-push checkpoints) is significantly different from any of the DVC use cases (CC @DavidGOrtega)
    • it also requires significant additional config (workflow.yaml to both run and also auto-create more (!) branches & open PRs, creds for cloud compute, creds for remote storage, additional env vars, code written for max num epochs)
  • placing the CML example in the same repo as DVC examples is very confusing to users. Switching branches is an unnecessary complexity on top of an already complex example.

I'd strongly suggest the CML case is pushed to a different repo (example-cml-experiments, with separate folders/branches for with & without dvc) rather than more branches in example-dvc-experiments/example-experiments.

@shcheklein
Copy link
Member

the CML use case (specifically auto-push checkpoints) is significantly different from any of the DVC use cases

how is it different in terms of the project? let's try to scope it here

also, let's scope the "happy-path", get started experience, etc ... what is the purpose of the repo for CML - tutorial, use case, get started? what are things that we'd like to show?

it also requires significant additional config

this should not be a problem to my mind, additional GH action config is totally fine to have (people won't see it unless you point to it)

placing the CML example in the same repo as DVC examples is very confusing to users.

agreed, if we talk about CML in general (when DVC is not being used at all). If we talk about DVC+CML - I'm not sure why that would be confusing?

And to clarify, name should be generic here in that case - example-experiments.

with separate branches

example per branch is bad for a lot of reasons - branches are first class citizens in the DVC workflow and mixing them this way is bad to my mind (think about connecting such a repo to DVC Studio), or running a command like metrics -a, etc.

rather than more branches in example-dvc-experiments/example-experiments

yep, agreed - I would not do branches. See above. It should be a simple repo like the existing get-started one that covers happy path across DVC, CML, DVCLive ... to clarify, I also don't think that it will cover everyrthing ... but we should be all optimizing for simplicity and try hard to have a common ground where all tools integrate nicely

@casperdcl
Copy link
Contributor

casperdcl commented Sep 3, 2021

  • the scope of the CML "config" stuff: I'd put this in brackets (workflow.yaml to both run and also auto-create more (!) branches & open PRs, creds for cloud compute, creds for remote storage, additional env vars, code written for max num epochs)
  • one that covers happy path across DVC, CML, DVCLive ...: ah, I agree this is a nice thing to have; one example repo that uses best-practice-of-everything. However I think that is a separate issue. I though we were talking about just CML example repos here (for use in https://cml.dev/doc/X where X is use-cases, user-guide, how-to, tutorial, example, blog, etc.)

@shcheklein
Copy link
Member

the scope of the CML "config" stuff:

this scope sounds good to me, that's what we do for the get-started-example, and there is not contradiction so far. I see only benefits in this.

I though we were talking about just CML example repos

yes, but this discussion started when we were trying to use mnist repo (and codify it) for CML as far as I understand?

Ideally I would then plan a bit - what kind of repositories will you need for CML, what of them you will need to codify, etc? No doubt there will be a lot of smaller repos (considering that we have Gitlab/Github/Bitbucket + different clouds + different scenarios like Ternsorboard). It's a separate question how do you want to build them, which of them to codify etc. Same with dvclive - if we want to cover all possible integrations we'll need a separate repo(s) to do that.

Here we are talking more about get started experience I think.


Back to my initial question - would it be useful/possible to create example-experiments repo that will be used in all the docs related to experiments, at least happy path, get started like? (may be we'll have to do Gitlab/Bitbucket versions, and learn how to push to three platforms).

@iesahin
Copy link
Contributor Author

iesahin commented Sep 6, 2021

I'd propose to determine the most common cases (i.e. happy path?) for the related technologies and bundle them in a common repository, and additionally have smaller repositories that may be used as a showcase.

In the CML case, it seems Github configuration with AWS. This can be default in example-experiments, and we can use other repositories for example-experiments-gitlab, etc. These custom repositories can be used for testing and templating for the new user projects.

Codification for the configuration is straightforward. We just need to determine at which stage it's most relevant to configure.

@iesahin iesahin changed the title example-dvc-experiments: Improve to use for CML example-dvc-experiments: Include CML configuration Sep 6, 2021
@iesahin
Copy link
Contributor Author

iesahin commented Nov 9, 2021

After reviewing this again, I think providing a repository generator (a la example-repos-dev) is more appropriate. We can have a "get-started" script that initializes a repository per the user's needs, after prompting for them.

...
$ Do you want to include CML configuration? (y/N)
y
$ For which cloud provider do you want to setup CML for? 
1: Github
2: Gitlab
3: Bitbucket
3
...

Otherwise, it will be difficult to keep tabs to create a separate repository on every possible setup. Also, I'm not sure we know happy path for all kinds of users, some may want a simple repository, others may want bells and whistles.

@dberenbaum
Copy link
Collaborator

Sounds a lot like creating our own cookiecutter. Having a fork of https://github.com/drivendata/cookiecutter-data-science could be a way to get users started quickly.

@DavidGOrtega
Copy link

DavidGOrtega commented Nov 10, 2021

n the CML case, it seems Github configuration with AWS. This can be default in example-experiments, and we can use other repositories for example-experiments-gitlab, etc. These custom repositories can be used for testing and templating for the new user projects.

I have a repo that its a full example (also integration tester) of DVC-CML for GL, BB and GH.
It mirrors every change in the other vendors.
Im giving it the final touches and I will give it back to iterative

@iesahin
Copy link
Contributor Author

iesahin commented Nov 15, 2021

Sounds a lot like creating our own cookiecutter. Having a fork of https://github.com/drivendata/cookiecutter-data-science could be a way to get users started quickly.

That's a better idea. @dberenbaum

@shcheklein shcheklein added the A: example-get-started-experiments DVC Experiment, DVCLive examples label May 11, 2022
@casperdcl
Copy link
Contributor

casperdcl commented May 17, 2022

srry haven't followed this since Sept 2021 🙈 😅

See the list at the top of #100 for the current CML example repo layout:

So it's a lot of potential complexity. In terms of "single example happy path showcase of all products" I'd suggest 2 options:

  1. With extra credentials required
  2. No extra creds required
    • DVC (data CRUD, pipelines, plots + metrics, DVC_EXP_AUTO_PUSH for spot recovery)
    • CML (runners, spot recovery, reports, tensorboard-dev)
    • DVCLive (live reports?, saving epoch statefile for spot recovery)
    • GHActions (CI)
    • AWS (storage CRUD, runners)
    • badges
    • Studio
    • Codespaces + VSCode extension

I don't know whether this is within the scope of example-dvc-experiments from dvc exp getting started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: example-get-started-experiments DVC Experiment, DVCLive examples enhancement New feature or request priority-p1 Immediate pool of tickets to take and work as part of the next sprint
Projects
None yet
Development

No branches or pull requests

5 participants