-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
start: independent trails #2496
Comments
Thanks for the summary, very informative. Pre-Q
Meaning https://dvc.org/doc/user-guide/experiment-management/checkpoints? Seems to use https://github.com/iterative/checkpoints-tutorial . Is that one basically https://github.com/iterative/dvc-checkpoints-mnist ( On the motivation for this
Probably the most important thing and first step here would be to decide on the content. What new content are we looking to add? Which one needs changes? That should drive the sample repos (@dberenbaum has mentioned this too I think).
Looks like all the existing repos are already stand-alone right? Just trying to understand whether we have a working setup or if there's some urge to keep iterating on sample repos before content. Great work so far BTW! On the discussion points
As high-level as possible while supporting hands-on steps.
People from all backgrounds who need to learn (certain parts of) DVC fast — get a good grasp of basic concepts, problems/solutions, while trying the major commands (UX).
I'd focus on DVC for now 🙂 Long answerDepends on the product:
|
Thanks @iesahin . I agree with @jorgeorpinel on the audience/level (I would keep it as is)/other projects (let's not do this for now- I would think about Some thoughts on the content (acknowledging that this is the most important one anyway). First, to remind a bit on why the existing example get started is considered suboptimal:
Bottom line: we want to make experiments (and potentially pipelines?) first class citizen in the get started.
I would not do this. I would start with two entry points - Data & Models Versioning (?), Experiments Management - something like that?
ideally, they should understand from the GS + Use Cases where does DVC fit, how it works - high level |
A comment on the HN recent submission, that should drive changes to the experiments section at least:
|
This comment is very valuable 👍🏼 |
Is there a way to get user feedback on GS pages somehow within the pages? PHP had comments in documentation maybe 10+ years ago. Could we have, for example, links to discussions, comments, or some sort of feedback facility in the pages themselves? |
@iesahin it's a very good point, I would love to see some nice way to collect feedback, discussions, etc. We can create a separate ticket for this and prioritize. |
Yes, This checkpoints guide has several drawbacks, IMO:
All checkpoints guides (either for the GS or UG), should assume experiments as a starting point. Checkpoints rely on The reason I proposed #2518 is this. There should be GS level guide after experiments that introduce the checkpoints. |
I'd rather have an evolutionary approach in these. What's the most obvious, glaring points that we miss from the Get Started section?
I propose three starting points for the GS docs:
These 3 documents should be independent from each other. They can have subsections that we use to link from each other. Each should take at most ~1 hour to read and understand the subject matter. Also we can employ Studio in several places, especially in experiments, if you don't mind hijacking open source software documentation for SaaS promotion. |
Example repos are actually shaped by the tutorial and showcase requirements. The reason I'm trying to bring forward a GS Experiments document in #2497 is to shape the repository in iterative/example-repos-dev#44 according to the reviews. I've updated, e.g., almost all the parameters, the pipeline, etc. in These |
What do you mean by high level exactly? @jorgeorpinel I have some ideas but would like to learn yours first. |
IMHO all backgrounds is a set, a bit too large. We need to profile the users and decide on their goals to use DVC. We need to make assumptions on the following criteria (and more)
The most important: What they ask from DVC?
I'd like to have 3-5 distinct persona, for whom we write our content. We can review the documents in their eyes. Without a set of concrete persona, I think content production becomes a moving target. I can write for myself and you can review for yourself but our goal is not to document the software for ourselves. |
I think presenting visual aspects using Studio is much easier at first. For example using plots and showing how those plots are generated in |
Then, it looks like we can use Write another document for Experiments, that contain (1) Experiment Management (2) Plots and Metrics, (3) Sharing Experiments, and (4) Checkpoints. Readers may start from Versioning and proceed to Experiments, or start from Experiments and hop to Versioning. I think we need another one for the Pipelines, or write the pipelines as an addendum to each of these. Pipelines are a bit orthogonal to the other aspects. I would like to read/tell one thing at a time, in each section. So adding pipelines to the mix may reduce the overall focus for the documents. We can have a shorter Pipelines document that we link from each of these. |
I think I can create 3 distinct profiles from these: (1) an industry person with Git knowledge looking for ML production tools, (2) a graduate student with ML experiments looking for experiment tracking, (3) a DevOps guy working in an ML environment with lots of data. If we can keep these profiles as distinct as possible while making their union cover our user base, we can check the docs in these profiles' eyes and see the omissions easily. I need to have some direction here about the typical users, Alice, Bob and Charlie. |
BTW, I'm using some ideas from Martin Lindstrom's Small Data about this profiling idea. I read the book a few years back and I remember how he produces marketing material using profiling. I remember the book saying there are a finite number of profiles that we should be thinking about and people belong to these categories, instead of each having a unique character. |
What they are looking to do with our tools depends on the doc. For example, I would assume that a get started doc for experiments would target someone doing ML experiments and needing to organize, compare, and track them to decide which experiment is best. Rather than define global profiles, maybe we should define a profile for each get started doc? |
This comment has been minimized.
This comment has been minimized.
Get Started pages are not tutorials. (See "Master Dict" in https://www.notion.so/iterative/wip-Lost-in-Translation-17a263187e2b40e88072ce041a5be4e1)
https://dvc.org/doc/user-guide/experiment-management/checkpoints and https://github.com/iterative/dvc-checkpoints-mnist (linked from a few places)
This is a good point. But it's hard for me to envision combining both topics since there's so much material in https://dvc.org/doc/start/metrics-parameters-plots
I think it makes sense to keep Access separate though.
This is a good Q. I don't think we've measured the read/try time before. I'm hoping it's much less than 1h — not sure that qualifies as "quick" (assuming Get Started = Quick Start).
Studio is a separate product and has it's own docs. I can see adding a layer to switch from terminal to studio in many examples but again, I wouldn't further complicate this discussion with that for now.
Great Q actually. By high-level I understand that the GS will cover all of DVC features but only enough to establish what main problem/solution they represent. In this sense it's a relatively shallow kind of doc. Again, it's goal is to cover lots of ground quickly, provide an overall impression, basic UX experience, and awaken curiosity (link to guides, refs, etc. for more deets).
I don't think it's too broad. People will filter themselves out. If you intentionally ended in the GS, you probably have a good reason, and fit our target audience.
Sounds good but I think the GS is the one place where we may not need to worry about that too much. Let's make a separate issue or discuss separately? (I have a metadoc about this here) |
A lot of information there :) I think Emre you are right about 3 entry points. Since we have 2 (mixed now into get-started), I would focus first on the 3rd one - experiments management. Primary persona for that one - tech savvy ML engineer, hands-on ML manager, industry (not students, not DevOps - or only their team asked them to check for other tools). It doesn't mean that we should disregard simplicity. But we should not be educating people on how to use git. Level: the purpose of get started is to have a document that people can get idea really quick from. It's more like a quick start. Thus - simple commands, hiding long explanations, etc. |
Here there is also a I'd rather rename the sections like (1) Quick Start to Data Management (2) Quick Start to Experiment Management (3) Quick Start to Pipelines and have at most 3000 words for each. (~10 minutes reading.) Another 20 minutes for trying commands and in around 30 minutes, the user should get a gist of the subject. |
I think, yes, ~1 hour is too much. Aiming for a soft limit of 10 minutes / 3000 words is better, probably. |
This is more or less what I understand too, but I think we should aim for 80% of the features that our user may need in their day to day activities. Instead of presenting DVC features, we should be thinking about which commands they use most and in what order. Once they started, they can come back and read the UG for details or other features. |
I think we need to control who filters themselves out. If we don't want some kind of audience, e.g., managers who never saw a command line before, filtering out is fine. But, if someone who might be within our users filter themselves out, IMO that's not OK. Let's throw all features to the wall and see which users stick themselves to it may not be a good strategy here. |
This goal and presenting all DVC features might contradict time to time, and in that case I'd prefer this ⬆️ goal for GS and presenting DVC features in UG. |
Notion document seems fine for discussion, but I don't believe that's not important. GS docs are the most restrictive place we have to think about the audience I think. It's like a glass shop window where you present your most interesting items. We have a limited space and we need to think about who might stop and take a look to these items. |
Alright we need a summary of this please @iesahin . Is #3050 all that's missing to get to a first milestone here? What else is left and can we consolidate (ideally into a new issue) with #1943 and #2474? That would be really helpful to get a sense of where we are with the GS and what a future milestone may look like. -- On the "trails" idea, as discussed offline we should probably keep a plain, curated structure for the Get Started so no need to decompose into these atomic doc units that can be reorganized into many trails. We could however use that strategy for the User Guide, where a complex structure like that could actually result beneficial! Related to #144 and #3128 -- let's move that discussion there? |
UPDATE: From call with @iesahin we agreed he'll submit a proposal (draft PR?) that reorganizes existing GS content mostly as-is (with possible overlap/ repetition) into 3 or 4 simple usage-based "tails" (instead of the 5 feature-based pages we have now). We didn't specify which basic usage cases, but here are my initial suggestions:
WDYT? |
I think we are complicating this a bit again. We clearly have two projects, two logical trails at the moment, and I would go with a simple restructuring around those two - Data and Experiments. |
Sure, those were just initial suggestions to get feedback. Just Data and Experiments sounds good to me too. It may be harder to keep all of the existing content with only minor editing to merge 5 pages into 2 (and not end up with extremely long tutorials) but we can try and see how it looks. The idea is that this draft/proposal shouldn't take too much effort, maybe a day or 2 (after #3050). |
why do we have to merge it? :) just keep it as-is. by that I mean - just make one extra level. |
Ok yeah that's an easy first step. Not sure what it achieves... Let's see what @iesahin comes up for now! |
This comment was marked as resolved.
This comment was marked as resolved.
OK so now we have 2 trails (data mgmt & experiment mgmt). @shcheklein @dberenbaum going back to #2496 (comment) (and other recent comments): should we separate most of the contents of Data Pipelines (and maybe metrics, etc.) into a different trail? I.e. can we prioritize #2857 now? This would include creating some new content as well as simplifying the existing one keeping in mind the original goal of having "trails": each one is comprehensive i.e. it covers all the major features of DVC appropriate for that point of view (even if there's some repetition).
|
Wrt to the original title/intention of the OP:
AFAICS we use https://github.com/iterative/dataset-registry in Data Versioning and https://code.dvc.org/get-started for Data Pipelines (both incorporated into https://github.com/iterative/example-get-started), and https://github.com/iterative/example-dvc-experiments for Exp Mgmt (both pages). Do we need any more? Who owns the existing example repos now? (Should we involve CSE?)
|
IMHO we shouldn't spend much more effort on get started trails right now because:
|
OK @dberenbaum, thanks for the detailed reasoning and I mostly agree with you. Just for the record though, here's what @shcheklein and I discussed would be the top problems with the current GS trails:
Yes but that's pretty much the blocker/ critical task at this point.
That depends on what other things we can do. For now I think indeed there's lower hanging fruit of similar impact like following up on #144 (comment) and #3833 but we can also start organizing and planning the wider team to address some of the above problems. |
* start: add index for Exp Mgmt * start: complete GS trail instructions in in index pages * start: fix refs to example repos per #2496 (comment) * start: bring tip out of details (indices) * Update content/docs/start/data-management/index.md * nav: roll back change * Update content/docs/start/index.md * Update content/docs/start/index.md * Update content/docs/start/index.md * Restyled by prettier (#4194) Co-authored-by: Restyled.io <commits@restyled.io> Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com> Co-authored-by: Restyled.io <commits@restyled.io> Co-authored-by: Thomas Kunwar <yathomasi@gmail.com>
Closing this, we have done the first part - split into two trails and we can back to this later when we need the next iteration (e.g. on pipelines). |
UDPATE: Jump to #2496 (comment).
Status
The docs use example-get-started for:
Checkpoints guide uses dvc-checkpoints-mnist
Goals
We want to improve the content with minimum changes to the existing documents. Adding more content to the already available material is desired.
We want to have a common/similar project for the tutorials. A single showcase project to contain all DVC features seems a bit artificial. A set of similar projects may be a better tradeoff for maintenance and usability.
DVC has different use cases for different people and we want to emphasize these:
** Data Versioning
** Data Access
** Sharing Models
** Presenting Models with Metrics and Plots
** Experiment Management and Sharing
** Checkpoints (which may be under "experiment management".)
There should be more than one entry points for the tutorials, e.g., experiment management should be a first-class citizen.
Discussion and Research Points
Current documentation is mostly pipelines-based. Almost all features revolve around
dvc.yaml
and the pipelines. How can we present DVC as an experiment management system without first telling about the pipelines?How high-level the GS docs should be? We also have UC and UG documents and most of the material in GS is also relevant to these sections. Who is our audience for GS? (ML Engineers? DevOps Engineers? DS Researchers? Students? Software Engineers?) What can we assume about them? What do we want to tell them without much low-level stuff and also staying relevant? What are their daily usage patterns?
How to evolve the example projects for each of the use cases?
How can we (or should we) present other relevant projects like Studio/CML/VSCode extension to people reading the GS pages?
Decisions and Tickets
Personas that make up the audience
The text was updated successfully, but these errors were encountered: