Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use-cases: improve Data Registry case #795

Closed
5 of 6 tasks
Suor opened this issue Nov 15, 2019 · 16 comments · Fixed by #818
Closed
5 of 6 tasks

use-cases: improve Data Registry case #795

Suor opened this issue Nov 15, 2019 · 16 comments · Fixed by #818
Assignees
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: cases Content of /doc/use-cases type: enhancement Something is not clear, small updates, improvement suggestions

Comments

@Suor
Copy link
Contributor

Suor commented Nov 15, 2019

I took a look at the new Data Registry page. Congrats @jorgeorpinel on compiling it and please don't be angry if you'll get a bit more than you expected :)

Here go some considerations big and small as they occur in the text.


Furthermore, the version of the data file imported to B can be an older iteration than what's currently used in A.

  • Time is somewhat non-linear when branches are involved, so it might not be correct to talk about "older iteration". Should we just say that B might refer to any version of an artifact from A.

Keeping this in mind, we could build a

could -> can

This way we would have a repository that has all the metadata and change history for the project's data.

  • I don't understand this sentence. "we would ... change history", what does it mean? Why do we want to change any history? What does "history of projects data" mean?

Also, why do we push data-registry from the start? Can't one repo simply depend on the other repo? Registry is only one scenario. The majority of advantages listed do not require data registry. I understand that this use case is named "Data Registry", but maybe we can lay some trail to it?

features such as change history

  • what does this mean "change history feature"?

Example

  • This story about wget and the past brings confusion. May we just state the problem (the current state of it) and solve it? The whole story then starts from the middle, we present commands that do something else then expected and then say about that in some buble. I needed to reread it several times to get what is this about.

A dataset we use for several of our examples and tutorials is one containing

  • ... tutorials contains ...

We partitioned the dataset in two ...

@shcheklein shcheklein changed the title Notes on Data Registry page notes on Data Registry page Nov 15, 2019
@shcheklein shcheklein added A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions use-cases labels Nov 15, 2019
@shcheklein shcheklein changed the title notes on Data Registry page improve Data Registry use case Nov 15, 2019
@jorgeorpinel

This comment has been minimized.

@jorgeorpinel jorgeorpinel changed the title improve Data Registry use case use-cases: improve Data Registry case Nov 19, 2019
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Nov 20, 2019

could -> can

I would agree if this was a tutorial or get started chapter with reproducible commands but here we are actually talking in a more hypothetical way.

@jorgeorpinel

This comment has been minimized.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Nov 20, 2019

why do we push data-registry from the start? Can't one repo simply depend on the other repo? Registry is only one scenario. The majority of advantages listed do not require data registry.

Actually, we first mention that "Instead of adding it it to both projects, B can simply import it from A." (implying simple project dependency). Like you noticed, this use case is about data registries so that's why we focus on them.

As for the list of advantages. I had the exact same concern with @shcheklein at first haha. Most of them are not specific to data registries. The reason we have them all listed there though, is that we hope use cases can serve as a bit of marketing, since we imagine they can be the landing pages for some users, linked to the use case directly from a search engine (first web page they ever see in our website). So, we are selling DVC as a whole in here.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Nov 20, 2019

We split the dataset into two ...

What's wrong with "partitioned"? We then use the word "parts" several times.

@jorgeorpinel
Copy link
Contributor

@Suor please notice my comments on some of your feedback above. I've also starting addressing your comments in #805 but maybe wait a little before reviewing that, until we have some more agreements in the discussions here. (Only the larger point about the example I haven't addressed or replied to yet.)

@Suor
Copy link
Contributor Author

Suor commented Nov 20, 2019

I would agree if this was a tutorial or get started chapter with reproducible commands but here we are actually talking in a more abstract/hypothetical way.

  • "can" is used all other the place there, this "could" stands out.

Actually, we first mention that "Instead of adding it it to both projects, B can simply import it from A." (implying simple project dependency). Like you noticed, this use case is about data registries so that's why we focus on them.

It is presented in a too abstract way I guess, you read through it and have nothing to fix your mind on. And then you make a huge conceptual jump with "Keeping this in mind ...". There is no way keeping this, the mere possibility of import, in mind I would come to Date Registry in one step. So the whole Data Registry looks like a solution without a problem.

  • Maybe we should present/state a problem in-between, like a graph of dependencies becoming messy when you have lots of dvc repos depending on each other chaotically.
    Or a need to have some data catalog if that is literally a data registry not say a model one.

What's wrong with "partitioned"? We then use the word "parts" several times.

  1. Partitioning has complicated uses like partitioning disk or database tables. Maybe it's just me, but it brings all those connotations for no use.
  2. The phrase train/test split is a common pattern in data science, e.g. it's called train_test_split() in sklearn.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Nov 21, 2019

"can" is used all other the place there, this "could" stands out.

OK. I'm changing the later "can" words in this same hypothetical context to "could". Notice that there's also a "would". I rephrased other parts of the paragraph too. See #805 (review).

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Nov 21, 2019

It is presented in a too abstract way I guess (nothing to fix your mind on). And then you make a huge conceptual jump with "Keeping this in mind... the whole Data Registry looks like a solution without a problem.

@Suor "Keeping this in mind" is supposed to refer to "DVC also includes the dvc get, dvc import, and dvc update commands." (Before the abstract project A and B example). I don't think its a huge jump, but I guess it could definitely be rewritten more clearly. We just want to avoid making this text too long, so a separate paragraph to give a full concrete example of simple project dependency (which is not the topic of the use case) is a bit problematic. Also because we want to insert a data registry diagram near the top but it may not make sense until the paragraph where it's actually introduced.

Perhaps we should just remove the abstract example altogether and find another place to talk about project dependency?

  • Or change the dependency mention so its' not confusing...

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Nov 21, 2019

I see you also left a clear suggestion for this @Suor... (I missed that comment before. 😅)

Maybe we should present/state a problem in-between, like a graph of dependencies becoming messy when you have lots of dvc repos depending on each other chaotically.

So yes, the problem here is that we don't want to talk about project dependency but about data registries. And keep it as short as possible. I think we're assuming people will know/understand the problems that can use this solution. What do you think @shcheklein? Also about my suggested way to address this concern:

Perhaps we should just remove the abstract example altogether and find another place to talk about project dependency? Or change the dependency mention so its' not confusing...

@jorgeorpinel

This comment has been minimized.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Nov 21, 2019

Me again. OK, while yo guys think about it, I simplified the project inter-dependency mention per the following ideas above:

Maybe we should present/state a problem in-between, like a graph of dependencies becoming messy
change the dependency mention so its' not confusing

You may see the exact changes and continue this discussion on #805 (review).

@jorgeorpinel
Copy link
Contributor

Last item pending here @Suor:

This story about wget and the past brings confusion... I needed to reread it several times to get what is this about. May we just state the problem (the current state of it) and solve it?

I agree that the intro to the example is a bit weird... It's similar to the old note project A and B example where we tried to just kind of mention something but in no more than one paragraph, so it ended up being too brief perhaps. I like how your suggestion of just stating problem and solving it sounds, but I'm not sure how exactly that would look. In a way this story is meant to state the problem. I'll think about this...

The whole story then starts from the middle

It actually starts from scratch though: 1) dataset split in 2 on a storage server, parts downloaded with wget 2) same files moved to dvc repo, to download with dvc get instead. Then the next paragraph goes into changing from 2 files to 2 versions instead (proper data registry).

we present commands that do something else then expected and then say about that in some buble

Again, since it's not a tutorial or get started chapter, we don't intend to provide end-to-end reproducible commands. We decided to add the expandable sections in case someone actually ran them and didn't get the expected results.

@jorgeorpinel
Copy link
Contributor

Alright. I tried to improve the logic of the example, not a major rewriting but significant rephrasing involved. Please review PR#805 and let's move this discussion over there. Please open reviews/comments there as needed.

@shcheklein
Copy link
Member

shcheklein commented Nov 25, 2019

Also, why do we push data-registry from the start? Can't one repo simply depend on the other repo? Registry is only one scenario. The majority of advantages listed do not require data registry. I understand that this use case is named "Data Registry", but maybe we can lay some trail to it?

@jorgeorpinel has answered this already, but I think the problem also comes from the way we introduce it. We go from using regular imports/gets to setting up a dedicated data registry. While we should be comparing no DVC at all for data management (it means - ad-hoc conventions and total mess on S3) with the DVC Data Registry which effectively provides some "meta" information for the same data on S3.

It means that I would not emphasize that it's a mess to chain multiple imports/gets. It's a mess to not use anything to organize data. And data registry just one of the ways to organize it.

@jorgeorpinel
Copy link
Contributor

I would not emphasize that it's a mess to chain multiple imports/gets. It's a mess to not use anything to organize data.

Good catch. Will review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: cases Content of /doc/use-cases type: enhancement Something is not clear, small updates, improvement suggestions
Projects
None yet
4 participants