Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce new data:status command #7943

Merged
merged 4 commits into from
Jul 22, 2022
Merged

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Jun 28, 2022

@skshetry skshetry marked this pull request as ready for review June 28, 2022 17:19
@skshetry skshetry requested a review from a team as a code owner June 28, 2022 17:19
@skshetry skshetry requested a review from pmrowla June 28, 2022 17:19
@skshetry

This comment was marked as outdated.

@dberenbaum
Copy link
Contributor

dberenbaum commented Jul 1, 2022

Very minor suggestion: Can we make the "uncommitted changes" red instead of green? And make the "committed changes" green?

@skshetry
Copy link
Member Author

skshetry commented Jul 1, 2022

Very minor suggestion: Can we make the "uncommitted changes" red instead of green?

You mean the header or the complete section?

@skshetry
Copy link
Member Author

skshetry commented Jul 1, 2022

I don't think uncommitted changes has to be red, it should be yellow at most. It's more useful for not-in-cache. :)

dvc/commands/data.py Outdated Show resolved Hide resolved
@dberenbaum
Copy link
Contributor

Need to think about how to handle --untracked-files:

  • Should we show them by default? Seems a bit different from Git because:
    • Without normal option, it will get really busy.
    • Users don't expect to track the entire repo with DVC.
    • If we suggest doing dvc data status and git status as a pair, they become redundant for untracked files.
  • Since it's different from Git, should we try not to copy the same syntax? What about just a boolean flag? What about a slightly different name (--untracked)?

@skshetry
Copy link
Member Author

skshetry commented Jul 1, 2022

I'm finding the untracked section quite useful. I think in parallel, we should see if we can implement untracked-files=normal easily.
If it's too slow, we can roll back to not showing untracked files by default. If it's difficult to implement, then we can drop normal and don't show by default.

@dberenbaum
Copy link
Contributor

If I have a tracked directory and am missing the cache, it will show as added even if the .dvc file hasn't changed:

$ mkdir data
$ dvc add data
$ git add .
$ git commit -m "add data"
$ rm -rf .dvc/cache
$ dvc data status
Not in cache:
  (use "dvc pull <file>..." to update your local storage)
        data

DVC uncommitted changes:
  (use "dvc commit <file>..." to track changes)
        added: data/

@dberenbaum
Copy link
Contributor

Can we ignore pipelines outputs where cache: false?

For example, if I run an experiment in https://github.com/iterative/example-get-started, the output looks like:

$ dvc data status
DVC committed changes:
  (git commit the corresponding dvc files to update the repo)
        modified: data/prepared/
        modified: data/features/
        modified: model.pkl
        modified: scores.json
        modified: prc.json
        modified: roc.json
(there are other changes not tracked by dvc, use "git status" to see)
$ git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   dvc.lock
        modified:   params.yaml
        modified:   prc.json
        modified:   roc.json
        modified:   scores.json

no changes added to commit (use "git add" and/or "git commit -a")

Ideally, dvc data status would not show scores.json, prc.json, or roc.json.

@skshetry
Copy link
Member Author

skshetry commented Jul 1, 2022

Can we ignore pipelines outputs where cache: false?

It should be ignored, I tried fixing it in 231805a. I don't see in example-get-started locally. Have you fetched the latest commit from this branch?

https://github.com/skshetry/dvc/blob/de58c695516908b49cd62fe22ed7b17c64339850/dvc/commands/data.py#L259

@dberenbaum
Copy link
Contributor

Looks amazing @skshetry!

If we can decide on the default behavior for untracked files and address #7943 (comment) and #7943 (comment), I think it's ready to merge.

For untracked files, I still think we should exclude by default. Even if we can support normal mode (and we can't for now), the other reasons still apply. I agree it's useful when debugging, but in real usage I don't think it is. Most people treat DVC-tracked outputs as special for large data and wouldn't expect DVC to show the state of the whole repo, especially since they still need to use git status alongside it.

@skshetry
Copy link
Member Author

skshetry commented Jul 1, 2022

If I have a tracked directory and am missing the cache, it will show as added even if the .dvc file hasn't changed:

I think this is #7661. I have also noticed that. Will require a fix upstream.

https://github.com/skshetry/dvc/blob/de58c695516908b49cd62fe22ed7b17c64339850/dvc/commands/data.py#L85-L86

@dberenbaum
Copy link
Contributor

Can we ignore pipelines outputs where cache: false?

It should be ignored, I tried fixing it in 231805a. I don't see in example-get-started locally. Have you fetched the latest commit from this branch?

Yeah, unfortunately it looks like it's still an issue.

$ dvc -V
2.11.1.dev63+g4e22e4178

@skshetry
Copy link
Member Author

skshetry commented Jul 1, 2022

Can we ignore pipelines outputs where cache: false?

Sorry, this one is different. Even though it's a pipeline output, it's data being tracked by DVC, right?

@dberenbaum
Copy link
Contributor

Sorry, this one is different. Even though it's a pipeline output, it's data being tracked by DVC, right?

Right, I just noticed that was about not-in-cache.

🤔 I see what you mean, but I think we should ignore cache: false outputs. They are being tracked for pipeline purposes. Since we are trying to differentiate between pipeline and data management functionality here, I don't think it belongs. DVC isn't managing those file versions.

For example:

$ dvc data status
DVC committed changes:
  (git commit the corresponding dvc files to update the repo)
        modified: data/prepared/
        modified: data/features/
        modified: model.pkl
        modified: scores.json
        modified: prc.json
        modified: roc.json
(there are other changes not tracked by dvc, use "git status" to see)

$ git restore dvc.lock

$ dvc checkout
M       data/prepared/
M       model.pkl
M       data/features/

$ dvc data status
DVC uncommitted changes:
  (use "dvc commit <file>..." to track changes)
        modified: scores.json
        modified: prc.json
        modified: roc.json
(there are other changes not tracked by dvc, use "git status" to see)

@skshetry
Copy link
Member Author

skshetry commented Jul 1, 2022

thinking I see what you mean, but I think we should ignore cache: false outputs. They are being tracked for pipeline purposes. Since we are trying to differentiate between pipeline and data management functionality here, I don't think it belongs. DVC isn't managing those file versions.

Fixed in 29c2fa1.

@skshetry
Copy link
Member Author

skshetry commented Jul 1, 2022

For untracked files, I still think we should exclude by default. Even if we can support normal mode (and we can't for now), the other reasons still apply. I agree it's useful when debugging, but in real usage I don't think it is. Most people treat DVC-tracked outputs as special for large data and wouldn't expect DVC to show the state of the whole repo, especially since they still need to use git status alongside it.

With b3dd1e020, --untracked-files are disabled by default.

@skshetry skshetry changed the title [wip]: data status command introduce new data:status command Jul 1, 2022
@skshetry skshetry self-assigned this Jul 1, 2022
@skshetry skshetry added feature is a feature A: cli Related to the CLI A: data-management Related to dvc add/checkout/commit/move/remove labels Jul 1, 2022
@skshetry skshetry requested review from efiop and removed request for pmrowla July 1, 2022 19:53
@jorgeorpinel
Copy link
Contributor

Just a quick thought on

@mattseddon @dberenbaum, I have added --with-dirs

Shouldn't be hard to merge the JSON output of both calls (with and without --granular) as an alternative. Not sure about the performance though

@skshetry skshetry force-pushed the data-status branch 2 times, most recently from 8c21f82 to 7bd6a18 Compare July 21, 2022 13:19
dvc/repo/data.py Outdated Show resolved Hide resolved
@skshetry skshetry requested a review from efiop July 21, 2022 17:08
@skshetry
Copy link
Member Author

@efiop, this should be ready for review/merge. :)

setup.cfg Show resolved Hide resolved
@mattseddon
Copy link
Member

mattseddon commented Jul 22, 2022

Started on moving to the new command. Found an issue with untracked files. In the following output untracked is relative to the git root and not the dvc repository root (dvc root is in the demo subdirectory):

{                                                                                                                                                                                                    
  "uncommitted": {
    "modified": [
      "model.pt",
      "training_metrics/scalars/loss.tsv",
      "training_metrics/scalars/acc.tsv",
      "training_metrics/",
      "misclassified.jpg",
      "predictions.json"
    ]
  },
  "untracked": [
    "demo/test.pl"
  ]
}

LMK if this was deliberate. Thanks.

@skshetry
Copy link
Member Author

Started on moving to the new command. Found an issue with untracked files. In the following output untracked is relative to the git root and not the dvc repository root (dvc root is in the demo subdirectory):

LMK if this was deliberate. Thanks.

Great catch. That's an oversight on my part. Will fix it. Thanks.

@skshetry
Copy link
Member Author

@mattseddon, I have tried fixing the sub dir untracked files issue. Let me know if it fixes for you.

@mattseddon
Copy link
Member

@mattseddon, I have tried fixing the sub dir untracked files issue. Let me know if it fixes for you.

LGTM

@efiop efiop merged commit c3ac4f8 into iterative:main Jul 22, 2022
@skshetry skshetry deleted the data-status branch July 22, 2022 13:31
@dberenbaum
Copy link
Contributor

Great work on this @skshetry!

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some string suggestions after checking iterative/dvc.org#3812.

"unchanged": "DVC unchanged files",
}
HINTS = {
"not_in_cache": 'use "dvc pull <file>..." '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we used back ticks in these texts? E.g.

Suggested change
"not_in_cache": 'use "dvc pull <file>..." '
"not_in_cache": 'Use `dvc pull <file>` '

}
HINTS = {
"not_in_cache": 'use "dvc pull <file>..." '
"to update your local storage",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • storage -> cache?
  • Missing sentence period .

Copy link
Member Author

@skshetry skshetry Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have made changes to this in successive PRs. Now it says:

(use "dvc fetch <file>..." to download files)

It's a hint, so we don't need a period.

Comment on lines +35 to +36
"committed": "git commit the corresponding dvc files "
"to update the repo",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"committed": "git commit the corresponding dvc files "
"to update the repo",
"committed": "`git commit` the corresponding DVC metafiles "
"to update the repo.",

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer quotes, it's easier on eyes. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's a convention already to code-inline this kind of things?

Copy link
Member Author

@skshetry skshetry Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do use backticks on help messages, but I am not sure about the reasoning. The only other place within CLI (except help) is when we print commands in git add hints where we just indent those.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only rec. here is to aim for consistency if possible. Probably not major (up to you) but it's a product quality question.

"to update your local storage",
"committed": "git commit the corresponding dvc files "
"to update the repo",
"uncommitted": 'use "dvc commit <file>..." to track changes',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"uncommitted": 'use "dvc commit <file>..." to track changes',
"uncommitted": 'Use `dvc commit <file>` to track changes.',

Comment on lines +38 to +39
"untracked": 'use "git add <file> ..." or '
'dvc add <file>..." to commit to git or to dvc',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"untracked": 'use "git add <file> ..." or '
'dvc add <file>..." to commit to git or to dvc',
"untracked": 'Use `git add <file>` or '
'`dvc add <file>` to track with Git or DVC.',

Comment on lines +40 to +41
"git_dirty": "there are {}changes not tracked by dvc, "
'use "git status" to see',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"git_dirty": "there are {}changes not tracked by dvc, "
'use "git status" to see',
"git_dirty": "There are {}changes not tracked by DVC. "
'Use `git status` to see them.',

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a hint, no need to upper case it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why hints don't need capitalization or punctuation. Are they part of a more complete sentence?

if not result:
no_changes = "No changes"
if git_info.get("is_empty", False):
no_changes += " in an empty git repo"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
no_changes += " in an empty git repo"
no_changes += " in an empty Git repo"

Comment on lines +128 to +130
DATA_STATUS_HELP = (
"Show changes between the last git commit, "
"the dvcfiles and the workspace."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and migrating discussion from iterative/dvc.org#3812 (review) cc @skshetry @shcheklein:

Suggested change
DATA_STATUS_HELP = (
"Show changes between the last git commit, "
"the dvcfiles and the workspace."
DATA_STATUS_HELP = (
"Show changes between the last Git commit, "
"DVC metafiles, and the workspace."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metafile is only docs-concept
this term has not been introduced in DVC (maybe for good reasons).

I think the only reason is that we don't have a system to ensure consistency in terminology between docs and help output but we probably should. It would improve the UX.

dvcfiles is also a weird term
we should... not introduce internal terms into help / docs.

And it sounds like .dvc files specifically. What about dvc.lock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be further simplified though, to just:

Show changes in DVC-tracked data between the last Git commit and the workspace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metafile is a confusing term. Tbh I don't even know what it is, I can guess that it's a file having metadata about something. Also, it's a logical concept, not a physical one like .dvc and dvc.yaml, so I find it to be an unnecessary redirection. Unlike docs, we can afford repetition which is not that many.

Here, I think we can just avoid mentioning the files at all.

Show changes in the data tracked by DVC in the workspace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show changes in DVC-tracked data between the last Git commit and the workspace.

yep, we can avoid metafiles, dvcfiles.


my 2cs - I find both terms suboptimal, but would probably prefer metafiles - since, yes they are technically metafiles - they contain a spec for data, yes metadata about data. At least it sounds reasonable to me. Plus we already use it in docs. May be we can avoid using this low lever terminology in help messages at all btw- that would the best option for end users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I think we can just avoid mentioning the files at all.
we can avoid using this low lever terminology in help messages

➕➕

Comment on lines +49 to +50
if old_obj is None:
return {"added": [root], **d}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of one more thing: This should be called new I think, because you can dvc add an updated file and it will be listed as modified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, it's interesting Git does "new file" ... VS Code does "added"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this one tbh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VS Code does "added"

I see A (added) in both Explorer and Source Control views:

image

image

Comment on lines +158 to +165
data_status_parser.add_argument(
"--unchanged",
action="store_true",
default=False,
help="Show unmodified DVC-tracked files.",
)
data_status_parser.add_argument(
"--untracked-files",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more...

Should the 2nd option be just --untracked? Unclear why -files is only in that name (plus it can be directories too).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, let's introduce this alternative now, and deprecate the previous option, drop it later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Git uses --untracked-files, so that's a major reason why this is this way. Also, --untracked-files at the moment is recursive, so the files is more correct. There are some questions regarding it's behaviour: #8061, so I'd defer it until then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK (up to you) but remember that Git's UI is not very good. We should not aim to copy it just because it's Git. Consistency is more important IMO. And "untracked" can refer to plural (files) so both are correct, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: cli Related to the CLI A: data-management Related to dvc add/checkout/commit/move/remove feature is a feature
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

8 participants