Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metadata fields: label, type to data #8232

Merged
merged 1 commit into from
Sep 8, 2022
Merged

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Sep 2, 2022

Related: #8214, Closes #8243

This PR:

  • adds labels - a list type, and type - a string type to .dvc schema.
  • adds support for --type flag in add/import/import-url.
  • adds support for --labels flag in add/import/import-url. You can specify the flag multiple times, and also can specify labels as comma-separated list.
    $ dvc add model.pkl --labels model,get-started --labels dataset-registry
  • type/labels are preserved on rewrites/overwrites.

Example .dvc file

$ dvc add model.pkl  --desc "My model" --type model --labels get-started,dataset-registry > /dev/null
$ cat model.pkl.dvc
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
  size: 4
  path: model.pkl
  desc: My model
  type: model
  labels:
  - get-started
  - dataset-registry

@skshetry skshetry added feature is a feature A: data-management Related to dvc add/checkout/commit/move/remove labels Sep 2, 2022
@skshetry skshetry self-assigned this Sep 2, 2022
@skshetry skshetry mentioned this pull request Sep 2, 2022
4 tasks
@skshetry skshetry changed the title Add metadata fields: label, type to data [DO NOT MERGE] Add metadata fields: label, type to data Sep 2, 2022
@dmpetrov
Copy link
Member

dmpetrov commented Sep 5, 2022

@skshetry looks amazing! thank you for taking an extra mile with data ls!

  • Do we need a separate command to only add/update metadata?

Not right now. Let's wait for more feedback.

  • Should we keep adding these metadata at the top level? Should we collect them under a different key?

Good question. It might be safer to roll it out under meta as it is now. Later we can "promote" these to top level. Please share your opinion.

  • What is the requirement for showing data and metadata?

It is needed for discovery. Your data ls command is a good example for a local repository. This functionality is even more useful across repositories - from a SaaS with a human UI.

An additional questions.

  1. Can we introduce a custom meta fields (it might be a separate requirement\PR).

For example. User needs a summary stats like:

$ dvc add data.csv  --desc "User data" --type data --labels user,online \
         --custom-type summary=`Rscript -e 'summary (as.numeric (readLines ("stdin")))' < data.csv`

$ cat data.csv.dvc
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
  size: 237894
  path: data.csv
  desc: User data
  type: data
  labels:
  - user
  - online
  custom:
     summary: "
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    2.25    3.50    3.50    4.75    6.00 "
 

$ dvc add model.pkl  --desc "My model" --type model --labels get-started,dataset-registry \
         --custom-type final-accuracy=0.82745
  1. Make sure the description can be multiline

@skshetry skshetry force-pushed the metadata branch 3 times, most recently from 10abb31 to 4aa8a28 Compare September 6, 2022 06:20
@skshetry skshetry changed the title [DO NOT MERGE] Add metadata fields: label, type to data add metadata fields: label, type to data Sep 6, 2022
@skshetry
Copy link
Member Author

skshetry commented Sep 6, 2022

I have dropped dvc data ls command from this PR. That command, support for custom metadata and dvc.yaml support will come in successive PRs.

This should be ready for a review.

@dberenbaum
Copy link
Contributor

@dmpetrov Let's separate the summary stats elsewhere since it goes into other topics like metrics. Seems like this already supports arbitrary metadata under meta, which should be enough for now I think.

@skshetry Looks great! A few minor UI notes:

I don't think we can have --meta consume multiple arguments or it creates issues like this:

$ dvc add --meta custom=val foo.txt
ERROR: the following arguments are required: targets

In other places like dvc exp run --set-param or dvc stage add, the flag has to be passed multiple times (dvc stage add -d foo.txt -d bar.txt). Can we do the same here?

Similarly, comma-separated --labels is not necessarily problematic but maybe it's better to support --label first --label second for consistency with the rest of the UI.

Finally, do the descriptions all need to end in (optional)? Other options like --file are also optional, but I don't think it needs to be explicitly stated in the help text.

@dberenbaum
Copy link
Contributor

  • Should we keep adding these metadata at the top level? Should we collect them under a different key?

Good question. It might be safer to roll it out under meta as it is now. Later we can "promote" these to top level. Please share your opinion.

No strong opinion here from me. I don't think I know enough yet about this usage to make an informed decision. I'm fine with the current implementation for now.

@skshetry
Copy link
Member Author

skshetry commented Sep 8, 2022

$ dvc add --meta custom=val foo.txt
ERROR: the following arguments are required: targets

Looks like a bug in python: python/cpython#53584, which consumes all positional arguments. Indeed, looks like we have to go with multiple --meta invocations.

Similarly, comma-separated --labels is not necessarily problematic but maybe it's better to support --label first --label second for consistency with the rest of the UI.

No strong opinion here, happy to go with --label option. While working with this, I was getting tired of typing --label too many times, that I felt that we need that.

Finally, do the descriptions all need to end in (optional)? Other options like --file are also optional, but I don't think it needs to be explicitly stated in the help text.

That's how it was in --desc. I can remove that.

@skshetry
Copy link
Member Author

skshetry commented Sep 8, 2022

Note that we do use --outs/--deps kind of naming in places where you can specify flags multiple times (except --set-param).

@aguschin
Copy link
Contributor

aguschin commented Mar 2, 2023

Hi @skshetry! We're getting back to this with @dberenbaum since we're thinking about removing artifacts.yaml part of GTO in favor of what you implemented in DVC.

You implemented this:

$ dvc data ls
 Path                            Type     Labels                              Description                               
 data.xml                        data     data-registry,get-started           imported code                             
 data/data.xml                   data     data-registry,get-started           imported                                  
 foo                             mytype1  model,get-started,dataset-registry  foo                                       
 scripts/innosetup/dvc.ico       -        -                                                                             
 scripts/innosetup/dvc_left.bmp  -        -                                                                             
 scripts/innosetup/dvc_up.bmp    -        -              
                                                                
$ dvc data ls --type data --labels model,data-registry
 Path           Type  Labels                     Description                                                            
 data.xml       data  data-registry,get-started  imported code                                                          
 data/data.xml  data  data-registry,get-started  imported

Why not simply dvc ls?

E.g., dvc ls --annotation can show what dvc data ls shows, and dvc ls -a --type model will show what dvc data ls --type data does?

@skshetry
Copy link
Member Author

skshetry commented Mar 2, 2023

Good point @aguschin. The problem is that dvc ls is data-level/file-level whereas these labels are part of a dataset.

@skshetry
Copy link
Member Author

skshetry commented Mar 2, 2023

I should note that this command is hidden/undocumented until we figure out what we need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove feature is a feature
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Add labels/type/desc support in .dvc file and dvc add/import/import-url.
5 participants