Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to manage repetitive dvc run commands (like unpacking of many zip files)? #1119

Closed
sotte opened this issue Sep 14, 2018 · 11 comments
Closed
Assignees
Labels
enhancement Enhances DVC feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint
Milestone

Comments

@sotte
Copy link
Contributor

sotte commented Sep 14, 2018

Assume you download a bunch of zips.

data/
  download/
    file1.zip
    ...
    fileN.zip

Now you need to unpack the files to lets say data/unpacked/file{i}. Something like this:

data/
  unpacked/
    file1/
    ...
    fileN/

When I use a Makefile to manage my pipeline I would write a simple rule for unpacking the zips and put them into their target folders. What is the best approach to do something like this with DVC? Right now I just type a bunch of dvc run commands but that does not scale well.

@efiop
Copy link
Member

efiop commented Sep 14, 2018

Hi @sotte !

Very interesting scenario! We have been thinking about introducing a build matrix #1018 to handle a bit different scenario, but i think in your case it could be useful as well. For example, Dvcfile placed in data/unpacked could look like:

cmd: tar -xf {deps} -C {outs}
matrix:
   - deps:
        - path: file1.tgz
     outs:
        - path: file1
   - deps:
        - path: file2.tgz
     outs:
        - path: file2
    ....

Would something like that suit you? Please feel free to share any thoughts or suggestions.

Thanks,
Ruslan

@sotte
Copy link
Contributor Author

sotte commented Sep 14, 2018

That looks quite nice!
Wildcards would be nice. Something like this:

matrix:
  - deps:
    -path: file*.csv.tgz
    outs:
    - path: file*.csv

But I guess the deps and outs have to be specified explicitly.

@efiop
Copy link
Member

efiop commented Sep 14, 2018

Glad you like it! We will take a closer look at implementing this in the near future.

Yeah, the wildcards raise more questions than answers. I can't promise that they are going to be implemented this way, but we will sure think about it to see if there are some suitable ways we could incorporate them in the future.

Thanks,
Ruslan

@efiop efiop self-assigned this Sep 14, 2018
@efiop efiop added the enhancement Enhances DVC label Sep 14, 2018
@efiop efiop added this to the Queue milestone Sep 14, 2018
@efiop
Copy link
Member

efiop commented Sep 14, 2018

Though without wildcards or something similar to them, we still leave the pain point of having to explicitly specify deps and outs. Need to definitely think about optimizing this as well.

@sotte
Copy link
Contributor Author

sotte commented Sep 14, 2018

True, but if you add the wildcard feature, then people (including myself ;)) would want more data pipeline features. How about rewrite rules? Parallel execution? Feature parity to makefile...at least ;) Do you want to go down the route?

I'm not saying that you should or should not. Just that it's a tricky balance.

@efiop
Copy link
Member

efiop commented Sep 14, 2018

Well, we are already sorta going that way with build matrix and upcoming parallel execution, so we might(and I think we should) as well make another step 🙂 Btw, what do you mean by "rewrite rules" ?

@sotte
Copy link
Contributor Author

sotte commented Sep 14, 2018

Wrong term, sorry. I was thinking of substitutions in makefiles. See https://www.gnu.org/software/make/manual/make.html#Substitution-Refs

@efiop
Copy link
Member

efiop commented Sep 14, 2018

Ah, got it. Yes, i think it might be necessary as well in the long term.

@efiop
Copy link
Member

efiop commented Sep 15, 2018

Btw, @shcheklein noticed that we had a similar discussion recently https://discuss.dvc.org/t/creating-an-aggregate-dvc-file/93 and thought that it might be worth mentioning that you can workaround this by using a bash script just like in the linked discussion.

@sotte
Copy link
Contributor Author

sotte commented Sep 18, 2018

That could work. It feels like a dirty workaround though and not very intuitive.

I'll try it out and let you know how it works out in the long run.

@dmpetrov
Copy link
Member

Great discussion! The first implementation is coming #4734. It does not completely solve this use case but it is a foundation for the next step - #331.

Let's close this issue and move all the discussions to "umbrella" issues #3633 & #331.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint
Projects
None yet
Development

No branches or pull requests

3 participants