Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconfigurable modules #1472

Closed
dmpetrov opened this issue Jan 7, 2019 · 2 comments
Closed

Reconfigurable modules #1472

dmpetrov opened this issue Jan 7, 2019 · 2 comments
Labels
feature request Requesting a new feature

Comments

@dmpetrov
Copy link
Member

dmpetrov commented Jan 7, 2019

This is the comment #1462 (comment) extracted as a separate feature request. Also, @kskyten opened submodels issue #301.

Users should have an ability to create a "library" of reconfigurable pipelines #1462 and reuse them from different projects. Pipeline import can work through copy, Git-submodules or git clone https://my-dvc-repo.

An analogy with programming:

DVC Programming Goal
Stage An operator (line of code) The simplest unit
Pipeline (the current DVC implementation) A function with no parameters Organize and manage complexity in a project
Reconfigurable pipelines #1462 A function with parameters Organize and manage complexity in a project
Reconfigurable modules (this issue) A function with parameters in a library (lib, so, jar file) Reuse from different projects

UPDATE: Added a link to @kskyten issue.

@dmpetrov dmpetrov added the feature request Requesting a new feature label Jan 7, 2019
@dmpetrov
Copy link
Member Author

dmpetrov commented Jan 9, 2019

A possible interface to reconfigurable modules is below. This example is based on https://dvc.org/doc/get-started

# Clones a repository and pull data for reconfigurable modules with data
$ dvc clone https://github.com/iterative/so-dataset-posts-25K

$ dvc run -d prepare.py -d so-dataset-posts-25K/data.xml \
        -o data.tsv -o data-test.tsv \
        python prepare.py so-dataset-posts-25K/data.xml

$ dvc clone https://github.com/iterative/text-to-bag-of-words

# Run cloned module instead of:
# dvc run -d featurization.py -d data.tsv -o matrix.pkl \
#              python featurization.py data.tsv matrix.pkl
# -d1 - pass a file as the first module input\dependency (since it can have a few)
# -o1 - instatiate (create a hardlink) the first module output as a data file
$ dvc sub text-to-bag-of-words -d1 data.tsv -o1 matrix.pkl \
              -p columns=1,2 -p lowercase=true -p max_features=9000

# Just a regular run
$ dvc run -d train.py -d matrix.pkl \
              -o model.pkl \
              python train.py matrix.pkl model.pkl

Details

The module should not be executed dvc sub from its directory text-to-bag-of-words since a single run might be not enough. Instead, a separate module instance should be created in some directory (let say .dvc/inst/) for each separate run with a unique suffix (for example .dvc/inst/text-to-bag-of-words_8bf3cfe).

Connection to build cache issue

The module unique suffix can be based on the module instance config file (not in the example above) and set of params. In such a way DVC can easely identify a similar runs and can be reused as build cache #1234 for a regular runs (not modules).

@dmpetrov
Copy link
Member Author

@kskyten I'd love to hear your feedback on this.

This was referenced Feb 25, 2019
@efiop efiop closed this as completed May 3, 2021
@iterative iterative locked and limited conversation to collaborators May 3, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

2 participants