Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import: allow "chaining" imports somehow (except circular?) #2610

Closed
jorgeorpinel opened this issue Feb 11, 2020 · 24 comments · Fixed by #2705
Closed

import: allow "chaining" imports somehow (except circular?) #2610

jorgeorpinel opened this issue Feb 11, 2020 · 24 comments · Fixed by #2705

Comments

@jorgeorpinel
Copy link
Contributor

More context around https://discordapp.com/channels/485586884165107732/485596304961962003/676405501940072453

As of now, imported data is not cached by default, so you won't be able to import any imported data:

repo1 -> ✔️ dvc import data -> repo2 🙂 -> ✖️ dvc import data -> repo3 🙁

Somehow allowing this could be useful for the case where you're building a data registry based on other previous smaller DVC repos, for example. Right now you have to dvc get and then dvc add those artifacts from scratch in the data registry (so they can be imported in further DVC repos).

Ruslan mentioned something about using "links" to implement this (on Discord).

UPDATE: Go to https://github.com/iterative/dvc/issues/3305#issuecomment-836503176

@jorgeorpinel
Copy link
Contributor Author

except circular

This means repo1 -> import into repo2 -> import back to repo1 of course cannot be allowed.

So maybe the solution is that when you import an import stage, you simply copy the DVC-file as-is (with it's original source repo URL, rev, etc. and if the original rev_lock exists in the present repo, the import recognizes a circular import and fails.

@efiop
Copy link
Member

efiop commented Feb 11, 2020

Just a note that we need to be careful about this and consider all the possible corner cases (e.g. circular dependencies). If we allow this, dvc will have to behave like a proper package manager when resolving dependencies, which is very hard (remember pip dep resolution PEP?).

@tadejsv
Copy link

tadejsv commented Dec 16, 2020

This would be a very important feature for me. The use case is the following: files (datasets, pretrained models...) is generated across many different repositories, so I need a central data registry to easily catalogue data. Also, in case some of the original data creating repositories are renamed or merged together, I only need to change the import in data registry and not have to track down every single user of the data.

@pared
Copy link
Contributor

pared commented Jan 7, 2021

It seems like this issue consist of two stages:

EDIT: we actually have an open issue for first one:
iterative/dvc#2710

@pared
Copy link
Contributor

pared commented May 10, 2021

Seems like #2079 was fixed by iterative/dvc#5324. Need to verify how it influences this issue.

  • Is chaining imports working now?
  • Do we have an error if we try to do "circular" import?

@pared
Copy link
Contributor

pared commented Jun 2, 2021

Ok, so to check if chaining imports works I created following test:

def cleanup_repo(repo_dir):
    cache = repo_dir.dvc.config["cache"]["dir"]
    shutil.rmtree(cache)
    os.remove(repo_dir / "data")


def add_remote_and_push(repo_dir):
    repo_dir.add_remote(
        name="str", url=str(repo_dir) + "_storage", default=True
    )
    repo_dir.dvc.push()


@pytest.mark.parametrize("with_cleanup", [0, 1])
@pytest.mark.parametrize("with_remote", [0, 1])
def test_chained_import(
    tmp_dir, scm, dvc, erepo_dir, make_tmp_dir, with_cleanup, with_remote
):
    with erepo_dir.chdir():
        erepo_dir.dvc_gen("data", "data content", commit="add data")

    repos = []
    for i in range(5):
        repos.append(make_tmp_dir(f"another_{str(i)}", scm=True, dvc=True))

    previous = erepo_dir
    for index, r in enumerate(repos):
        with r.chdir():
            stage = r.dvc.imp(str(previous), "data", out="data")
            r.scm.add([stage.dvcfile.relpath])
            r.scm.commit("import data")
            previous = r

            if with_remote:
                add_remote_and_push(r)

            # to check if chained dependencies are resolved to source,
            # get rid of data and caches for intermediate repos
            if with_cleanup:
                cleanup_repo(r)

    from funcy import last

    latest = last(repos)
    stage = dvc.imp(str(latest), "data", out="imported_data")
    scm.add([stage.dvcfile.relpath])
    scm.commit("add data")

    assert (tmp_dir / "imported_data").read_text() == "data content"

    # check circular import
    with erepo_dir.chdir():
        erepo_dir.dvc.imp(str(tmp_dir), "imported_data", out="circular_import")
        assert (erepo_dir / "circular_import").read_text() == "data content"

So, answering my own questions from previous comment
TLDR:

  1. Do we support chaining imports now? Yes, but in very limited scope.
  2. Do we have an error if we try to do "circular" import? No

Some more context:

intermediate repo - repository that is one of the links between source repo and our last repo

  1. Chaining imports works only if we do not have default remotes in intermediate repos, and the cache for last link in the "chain" is present and contains imported assets (in case of our test its when both params values are set to 0). Lack of remote causes error most likely due to import: Allow pushing imported files to remote dvc#4527

  2. In special case, when we allow chaining imports (as described in 1.) We do not have error on circular import. That is not dangerous, because import sources the import from target repo (last link in chain of imports), and not the source one(first link).

In conclusion, the issue has not been fixed. Only very specific case has been fixed in iterative/dvc#5324.

During research for this comment, I came to the conclusion that, depending on how we decide to resolve the import assets, iterative/dvc#4527 and iterative/dvc#2599 might be prerequisistes for this issue. That will be the case if we decide to "follow latest link" - import from target repository, rather than try to resolve whole chain of imports.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Jun 2, 2021

rather than try to resolve whole chain

Wouldn't that be the easiest approach? And is it very difficult? I imagine all you need is all the .dvc files in the chain, and the configuration of the first link. Assuming you can connect to all the Git repos, it seems doable?

That doesn't mean iterative/dvc#4527 and iterative/dvc#2599 aren't valuable too but this way they're separate concerns.

Circular deps can be prevented by not allowing .dvc files to repeat (e.g. by md5) when rebuilding the chain. There could also be a reasonable set limit of links, say 8. Are there other edge cases e.g. race conditions or something?

@pared
Copy link
Contributor

pared commented Jun 3, 2021

Wouldn't that be the easiest approach?

@jorgeorpinel do you mean resolving the whole chain? I think that the easiest would be to import from target repo, but that would need iterative/dvc#4527. That way rules of import would be easy: We import from target repo, thats all. In case of resolving whole chain, well any link is missing, we are done.

@jorgeorpinel jorgeorpinel changed the title import: allow chaining imports somehow? (except circular) import: allow "chaining" imports somehow? (except circular) Jun 3, 2021
@jorgeorpinel jorgeorpinel changed the title import: allow "chaining" imports somehow? (except circular) import: allow "chaining" imports somehow? (except circular?) Jun 3, 2021
@jorgeorpinel jorgeorpinel changed the title import: allow "chaining" imports somehow? (except circular?) import: allow "chaining" imports somehow (except circular?) Jun 3, 2021
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Jun 3, 2021

the easiest would be to import from target repo, but that would need iterative/dvc#4527

You're right, that would be simpler in terms of behavior. Even circular imports could happen and be fine under that approach.

But it wouldn't really constitute "chaining". Let's can call it "cascading" for now. My point is that it doesn't really answer this issue fully:

What if the target repo link doesn't keep imports in remote storage by choice (e.g. to save on storage costs)? dvc import would still fail, and we may have this same request again (to look for the import in previous links).

That said maybe it's a good limitation, to avoid people from inadvertently importing from underlying sources with unknown reputation, thinking it's coming from a project they trust (unless a clear warning/confirmation is given).

Idk what the answer is. I'd check with @dberenbaum et al at this point 🙂

@dberenbaum
Copy link
Contributor

I looked quickly at the discord message but didn't really get the context. It sounded like it's pretty easy to work around this be doing get/add, so I'm unclear how much this is a need? Maybe the simplest solution is to improve messaging so that it's clear what's happening when someone tries to do a chained import and explains what to do.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Jun 3, 2021

Yeah it's an old chat. I wouldn't worry too much about that particular case but indeed it's unclear whether this is needed. At some point it was prioritized as p2 so we probably thought it was somewhat important.

easy to work around this be doing get/add

I don't think that's a workaround because you can't even get an import. The Q is if there's a repo somewhere (which you don't control) with imported data, and you want to get/import from there. The workaround would be to rebuild the import chain yourself (analyzing all .dvc files, assuming you can access all the repos in the chain) and import from the first link.

@pared
Copy link
Contributor

pared commented Jun 7, 2021

It seems that this particular use case is not "that" popular. It seems that it could be solved by simply implementing iterative/dvc#4527. I think what I am looking for is if there is a reason/use case when we would like to resolve whole import chain and import data from source, rather than import it from last link of import chain.

@jorgeorpinel
Copy link
Contributor Author

OK, agree.

@dberenbaum
Copy link
Contributor

What's the status of this after iterative/dvc#6109?

@pmrowla
Copy link
Contributor

pmrowla commented Jun 30, 2021

After iterative/dvc#6109, chained imports should work, and if there is a circular import in the chain, DVC will error out.

It currently works by resolving the entire import chain each time, and requires that DVC be able to access all of the repos in the chain (and all of the default remotes for each repo in the chain). This requirement remains true even after the initial dvc import (so to be able to pull or update the import, you still need access to each of the repos & remotes).

This example script shows how it works: https://gist.github.com/pmrowla/c1e86bc41acc05d06a3752d8e8700e4a

(The script creates 4 repos that each have their own separate default remotes)

We start with 2 repos that each contain a single file:

/Users/pmrowla/git/scratch/import-test/repo/a
├── foo
└── foo.dvc

/Users/pmrowla/git/scratch/import-test/repo/b
├── bar
└── bar.dvc

Next we add 3rd repo:

/Users/pmrowla/git/scratch/import-test/repo/c
└── dir
    ├── bar
    ├── bar.dvc
    ├── foo
    ├── foo.dvc
    ├── subdir
    │   └── baz
    └── subdir.dvc

In this repo, dir contains imports for the files repos A and B, as well as its own DVC-tracked directory (dir/subdir)

# dir/foo.dvc
md5: d652071a8f0fd9f5c74d9348a468dec5
frozen: true
deps:
- path: foo
  repo:
    url: /Users/pmrowla/git/scratch/import-test/repo/a
    rev_lock: 32ab3ddc8a0b5cbf7ed8cb252f93915a34b130eb
outs:
- md5: acbd18db4cc2f85cedef654fccc4a4d8
  size: 3
  path: foo
outs:
- md5: 630bd47b538d2a513c7d267d07e0bc44.dir
  size: 3
  nfiles: 1
  path: subdir
# dir/bar.dvc
md5: 214f215367e03b341128764728577ae1
frozen: true
deps:
- path: bar
  repo:
    url: /Users/pmrowla/git/scratch/import-test/repo/b
    rev_lock: 2e72278fd0f097bf932932a60bfe75a4dd019e8b
outs:
- md5: 37b51d194a7513e45b56f6524f2d51f2
  size: 3
  path: bar

In the final repo, we import dir from the 3rd repo. This will import dir/subdir from repo C, as well as the chained imports from A and B.

/Users/pmrowla/git/scratch/import-test/repo/d
├── dir
│   ├── bar
│   ├── foo
│   └── subdir
│       └── baz
└── dir.dvc
# dir.dvc
md5: bfdac3f7c77bdd89dcd1f6d22f5e39c5
frozen: true
deps:
- path: dir
  repo:
    url: /Users/pmrowla/git/scratch/import-test/repo/c
    rev_lock: 15136ed84b59468b68fd66b8141b41c5be682ced
outs:
- md5: e784c380dd9aa9cb13fbe22e62d7b2de.dir
  size: 27
  nfiles: 4
  path: dir
~

Note that the DVC file for final import only references repo C. We do not save that foo and bar are "chained", and will always need to look up the contents of dir from repo C. The chain resolution happens each time you run import/update/pull, so repo D always needs access to everything else (repos and remotes) in the chain on import/update/pull.

# in repo D
$ rm -rf .dvc/cache
$ dvc pull -v
2021-06-30 11:02:27,560 DEBUG: Check for update is disabled.
2021-06-30 11:02:27,595 DEBUG: Creating external repo /Users/pmrowla/git/scratch/import-test/repo/c@15136ed84b59468b68fd66b8141b41c5be682ced
2021-06-30 11:02:27,595 DEBUG: erepo: git clone '/Users/pmrowla/git/scratch/import-test/repo/c' to a temporary dir
...
2021-06-30 11:02:27,981 DEBUG: erepo: git clone '/Users/pmrowla/git/scratch/import-test/repo/b' to a temporary dir
...
2021-06-30 11:02:28,070 DEBUG: erepo: git clone '/Users/pmrowla/git/scratch/import-test/repo/a' to a temporary dir
...
2021-06-30 11:02:28,156 DEBUG: Downloading '../../remote/c/63/0bd47b538d2a513c7d267d07e0bc44.dir' to '.dvc/cache/63/0bd47b538d2a513c7d267d07e0bc44.dir'
2021-06-30 11:02:28,170 DEBUG: state save (114461256, 1625018548170011136, 266) e784c380dd9aa9cb13fbe22e62d7b2de.dir
...
2021-06-30 11:02:28,179 DEBUG: Downloading '../../remote/b/37/b51d194a7513e45b56f6524f2d51f2' to '.dvc/cache/37/b51d194a7513e45b56f6524f2d51f2'
...
2021-06-30 11:02:28,187 DEBUG: Downloading '../../remote/a/ac/bd18db4cc2f85cedef654fccc4a4d8' to '.dvc/cache/ac/bd18db4cc2f85cedef654fccc4a4d8'
...
2021-06-30 11:02:28,196 DEBUG: Downloading '../../remote/c/73/feffa4b7f6bb68e44cf984c85f6e88' to '.dvc/cache/73/feffa4b7f6bb68e44cf984c85f6e88'
...
4 files fetched
2021-06-30 11:02:28,211 DEBUG: Analytics is disabled.

Note that the full import chain is resolved on dvc pull (all 3 of the imported repos end up being cloned), and that individual files are pulled from the original remotes.

  • .dir cache for repo C's dir/subdir is fetched from remote C
  • baz is fetched from remote C
  • foo is fetched from remote A
  • bar is fetched from remote B

If the user importing/pulling into repo D did not have access to all 3 of the remotes, the pull would fail.

The example script also shows what happens for circular imports. If we now go into repo A, and try to import from repo D, it will form a circular import and fail (because D imports from C which imports from A)

# in repo A
dvc import /Users/pmrowla/git/scratch/import-test/repo/d dir
Importing 'dir (/Users/pmrowla/git/scratch/import-test/repo/d)' -> 'dir'
ERROR: failed to import 'dir' from '/Users/pmrowla/git/scratch/import-test/repo/d'. - 'dir (/Users/pmrowla/git/scratch/import-test/repo/d)' contains invalid circular import. DVC repo '/Users/pmrowla/git/scratch/import-test/repo/d' already imports from '/Users/pmrowla/git/scratch/import-test/repo/a'.

@pmrowla
Copy link
Contributor

pmrowla commented Jun 30, 2021

I'm not sure if this implementation meets the needs for closing this issue or not.

I think there was some discussion before regarding whether or not the final import into D should only require access to repo C? (this would require pushing all of C's imported files into C's default remote, rather than fetching them from their original locations in remotes A & B)

I'm also not sure whether or not we want to document that this is actually a supported use case

@dberenbaum @jorgeorpinel

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Jun 30, 2021

chained imports should work, and if there is a circular import in the chain, DVC will error out.
It currently works by resolving the entire import chain each time, and requires that DVC be able to access all of the repos in the chain (and all of the default remotes

That qualifies as "allowing chaining imports somehow" to me! I.e. it may close this ticket (condition explained below).

not sure whether or not we want to document that this is actually a supported use case

I'm just also curious whether the current behavior is what we want long term, given earlier comments e.g. https://github.com/iterative/dvc/issues/3305#issuecomment-855792862 - "it could be solved by simply implementing iterative/dvc#4527... what I am looking for is if there is a reason/use case when we would like to resolve whole import chain" cc @pared @dberenbaum

If this import chain feature (based on iterative/dvc#6109) is definitive, let's document it. E.g. it can be a mention and examples in import/update initially, and a how-to or part of future guides and use cases later.

@pmrowla
Copy link
Contributor

pmrowla commented Jun 30, 2021

The one other thing to note would be that running dvc update dir from repo D will only check if dir has changed in repo C.

So if foo has changed in repo A, but no one has run dvc update dir/foo in repo C, the dvc update dir from repo D won't do anything (since the intermediate import .dvc file for foo in repo C will be unchanged)

@dberenbaum
Copy link
Contributor

I'm just also curious whether the current behavior is what we want long term, given earlier comments e.g. #3305 (comment) - "it could be solved by simply implementing iterative/dvc#4527... what I am looking for is if there is a reason/use case when we would like to resolve whole import chain" cc @pared @dberenbaum

Hmm, I'm not sure what's the best approach. It's probably best to document the chained imports for now since that's the current implementation and revisit whether imports should be "backed up" later if it becomes a more obvious need.

@dberenbaum dberenbaum transferred this issue from iterative/dvc Jul 6, 2021
@dberenbaum
Copy link
Contributor

Moving to dvc.org since it seems like the remaining work is documentation.

@jorgeorpinel We might be able to better support iterative/dvc#4527 in the future, but until then it IMO it would be useful to be more clear that imports rely on the original source remote and users need access to that, including for chained imports. Thoughts?

@pmrowla Can you take this one if we need to document it? Is there someone else who should be assigned?

@pmrowla
Copy link
Contributor

pmrowla commented Jul 6, 2021

@dberenbaum I'm guessing it will have to be me since I did the current implementation, you can just assign it to me if/when needed

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Jul 6, 2021

Thanks guys (and no rush), we'll take it over as soon as we have all the basic info.

running dvc update dir from repo D will only check if dir has changed in repo C

This part may deserve it's own issue in iterative/dvc though?

@dberenbaum
Copy link
Contributor

My thoughts: The current behavior could be expected or unexpected depending on the circumstances. There's not a lot of user feedback outside of the one user in iterative/dvc#4527, so I'm not sure it's worth keeping the discussion going. Can we document what we have and revisit if users complain or are confused?

@pmrowla
Copy link
Contributor

pmrowla commented Jul 7, 2021

running dvc update dir from repo D will only check if dir has changed in repo C

This part may deserve it's own issue in iterative/dvc though?

This seems like the expected behavior to me. In this case, I am essentially importing a pinned/frozen stage from repo C. When I run dvc update in repo D, I want to check whether or not the frozen stage in repo C has been updated (meaning my final import into D should only be changed if/when someone has run dvc update in repo C and git pushed that update to repo C).

@efiop efiop added this to To do in DVC 13 July - 26 July 2021 via automation Jul 11, 2021
@efiop efiop moved this from To do to Done in DVC 29 June - 12 July 2021 Jul 11, 2021
@efiop efiop moved this from Done to To do in DVC 29 June - 12 July 2021 Jul 11, 2021
@pmrowla pmrowla moved this from To do to Done in DVC 29 June - 12 July 2021 Jul 13, 2021
@efiop efiop removed this from To do in DVC 13 July - 26 July 2021 Jul 26, 2021
@efiop efiop added this to To do in DVC 27 Jul - 10 Aug via automation Jul 26, 2021
@pmrowla pmrowla moved this from To do to Review in progress in DVC 27 Jul - 10 Aug Aug 10, 2021
@pmrowla pmrowla moved this from Review in progress to Done in DVC 27 Jul - 10 Aug Aug 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

6 participants