-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
import: allow "chaining" imports somehow (except circular?) #2610
Comments
This means repo1 -> import into repo2 -> import back to repo1 of course cannot be allowed. So maybe the solution is that when you import an import stage, you simply copy the DVC-file as-is (with it's original source |
Just a note that we need to be careful about this and consider all the possible corner cases (e.g. circular dependencies). If we allow this, dvc will have to behave like a proper package manager when resolving dependencies, which is very hard (remember pip dep resolution PEP?). |
This would be a very important feature for me. The use case is the following: files (datasets, pretrained models...) is generated across many different repositories, so I need a central data registry to easily catalogue data. Also, in case some of the original data creating repositories are renamed or merged together, I only need to change the import in data registry and not have to track down every single user of the data. |
It seems like this issue consist of two stages:
EDIT: we actually have an open issue for first one: |
Seems like #2079 was fixed by iterative/dvc#5324. Need to verify how it influences this issue.
|
Ok, so to check if chaining imports works I created following test: def cleanup_repo(repo_dir):
cache = repo_dir.dvc.config["cache"]["dir"]
shutil.rmtree(cache)
os.remove(repo_dir / "data")
def add_remote_and_push(repo_dir):
repo_dir.add_remote(
name="str", url=str(repo_dir) + "_storage", default=True
)
repo_dir.dvc.push()
@pytest.mark.parametrize("with_cleanup", [0, 1])
@pytest.mark.parametrize("with_remote", [0, 1])
def test_chained_import(
tmp_dir, scm, dvc, erepo_dir, make_tmp_dir, with_cleanup, with_remote
):
with erepo_dir.chdir():
erepo_dir.dvc_gen("data", "data content", commit="add data")
repos = []
for i in range(5):
repos.append(make_tmp_dir(f"another_{str(i)}", scm=True, dvc=True))
previous = erepo_dir
for index, r in enumerate(repos):
with r.chdir():
stage = r.dvc.imp(str(previous), "data", out="data")
r.scm.add([stage.dvcfile.relpath])
r.scm.commit("import data")
previous = r
if with_remote:
add_remote_and_push(r)
# to check if chained dependencies are resolved to source,
# get rid of data and caches for intermediate repos
if with_cleanup:
cleanup_repo(r)
from funcy import last
latest = last(repos)
stage = dvc.imp(str(latest), "data", out="imported_data")
scm.add([stage.dvcfile.relpath])
scm.commit("add data")
assert (tmp_dir / "imported_data").read_text() == "data content"
# check circular import
with erepo_dir.chdir():
erepo_dir.dvc.imp(str(tmp_dir), "imported_data", out="circular_import")
assert (erepo_dir / "circular_import").read_text() == "data content" So, answering my own questions from previous comment
Some more context:
In conclusion, the issue has not been fixed. Only very specific case has been fixed in iterative/dvc#5324. During research for this comment, I came to the conclusion that, depending on how we decide to resolve the import assets, iterative/dvc#4527 and iterative/dvc#2599 might be prerequisistes for this issue. That will be the case if we decide to "follow latest link" - import from target repository, rather than try to resolve whole chain of imports. |
Wouldn't that be the easiest approach? And is it very difficult? I imagine all you need is all the .dvc files in the chain, and the configuration of the first link. Assuming you can connect to all the Git repos, it seems doable? That doesn't mean iterative/dvc#4527 and iterative/dvc#2599 aren't valuable too but this way they're separate concerns. Circular deps can be prevented by not allowing .dvc files to repeat (e.g. by md5) when rebuilding the chain. There could also be a reasonable set limit of links, say 8. Are there other edge cases e.g. race conditions or something? |
@jorgeorpinel do you mean resolving the whole chain? I think that the easiest would be to import from target repo, but that would need iterative/dvc#4527. That way rules of import would be easy: We import from target repo, thats all. In case of resolving whole chain, well any link is missing, we are done. |
You're right, that would be simpler in terms of behavior. Even circular imports could happen and be fine under that approach. But it wouldn't really constitute "chaining". Let's can call it "cascading" for now. My point is that it doesn't really answer this issue fully: What if the target repo link doesn't keep imports in remote storage by choice (e.g. to save on storage costs)? That said maybe it's a good limitation, to avoid people from inadvertently importing from underlying sources with unknown reputation, thinking it's coming from a project they trust (unless a clear warning/confirmation is given). Idk what the answer is. I'd check with @dberenbaum et al at this point 🙂 |
I looked quickly at the discord message but didn't really get the context. It sounded like it's pretty easy to work around this be doing |
Yeah it's an old chat. I wouldn't worry too much about that particular case but indeed it's unclear whether this is needed. At some point it was prioritized as p2 so we probably thought it was somewhat important.
I don't think that's a workaround because you can't even |
It seems that this particular use case is not "that" popular. It seems that it could be solved by simply implementing iterative/dvc#4527. I think what I am looking for is if there is a reason/use case when we would like to resolve whole import chain and import data from source, rather than import it from last link of import chain. |
OK, agree. |
What's the status of this after iterative/dvc#6109? |
After iterative/dvc#6109, chained imports should work, and if there is a circular import in the chain, DVC will error out. It currently works by resolving the entire import chain each time, and requires that DVC be able to access all of the repos in the chain (and all of the default remotes for each repo in the chain). This requirement remains true even after the initial This example script shows how it works: https://gist.github.com/pmrowla/c1e86bc41acc05d06a3752d8e8700e4a (The script creates 4 repos that each have their own separate default remotes) We start with 2 repos that each contain a single file:
Next we add 3rd repo:
In this repo,
In the final repo, we import
Note that the DVC file for final import only references repo C. We do not save that
Note that the full import chain is resolved on
If the user importing/pulling into repo D did not have access to all 3 of the remotes, the pull would fail. The example script also shows what happens for circular imports. If we now go into repo A, and try to import from repo D, it will form a circular import and fail (because D imports from C which imports from A)
|
I'm not sure if this implementation meets the needs for closing this issue or not. I think there was some discussion before regarding whether or not the final import into D should only require access to repo C? (this would require pushing all of C's imported files into C's default remote, rather than fetching them from their original locations in remotes A & B) I'm also not sure whether or not we want to document that this is actually a supported use case |
That qualifies as "allowing chaining imports somehow" to me! I.e. it may close this ticket (condition explained below).
I'm just also curious whether the current behavior is what we want long term, given earlier comments e.g. https://github.com/iterative/dvc/issues/3305#issuecomment-855792862 - "it could be solved by simply implementing iterative/dvc#4527... what I am looking for is if there is a reason/use case when we would like to resolve whole import chain" cc @pared @dberenbaum If this import chain feature (based on iterative/dvc#6109) is definitive, let's document it. E.g. it can be a mention and examples in |
The one other thing to note would be that running So if |
Hmm, I'm not sure what's the best approach. It's probably best to document the chained imports for now since that's the current implementation and revisit whether imports should be "backed up" later if it becomes a more obvious need. |
Moving to dvc.org since it seems like the remaining work is documentation. @jorgeorpinel We might be able to better support iterative/dvc#4527 in the future, but until then it IMO it would be useful to be more clear that imports rely on the original source remote and users need access to that, including for chained imports. Thoughts? @pmrowla Can you take this one if we need to document it? Is there someone else who should be assigned? |
@dberenbaum I'm guessing it will have to be me since I did the current implementation, you can just assign it to me if/when needed |
Thanks guys (and no rush), we'll take it over as soon as we have all the basic info.
This part may deserve it's own issue in iterative/dvc though? |
My thoughts: The current behavior could be expected or unexpected depending on the circumstances. There's not a lot of user feedback outside of the one user in iterative/dvc#4527, so I'm not sure it's worth keeping the discussion going. Can we document what we have and revisit if users complain or are confused? |
This seems like the expected behavior to me. In this case, I am essentially importing a pinned/frozen stage from repo C. When I run |
As of now, imported data is not cached by default, so you won't be able to import any imported data:
repo1 -> ✔️ dvc import data -> repo2 🙂 -> ✖️ dvc import data -> repo3 🙁
Somehow allowing this could be useful for the case where you're building a data registry based on other previous smaller DVC repos, for example. Right now you have to
dvc get
and thendvc add
those artifacts from scratch in the data registry (so they can be imported in further DVC repos).Ruslan mentioned something about using "links" to implement this (on Discord).
UPDATE: Go to https://github.com/iterative/dvc/issues/3305#issuecomment-836503176
The text was updated successfully, but these errors were encountered: