Skip to content

Conversation

@sharidas
Copy link

@sharidas sharidas commented Jan 12, 2020

Fix the import issue when target does not have
remote config.

Signed-off-by: Sujith H sujith.h@gmail.com

  • ❗ Have you followed the guidelines in the Contributing to DVC list?

  • 📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

  • ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@sharidas sharidas changed the title Fix import issue without remote config in the target [WIP] Fix import issue without remote config in the target Jan 12, 2020
@sharidas
Copy link
Author

This PR is still a work in progress. In this PR I have tried to capture the NoRemoteError execption and then look for the file/dir inside the dvc repo where we are importing. If the file/dir is found then no further action required as its available in the cache.

I have not added condition to check: what if the cache is not available in the reciever of the import. I need to check the same.

Any feedback regarding the approach is appreciated.

@sharidas sharidas changed the title [WIP] Fix import issue without remote config in the target Fix import issue without remote config in the target Jan 15, 2020
@sharidas
Copy link
Author

Would appreciate review for this PR.

@shcheklein
Copy link
Member

@sharidas should we add a test for this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is describing what the code is doing, but I would appreciate more a comment about why it is doing such thing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the comment. Let me know if this looks ok?

@sharidas
Copy link
Author

@sharidas should we add a test for this?

Updated the PR by adding test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are already inside of a tmp_dir, no need to chdir. 🙂

Comment on lines 67 to 70
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are adding by full path, which will make dvc treat it as an external output. What you want is:

tmp_dir.dvc_gen({dst: "hello"})

looks like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: using open + write + close is a bad style even without pathlib, where you should use with open() instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
os.remove(tmp_dir / dst)
(tmp_dir / dst).unlink()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use

Suggested change
if os.path.exists(self.repo.cache.local.get(out.checksum)):
if not self.repo.cache.local.changed_cache(out.checksum):

to make it fully work with directories and be more robust in general. There is nothing wrong with your implementation, just an obscure internal stuff that is more appropriate here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't think this affects the test at all, because dvc import clones erepo and non-commited changes won't be cloned anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens in this test currently is that dvc automatically sets default remote to erepo_dir's .dvc/cache and pulls from there. https://github.com/iterative/dvc/blob/master/dvc/external_repo.py#L91 Need to somehow disable it for this test or workaround it. 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop If my understanding is correct https://github.com/iterative/dvc/blob/master/tests/dir_helpers.py#L282-L283 is the code which adds remote configuration. Or am I missing something. I didn't quite get the point here. I was thinking: if removal of config from the .dvc/config from the erepo would help to replicate the scenario, while writing the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sharidasan Oh, good point! I've totally forgotten about that one. We used to use that before https://github.com/iterative/dvc/blob/master/dvc/external_repo.py#L91 logic. But still, as you can see that file is commited and here you are only removing it from the workspace, not committing it, so clone of this erepo would still have that config intact.

@sharidas
Copy link
Author

@efiop I have updated the change again. In the latest change set I have introduced a new fixture ( similar to erepo_dir ) which differs from erepo_dir where no config data is present. So when the erepo is cloned an empty config file is created. Let me know if this looks kind of ok ...

Comment on lines 298 to 313
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately removing that config part still won't help, as https://github.com/iterative/dvc/blob/master/dvc/external_repo.py#L91 is still in place, which automatically sets it back when cloning 🙁

How about we simply mock cloud.pull to raise NoRemoteError in our test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am bit confused here. When I execute:

python -m tests tests/func/test_import.py::test_import_cached_file

I do see a config file under location /tmp/pytest-of-sujith/pytest-0/popen-gw0/erepo0/.dvc/config in my machine. The file is actually empty. So while cloning it sets back but the content is empty.. Kindly correct me if my obeservation is wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sharidasan It creates .dvc/config.local though, not .dvc/config.

Comment on lines +95 to +102
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a side note: we might later consider making pull in general not require remote if everything is already locally available. This is out of focus for this PR, current implementation works great for what we need it for 👍 So no actions required here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a correct logic for dir outs, it should be just out.changed_cache().

However, I don't understand the overall approach, we just created that repo it has empty cache. What's the point?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, this is for local import/shared cache case. I would say we should change .pull() implementation to check local cache first and only go for upstream when something is missing in local cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor please see #3120 (comment) . out.change_cache() would be better, true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an overkill creating a fixture for single test. Any of the following would be better:

  • use existing erepo_dir, drop upstream in a test
  • change erepo_dir to not have upstream, fix any issues it causes (if that is not too hard)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor Def don't need that new fixture, but change erepo_dir to not have upstream won't help, see #3120 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, this is for local import/shared cache case. I would say we should change .pull() implementation to check local cache first and only go for upstream when something is missing in local cache.

@sharidas
Copy link
Author

I would say we should change .pull() implementation to check local cache first and only go for upstream when something is missing in local cache.

This case is related get/import failure when remote config is missing. Meaning when we get/import the file/dir from the remote url. I am sharing my understanding with respect to the ticket here :)

Fix the import issue when target does not have
remote config.

Signed-off-by: Sujith H <sujith.h@gmail.com>
@sharidas
Copy link
Author

I have updated the test. Also verified the expected code flow to trigger the excecption for NoRemoteError. Let me know if this change set looks ok? @efiop @shcheklein

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sharidas ! 🙏

@efiop efiop merged commit cf01357 into iterative:master Jan 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

get/import: don't require default remote for get/import in certain cases

5 participants