Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc get/import shows temporary paths #2605

Closed
dmpetrov opened this issue Oct 13, 2019 · 5 comments · Fixed by #2798
Closed

dvc get/import shows temporary paths #2605

dmpetrov opened this issue Oct 13, 2019 · 5 comments · Fixed by #2798
Labels
p1-important Important, aka current backlog of things to do ui user interface / interaction

Comments

@dmpetrov
Copy link
Member

$ dvc -V
0.62.1
$ dvc get https://github.com/iterative/example-get-started data/data.xml
Multi-Threaded:
 79%|███████▉  |../../../../private/var/folders/hq/h82mgrhx37nc89_1rcg6gm30081024/37916850 [00:08<00:03,    2.25MB/s]

$ dvc get https://github.com/iterative/example-get-started model.pkl
Multi-Threaded:
 62%|██████▏   |../../../../private/var/folders/hq/h82mgrhx37nc89_1rcg6gm283866624/6262877 [00:00<00:02 ...

Is there a reason to show temporary paths and what actions we can make on this? I'd expect to see just a file name like model.pkl?

@dmpetrov dmpetrov added the ui user interface / interaction label Oct 13, 2019
@efiop efiop added the p1-important Important, aka current backlog of things to do label Oct 14, 2019
@efiop
Copy link
Contributor

efiop commented Oct 15, 2019

For the record: we can either chdir() into the temporary repo root or to make a global variable that we can adjust that will make PathInfo.str() use it for relpath (Kudos @Suor).

@Suor
Copy link
Contributor

Suor commented Nov 1, 2019

This is not only about dvc get. Import are also affected, see #2691

@Suor
Copy link
Contributor

Suor commented Nov 1, 2019

An addition about global variable, which might be used as base for PathInfo.__str__(), it should be:

  • really a thread local, not a simple variable
  • not accessed nor changed outside dvc/path_info.py directly
  • managed with a context manager:
with path_info_base(some_dir):
    # ... path_infos are stringified differently here

This will protect us from usual global var issues making it a dynamic scope variable if you know what I mean ;)

@Suor Suor changed the title dvc get shows temporary paths dvc get/import shows temporary paths Nov 5, 2019
@Suor
Copy link
Contributor

Suor commented Nov 5, 2019

From discord:


@kurianbenoy:
I am interested to fix that issue, even if it takes more time to understand the codebase

@efiop:
Glad to hear that! :slight_smile: Got it. Okay, let's think about some clever ways to fix that then.
ok, so we are using named cache for that. We could make it check if we are inside of a repo or not and if not then print a relpath computed relative to the repo.root_dir instead of cwd.
the piece of the code that is responsible for that is https://github.com/iterative/dvc/blob/0.66.4/dvc/output/base.py#L410

specifically str(self) that is passed as a name
so if we could do something like:

if os.getcwd().startswith(self.stage.repo.root_dir):
    name = str(self)
else:
    name = relpath(self.path_info, self.stage.repo.root_dir)

and then use that name instead of str(self) in that line, then it would automatically start working in an acceptable way.

@Suor: Some outs might be outside repo dir, but still "in repo", referencing them simply by name would be confusing

@Suor
Copy link
Contributor

Suor commented Nov 5, 2019

So the core culprit is stringifying PathInfo objects, which don't know about repo and so can't show sane paths, stringifying output objects is, however, ok. So:

  • it's ok to use str(self) where @efiop mentioned
  • OutputLocal.__str__() should be fixed though
  • the code Ruslan showed above is almost right, the issue is that a dvc file might refer outside repo dir with a rel path

I propose this logic for OutputLocal.__str__():

if not self.is_in_repo:
    return self.def_path
elif <curdir isin self.repo.root_dir>:
    return relpath(self.path_info, curdir)
else:
    return relpath(self.path_info, self.repo.root_dir)

Which means we show relative paths for the repo we are in and rel to repo root paths for all other repos. Remotes and absolute paths are always shown as is.

On <curdir isin ...> this not quire .startswith() as that is wrong for paths:

"a/b/c" isin "a/b" == True
"a/bc" isin "a/b" == False

This would fix the issue and we won't luckily need chdirs nor a global var.

P.S. As a second cleanup/refactor stage I would propose completely stop using PathInfo.__str__(), since it fundamentally doesn't know how to stringify itself. Raise TypeError() there and fix all the usages. This is definitely separate, don't think about it until the issue itself is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1-important Important, aka current backlog of things to do ui user interface / interaction
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants