Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

push: hangs after collecting information for a couple of minutes #1975

Closed
mroutis opened this issue May 9, 2019 · 4 comments

Comments

2 participants
@mroutis
Copy link
Collaborator

commented May 9, 2019

version: 0.40.0+6408b5

❯ dvc push -r ssh
Preparing to upload data to 'ssh://localhost/tmp/data-storage'
Preparing to collect status from ssh://localhost/tmp/data-storage
[##############################] 100% Collecting information

The user shouldn't think that the process is misbehaving (thinking it is dead because it is not giving any signal of processing stuff -- without checking top)

@mroutis mroutis added the ui label May 9, 2019

@efiop

This comment has been minimized.

Copy link
Member

commented May 9, 2019

So what is the cause of that? Did you find out? You are working on that status optimization, maybe it is related to that?

@mroutis

This comment has been minimized.

Copy link
Collaborator Author

commented May 9, 2019

I still don't know, @efiop , just opened the issue to not forget about it

@mroutis mroutis self-assigned this May 9, 2019

@mroutis mroutis added this to To do in Weekly tasks via automation May 9, 2019

@mroutis mroutis moved this from To do to In progress in Weekly tasks May 11, 2019

@mroutis

This comment has been minimized.

Copy link
Collaborator Author

commented May 11, 2019

@efiop , A quick update, the problem is with this method:

# dvc/remote/local/__init__.py
    def _fill_statuses(self, checksum_info_dir, local_exists, remote_exists):
        for md5, info in checksum_info_dir.items():
            status = STATUS_MAP[(md5 in local_exists, md5 in remote_exists)]
            info["status"] = status

Usually the operation shouldn't take that much but here, when checksum_info_dir is like 100,002
it can take --on my slow computer-- ~120 seconds (2 minutes), even if it is not that much, it is not a good UI to not notify the user about what's going on.

We can optimize this by using a set because they have O(1) lookup instead of O(n). The difference is outstanding, taking ~0.04 seconds to compute:

    def _fill_statuses(self, checksum_info_dir, local_exists, remote_exists):
        # Using sets because they are way faster for lookups
        local = set(local_exists)
        remote = set(remote_exists)

        for md5, info in checksum_info_dir.items():
            status = STATUS_MAP[(md5 in local, md5 in remote)]
            info["status"] = status

@mroutis mroutis added the performance label May 11, 2019

@efiop

This comment has been minimized.

Copy link
Member

commented May 11, 2019

@mroutis Amazing catch! 🎉

mroutis added a commit to mroutis/dvc that referenced this issue May 11, 2019

@mroutis mroutis referenced this issue May 11, 2019

Merged

status: optimize the _fill_statuses method #1981

2 of 2 tasks complete

@efiop efiop closed this in #1981 May 11, 2019

Weekly tasks automation moved this from In progress to Done May 11, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.