Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull from external repo: support --j to limit the number of parallel connections #3396

Closed
pommedeterresautee opened this issue Feb 24, 2020 · 6 comments · Fixed by #3413
Closed
Assignees
Labels
feature request Requesting a new feature good first issue

Comments

@pommedeterresautee
Copy link

We are introducing DVC in our company and were quite happy until we started using it on a large project containing few hundred of thousands of files representing approximatively 300 Gb.
We use S3 as storage.
When someone from our team did a dvc pull of this project, it sucked the whole internet bandwidth of our office.

We tried to mitigate the issue by limiting the number of concurrent jobs to 1 (option -j 1) but it was not enough.
Our IT Ops team told us that dvc has opened hundred of concurrent connections to download files from our S3 bucket, and that it explains why we have been able to suck most of the bandwidth.

Is there other option than --jobs to limit the number of parallel connections we should take care of?
Is there some existing workaround for this situation?

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Feb 24, 2020
@shcheklein
Copy link
Member

@pommedeterresautee I was not able to reproduce it 🤔

could you please run dvc version?

also, when you run dvc pull -j 1 and it starts downloading, how many progress bars do you see?

@pommedeterresautee
Copy link
Author

pommedeterresautee commented Feb 25, 2020

DVC version: 0.86.2
Python version: 3.6.9
Platform: Linux-5.3.0-40-generic-x86_64-with-Ubuntu-19.10-eoan
Binary: False
Package: snap
Cache: reflink - not supported, hardlink - supported, symlink - supported

I see quite a lot of progress bars, too many for my terminal which crazily scroll the output:

dvc pull -j 1

image

@shcheklein
Copy link
Member

@iterative/engineering someone who is using Linux, could you please check really quick that j is being propagated properly?

@pommedeterresautee could you if you have anything in you DVC config file related to the number of jobs? It's .dvc/config and .dvc/config.local?

@casperdcl
Copy link
Contributor

Had a quick look; not sure if this is the issue but repo.fetch._fetch_external() doesn't get a jobs argument.

@shcheklein
Copy link
Member

@casperdcl great catch! I just realized that it pull (fetches) from external repos! So, yes it looks like we definitely need to pass j to it down to it.

@efiop it looks like the reason for this is clear, can we prioritize and add this?

@shcheklein shcheklein changed the title Limit the number of parallel connections pull from external repo: support --j to limit the number of parallel connections Feb 26, 2020
@shcheklein shcheklein added feature request Requesting a new feature good first issue labels Feb 26, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Feb 26, 2020
@casperdcl
Copy link
Contributor

great catch!

Never underestimate the debugging power of debian on a phone :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature good first issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants