Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient downloading #118

Open
jbusecke opened this issue Mar 29, 2024 · 2 comments
Open

More efficient downloading #118

jbusecke opened this issue Mar 29, 2024 · 2 comments

Comments

@jbusecke
Copy link
Collaborator

https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-28_20_29_34-3856017896282669695;logsSeverity=INFO?project=leap-pangeo&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))&authuser=1&supportedpurview=project

We really need a way to cache data in a less expensive way. The job above just wasted 12 DCU just to find out that one of the files wasn't available.

If we can, we should restrict the amount of workers and download within threads on a single worker (see pangeo-forge/pangeo-forge-recipes#713). The scaling seems to only be efficient if we have fast downloads?

@jbusecke
Copy link
Collaborator Author

jbusecke commented May 21, 2024

My wish here would be for a stage that does the following:

  • If possible, check all urls first, and fail fast if one of them is not available
    • Absolute 💎 bonus feature: If I could pass multiple lists of urls, determine which list is available at runtime, and maybe even pick the fastest connection bases on initial ping...this might be too complicated
  • Have an upper limit on how many connections can be established to a given server (I believe this is partially implemented as the 'max_concurrency' argument in OpenWithFsspec
  • Use as little workers as possible to download several files in parallel, or download parts of large files in parallel using threads on the workers.

@jbusecke
Copy link
Collaborator Author

jbusecke commented Jun 6, 2024

I have added concurrent downloads in #172 (and the accidental push to main before 🙈).

After discussion with PGF folks, I think I should implement a check if urls are available as part of the async-client in pangeo-forge-esgf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant