Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve resource utilization/efficiency of file caching #713

Open
jbusecke opened this issue Mar 19, 2024 · 1 comment
Open

Improve resource utilization/efficiency of file caching #713

jbusecke opened this issue Mar 19, 2024 · 1 comment

Comments

@jbusecke
Copy link
Contributor

Nothing super specific here, but wanted to brain dump and get a broader discussion going.

As part of my CMIP work my recipes often download many files from sometimes slow servers. This seems to take very long and frequently scales up to many workers, which increases cost.

Looking at the Dataflow resource metrics
image
it seems like there is one worker spun up per file? There is a spike in CPU useage initially, but then the worker idles around mostly.

Can we maybe modify the level of concurrency here and have one worker download/cache multiple files via threads to improve performance and/or save costs?

Perhaps something to chat about on Thu @ranchodeluxe @moradology ?

@cisaacstern
Copy link
Member

Can we maybe modify the level of concurrency here and have one worker download/cache multiple files via threads to improve performance and/or save costs?

Yes! I think this may do what you want:

https://github.com/google/xarray-beam/blob/main/xarray_beam/_src/threadmap.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants