-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3: inline bucket region caching #6406
Conversation
StatisticsHere are the performance results for
It seems like it is always faster by 1.2x And here are more general scenerios with datasets: Previously:
Current PR:
tl;dr for statistics: overall it is ~1.2x faster for almost any case beside those where there is only one stage, and for those cases it is My evaluation for these is that, we need to make clever assumptions and remove the regression caused for single target options. If we can do that, this is a clear positive win ranging from %20 to %80 speedup. @efiop do you think it worth spending time on optimizing the initial call? |
@isidentical What's the status here? Is this still a draft? |
For the reference: After a discussion with @efiop, we decided to merge this as is and focus on potential optimizations in the future since this is a clear win for a lot of scenarios. |
This patch adds a new option (
cache_regions
) to enable bucket region caching in s3fs (fsspec/s3fs#495). This basically caches regions for each buckets in-memory during the program's runtime. If you don't specify any region in your config, then you end up with resolving the same bucket's region over and over which comes with a burden for small running programs.For
info()
calls on the same bucket, this was the stats I was able to gather without the bucket region caching:As it can be seen, a simple
info()
call costs about ~1.2 seconds. However, if we cache buckets then the stats will look like this;Which basically frees the overhead of resolving the bucket after the first call, and the per-call time as well as the total time gets reduced substantially (almost 2x) after the usage of a few invocation. At some point (around 16 calls), the difference becomes almost 3x (1.2 vs 0.45).
Obviously we are trying to reduce the number of
info()
calls we are making, and from what I can see for basic workflows we only make a few of them but still this can yield some benefit forstatus -c
(I will run benchmarks and share them here).Resolves #5969