s3: inline bucket region caching #6406

isidentical · 2021-08-10T09:01:08Z

This patch adds a new option (cache_regions) to enable bucket region caching in s3fs (fsspec/s3fs#495). This basically caches regions for each buckets in-memory during the program's runtime. If you don't specify any region in your config, then you end up with resolving the same bucket's region over and over which comes with a burden for small running programs.

For info() calls on the same bucket, this was the stats I was able to gather without the bucket region caching:

total invocations: 1, total time: 2.765854, time/invoc: 2.765854
total invocations: 2, total time: 3.100261, time/invoc: 1.550131
total invocations: 3, total time: 4.330655, time/invoc: 1.443552
total invocations: 4, total time: 5.416423, time/invoc: 1.354106
total invocations: 5, total time: 6.585422, time/invoc: 1.317084
total invocations: 6, total time: 12.837526, time/invoc: 2.139588
total invocations: 7, total time: 9.008174, time/invoc: 1.286882
total invocations: 8, total time: 9.980438, time/invoc: 1.247555
total invocations: 9, total time: 11.341123, time/invoc: 1.260125
total invocations: 10, total time: 12.406109, time/invoc: 1.240611

As it can be seen, a simple info() call costs about ~1.2 seconds. However, if we cache buckets then the stats will look like this;

total invocations: 1, total time: 2.944956, time/invoc: 2.944956
total invocations: 2, total time: 3.025406, time/invoc: 1.512703
total invocations: 3, total time: 3.395653, time/invoc: 1.131884
total invocations: 4, total time: 3.698849, time/invoc: 0.924712
total invocations: 5, total time: 3.924782, time/invoc: 0.784956
total invocations: 6, total time: 4.052484, time/invoc: 0.675414
total invocations: 7, total time: 4.380276, time/invoc: 0.625754
total invocations: 8, total time: 4.702067, time/invoc: 0.587758
total invocations: 9, total time: 4.911772, time/invoc: 0.545752
total invocations: 10, total time: 5.013062, time/invoc: 0.501306

Which basically frees the overhead of resolving the bucket after the first call, and the per-call time as well as the total time gets reduced substantially (almost 2x) after the usage of a few invocation. At some point (around 16 calls), the difference becomes almost 3x (1.2 vs 0.45).

Obviously we are trying to reduce the number of info() calls we are making, and from what I can see for basic workflows we only make a few of them but still this can yield some benefit for status -c (I will run benchmarks and share them here).

Resolves #5969

dvc/config_schema.py

isidentical · 2021-08-12T18:30:52Z

Statistics

Here are the performance results for status -c calls (the results are collected through this script) with number of individual stages:

num_files=1 slower 1.26x
num_files=2 faster 1.18x
num_files=3 faster 1.18x
num_files=4 faster 1.20x
num_files=5 faster 1.19x
num_files=6 faster 1.19x
num_files=7 faster 1.21x
num_files=8 faster 1.19x
num_files=9 faster 1.17x
num_files=10 faster 1.18x
num_files=11 faster 1.17x
num_files=12 faster 1.19x
num_files=13 faster 1.18x
num_files=14 faster 1.19x
num_files=15 faster 1.18x
num_files=16 faster 1.21x
num_files=17 faster 1.20x
num_files=18 faster 1.18x
num_files=19 faster 1.18x
num_files=20 faster 1.19x
num_files=21 faster 1.16x
num_files=22 faster 1.18x
num_files=23 faster 1.23x
num_files=24 faster 1.18x
num_files=25 faster 1.19x
num_files=26 faster 1.19x
num_files=27 faster 1.20x
num_files=28 faster 1.18x
num_files=29 faster 1.20x
num_files=30 faster 1.21x
num_files=31 faster 1.19x

It seems like it is always faster by 1.2x

And here are more general scenerios with datasets:

Previously:

=======================================s3=======================================                                            
    Story: cloud status
        fresh status (nothing missing, 1024 files on the remote) took 2.0016 seconds (best: 1.9944, worst: 2.0108)
        status (1k files missing, 1024 files on the remote) took 6.9438 seconds (best: 6.9271, worst: 6.9573)
        push only new files (1024 new small files / 1024 existing small files) took 10.0141 seconds (best: 9.978, worst: 10.0749)
        fresh status (nothing missing, 2k + 4k files on the remote) took 2.0907 seconds (best: 2.0743, worst: 2.1023)
        status (1 missing file, 2k + 4k files on the remote) took 3.6332 seconds (best: 3.6189, worst: 3.6436)
        push only new files (1 new small file, 2k + 4k files on the remote) took 6.4358 seconds (best: 6.387, worst: 6.4623)
        status (51 missing file, 2k + 4k files on the remote) took 3.6709 seconds (best: 3.5951, worst: 3.8077)
        push only new files (51 new small file, 2k + 4k files on the remote) took 6.4229 seconds (best: 6.3705, worst: 6.4841)

Current PR:

=======================================s3=======================================                                            
    Story: cloud status
        fresh status (nothing missing, 1024 files on the remote) took 2.6278 seconds (best: 2.6035, worst: 2.6443)
        status (1k files missing, 1024 files on the remote) took 3.9694 seconds (best: 3.9521, worst: 3.9883)
        push only new files (1024 new small files / 1024 existing small files) took 6.7651 seconds (best: 6.7512, worst: 6.7908)
        fresh status (nothing missing, 2k + 4k files on the remote) took 2.7187 seconds (best: 2.6719, worst: 2.7891)
        status (1 missing file, 2k + 4k files on the remote) took 3.0316 seconds (best: 3.0046, worst: 3.0687)
        push only new files (1 new small file, 2k + 4k files on the remote) took 5.0051 seconds (best: 4.9621, worst: 5.0555)
        status (51 missing file, 2k + 4k files on the remote) took 3.0583 seconds (best: 3.0238, worst: 3.0979)
        push only new files (51 new small file, 2k + 4k files on the remote) took 5.1756 seconds (best: 5.0236, worst: 5.2988)

tl;dr for statistics: overall it is ~1.2x faster for almost any case beside those where there is only one stage, and for those cases it is
~1.25x slower. At some point the speed up rises to ~1.8x (on push only new files story) and overall mean is about ~1.2x.

My evaluation for these is that, we need to make clever assumptions and remove the regression caused for single target options. If we can do that, this is a clear positive win ranging from %20 to %80 speedup. @efiop do you think it worth spending time on optimizing the initial call?

efiop · 2021-09-02T13:32:01Z

@isidentical What's the status here? Is this still a draft?

isidentical · 2021-09-06T09:20:46Z

For the reference: After a discussion with @efiop, we decided to merge this as is and focus on potential optimizations in the future since this is a clear win for a lot of scenarios.

s3: inline bucket region caching

12dd175

efiop reviewed Aug 10, 2021

View reviewed changes

dvc/config_schema.py Show resolved Hide resolved

efiop approved these changes Aug 10, 2021

View reviewed changes

isidentical self-assigned this Aug 10, 2021

isidentical added fs: s3 Related to the S3 filesystem optimize Optimizes DVC labels Aug 10, 2021

isidentical marked this pull request as ready for review September 6, 2021 08:16

isidentical requested a review from a team as a code owner September 6, 2021 08:16

isidentical requested a review from karajan1001 September 6, 2021 08:16

isidentical merged commit 348dfc0 into iterative:master Sep 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s3: inline bucket region caching #6406

s3: inline bucket region caching #6406

isidentical commented Aug 10, 2021 •

edited

isidentical commented Aug 12, 2021

efiop commented Sep 2, 2021

isidentical commented Sep 6, 2021

s3: inline bucket region caching #6406

s3: inline bucket region caching #6406

Conversation

isidentical commented Aug 10, 2021 • edited

isidentical commented Aug 12, 2021

efiop commented Sep 2, 2021

isidentical commented Sep 6, 2021

isidentical commented Aug 10, 2021 •

edited