Expensive daily s3 to disk replication #2889

kardaj · 2019-01-04T08:39:24Z

Hey there! This is an updated post from slack a couple weeks ago.

I have a couple of S3 buckets with hundred thousands of objects each and I'm syncing them to a local server on daily basis. I have been using aws s3 sync for a couple of years and in the last few months, I moved to rclone to improve sync reliability. This switch came with a noticeable bump in my bill. After breaking down the costs, each daily sync session is costing an average of 10$:

6 million HeadObject (2$)
1.2 million ListBucket (8$)
3.5 thousand GetObject(<0.01$)

From what I noticed, rclone is quite aggressive on ListBucket and HeadObject operations. Is there a way to tune down this behaviour through the configuration?

Here's some setup information:

$rclone version
rclone v1.44
- os/arch: linux/amd64
- go version: go1.11.1

The commands I'm using look like this: rclone sync s3://bucket /s3/bucket --verbose

The text was updated successfully, but these errors were encountered:

ncw · 2019-01-04T10:36:26Z

The first thing to do is to use --size-only or --checksum as a syncing method. This will stop rclone reading the metadata to discover the modification time of the files. This will help with the HeadObject calls.

If you want to reduce the ListBucket calls then try --fast-list. This will use a more memory but it will use much fewer ListBucket calls.

If you want to see what operations rclone is doing then do -vv --dump headers.

There are some hints to these things in the rclone docs, but maybe there should be a optimizing s3 section in the s3 docs?

kardaj · 2019-01-07T12:33:52Z

Thank you for the answer. I'll be testing the effects of this optimisation in the next few weeks. I think it's important information that should be present in the documentation as personally, the bill took me by surprise. There was another feedback from the slack channel that had a similar experience with Google Storage. Is the behaviour the same for all providers?

ncw · 2019-01-07T16:45:23Z

Thank you for the answer. I'll be testing the effects of this optimisation in the next few weeks.

Great - let us know how it goes

I think it's important information that should be present in the documentation as personally, the bill took me by surprise

Sorry :-( Let's try to draft some more words for the documents - do you want to have a go?

There was another feedback from the slack channel that had a similar experience with Google Storage. Is the behaviour the same for all providers?

No, annoyingly the providers are all slightly different! A constant is that --fast-list will help. --size-only will help on s3 & swift, but not on b2 or google cloud storage (if I remember correctly).

dertel · 2019-07-03T17:41:02Z

@kardaj did this save you money? I'm experiencing similarly high costs for requests and I use rclone for daily S3 backups.

kardaj · 2019-07-05T08:46:34Z

@dertel Yes! using --fast-list and --size-only drastically reduced my bill.

ncw · 2019-07-05T11:20:45Z

I'd quite like to put a section on reducing costs in the s3 docs.

What do you think of this?

Reducing costs

By default rclone will use the modification time of objects stored in S3 for syncing. This is stored in object metadata which unfortunately takes an extra HEAD request to read which can be expensive. To avoid this when using rclone copy/move/sync use --size-only or --checksum. (Note that using --checksum will MD5 all the files in the sync which may take some time) Another solution is to use --update and --use-server-modtime - there is a section about this LINK. . Eg

rclone sync --fast-list --size-only /path/to/source s3:bucket
rclone sync --fast-list --update --use-server-modtime /path/to/source s3:bucket

Rclone's default directory traversal is to traverse each directory individually. This takes one transaction per directory. Using the --fast-list flag will read all the files into memory first using a small number of transactions (one per 1000 objects). See section on --fast-list LINK.

Note that if you are only copying a small number of files into a big repository then --no-traverse is a good idea. You can do a "top-up" sync very cheaply by using --max-age and --no-traverse to copy only recent files, eg

rclone copy --min-age 24h --no-traverse /path/to/source s3:bucket

If using rclone mount or any command using the VFS (eg most of rclone serve) then you might want to consider --no-modtime which will stop rclone reading the modification time off every object which will save a HEAD request.

kardaj · 2019-07-05T12:58:40Z

I think there should be an additional warning at a visible place in the documentation that points to the "reducing costs" section. The thing is, since the defaults are different from the aws cli, users will be surprised by the behaviour and the bump in the bill.

ncw · 2019-07-05T16:11:17Z

The thing is, since the defaults are different from the aws cli, users will be surprised by the behaviour and the bump in the bill.

What are the defaults for the aws cli? I suspect it won't support modified times on files.

OnGitHubSchatz · 2019-07-05T17:22:16Z

Defaults according to the aws s3 cli docs

A local file will require uploading if the size of the local file is different than the size of the s3 object, the last modified time of the local file is newer than the last modified time of the s3 object, or the local file does not exist under the specified bucket and prefix.

Option --size-only uses filesize only.

dertel · 2019-07-19T22:16:26Z

@kardaj thanks for the tip, worked for me.

ncw · 2020-11-26T15:07:59Z

I have added a section to the s3 docs all about reducing costs in 5063423.

This solves the immediate problem here.

In future keeping a local db of changes might help too or implementing change detection, but neither of those are s3 specific.

mrxvt · 2023-05-21T06:40:09Z

Can I ask why --checksum isn't just default?

ncw added doc fix Remote: S3 labels Jul 5, 2019

ncw added this to the v1.49 milestone Jul 5, 2019

ncw modified the milestones: v1.49, v1.50 Aug 27, 2019

ncw modified the milestones: v1.50, v1.51 Nov 14, 2019

ncw modified the milestones: v1.51, v1.52 Feb 1, 2020

ncw modified the milestones: v1.52, v1.53 May 29, 2020

ncw modified the milestones: v1.53, v1.54 Sep 5, 2020

ivandeex added the change detection label Nov 25, 2020

This was referenced Nov 26, 2020

fs/operations: do ignore size when --ignore-size is specified along with --checksum #4011

Closed

feature: implement advanced change detection #4810

Open

ncw closed this as completed in 5063423 Nov 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expensive daily s3 to disk replication #2889

Expensive daily s3 to disk replication #2889

kardaj commented Jan 4, 2019 •

edited

ncw commented Jan 4, 2019

kardaj commented Jan 7, 2019

ncw commented Jan 7, 2019

dertel commented Jul 3, 2019

kardaj commented Jul 5, 2019

ncw commented Jul 5, 2019 •

edited

kardaj commented Jul 5, 2019

ncw commented Jul 5, 2019

OnGitHubSchatz commented Jul 5, 2019

dertel commented Jul 19, 2019

ncw commented Nov 26, 2020

mrxvt commented May 21, 2023

Expensive daily s3 to disk replication #2889

Expensive daily s3 to disk replication #2889

Comments

kardaj commented Jan 4, 2019 • edited

ncw commented Jan 4, 2019

kardaj commented Jan 7, 2019

ncw commented Jan 7, 2019

dertel commented Jul 3, 2019

kardaj commented Jul 5, 2019

ncw commented Jul 5, 2019 • edited

Reducing costs

kardaj commented Jul 5, 2019

ncw commented Jul 5, 2019

OnGitHubSchatz commented Jul 5, 2019

dertel commented Jul 19, 2019

ncw commented Nov 26, 2020

mrxvt commented May 21, 2023

kardaj commented Jan 4, 2019 •

edited

ncw commented Jul 5, 2019 •

edited