Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expensive daily s3 to disk replication #2889

Closed
kardaj opened this issue Jan 4, 2019 · 12 comments
Closed

Expensive daily s3 to disk replication #2889

kardaj opened this issue Jan 4, 2019 · 12 comments

Comments

@kardaj
Copy link

kardaj commented Jan 4, 2019

Hey there! This is an updated post from slack a couple weeks ago.

I have a couple of S3 buckets with hundred thousands of objects each and I'm syncing them to a local server on daily basis. I have been using aws s3 sync for a couple of years and in the last few months, I moved to rclone to improve sync reliability. This switch came with a noticeable bump in my bill. After breaking down the costs, each daily sync session is costing an average of 10$:

  • 6 million HeadObject (2$)
  • 1.2 million ListBucket (8$)
  • 3.5 thousand GetObject(<0.01$)

From what I noticed, rclone is quite aggressive on ListBucket and HeadObject operations. Is there a way to tune down this behaviour through the configuration?

Here's some setup information:

$rclone version
rclone v1.44
- os/arch: linux/amd64
- go version: go1.11.1

The commands I'm using look like this: rclone sync s3://bucket /s3/bucket --verbose

@ncw
Copy link
Member

ncw commented Jan 4, 2019

The first thing to do is to use --size-only or --checksum as a syncing method. This will stop rclone reading the metadata to discover the modification time of the files. This will help with the HeadObject calls.

If you want to reduce the ListBucket calls then try --fast-list. This will use a more memory but it will use much fewer ListBucket calls.

If you want to see what operations rclone is doing then do -vv --dump headers.

There are some hints to these things in the rclone docs, but maybe there should be a optimizing s3 section in the s3 docs?

@kardaj
Copy link
Author

kardaj commented Jan 7, 2019

Thank you for the answer. I'll be testing the effects of this optimisation in the next few weeks. I think it's important information that should be present in the documentation as personally, the bill took me by surprise. There was another feedback from the slack channel that had a similar experience with Google Storage. Is the behaviour the same for all providers?

@ncw
Copy link
Member

ncw commented Jan 7, 2019

Thank you for the answer. I'll be testing the effects of this optimisation in the next few weeks.

Great - let us know how it goes

I think it's important information that should be present in the documentation as personally, the bill took me by surprise

Sorry :-( Let's try to draft some more words for the documents - do you want to have a go?

There was another feedback from the slack channel that had a similar experience with Google Storage. Is the behaviour the same for all providers?

No, annoyingly the providers are all slightly different! A constant is that --fast-list will help. --size-only will help on s3 & swift, but not on b2 or google cloud storage (if I remember correctly).

@dertel
Copy link

dertel commented Jul 3, 2019

@kardaj did this save you money? I'm experiencing similarly high costs for requests and I use rclone for daily S3 backups.

@kardaj
Copy link
Author

kardaj commented Jul 5, 2019

@dertel Yes! using --fast-list and --size-only drastically reduced my bill.

@ncw
Copy link
Member

ncw commented Jul 5, 2019

I'd quite like to put a section on reducing costs in the s3 docs.

What do you think of this?

Reducing costs

By default rclone will use the modification time of objects stored in S3 for syncing. This is stored in object metadata which unfortunately takes an extra HEAD request to read which can be expensive. To avoid this when using rclone copy/move/sync use --size-only or --checksum. (Note that using --checksum will MD5 all the files in the sync which may take some time) Another solution is to use --update and --use-server-modtime - there is a section about this LINK. . Eg

rclone sync --fast-list --size-only /path/to/source s3:bucket
rclone sync --fast-list --update --use-server-modtime /path/to/source s3:bucket

Rclone's default directory traversal is to traverse each directory individually. This takes one transaction per directory. Using the --fast-list flag will read all the files into memory first using a small number of transactions (one per 1000 objects). See section on --fast-list LINK.

Note that if you are only copying a small number of files into a big repository then --no-traverse is a good idea. You can do a "top-up" sync very cheaply by using --max-age and --no-traverse to copy only recent files, eg

rclone copy --min-age 24h --no-traverse /path/to/source s3:bucket

If using rclone mount or any command using the VFS (eg most of rclone serve) then you might want to consider --no-modtime which will stop rclone reading the modification time off every object which will save a HEAD request.

@ncw ncw added this to the v1.49 milestone Jul 5, 2019
@kardaj
Copy link
Author

kardaj commented Jul 5, 2019

I think there should be an additional warning at a visible place in the documentation that points to the "reducing costs" section. The thing is, since the defaults are different from the aws cli, users will be surprised by the behaviour and the bump in the bill.

@ncw
Copy link
Member

ncw commented Jul 5, 2019

The thing is, since the defaults are different from the aws cli, users will be surprised by the behaviour and the bump in the bill.

What are the defaults for the aws cli? I suspect it won't support modified times on files.

@OnGitHubSchatz
Copy link

Defaults according to the aws s3 cli docs

A local file will require uploading if the size of the local file is different than the size of the s3 object, the last modified time of the local file is newer than the last modified time of the s3 object, or the local file does not exist under the specified bucket and prefix.

Option --size-only uses filesize only.

@dertel
Copy link

dertel commented Jul 19, 2019

@kardaj thanks for the tip, worked for me.

@ncw ncw modified the milestones: v1.49, v1.50 Aug 27, 2019
@ncw ncw modified the milestones: v1.50, v1.51 Nov 14, 2019
@ncw ncw modified the milestones: v1.51, v1.52 Feb 1, 2020
@ncw ncw modified the milestones: v1.52, v1.53 May 29, 2020
@ncw ncw modified the milestones: v1.53, v1.54 Sep 5, 2020
@ncw ncw closed this as completed in 5063423 Nov 26, 2020
@ncw
Copy link
Member

ncw commented Nov 26, 2020

I have added a section to the s3 docs all about reducing costs in 5063423.

This solves the immediate problem here.

In future keeping a local db of changes might help too or implementing change detection, but neither of those are s3 specific.

@mrxvt
Copy link

mrxvt commented May 21, 2023

Can I ask why --checksum isn't just default?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants