Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rclone backup: new command with incremental strategy #18

Open
briceburg opened this issue Jan 19, 2015 · 28 comments
Open

rclone backup: new command with incremental strategy #18

briceburg opened this issue Jan 19, 2015 · 28 comments

Comments

@briceburg
Copy link

briceburg commented Jan 19, 2015

Hi Nick,

I'm currently using s3ql to mount remote S3/GCS data as a local filesystem (through fuse), and then use a shell script to implement rsync based backups to this filesystem.

I'd like to make use of rclone based backups in ansible-pcd -- for its simplicity and for user friendliness (e.g. browsing on the remote end will display the backed up files themselves -- versus s3ql which displays indecipherable filesystem metadata).

For the majority of "backups", it's important to have incremental functionality that allows you to a restore a file from "yesterday" w/o having "todays" changes override and make that impossible. E.g. snapshots/rotations/&c.

Most incremental strategies also make subsequent backups efficient by limiting what gets backed up to changed components only; speeding up runtime and transfer time, and saving space @ the backup destination.

Protecting against file corruption making its way into downstream [snapshots/rotations/&c] is also a plus.

I think rclone handles the syncing component well -- although don't see anything apparent re: incremental snapshotting. Do you plan to implement or could you share your thoughts on this feature?

How to use GitHub

  • Please use the 👍 reaction to show that you are affected by the same issue.
  • Please don't comment if you have no relevant information to add. It's just extra noise for everyone subscribed to this issue.
  • Subscribe to receive notifications on status change and new comments.
@briceburg
Copy link
Author

My initial thought is to use remote-to-remote for this, e.g.

First Backup ("base")

rclone copy /path/to/backup remote:/backups/base

Subsequent Backups

date=`date "+%Y%m%d_%H:%M:%S"`
rclone sync remote:/backups/base remote:/backups/$date
rclone sync /path/to/backup remote:/backups/$date

Not sure about the efficiency of the remote-to-remote. Bad idea?

Also, the README indicates:

[sync] Deletes any files that exist in source that don't exist in destination.

I'm used to behavior that deletes files that exist in the target destination that do not exist in the source destination. Worried that the rclone behavior would remove [new] files from /path/to/backup ...

Thanks!

@ncw
Copy link
Member

ncw commented Jan 19, 2015

Your idea for the remote to remote copy is how I would approach it.

This has one disdvantage with rclone as it stands today in that it will effectively download the data and re-upload it. However I have been thinking about an allowing bucket to bucket copies which would be exactly what you want. S3, Swift and GCS all allow this. Here is the docs for GCS

So if I were to implement that then the copy to backup first would work really quite well I think.

As for

[sync] Deletes any files that exist in source that don't exist in destination.

I think it is badly worded, it "deletes files in the destination that don't exist in the source" as you would expect. I'll fix the wording

@briceburg
Copy link
Author

Nick,

Great! This is pretty exciting. Bucket-to-bucket copying sounds promising. What about this approach as well;

rclone sync /path/to/backup remote:/backups/base remote:/backups/changes.2015-01-19

Where rclone would compare /path/to/backup against remote:/backups/base , and copy changes to remote:/backups/changes-2015-01-19

Obviously this would mess with the deletes behavior, which could be dealt with by adding a flag that would remove deleted files from remote:/backups/base, and optionally preserving them elsewhere (e.g. copying them to remote:/backups/deleted-files ). We could then run a janitorial command that removes files older than X days from remote:/backups/deleted-files ) ... *and also take advantage of bucket-to-bucket copying without incurring the cost of doubling storage space with each snapshot *

@ncw
Copy link
Member

ncw commented Jan 19, 2015

Interesting idea!

I think I'd simplify the logic slightly and make it a new rclone command

rclone sync3 /path/to/backup remote:/backups/base remote:/backups/changes.2015-01-19
  • for every file in/path/to/backup
    • if it is in base unchanged - skip
    • if it is modified in base
      • copy the file from base to changes if it exists in base
      • upload the file to base
  • for every file in base but not in backup
    • move it from base to changes

This would mean that base would end up with a proper sync of backup, but changes would have any old files which changed or were deleted. It would then effectively be a delta, and you would have all the files at both points in time.

You could re-create the old filesystem easily, except for if you uploaded new files into base - there would be no way of telling just by looking at base and changes that those new files where new or just unchanged old files. This may or may not be a problem!

@briceburg
Copy link
Author

Nick,

Right -- totally agree re: new rclone command for this. The precursor is getting bucket-to-bucket functionality.

Just FYI, I'm super excited about rclone && am loving your work. For now, I'm going to go with an incredibly simple approach to backups with rclone.

basically I'll have two targets, one that is synced weekly, and one that is synced daily. E.g. my cron will look like;

10     2     *     *     *  rclone sync ~/VAULT google:vault/nesta/daily
10     4     *     *     0  rclone sync ~/VAULT google:vault/nesta/weekly

This will [hopefully] preserve deleted files in the weekly snapshot. Could also add a monthly &c.

I think this will work for now, but certainly interested in helping with bucket-to-bucket and incremental strategies. If I can help, please let me know. May need to learn Go :)

@briceburg
Copy link
Author

I've added ansible scripts to 1) install rclone and 2) implement the above backup strategy on a crontab based system (still need to make a systemd timer compatible version for archlinux &c). Sharing for fun.

Install rclone:

---

- name: install rclone 
  hosts: all
  sudo: true
  sudo_user: root

  vars:
    # check http://rclone.org/downloads/ for latest...
    rclone_version: 1.07
    rclone_vstr: rclone-v{{ rclone_version }}-linux-amd64
    rclone_target: /opt/rclone/{{ rclone_vstr }}

  pre_tasks:
    - stat: path={{ rclone_target }}
      register: stat_rclone 

  tasks:
    - name: download rclone
      uri:
        dest=/tmp/
        follow_redirects=all
        url=http://downloads.rclone.org/{{ rclone_vstr }}.zip
      when: not stat_rclone.stat.exists

    - name: unpack rclone
      command: unzip /tmp/{{ rclone_vstr }}.zip -d /opt/rclone
        creates={{ rclone_target }}

    - name: add rclone to path
      file:
        state=link
        dest=/usr/local/bin/rclone
        src={{ rclone_target }}/rclone

Backup Stategy

---
- name: vault backup 
  hosts: all

  vars:
    vault_base: "google:iceburg-vault/{{ TARGET_USER }}"
    vault_daily: "{{ vault_base }}/daily" 
    vault_weekly: "{{ vault_base }}/weekly"

  tasks:
    - name: $HOME/.rclone.conf
      file:
        state=link
        dest={{ TARGET_USER_HOME }}/.rclone.conf
        src={{ DOTFILES_DIR }}/.rclone.conf
        force={{ FORCE_LINKS }}

    - name: fetch vault
      command: rclone copy {{ vault_daily }} ~/VAULT
        creates=~/VAULT

    - name: schedule daily vault backup
      cron:
        name="daily vault backup"
        minute=40
        hour=4
        job="rclone sync ~/VAULT {{ vault_daily }}"

    - name: schedule weekly vault backup
      cron:
        name="weekly vault backup"
        minute=40
        hour=5
        job="rclone sync ~/VAULT {{ vault_weekly }}"

@briceburg
Copy link
Author

Nick,

I've been playing with Syncthing of late. It uses the very cool idea of "versions" I believe derrived from Dropbox and/or Bittorret Sync. Vs. the incremental ideas outlined -- perhaps an incremental versioning scheme is prefered and easier to implement?

The "simple" Versioning scheme in Syncthing allows you to specify a folder name and number of copies you would like to preserve. E.g.

  1. During a sync, if a file is changed, copy the original version to the "versioned" folder.
    E.g. :/.versions//filename.
  2. If more than X versions of a file exist, delete the oldest.

So for the sync

rclone sync-versioned /path/to/backup remote:/backups

If remote:/backups/apache/virtualhost.a was FOUND, but deleted or changed from /path/to/backup/apache/virtualhost.a , rclone would

  • make sure remote:/backups/.versions/apache folder exists (assuming .versions is the configured folder name)
  • copy remote:/backups/apache/virtualhost.a to remote:/backups/.versions/apache/virtualhost.a
    • if remote:/backups/.versions/apache/virtualhost.a exists, apply versioning scheme. E.g. rename older backups to remote:/backups/.versions/apache/virtualhost.a.[1-4] if configured to preserve 5 versions of a file.

Personally I think versions may be more accessible, and doesn't involve deltas. What do you think?

@ncw ncw added the enhancement label Feb 4, 2015
@ncw
Copy link
Member

ncw commented Feb 13, 2015

Sorry missed your last comment..

Yes, Versions sounds like it would be simpler for people to understand.

The renaming scheme needs a bit of thought - windows doesn't deal well with files with funny extensions.

Implementation wise, it is quite similar to the schemes above.

@briceburg

This comment has been minimized.

@ncw
Copy link
Member

ncw commented Feb 10, 2016

I'll just note that rclone now has bucket to bucket copy and sync which may be helpful!

@ncw ncw added this to the Unplanned milestone Feb 10, 2016
@leocrawford
Copy link

leocrawford commented Aug 8, 2016

A feature along the lines of #18 or #98 would be very welcome. I agree that it is desirable to store full files rather than diffs for simplicity and ease of restoration, but i wonder if we could improve on versioned folders idea?

The main drawback of this is when a file is moved (or repeatedly removed and created) we get a lot of copies of the same file. Instead if we treated the .backup directory as a content addressable storage, such that each backed up file was stored using its md5 has as a filename we would only need a little metadata stored to allow a restore.

I'd suggest that what we could need to store for each version is a JSON file that contains a line for each filesystem change along the lines of

operation, metadata, blob

here:

  • Operation would be add, delete, mkdir or similar (probably to match operations in fs)
  • Metadata would contain chmod, date, etc.
  • blob would be a md5 of the file in question

I'd suggest that the version file itself is named as the md5 of its contents and contains a reference (probably in the first line) to the previous backup. The most recent backup would probably be retained by writing the md5 of the most recent backup to a file called HEAD in the .backup directory. this would be the only file that would ever need to change. (in effect we're creating a merkle tree)

The advantage of this approach is that as well as restoring files we can restore other changes readily, by returning to any arbitrary point in the history (including deleted files, metadata, etc) and it could cope with multi-way syncing with a little work. I also believe this approach could support a full two-way sync more readily than simple versioning, as the metadata allows us to determine what changes have been made since last sync reducing our ability to determine which update to propagate, rather than simply having to mark a potential conflict.

In practice the easiest way of doing a restore is to allow source to have an optional version specified (either by using the md5 hash or simply an integer to represent the number of steps back to go), and so a restore could simply be a copy from the (old) destination.

One interesting way to implement this would be to provide SourceVersionWrapper and DestinationVersionWrapper which wrap any existing fs object, and in the case of SourceVersionWrapper allow an arbitrary version to be specified, and DestinationVersionWrapper simply creates the .backup metadata and blobs.

The advantage of this would be that if you did implement a FUSE support #494 then you would have in effect created a versioned filesystem for free. :-)

@thibaultmol
Copy link
Contributor

New feature from Backblaze for B2: https://www.backblaze.com/blog/backblaze-b2-lifecycle-rules/

(might be relevant)

@robinrosenstock

This comment has been minimized.

@ncw
Copy link
Member

ncw commented Nov 10, 2017

rclone now supports --backup-dir which with a tiny amount of scripting gives all the tools necessary for incremental backups.

I keep meaning to wrap this into an rclone backup command, but I haven't got round to it yet!

@navotera

This comment has been minimized.

@ncw
Copy link
Member

ncw commented Jan 29, 2018

@navotera --backup-dir does a server side move, (or possibly a server side copy followed by a delete if server side move isn't available).

@navotera

This comment has been minimized.

@guestisp
Copy link

So, by using something like:
rclone sync /path/to/local remote:current --backup-dir remote:$(date) remote:current will hold the latest backup (thus, the "current" version of files) and every changes between the current version and the previous one would be stored in "remote:$(date)" resulting in something like rsnapshot?

In other words, if yesterday i had a file called "foo" that was deleted today, with today clone, this file will be removed from the current remote and placed in remote with yesterday date, right?

Isn't easier to run a remote copy before a new sync? Like the following:

rclone sync remote:current remote:yesterday
rclone sync /path/to/local remote:current

Exactly like rsnapshot

@ncw
Copy link
Member

ncw commented Aug 10, 2018

@guestisp

So, by using something like:
rclone sync /path/to/local remote:current --backup-dir remote:$(date) remote:current will hold the latest backup (thus, the "current" version of files) and every changes between the current version and the previous one would be stored in "remote:$(date)" resulting in something like rsnapshot?

Yes that is right

In other words, if yesterday i had a file called "foo" that was deleted today, with today clone, this file will be removed from the current remote and placed in remote with yesterday date, right?

Yes.

Isn't easier to run a remote copy before a new sync? Like the following:

That will use a lot more storage - you'll have a complete copy for yesterday and a complete copy for current.

@guestisp
Copy link

But with --backup-dir, i have to search a file in every repository or each repository is a complete copy like with rsnapshot and hardlinks?

@ncw
Copy link
Member

ncw commented Aug 11, 2018

But with --backup-dir, i have to search a file in every repository or each repository is a complete copy like with rsnapshot and hardlinks?

Yes searching will be necessary as not many cloud providers support hard links. (A few do like google drive).

I intend to make a rclone backup command which hides this from the user though at some point.

@guestisp
Copy link

I'm trying to use the suggested method (--backup-dir) but something is not working as expected.

This is a simple script that i'm running:

#!/bin/sh

BACKUP_DIR=$(/bin/date +'%F_%R')
for dir in /etc /var/www /var/backups /var/spool/backups; do 
   rclone sync $dir amazon_s3:mybuket/current/$dir --backup-dir amazon_s3:mybuket/${BACKUP_DIR} --exclude '*/storage/logs/*' --stats 2s --log-level ERROR
done

I would expect that on first run, everthing would be synced in mybuket/current (and this is working properly), then on every subsequent run, changed files should be moved to mybucket/${BACKUP_DIR} but this is not working. Files are still synced in current

I would like to have something like rsnapshot. current should hold the latest sync, then every changes from the latest sync and the previous one, should be moved to the backup-dir.
In example, yesterday I had file1, file2. These are synced in current. Today I remove file2 and change file1 content. On next run, today's version should be synced in current, the yesterday version should be moved in 20180817_0930

@ncw
Copy link
Member

ncw commented Aug 28, 2018

What should happen is any files that are changed or deleted get moved to the backup-dir which is I think what you are asking for.

Here is a simple example

$ tree src
src
└── file1

0 directories, 1 file
$ rclone sync src dst/current --backup-dir dst/backup1
$ tree dst
dst
└── current
    └── file1

1 directory, 1 file
$ date > src/file1
$ date > src/file2
$ rclone sync src dst/current --backup-dir dst/backup1
$ tree dst
dst
├── backup1
│   └── file1
└── current
    ├── file1
    └── file2

2 directories, 3 files
$ rm src/file1
$ rclone sync src dst/current --backup-dir dst/backup2
$ tree dst
dst
├── backup1
│   └── file1
├── backup2
│   └── file1
└── current
    └── file2

3 directories, 3 files
$ 

I would say also that amazon_s3:mybuket/${BACKUP_DIR} in your script should be amazon_s3:mybuket/${BACKUP_DIR}/$dir to fit in with the naming scheme.

@guestisp
Copy link

So, removing file1 results in removal from current and a copy stored in backup1, right ?

I'm trying to figure out a properly naming schema, in example, I would like to create hourly backups. Currently i'm testing this:

BACKUP_DIR=$(/bin/date +'%F_%R' -d '1 hour ago')
rclone sync $dir amazon_s3:mybucket/current$dir --backup-dir amazon_s3:mybucket/${BACKUP_DIR}$dir

in a hourly cron.
This should move any old file from current to the last hour, and keep the current backup in current, so, if file1 is removed now (15:10), on next run (16:00), the current will loose file1 but the 15:00 directory will keep it. right ?

There is a huge drawback: the backup-dir will only hold changes, not the full tree like rsnapshot does via links. Probably, the following would create something more similiar to rsnapshot:

rclone sync remote:current remote:1-hour-ago
rclone sync /path/to/local remote:current

but using much more space.

@ncw
Copy link
Member

ncw commented Aug 31, 2018

So, removing file1 results in removal from current and a copy stored in backup1, right ?

That is correct.

This should move any old file from current to the last hour, and keep the current backup in current, so, if file1 is removed now (15:10), on next run (16:00), the current will loose file1 but the 15:00 directory will keep it. right ?

Yes that sounds correct too.

There is a huge drawback: the backup-dir will only hold changes, not the full tree like rsnapshot does via links

I intend to fix this with a dedicated backup command at some point but we are not there yet.

rclone sync remote:current remote:1-hour-ago
rclone sync /path/to/local remote:current

yes that would work. The first rclone command would use server side copies so be relatively quick too. It does use a lot more space though. Some might say that was a good thing as you then have two actually independent backups.

@ivandeex ivandeex changed the title incremental strategy rclone backup: new command with incremental strategy Dec 2, 2020
@balupton

This comment has been minimized.

@ivandeex
Copy link
Member

@ncw
The last comment here is 3 years old
Do you think that rclone backup is still a viable idea?

@hmoffatt
Copy link

hmoffatt commented Jul 28, 2022

Could you use --compare-dest with a list of all the directories since the last full backup in order to make an incremental backup?

Full backup: possibly use --copy-dest from all of the previous incrementals to avoid uploading again
Incremental backup: --compare-dest all the incrementals + the last full
Differential backup: --compare-dest the last full backup only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

10 participants