Skip to content
This repository

Backup data pipelining #241

Closed
phene opened this Issue · 6 comments

2 participants

Geoffrey Hichborn Michael van Rooijen
Geoffrey Hichborn

I've come to realize how disconnected each stage of the backup currently is, and I wanted to open up a dialog on refactoring this gem so that we can stream data through each stage rather than perform each read/write separately to the disk.

Take this configuration as an example:

Backup::Model.new(:database_and_assets, 'full database and asset backup') do

  database MySQL do |database|
    database.host               = DB_CONFIG['host']
    database.name               = DB_CONFIG['database']
    database.username           = DB_CONFIG['username']
    database.password           = DB_CONFIG['password']
    database.additional_options = ['--single-transaction']
  end

  archive :assets do |archive|
    archive.add "/path/to/asset/data"
  end

  compress_with Gzip do |compression|
    compression.best    = true
    compression.fast    = false
  end

  encrypt_with OpenSSL do |encryption|
    encryption.password = "my_password"
    encryption.salt     = true
    encryption.base64   = false
  end

  store_with Local do |local|
    local.path          = "/some/mount/of/remote/device"
    local.keep          = 4
  end
end

When I trigger the backup, the following occurs:

  1. Mysql dump to .sql file
  2. Tar asset data
  3. Tar results of 1 and 2 together
  4. Gzip tar file from 3
  5. Encrypt tar.gz from 4
  6. Copy tar.gz.enc to destination path

It seems to me that we could easily combine steps 3, 4, 5, and 6 together, so that instead of performing a full read and write of the data 4 times, we only do it once. This could be accomplished by:

  1. Enabling the compress_with Gzip statement to inform the packager that the -z option should be used
  2. Enabling the encrypt_with OpenSSL statement to inform the packager that it should be piped into the openssl command with the correct options
  3. Enabling the store_with Local statement to inform the packager where it should be saved to instead of writing to a tmp file and copying.

These enhancements may become critical as backups get larger in size and disk/network I/O time can not be fully controlled.

I don't have any specific code change proposals yet, but I wanted to get a the rest of the communities thoughts on this problem.

Michael van Rooijen
Owner

Yup. This is what I've been wanting to do for a while now but wasn't sure what a good approach would be to take. Thanks for bringing it up. This is definitely something I want to have incorporated in to Backup, because as you said, the large the Backups the heavier I/O / CPU and Disk Usage will be.

If we could stream/convert File A -> File B without leaving a copy of File A behind that would be great. (And File B -> C -> D to stream through all the stages to get to the final result).

I am certainly willing to get something like that going, it has been bugging me for quite a while. Though it won't make it in to 3.0.20 (next release) if possible I'd like to implement it as soon as possible. It should be a seamless upgrade seeing as it doesn't change DSL or the end-result for the user so it can be incorporated at any time.

I'm currently waiting with 3.0.20 because I want to get the last few tickets resolved and bugs fixed so we have a clean base to work off of. I hope to get 3.0.20 out in the next few days, maybe this weekend if all goes well!

Cheers and thanks for the suggestion!

Michael van Rooijen
Owner

Note:

This would have to work with every compressor and encryptor, not just openssl and gzip, but that shouldn't be a problem I think?

Geoffrey Hichborn

As long as the tool supports reading and writing via stdin/stdout, then it should be easy to build a command pipeline with any of the tools involved.

tar cf - /path/to/my/files | gzip/bzip | openssl/gpg > destination.tar.gz/bz.enc

Michael van Rooijen
Owner

Right, we'll have to see if bzip2 pbzip2 lzma (compressors) and gpg (encryptor) support it, if so then we should be able to get it going. Also there is another process you missed because it isn't in the latest gem yet (but in HEAD@develop) which is the Splitter, and the Splitter basically uses the split utility to split the archives in to multiple chunks, but I'm pretty sure that supports reading/writing from stdin/stdout.

https://github.com/meskyanichi/backup/blob/develop/lib/backup/splitter.rb

This will be added to Backup in 3.0.20 which is in a few days hopefully.

Deleted user

This is currently being worked on. This will not be in this next release, but will be in the following release - hopefully not too far off :)

As it stands, the new process will be:

  • Each archive configured will pipe it's tar output through the configured Compressor.
  • Each MySQL and PostgreSQL database configured will pipe it's dump output through the configure Compressor. (I'm still looking at the other databases...)

So, from your example, you would end up with:
database_and_assets/archive/assets.tar.gz
database_and_assets/MySQL/#{DB_CONFIG['database']}.sql.gz

The final packaging tar command would be piped through the Encryptor, giving you:
#{time}.database_and_assets.tar.enc
Or, if the new Splitter is also used, the tar output will be piped through the Encyptor and split, resulting in:
#{time}.database_and_assets.tar.enc-aa
#{time}.database_and_assets.tar.enc-ab
...etc...

Deleted user

Pipeline changes have been merged to the develop branch. 9be16f7

Deleted user ghost closed this
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.