Horcrux - On Demand, Version controlled access to your Data
Docker containers offer developers with the agility and flexibility of replicating the production/test setup in their development environment. So now, developers can develop features, unit test, fix issues in their local setup before pushing it to the test/production environment. Since containers are state-less, it can be moved anywhere easily, say within datacenter or into clouds etc. But in most cases, containers has to access/modify data that traditionally lives in a centralized storage. For example, one of the popular container stack LEMP, needs to access MySQL database. In order to access these data, Docker developed the concept of volume plugins, which can be associated with a container when it is created.
Now the container can move to a different location as long as the volume plugin works there as well. This solves the problem in production/test cases, but if developers have to access the centralized data, that again restricts their flexibily. At the same time, with ever increasing size of data, it is not possible to give each developer a separate copy of the data (say database). That would be prohibitively expensive. To solve this, we give you "Horcrux"...
What Horcrux provides?
- Horcrux provides you (developer) a local view of the whole centralized Data (database etc), so you can develop/test your application without worrying about messing up your precious central repository.
- Centralized repository can be located anywhere (local servers that provide scp access, Minio servers etc.) or in cloud (Amazon AWS S3, Microsoft Azure, Google Cloud etc.), so you are free to access it from anywhere (within your office, at home, in-flight (just kidding)...
- The data volume is visible as a local FUSE filesystem in the developer/test environment.
- When the data is accessed by the application (containers), only the particular chunk of data needed is fetched from the remote repository on-demand and stored locally (in the cache). The whole access is transparent to the applicaiton/container.
- Since only portion of data accessed is retrieved and stored locally, you don't have to buy terabytes of storage for each developers setup or test machine.
- When the working data set is accessed next time around, it is served from the local cache, blazingly fast (almost, as fast as the local file system/storage :))
- The local view provided by Horcrux is a read/write view, so the application/container can modify the data locally.
- and can view, at any time, what is changed.
In future versions, we will add git like capabilities, so you can:
- Commit the local changes, and push it to the centralized repo with a comment (only modified portion is pushed).
- Browse through changes in the remote repository and,
- Access (mount) any version of remote data locally (roll back/forward) to develop/ troubleshoot issues with ease.
- Bestof all, you don't have to do any evil spell (just a few good ones).
We would like to call it a git for DB (but technically it is not the same :), since it will provide all git compatible commands (may even provide a git extension) so you can do pretty much all things with your data that you are already doing with git for your source code.
Generate a Horcrux version for your central data
Place the Horcrux version of your data anywhere you like (local servers within your LAN, AWS S3 etc). We suggest putting it in more than one place. If you don't know yet, check out a cool project, Minio object store server. It can be used to store Horcrux as well.
In the development or test environment: Create Docker volumes using Horcrux volume driver and specifying where the remote data is stored
Now the volumes can be used within your containers as data volumes.
#To Generate Horcrux version of the data:
Step 1: Install Horcrux
Horcrux consists of two binaries, horcrux-cli and horcrux-dv
- horcrux-cli: Used to generate Horcrux version of the data.
- horcrux-dv: A volume driver plugin for Docker.
Download the latest binary copy of horcrux-cli from:
NOTE: Need help to generate binaries for OSX
From Source Code:
Step 2: Generate Horcrux version of the data [Reducto Spell]
NAME: ./horcrux-cli generate - [options] <name> <in-dir> <out-dir> USAGE: ./horcrux-cli generate [command options] [arguments...] OPTIONS: --chunksize, -s "64M" Chunk Size
Lets consider an example of MySQL database stored in database server "kural". We name the database as "AMCC" (some meaningful name).
- Original Database Server: kural
- Name of the database: AMCC
- Location of MySQL files: /var/lib/mysql
For this example we use all files including log files :)
[Optional] Validate the generated Horcrux (in the Database server):
- Server used to validate: kural
- Mount point used: /mnt/horcrux
- Horcrux location: /opt/horcrux
- Access Method: Local "cp" (a.k.a Locket)
NOTE: full path after "cp://", so three slashes in total
Distribute Horcrux to remote repositories
- For AWS S3, you can use "aws s3 sync" on /opt/horcrux-amcc
- You can also replicate the Horcrux to a local server and give SSH access to your developers
That's it... now Horcrux can be accessed by multiple developers simultaneously, and with minimal storage in their development/test machines.
Steps to do to work with the Horcrux generated above
If a developer wants to access the Horcrux, he/she needs to do the "Revelo" spell as follows
Step 1: Install Horcrux
- Please refer to Install section above
Step 2: Install and Configure FUSE
- FUSE should be installed default in most distros (Ubuntu/Fedora/CentOS/RHEL). If not, please check your distro doc on how to install FUSE packages.
- Edit /etc/fuse.conf and add "user_allow_other" line
NOTE: Right now we use "Allow Other" FUSE mount option so the mount point can be accessed by any user within the local system. We need this so the apps within containers can access the bind mountpoint. It may not be a problem right now since developers machine are mostly used by one (or few trusted) user(s). Will be addressed in future release.
- Make /etc/fuse.conf readable
# chmod a+r /etc/fuse.conf
Step 3: Start horcrux-dv volume plugin
# horcrux-dv >& /var/log/horcrux-dv.log &
Step 4: Create Docker volumes using Horcrux Volume driver
Docker volume "v1" that uses SCP access from remote server kural
- Here docker volume name is "v1" - "-d": specifies the Docker volume driver as horcrux - We use the -o to pass options to our horcrux volume driver - first option: "--name=AMCC", here we give the same name that was used in generate step - second option: "--access=scp://muthu@kural:/opt/horcrux-mysql-amcc" specifies the access method as SCP and the remote location as "kural:/opt/horcrux-mysql-amcc"
Docker volume "v2" that uses AWS S3 as remote location
- Here the bucket name is "muthu.horcrux" - Region is "us-west-1"
NOTE: AWS credentials are in ~/.aws/credentials
Step 5: Create Docker containers with volume v1
Container test-scp can now access the all MySQL files inside /data directory
That's pretty much it...
Muthukumar. R - m u t h u r AT g m a i l DOT c o m
Many thanks to:
- Docker folks for the excellent volume plugin interface and the documents - and answering my questions in #docker-dev
- Bazil folks for their Go FUSE on top of which this is built.
- Minio team for their excellent object store server.
- An excellent doc on scp protocol:
- And the golang folks.
- This is version 00.03-rc, so that pretty much explains it (but not bad at all, give it a shot and let me know)
- Only tested on Linux systems (latest version of Fedora, Ubuntu, CentOS)
- Some of local FS calls (Rename) is not there yet - just a matter of adding it, let me know if its needed badly :)
- Open flags like O_EXCL is b0rked (not hard to fix though, next version)
- With scp access, volumes are not visible inside the container consistently. If you experience this, you can workaround by creating a temp container with that volume and leave it running while you create/manage other containers for that volume.
- Cache is not cleaned up after "docker volume rm". If it grows big, please clean it up manually for now.
- Use it and let me know if it helps you..
- Help in supporting other access methods (Google Cloud, Azure, etc..)
- More testing and bug reports (see reporting issues)
- Mac support
- E m a i l m e : m u t h u r AT g m a i l DOT c o m
Version 00.02 (02/06/2016)
- Support for Docker 1.10+ volume plugin changes (Get, List)
Version 00.01 (02/05/2016)
- Supported access methods - CP, SCP, MINIO, AWS S3
- Tested only on Linux systems (Fedora, Ubuntu, CentOS)
- Versioning is not yet there (coming soon...)