Architecture
While working at Indeed, many of our systems leveraged a producer-consumer architecture. In this pattern, services can load an artifact containing data into memory in order to better service requests. These artifacts could be consumed by a single service or shared across many services. Eventually, this developed into a platform called RAD (short for resilient artifact distribution).
Sometime after Indeed developed RAD internally, we saw a similar system open sourced from Netflix called Hollow. Hollow is a Java library used to distribute in-memory datasets. Unlike RAD's file-system based approach, Hollow stored everything in S3. While I have not used Hollow myself, I see the utility it provides to the Java ecosystem.
As I've run more workloads on my Raspberry Pis, I've found myself wanting a different type of file system. Since my pis don't have a lot of disk capacity, I want to be able to distribute parts of my total data across the cluster. In addition to that, I want to deduplicate blocks of data that might repeat between versions to help keep the data footprint low.
Dataset
At Indeed, we referred to these as "artifacts," but I often found the term to be too generic in conversation. In AetherFS, we refer to collections of information as a dataset. Datasets can be tagged, similar to containers. This allows publishers to manage their own history and channels for consumers.
NOTE: In theory, this system could be used as a generalized package manager, in which the term "artifact" would be appropriate. However, that is not my intent for this project.
Tag
A Tag in AetherFS is a pointer to a set of blocks that make up the associated dataset. They can be used to mark concrete
versions of a dataset (like a semantic or calendar version) or a floating pointer (like stable
or next
).
Using a floating pointer allows consumers to stay up to date on the latest version of the dataset without requiring
redeploy.
Blocks
When a publisher pushes a dataset into AetherFS, its contents are broken up into fixed sized blocks. This allows smaller files to be stored as a single block and larger ones to be broken up into multiple smaller ones. In the end, their goal is to reduce the amount of data between versions and reduce the number of calls made to the backend.
Blocks are immutable, which allows them to be cached amongst agents in your clusters. This allows hot data to be read from your peers instead of making a call to your underlying storage tier.
Signature
Each block stored in S3 is given a unique, cryptographic signature that represents the contents of the block (i.e. a cryptographic hash). Signatures allow clients to check if a block already exists, to download a block, and to upload a block.
This document focuses on the design of a highly available, partition-tolerant virtual file system for small to medium datasets.
- Efficiently store and query information in AWS S3 (or equivalent).
- Information should be encrypted in transit and at rest.
- Authenticate clients (users and services) using common schemes (OIDC, Basic).
- Enforce access controls around datasets.
- Provide numerous interfaces to manage and access information in the system.
- Built in developer tools to help producers understand the performance of their datasets.
AetherFS is implemented to allow for a variety of deployments. This attempts to meet workloads where they are today. Out of the box, AetherFS provides a replicated hub, but it can run as a DaemonSet, sidecar, or init-container depending on the needs of an application.
AetherFS does not make any guarantees around the consistency of data as this is dependent on the underlying S3 store.
While there are a few moving components to AetherFS, we distribute everything as part of a single binary.
The command line interface (CLI) is the primary mechanism for interacting with datasets stored in AetherFS. Operators use it to run the various infrastructure components and engineers use it to download datasets on demand. It can even be run as an init container to initialize a file system for a Kubernetes Pod.
$ aetherfs -h
NAME:
aetherfs - A virtual file system for small to medium sized datasets (MB or GB, not TB or PB).
USAGE:
aetherfs [options] <command>
COMMANDS:
auth Manage authentication to AetherFS servers
pull Pulls a dataset from AetherFS
push Pushes a dataset into AetherFS
run Runs the AetherFS process
version Print the binary version information
GLOBAL OPTIONS:
--log_level value adjust the verbosity of the logs (default: "info") [$LOG_LEVEL]
--log_format value configure the format of the logs (default: "json") [$LOG_FORMAT]
--help, -h show help (default: false)
COPYRIGHT:
Copyright 2021 The AetherFS Authors - All Rights Reserved
An AetherFS hub is a replicated deployment of AetherFS instances. In addition to providing the core APIs, hubs typically have the web interface enabled, allowing users to interact with contents of datasets in a browser. They're horizontally scalable and can be auto-scaled based on CPU utilization. Memory can be tuned to allow more data to be cached at the edge, reducing round trips to the underlying S3 store.
$ aetherfs run --web_enable -h
NAME:
aetherfs run - Runs the AetherFS process
USAGE:
aetherfs run command [command options] [arguments...]
COMMANDS:
OPTIONS:
--config_file value specify the location of a file containing the run configuration [$CONFIG_FILE]
--port value which port the HTTP server should be bound to (default: 8080) [$PORT]
--tls_enable whether or not TLS should be enabled (default: false) [$TLS_ENABLE]
--tls_cert_path value where to locate certificates for communication [$TLS_CERT_PATH]
--tls_ca_file value override the ca file name (default: "ca.crt") [$TLS_CA_FILE]
--tls_cert_file value override the cert file name (default: "tls.crt") [$TLS_CERT_FILE]
--tls_key_file value override the key file name (default: "tls.key") [$TLS_KEY_FILE]
--tls_reload_interval value how often to reload the config (default: 5m0s) [$TLS_RELOAD_INTERVAL]
--nfs_enable enable NFS support (default: false) [$NFS_ENABLE]
--nfs_port value which port the NFS server should be bound to (default: 2049) [$NFS_PORT]
--agent_enable enable the agent API (default: false) [$AGENT_ENABLE]
--agent_shutdown_enable enables the agent API to initiate a shutdown (default: false) [$AGENT_SHUTDOWN_ENABLE]
--storage_driver value configure how information is stored (default: "s3") [$STORAGE_DRIVER]
--storage_s3_endpoint value location of s3 endpoint (default: "s3.amazonaws.com") [$STORAGE_S3_ENDPOINT]
--storage_s3_tls_enable whether or not TLS should be enabled (default: false) [$STORAGE_S3_TLS_ENABLE]
--storage_s3_tls_cert_path value where to locate certificates for communication [$STORAGE_S3_TLS_CERT_PATH]
--storage_s3_tls_ca_file value override the ca file name (default: "ca.crt") [$STORAGE_S3_TLS_CA_FILE]
--storage_s3_tls_cert_file value override the cert file name (default: "tls.crt") [$STORAGE_S3_TLS_CERT_FILE]
--storage_s3_tls_key_file value override the key file name (default: "tls.key") [$STORAGE_S3_TLS_KEY_FILE]
--storage_s3_tls_reload_interval value how often to reload the config (default: 5m0s) [$STORAGE_S3_TLS_RELOAD_INTERVAL]
--storage_s3_access_key_id value the access key id used to identify the client [$STORAGE_S3_ACCESS_KEY_ID]
--storage_s3_secret_access_key value the secret access key used to authenticate the client [$STORAGE_S3_SECRET_ACCESS_KEY]
--storage_s3_region value the region where the bucket exists [$STORAGE_S3_REGION]
--storage_s3_bucket value the name of the bucket to use [$STORAGE_S3_BUCKET]
--web_enable enable the web ui (default: false) [$WEB_ENABLE]
--help, -h show help (default: false)
AetherFS can be run in an agent
mode. This exposes additional APIs for managing datasets on the local file system. In
this mode, AetherFS acts like the CLI (leveraging things like local keyrings for auth
) while also providing the common
core APIs.
$ aetherfs run --agent_enable [--agent_shutdown_enable]
This project exposes a number of interfaces over up to two ports.
-
8080
- HTTP Protocol - providesgrpc
,http
,rest
,ui
,webdav
-
2049
- NFS Protocol - providesnfs
(optional)
Why so many interfaces? Most of these were added over time to support different use cases. For example, WebDAV was added for easy server mounts in OSX. NFS was added to support more native mounts at the *nix layer which is needed for the small computers this will be running on (raspberry pis).
gRPC is the primary form of communication between processes. While it has some sharp edges, supporting communication through an HTTP ingress is a requirement for end users to be able to download data. gRPC provides a convenient way to write stream based APIs which are critical to large datasets. AetherFS uses gRPC to communicate between components.
DatasetAPI
The DatasetAPI
allows callers to interact with various datasets stored within AetherFS. Dataset manifests contain a
complete list of files within the dataset, their sizes, and last modified timestamps. The manifests also contain a list
of blocks that are required to construct the dataset. Using these components, clients can piece together the underlying
files.
BlockAPI
The BlockAPI
gives callers direct access to the block data. You must use the DatasetAPI
to obtain block references.
The BlockAPI
does not provide callers with the ability to list blocks (intentional design decision). An entire block
can be read at a time, or just part of one.
AetherFS provides a REST interface via grpc-gateway. It's offered primarily for end users to interact with if no gRPC client has been generated or gRPC is unavailable. For example, the web interface is built against the REST interface since gRPC isn't* available in browsers. Because our API contains streaming calls, support over REST is difficult to do.
All REST endpoints are under the /api
prefix.
Using Golang's http.FileServer, AetherFS was able to provide a quick prototype of a FileSystem implementation. Using HTTP range requests, callers are able to read segments of large files that may be too large to fit in memory all at once. Agents can still make use of caching to keep the data local to the process.
This currently resides under the /fs
prefix.
Alternatively, you can mount the /webdav
endpoint. WebDAV extends HTTP to support various file system operations. OSX
can easily connect to WebDAV servers using Finder. For native file system mounts, use the NFS interface.
Instead of implementing a FUSE file system, I wound up implementing an NFS server interface. After spending some time fighting FUSE on OSX, I found implementing an NFS server to be much easier. An added benefit to this interface is that it reduces requirements on the clients. Many systems (such as Kubernetes) support NFS mounts out of box, making it easy to integrate AetherFS as a read only file system.
mkdir [mountpoint]
# on OSX
mount -o port=2049,mountport=2049 -t nfs localhost:/mount [mountpoint]
# on Linux
mount -o port=2049,mountport=2049,nfsvers=3,noacl,tcp -t nfs localhost:/mount [mountpoint]
For an example on how to mount AetherFS using a Kubernetes Persistent Volume, see the aetherfs-datasets chart.
A web interface is available under the /ui
prefix. It allows you to explore datasets, their tags, and even files
within the dataset from a graphical interface. If you include a README.md
file in the root of your dataset, we're able
to render inline in the browser. Eventually, it will even be able to provide insight into how well your dataset performs
between versions.
In addition to providing a way to explore datasets, it serves a Swagger UI for engineers to explore the API in more detail.
There isn't a ton of bells and whistles to how data is managed within the AetherFS architecture. We expect the storage provider to offer durability guarantees. Caching will play a more important role in the next release and will require some detail.
When uploading files to AetherFS, we pack all files found in a target directory, zip, or tarball into a single contiguous blob. This large blob is broken into smaller blocks that are ideally sized for your storage layer. For example, Amazon Athena documentation suggests using S3 objects between 256MiB and 1GiB to optimize network bandwidth.
Each dataset can choose their own block size, ideally striving to get the most reuse between versions. While producers have control over the size of the blocks that are stored in AetherFS, they do not control the size of the cacheable parts. This allows consumers of datasets to tune their usage based by adding more memory, disk, or peers where they need to. To help explain this a little more, let us consider the following example manifest.
{
"dataset": {
"files": [{
"name": "HYP_HR_SR_W_DR.VERSION.txt",
"size": "5",
"lastModified": "2012-11-08T06:34:01Z"
}, {
"name": "HYP_HR_SR_W_DR.prj",
"size": "147",
"lastModified": "2012-09-19T13:56:32Z"
}, {
"name": "HYP_HR_SR_W_DR.tfw",
"size": "173",
"lastModified": "2009-12-22T20:48:30Z"
}, {
"name": "HYP_HR_SR_W_DR.tif",
"size": "699969826",
"lastModified": "2012-07-16T16:23:30Z"
}, {
"name": "README.html",
"size": "30763",
"lastModified": "2012-11-08T06:34:01Z"
}],
"blockSize": 268435456,
"blocks": [
"43fbwtgpsbh6naab7cevphlcxvp6xi3etkwataxtvxgkltcpky4q====",
"feufuim3ogmo34aqv5htwkww4ovll5weurt3nc6p233irbjlwjwq====",
"7egqp74hvvfdau5vy6tyynvzs6mlxoikit5bc4fdxeiuwwclcd4q===="
]
}
}
This manifest is based on BitTorrent and the modifications Indeed need to make to support RAD. While not immediately obvious, it contains all the information needed to reconstruct the original file system AetherFS took a snapshot of. For example, the snippet of pseudocode below can be used to identify which blocks need to be read in order to reconstruct a specific file.
offset = 0
size = 0
for file in files {
if file.name == desiredFile {
size = file.size
break;
}
offset = offset + sile.size
}
block = blocks[offset / blockSize]
blockOffset = offset % blockSize
// read block starting at blockOffset for size
// note size > blockSize ;-)
Keep in mind, that blocks can contain many files, and a single file can require many blocks. This is an important detail when reconstructing data.
AetherFS persists data in an S3 compatible solution. Internally, it uses the MinIO Golang client to communicate with the S3 API. We support reading common AWS environment variables, MinIO environment variables, or through command line options. Similar to Git's object store and Dockers blob store, the AetherFS block store persists blocks in a directory prefixed by the first two letters of the signature. Meanwhile, datasets are in a separate key space that allows hubs to list datasets, and their tags without the need for a metadata file (i.e. through prefix scans.) For example:
blocks/{sha[:2]}/{sha[2:]}
blocks/1e/0d0f7ab123836cd69efb28cc62908c03002a11
blocks/1e/67d8b5474d6141c12da7798e494681e3d87440
blocks/1e/d939b2a6636c5a6085d0e885df0ac997c0d0a7
blocks/1e/fc7af3a2e5dfdd23163bd9e8c5b97059f56a37
datasets/{name}/{tag}
datasets/test/v21.10
datasets/test/latest
Since we store this information separately, implementing expiration and cleaning up older blocks is relatively easy to implement. Some datasets, like Maxmind, have terms of use that say new versions need to be replaced in a timely manner and older copies are deleted upon request.
Not yet implemented.
This is an initial release. We'll add more on this later. Generally speaking, I like to take a reasonable approach toward security. To me, that means that we ship security features as we need them instead of trying to force them in right out the gate. That being said, while our initial release has no intent on supporting authentication, we're looking at including it in our second.
So stay tuned until then... or not. I'm sure you know how these things work...
Not yet implemented.
Not yet implemented.
For the most part, AetherFS expects your small blob storage solution to provide this functionality. After an initial search, it seemed like most solutions provide some form of encryption at rest. Later on, we may add end-to-end encryption support (assuming interest).
Where possible, our systems leverage TLS certificates to encrypt communication between processes.