Skip to content
This repository has been archived by the owner on Dec 1, 2023. It is now read-only.

Commit

Permalink
doc: finish arch (#4)
Browse files Browse the repository at this point in the history
  • Loading branch information
mjpitz committed Oct 20, 2021
1 parent a0a4a9e commit 9e57412
Showing 1 changed file with 185 additions and 91 deletions.
276 changes: 185 additions & 91 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@
* [Requirements](#requirements)
* [Implementation](#implementation)
* [Components](#components)
* [AetherFS CLI](#aetherfs-cli)
* [AetherFS Hub](#aetherfs-hub)
* [AetherFS Agent](#aetherfs-agent)
* [AetherFS CLI](#aetherfs-cli)
* [Interfaces](#interfaces)
* [REST & gRPC](#rest--grpc)
* [HTTP File Server](#http-file-server)
Expand All @@ -36,7 +36,7 @@ resilient artifact distribution).

Sometime after Indeed developed RAD internally, we saw a similar system open sourced from [Netflix][] called [Hollow][].
Hollow is a Java library used to distribute in-memory datasets. Unlike RAD's file-system based approach, Hollow stored
everything in S3. While I have not used Hollow myself, I see the utility it provides to Java ecosystem.
everything in S3. While I have not used Hollow myself, I see the utility it provides to the Java ecosystem.

[Indeed]: https://www.indeed.com
[RAD]: https://www.youtube.com/watch?v=lDXdf5q8Yw8
Expand All @@ -45,9 +45,10 @@ everything in S3. While I have not used Hollow myself, I see the utility it prov

### Motivation

Since leaving Indeed, I've often thought about what a modern take on this technology might look like. In addition to
this curiosity, I've found myself wanting a similar solution that can be used on edge or IoT devices where storage may
be limited or non-existent.
As I've run more workloads on my Raspberry Pis, I've found myself wanting a different type of file system. Since my pis
don't have a lot of disk capacity, I want to be able to distribute parts of my total data across the cluster. In
addition to that, I want to deduplicate blocks of data that might repeat between versions of that to help keep the
footprint low.

### Concepts

Expand All @@ -57,46 +58,27 @@ At Indeed, we referred to these as "artifacts," but I often found the term to be
AetherFS, we refer to collections of information as a _dataset_. Datasets can be tagged, similar to containers. This
allows publishers to manage their own history and channels for consumers.

For example, you might maintain a `stable` tag that contains the latest stable version of the dataset. To help insulate
consumers, you might also manage a `next` tag that contains the next version of the dataset. This allows consumers to
follow the `stable` tag in production, and the `next` tag in development.

You can also follow your standard [semantic][] or [calendar][] versioning tags to maintain a history of all versions of the
dataset. This is particularly useful should you need to rollback a change to a dataset.

[semantic]: https://semver.org/
[calendar]: https://calver.org/

_**NOTE:** In theory, this system could be used as a generalized package manager, in which the term "artifact" would be
appropriate. However, that is not my intent for this project._

**Blocks**

When clients push a dataset into AetherFS, its contents are broken up into fixed sized _blocks_. This allows smaller
files to be stored as a single block and larger ones to be broken up into multiple smaller ones. In the end, their goal
is to reduce the amount of data between versions and reduce the number of calls made to the backend.
**Tag**

Blocks are immutable, which allows them to be cached amongst your agents. This allows hot data to be read from your
peers instead of making a call to your underlying storage tier.
A Tag in AetherFS is a pointer to a set of blocks that make up the associated dataset. They can be used to mark concrete
versions of a dataset (like a [semantic][] or [calendar][] version) or a floating pointer (like `stable` or `next`).
Using a floating pointer allows consumers to stay up to date on the latest version of the dataset without requiring
redeploy.

**BitTorrent**
[semantic]: https://semver.org/
[calendar]: https://calver.org/

Indeed's RAD ecosystem used the [BitTorrent][] protocol to replicate information around the world. This was done to
reduce the data load on the producer machine. However, for Indeed to leverage BitTorrent, they needed to modify the
torrent manifest to propagate the last modified times for a file. This adds a maintenance burden since we would then
need to maintain a [fork][]. Similarly, the academic community has latched onto this at [Florida State University][]
where they use BitTorrent to share large datasets between researchers.
**Blocks**

While AetherFS does not use BitTorrent, we do lift some concepts from the protocol. For example, our dataset manifest
uses a similar structure to a BitTorrent manifest since we deal with similar structures. Similar to BitTorrent, AetherFS
chunks the data into blocks, optimized for storage in [AWS S3][] (or equivalent). When read from S3, we break blocks
down into smaller, cache optimized blocks. For better performance, we can tier the sizes of our caching layers. This
will be explained more in depth later on.
When a publisher pushes a dataset into AetherFS, its contents are broken up into fixed sized _blocks_. This allows
smaller files to be stored as a single block and larger ones to be broken up into multiple smaller ones. In the end,
their goal is to reduce the amount of data between versions and reduce the number of calls made to the backend.

[Florida State University]: https://web.archive.org/web/20130402200554/https://www.hpc.fsu.edu/index.php?option=com_wrapper&view=wrapper&Itemid=80
[BitTorrent]: https://en.wikipedia.org/wiki/BitTorrent
[fork]: https://github.com/indeedeng/ttorrent
[AWS S3]: https://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html
Blocks are immutable, which allows them to be cached amongst agents in your clusters. This allows hot data to be read
from your peers instead of making a call to your underlying storage tier.

**Signature**

Expand All @@ -118,28 +100,70 @@ datasets.
- Provide numerous interfaces to manage and access information in the system.
- Built in developer tools to help producers understand the performance of their datasets.

[AWS S3]: https://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html

## Implementation

<a href="docs/assets/seen-stored-cached.png">
<img src="https://aetherfs.tech/assets/overview.png" align="right" width="60%"/>
<a href="docs/assets/overview.png">
<img src="https://aetherfs.tech/assets/overview.png" align="right" width="40%"/>
</a>

AaetherFS deploys as a simple client-server architecture, with a few caveats. It's distributed as a single binary making
the full suite of components accessible to users.
AaetherFS deploys as a simple client-server architecture, with a few caveats.

The AetherFS Agent process can cluster into a mesh of nodes, allowing it to share data amongst its peers. This allows
agents to reduce calls to the database by re-using blocks of data that are likely already within the cluster.

### Components

While there are a few moving components to AetherFS, we distribute everything as part of a single binary.

#### AetherFS CLI

![role: client](https://img.shields.io/badge/role-client-white?style=for-the-badge)
![interfaces: command line, fuse](https://img.shields.io/badge/interfaces-command%20line,%20fuse-white?style=for-the-badge)

The command line interface (CLI) is the primary mechanism for interacting with datasets stored in AetherFS. Operators
use it to run the various infrastructure components and engineers use it to download datasets on demand. It can even be
run as an init container to initialize a file system for a Kubernetes Pod.

```
$ aetherfs -h
NAME:
aetherfs - A virtual file system for small to medium sized datasets (MB or GB, not TB or PB).
USAGE:
aetherfs [options] <command>
COMMANDS:
pull Pulls a dataset from AetherFS
push Pushes a dataset into AetherFS
run Run the various AetherFS processes
version Print the binary version information
GLOBAL OPTIONS:
--log_level value adjust the verbosity of the logs (default: "info") [$LOG_LEVEL]
--log_format value configure the format of the logs (default: "json") [$LOG_FORMAT]
--help, -h show help (default: false)
COPYRIGHT:
Copyright 2021 The AetherFS Authors - All Rights Reserved
```

#### AetherFS Hub

![role: server](https://img.shields.io/badge/role-server-white?style=for-the-badge)
![interfaces: grpc, file server, rest, web](https://img.shields.io/badge/interfaces-grpc,%20file%20server,%20rest,%20web-white?style=for-the-badge)

The AetherFS Hub is the primary component in AetherFS. It provides the core interfaces that are leverage by all other
components in AetherFS. The hub is responsible for managing the underlying storage tier and verifying the
authenticated clients have access to the desired dataset.
AetherFS hubs host datasets for download. They're horizontally scalable making it easy to scale up when additional CPU
or memory is needed. Hubs are responsible for managing the underlying storage tier (S3) and verifying authenticated
clients have access to datasets. AetherFS itself does not implement an identity provider, but will eventually work with
[dex](https://dexidp.io) or other OIDC providers.

```
$ aetherfs run hub -h
NAME:
aetherfs run hub - Runs the AetherFS Hub process
Expand Down Expand Up @@ -168,59 +192,29 @@ OPTIONS:
![role: client, server](https://img.shields.io/badge/role-client,%20server-white?style=for-the-badge)
![interfaces: grpc, file server, fuse, rest](https://img.shields.io/badge/interfaces-grpc,%20file%20server,%20fuse,%20rest-white?style=for-the-badge)

The AetherFS agent is an optional sidecar process. It provides an application level cache for block data and can also
manage a local file system path (if enabled). It provides a special `AgentAPI` that can publish datasets
programmatically.
AetherFS agents support a variety of use cases. At the end of the day, it's goal is to serve information from AetherFS
as quickly as possible through a variety of mechanisms. In addition to the File Server and FUSE interfaces, it can
manage a local file system. This allows applications to interact with files on disk just as if they were packaged
locally.

_Not yet implemented._

```
$ aetherfs run agent -h
```

#### AetherFS CLI

![role: client](https://img.shields.io/badge/role-client-white?style=for-the-badge)
![interfaces: command line, fuse](https://img.shields.io/badge/interfaces-command%20line,%20fuse-white?style=for-the-badge)

End users can interact with a command line interface (CLI) to push and pull datasets from AetherFS hubs. It's the
primary point of interaction for operators and engineers. It also contains the necessary code to run and operate your
own hub and agent processes.

```
$ aetherfs -h
NAME:
aetherfs - A virtual file system for small to medium sized datasets (MB or GB, not TB or PB).
USAGE:
aetherfs [options] <command>
COMMANDS:
pull Pulls a dataset from AetherFS
push Pushes a dataset into AetherFS
run Run the various AetherFS processes
version Print the binary version information
GLOBAL OPTIONS:
--log_level value adjust the verbosity of the logs (default: "info") [$LOG_LEVEL]
--log_format value configure the format of the logs (default: "json") [$LOG_FORMAT]
--help, -h show help (default: false)
COPYRIGHT:
Copyright 2021 The AetherFS Authors - All Rights Reserved
```

### Interfaces

#### REST & gRPC
This project exposes numerous interfaces for interaction. Why? For the most part, everyone has their own preference. In
this case, a lot of it is offered out of convenience. For example, instead of implementing a `FileAPI`, I implemented a
[file server](#http-file-server) instead. Beyond the initial HTTP File System, this core set of logic can be reused to
write the FUSE interface later on.

Each component of the architecture provides a REST and gRPC interface for communication. While primarily used by
internal components, these interfaces can be used by calling applications as well. However, AetherFS expects callers to
interact with one of our other interfaces as they abstract away the complexity of the underlying storage.
#### gRPC

For the most part, AetherFS's interfaces are inspired by Docker and Git. All REST routes sit under the `/api` prefix.
gRPC is the primary form of communication between processes. While it has some sharp edges, supporting communication
through an HTTP ingress is a requirement for end users to be able to download data. gRPC provides a convenient way to
write stream based APIs which are critical to large datasets. AetherFS uses gRPC to communicate between components.

**DatasetAPI**

Expand All @@ -235,6 +229,14 @@ The `BlockAPI` gives callers direct access to the block data. You must use the `
The `BlockAPI` does not provide callers with the ability to list blocks (intentional design decision). An entire block
can be read at a time, or just part of one.

#### REST

AetherFS provides a REST interface via grpc-gateway. It's offered primarily for end users to interact with if no gRPC
client has been generated or gRPC is unavailable. For example, the web interface is built against the REST interface
since gRPC isn't* available in browsers. Because our API contains streaming calls, support over REST is difficult to do.

All REST endpoints are under the `/api` prefix.

#### HTTP File Server

Using [Golang's http.FileServer][], AetherFS was able to provide a quick prototype of a FileSystem implementation. Using
Expand Down Expand Up @@ -262,32 +264,124 @@ detail.

### Data Management

#### Packing

<a href="docs/assets/seen-stored-cached.png">
<img src="https://aetherfs.tech/assets/seen-stored-cached.png" align="right" width="40%"/>
</a>

There isn't a ton of bells and whistles to how data is managed within the AetherFS architecture. We expect the storage
provider to offer durability guarantees. Caching will play a more important role in the next release and will require
some detail.

#### Packing

When uploading files to AetherFS, we pack all files found in a target directory, zip, or tarball into a single
contiguous blob. This large blob is broken into smaller blocks that are ideally sized for your storage layer. For
example, Amazon Athena documentation suggests using S3 objects between 256MiB and 1GiB to optimize network bandwidth.
<!-- needs citation -->

Each dataset can choose their own block size, ideally striving to get the most reuse between versions. While producers
have control over the size of the blocks that are stored in AetherFS, they do not control the size of the cacheable
parts. This allows consumers of datasets to tune their usage based by adding more memory or disk where they need to.
parts. This allows consumers of datasets to tune their usage based by adding more memory, disk, or peers where they need
to. To help explain this a little more, let us consider the following example manifest.

```json
{
"dataset": {
"files": [{
"name": "HYP_HR_SR_W_DR.VERSION.txt",
"size": "5",
"lastModified": "2012-11-08T06:34:01Z"
}, {
"name": "HYP_HR_SR_W_DR.prj",
"size": "147",
"lastModified": "2012-09-19T13:56:32Z"
}, {
"name": "HYP_HR_SR_W_DR.tfw",
"size": "173",
"lastModified": "2009-12-22T20:48:30Z"
}, {
"name": "HYP_HR_SR_W_DR.tif",
"size": "699969826",
"lastModified": "2012-07-16T16:23:30Z"
}, {
"name": "README.html",
"size": "30763",
"lastModified": "2012-11-08T06:34:01Z"
}],
"blockSize": 268435456,
"blocks": [
"43fbwtgpsbh6naab7cevphlcxvp6xi3etkwataxtvxgkltcpky4q====",
"feufuim3ogmo34aqv5htwkww4ovll5weurt3nc6p233irbjlwjwq====",
"7egqp74hvvfdau5vy6tyynvzs6mlxoikit5bc4fdxeiuwwclcd4q===="
]
}
}
```

This manifest is based on [BitTorrent][] and the modifications Indeed need to make to support RAD. While not immediately
obvious, it contains all the information needed to reconstruct the original file system AetherFS took a snapshot of.
For example, the snippet of pseudocode below can be used to identify which blocks need to be read in order to
reconstruct a specific file.

[BitTorrent]: https://en.wikipedia.org/wiki/BitTorrent

```
offset = 0
size = 0
for file in files {
if file.name == desiredFile {
size = file.size
break;
}
offset = offset + sile.size
}
block = blocks[offset / blockSize]
blockOffset = offset % blockSize
// read block starting at blockOffset for size
// note size > blockSize ;-)
```

Keep in mind, that blocks can contain many files, and a single file can require many blocks. This is an important detail
when reconstructing data.

#### Persistence

<!-- todo -->
AetherFS persists data in an S3 compatible solution. Internally, it uses the MinIO Golang client to communicate with the
S3 API. We support reading common AWS environment variables, MinIO environment variables, or through command line
options. Similar to Git's object store and Dockers blob store, the AetherFS block store persists blocks in a directory
prefixed by the first two letters of the signature. Meanwhile, datasets are in a separate key space that allows hubs to
list datasets, and their tags without the need for a metadata file (i.e. through prefix scans.) For example:

```
blocks/{sha[:2]}/{sha[2:]}
blocks/1e/0d0f7ab123836cd69efb28cc62908c03002a11
blocks/1e/67d8b5474d6141c12da7798e494681e3d87440
blocks/1e/d939b2a6636c5a6085d0e885df0ac997c0d0a7
blocks/1e/fc7af3a2e5dfdd23163bd9e8c5b97059f56a37
datasets/{name}/{tag}
datasets/test/v21.10
datasets/test/latest
```

Since we store this information separately, implementing expiration and cleaning up older blocks is relatively easy to
implement. Some datasets, like Maxmind, have terms of use that say new versions need to be replaced in a timely manner
and older copies are deleted upon request.

#### Caching

_Not yet implemented._

### Security & Privacy

<!-- todo: add some pretext -->
This is an initial release. We'll add more on this later. Generally speaking, I like to take a reasonable approach
toward security. To me, that means that we ship security features as we need them instead of trying to force them in
right out the gate. That being said, while our initial release has no intent on supporting authentication, we're looking
at including it in our second.

So stay tuned until then... or not. I'm sure you know how these things work...

#### Authentication

Expand Down

0 comments on commit 9e57412

Please sign in to comment.