Apache Parquet

The idea is to use Apache Parquet for storing group relations between Runtime and the Application. Apache Parquet is a column-oriented data storage format, which could be used to hold the needed pieces of information. It would be stored on the root of the S3 bucket.

The whole structure on S3 would be as follows:

{tenant_id}/
├── applications/
│   ├── {application_id}/
│   │  └── data.json
├── runtimes/
│   ├── {application_id}/
│   │  └── data.json
└── db.parquet

To minimize the file size, the db.parquet file would store all details regarding labels for runtimes and applications. We would do queries for:

all labels
all applications with a specific label
all runtimes with a specific label
all labels with a specific key

However, all read/write operations would need multiple S3 requests to handle them. For example:

To read all runtimes (when we have 10 of them), we would need to:

fetch list of the runtimes directory (1 request)
for every runtime, we would need to download the data.json file (10 requests)
download the db.parquet file (1 request), query labels for all runtimes and merge the above results with data

We could use a similar approach to the one described here, by writing labels for runtimes and applications as files under the {tenant_id}/{applications or runtimes}/{application_id or runtime_id}/ directory.

Pros

Efficient data store

Very good compression. For example, if the CSV file has 1 TB, the data stored in Parquet file has 130 GB.
Good performance for reading data

Fetching specific column values doesn't need reading the entire raw data.

Cons

Doing queries

For Golang, we have parquet-go library, which handles the reading and writing of the parquet file.

In order to do queries in Apache Parquet, we would need a query engine, such as Apache Drill, which is written in Java.

If we wanted to avoid using this engine. We would need to rethink how to structure the data, to be able to cover all our use cases with just the ability to read a specific column.
Concurrent write into the same file

We could enable bucket files versioning (for example, in GCS). However, we would need to implement a mechanism to always fetch the latest version of the file and retry saving until it is successful.
Time-consuming operations

For example, every time a user adds an application to a group, we would need to download the parquet file, modify it and then upload the modified version. We can't implement caching as we always need the latest version of this file, especially when we will use Cross-region replicas.

However, there is S3 Select available on AWS for Parquet, which would eliminate the need to download the whole file for read-only operations. Minio supports S3 Select. Still, writing will always mean that we need to download the file.

Summary

The Apache Parquet is a good solution for a different use case - when storing a big amount of data with complex structure. In our case, we can store labels using standard S3 capabilities: creating and removing files and directories. We can create a directory structure on S3 to enable querying by groups.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet.md

parquet.md

Apache Parquet

Pros

Cons

Summary

Files

parquet.md

Latest commit

History

parquet.md

File metadata and controls

Apache Parquet

Pros

Cons

Summary