Skip to content

metatexx/avrox

Repository files navigation

AvroX

AvroX enables Avro formatted data to be discoverable in a closed system. The idea behind it was to create a method for encoding structured data for NATS messages in a more compact format than using JSON + JSON schemas.

This package is still experimental!

TL;DR: Don't use it for anything important!

This package is a work in progress (WIP). Feel free to try it out, but don't expect it to be suitable for production use, and anticipate daily changes to the API. We publish this primarily because some of the tools we also publish utilize this package, and we believe it could become very useful eventually.

Take note of our disclaimer below!

What it delivers:

  • Highly concise binary encoding for small data sizes.
  • A JSON schema with additional data documentation.
  • Optional further data compression (we are currently using Snappy, but we support up to 7 types with AvroX).
  • Avro supports good native types for time, date, and binary data. This works for Go because of the wonderful hamba/avro/v2 package.
  • A usage experience similar to other marshaller implementations.
  • A three level versioning identifier (similar to a semver) with N.S.V which is namespace, schema, version.
  • Support for unmarshalling to a given list of schemas (unions) where the destinations can be a nil type or a concrete type. It returns then either a new allocated type of uses the given storage after identifying what schema is used in the source data.
  • Some basic types like string, int, map[string]any can be directly marshalled, while also utilizing Avro.
  • The unmarshaller automatically detects JSON (for manual debugging) as an alternative to Avro data (may get removed soon).
  • Seamless integration with the NATS CLI Tool --translate option through the use of our message converter tool msgcvt. This will also eventually support schema storage within NATS.
  • The schema can be used in an interpreted or compiled manner (we do not use compiled Avro so far).
  • Schema registry with namespace support, accommodating both public and private schemas.
  • AvroX Data could be discovered in a binary stream (although this is just an experiment)
  • Avro schema's can be also be autogenerated through avscgen which is currently still proprietary and may be release by us to the public eventually.
  • We are working also on an auto indexer that can generate indexes for the messages in a stream based on indexin information that can be added to a shemas fields (a bit like adding indexes when using a database).

(This list is not exhastive...)

How we arrived at the current stage:

During our research into alternative formats for storing a large volume of small data in a NATS JetStream, we examined various formats:

  • JSON + JSON Schema was our original idea. However, the overhead became rather significant when storing millions of messages. Ensuring the schema and JSON were in sync also required extra steps during implementation and testing. We believed there must be a more elegant solution, which led us to begin our search.
  • Gob was our first alternative, but it quickly became apparent that it actually increased data size when used with numerous individual messages and struct tags. It also required recompilation and lacked discoverability. Additionally, Go code is not inherently a schema. Parsing structs and struct tags to generate documents was quite cumbersome.
  • ProtoBuf necessitated recompilation and a considerable amount of additional tooling, as well as generated excessive code. We previously used it alongside Twirp before deciding to employ NATS for messaging at the border too (see: https://github.com/oderwat/go-nats-app). Twirp inspired us to consider supporting JSON as an alternative to the endpoints.
  • CBOR, with its Go package fxamacker/cbor, looked promising and somewhat reduced data size, but not significantly enough. It also lacked a robust schema representation. However, it could be parsed without the schema, like JSON. While working on this, we realized that a shareable, simple text schema was what we needed.
  • BSON was briefly considered but quickly ruled out.

As we experimented with various implementations and formats, our desired features became increasingly clear.

  1. It should have a small storage size.
  2. We want a mandatory schema for documentation and discovery.
  3. It should be very easy to use and plug-in, just like other marshallers.
  4. Debugging messages should be possible without recompiling the used tools.
  5. It should not hinder prototyping or the creation of quick tools.
  6. There should be a way to bypass it and revert to using JSON.
  7. It should be safe and performant.
  8. While an interpreted schema is beneficial, there should also be a way to generate specialized code for increased performance.

Disclaimer

This code and documentation are works in progress, and everything may change without further notice. We are shure there are bugs to fix and optimisations to make. This project utilizes the GPT-4 language model for generating some of its content.

MIT License / Copyright 2023 by METATEXX GmbH

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages