Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Break change] Support 64bit length, add various types and typed containers #311

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

cmpute
Copy link

@cmpute cmpute commented Jun 16, 2021

This is a proposal for a lot of modifications based on current specs. Part 1 will break backward compatibility (specifically depricate old fixext fields), but it's well worth it. It will drastically improve space efficiency and time efficiency in certain applications. The main modification includes

Top-level types (Part 1)

Add variable length fixext

fixext 1,2,4,8,16 is unified to a more compact format as proposed in #310 . Benefit for this includes:

  1. This new format free 4 top-level type codes, which is used to add some new types below.
  2. It will also save a lot of bytes if the ext type payload is not a power of 2.
    For example, to store a 3-byte ext data, previous we need 0xc7 + 0x03 + type byte + 4-byte payload = 7 bytes. With the proposed format it only needs 5 bytes (28.57% less)

Add complex numbers

The most common 64-bit and 128-bit complex type are added.

Complex number is the only primitive type missing in msgpack format. It's natively supported by most general-purpose and scientific programming languages. Adding complex numbers as top level types will help serialize scientific data with typed containers proposed below.

Add bin 64, ext 64

This is a feature requested by a lot of people (#214 #190 #268). 64-bit indexing support is added to bin and ext, which will fit most of the demands.

In modern computers, RAM size is usually larger than 4GB (can be up to TB in data centers), so loading all data into memory is very common. Chunking the data is inconvenient and will lead to performance loss if large data is stored. Moreover, there's currently no specification about how to chunk the data in msgpack. With the help of 4 additional type codes freed by variable length fixext, this can be easily added to the specification.

In my opinion, msgpack is very simple and clean, it can be used to store large data, satisfying more demands than network communication.

More ext types (Part 2)

Add bigint, bigfloat

This proposal is modified from #249, fixing #206, #292. Only interger and floating point number is added. Large decimal and fraction types are rarely demanded in my opinion. int 128, float 16 and float 128 are also proposed with this format, which only requires 2 extra byte thanks to the variable length fixext.

Add UUID

UUID is widely used nowadays. Officially support UUID by assigning an extension type is not a bad idea in my opinion. This will fix #222 #239.

With UUID, Bigint and Bigfloat supported, there're 4 additional ext types left within fixext capacity, which can be used in future.

Add typed containers

Motivated by #267 and #268, I added support for typed containers, specifically typed array, typed map and typed n-d array. The benefit for typed containers is for reducing overhead of the additional type bytes and zeroing copies. "structured array" as proposed in #267 is not added since it's a lot more complicated for parsers to implement than the formats proposed in this PR.

Note that the size of the containers is not explicity stored in the proposed format, it should be calculated by (payload size - overhead size) / (element size)


This is a big proposal, comments, suggestions and modifications are welcome!

- Add variable lengh fixext
- Add complex numbers
- Add bin 64, ext 64
- Add bigint, bigfloat
- Add UUID
- Add typed containers
@cmpute
Copy link
Author

cmpute commented Jun 16, 2021

A possible guideline for parsers to handle the backward compatibility:

Serialization

  • Serialize complex / bin 64 / ext 64: throw error for old version
  • Serialize fixext: force user to explicitly specify the msgpack version
  • Serialize other types: same as before

Deserialization

  • Deserialize complex / bin 64 / ext 64 / fixext: force user to explicitly specify the msgpack version
  • Deserialize other types: same as before

@mincequi
Copy link

This is really nice. UUIDs and complex types would greatly help for my purposes.
What i would also like to see is a float16 data type. For my use case, this would mean 40% less space.

@cmpute
Copy link
Author

cmpute commented Nov 13, 2021

This is really nice. UUIDs and complex types would greatly help for my purposes. What i would also like to see is a float16 data type. For my use case, this would mean 40% less space.

@mincequi I already included the spec for float16 in the bigfloat type, since ideally bigfloat can represent a float number with any size. For more efficient way to store float16, a modification to array types could be useful

@Saiv46
Copy link

Saiv46 commented Mar 12, 2023

I think everything can be described as extensions. Other things (such as deprecation of fixext) is harmful to backward compatibility (nobody wants to unexpectedly find out that libraries on two platforms are implementing different versions of spec)

@benatkin
Copy link

@Saiv46 no, extension types have the length defined in the same way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A special format for UUID
4 participants