Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support configurable Zstd levels #81

Merged
merged 18 commits into from
Nov 6, 2023
Merged

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Nov 3, 2023

Part of #35.
This is somewhat complex because gRPC supports only static registration of compression objects and collectors naturally let you provide per-component configuration. To configure per-component compression levels is somewhat supported, as follows:

There are 10 distinct compression levels, 1 through 10. Any one of them can be configured through any component, but there is only one global value permitted for the encoder configuration, per level, and decoder configuration is singular (to keep it simple--it could be per level as well but I didn't go this way).

Updated README to explain the caveat.

Note the implementation in mru.go is derived from a production-tested implementation used internally. I've edited it for minor details, added a Reset method, and written new tests.

These configurations are independent of the OTel-collector-wide compression setting controlled by https://github.com/mostynb/go-grpc-compression. Unlike that library, the new support is thread-safe and reconfiguration at runtime results in resetting the MRU cache.

The names used by these compression settings are "zstdarrow1" through "zstdarrow10".

collector/exporter/otelarrowexporter/README.md Outdated Show resolved Hide resolved
level: 10 # describes gRPC-level compression level (default 5)
payload_compression: zstd # describes Arrow-IPC compression (default "none")
otelarrow/fastest:
compression: zstd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random question - curious what would happen if compression (grpc level) was not zstd and payload_compression uses zstd? i.e. should these compression methods always match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two settings are independent. The Arrow library has only a Zstd option exposed, and it's OK to "double compress". It sounds funny, but we've seen the double compression outperfom either by itself, so we are leaving the option open to experiment. However, since making Zstd levels configurable, I believe the gRPC compression will start to be successful on its own -- this deserves to be validated before we change defaults.

collector/compression/zstd/zstd.go Show resolved Hide resolved
defer staticInstances.lock.Unlock()

for level := MinLevel; level <= MaxLevel; level++ {
updateOne(&staticInstances.byLevel[level].dec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm in SetEncoderCfg(cfg EncoderConfig) we only do updateOne(&staticInstances.byLevel[cfg.Level].enc) but here for the decoder we are updating the config for all levels. Why update all levels for the decoder config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine question! I have gone both ways on this question during development. Firstly, encoders have levels, but decoders only have limits.

As structured, the decoder level used in this code is an artifact of allowing the encoder to configure levels, because the name of the PAIR changes when the encoder level changes.

Given that the notion of decoder level serves only to match the decoder limits with encoder limits, then it's worth asking whether it would be useful. Imagine configuring permissive limits for high-level compression and smaller limits for lower compression levels. This makes sense, but it might not protect the user against in a hostile environment. The goal of these limits is to protect the user, so having one limit is just easier.

I originally coded multiple limits, then reasoned through what the user would see. The OTel-Arrow receiver component would be able to configure different limits for every level, so would have a slice of decoder configs. This would make it hard to simply configure a limit, because you'd have to configure them all.

The code could be restructured to not have 10 distinct decoders, or it could be extended back in the other direction, to allow distinct decoder configurations. For now, for simplicity, I chose this middle ground.

collector/exporter/otelarrowexporter/factory.go Outdated Show resolved Hide resolved
pkg/otel/arrow_record/consumer.go Show resolved Hide resolved

- `level`: in the range 1-10 determines a number of defaults (default 5)
- `window_size_mib`: size of the Zstd window in MiB, 0 indicates to determine based on level (default 0)
- `concurrency`: controls background CPU used for compression, 0 indicates to let `zstd` library decide (default 1)

The exporter supports configuring compression at the [Arrow
columnar-protocol
level](https://arrow.apache.org/docs/format/Columnar.html#format-ipc).

- `payload_compression`: compression applied at the Arrow IPC level, "none" by default, "zstd" supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum I thought that we used zstd compression at the arrow level by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have evidence that by configuring the Zstd level and window size in the encoder, the payload compression is not actually helping very much. All of our earlier experiments w/ gRPC-level Zstd compression had inadvertently used a relatively small window size, which defeated the gRPC-level compression and made Arrow-level payload compression relatively important.

Coincidentally, I audited the Arrow code that sets up Zstd and recognize that it is using single-record encode and decode operations, there are no Zstd streams being built up and there is also no configurability. I wonder whether the Arrow architects have considered using Zstd streams within the Arrow IPC streams, but it seems to not be that way presently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since #82 merged I will repeat my experiment that led to this conclusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming that for the 30GB uncompressed data sample you and I have been using, @lquerel, for a single stream lasting the entire input (with #82), single-zstd at the gRPC level outperforms double-zstd at various levels.

For the uncompressed size 2.9875869937e+10, the results:

single-zstd, level 10: 2.540397625e+09 i.e., 0.085
double-zstd, level 10: 2.663619163e+09 i.e., 0.089
single-zstd, level 8: 2.627058159e+09 i.e., 0.088
single-zstd, level 5: 2.701691178e+09 i.e., 0.090
double-zstd, level 5: 2.870220537e+09, i.e., 0.096

@jmacd jmacd merged commit 5176e5b into open-telemetry:main Nov 6, 2023
2 checks passed
@jmacd jmacd deleted the jmacd/zstdopts branch November 6, 2023 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants