[Performance Idea] Granular Choice for Stored Fields Compression Algorithm #11605

mgodwan · 2023-12-13T06:49:16Z

Is your feature request related to a problem? Please describe.
Today, an operator can set index.codec setting which applies the required compression technique based on the value set by operator (e.g. zstd, deflate, lz4) . Whenever a segment is written, the configured compression algorithm is applied by Lucene for the stored fields.

For use cases where customer is looking for updates on recently ingested documents getting recent documents, it may be better to not compress as we need to retrieve the stored _source field which needs to be decompressed and may require additional CPU usage.

Once the segments are merged into larger segments, and the documents may not be frequently accessed, we can compress to take advantage of the required storage size and reduce write amplification due to lesser size.

The lever can be based on different parameters such as FieldInfo, Segment Size, etc. and can be incorporated into merge policy or Per Field codec configuration, and not just based on temporal locality.

Describe the solution you'd like
This is a rough idea I'm looking for feedback on if it may be a good idea to explore.

Additional context

I was analyzing the performance of an update benchmark workload where I witnessed recently ingested documents were being retrieved for updates and spending CPU cycles in decompression.

The text was updated successfully, but these errors were encountered:

mgodwan · 2023-12-13T06:51:29Z

@shwetathareja @backslasht @msfroh Looking forward to your thoughts on this.

backslasht · 2023-12-13T15:36:36Z

Thanks @mgodwan for the proposal.

For use cases where customer is looking for updates on recently ingested documents getting recent documents, it may be better to not compress as we need to retrieve the stored _source field which needs to be decompressed and may require additional CPU usage.

Trying to understand the use case better. What would be a real world scenario for this?

mgodwan · 2023-12-13T19:22:18Z

What would be a real world scenario for this?

One of the examples I can think of is around eCommerce-order/payment transaction data, where clients may start with creating a document, and as the lifecycle of order/transaction changes (e.g. payment gets authorized/captured), the updates are applied within a few seconds to the same document.

shwetathareja · 2023-12-26T11:46:31Z

Once the segments are merged into larger segments, and the documents may not be frequently accessed, we can compress to take advantage of the required storage size and reduce write amplification due to lesser size.

Merging is continuous process. Along with immediate updates use cases (which are done within very short period of time (say in secs) after document creation), this should also benefit the merging of newly created segments as they go through multiple merging cycles and would save on decompression/ compression overhead.

backslasht · 2024-01-01T07:26:57Z

Along with immediate updates use cases (which are done within very short period of time (say in secs) after document creation), this should also benefit the merging of newly created segments as they go through multiple merging cycles and would save on decompression/ compression overhead.

@shwetathareja - Interesting thought. Are you suggesting to do compression only when the segment under creation is of size greater than X bytes?

@mgodwan - is this only applicable for zstd as the decompression is comparatively slower when compared to LZ4?

Trying to gauge the advantage it provides considering the complexity it brings on the decision logic to compress or not (and/or possibly compress using a faster compression algorithm)?

mgodwan · 2024-01-02T10:35:21Z

Are you suggesting to do compression only when the segment under creation is of size greater than X bytes?

@backslasht This was the intended solution with this proposal as over time, older smaller segments will be merged into larger segments. The solution can be to either skip the compression/use a faster compression for smaller/recent segments, and storage-optimizing compression for older/larger segments.

is this only applicable for zstd as the decompression is comparatively slower when compared to LZ4?

I'll need to measure the trade-off between disk latency/IOPS and CPU usage, but characteristically speaking, we will need to analyse how this behaves for all compression algorithms. LZ4 usually takes lesser CPU for compression/decompression, but the overall indexing throughput observed is lesser as well there.

backslasht · 2024-01-03T05:44:22Z

Thanks @mgodwan for the explanation.

I agree, this idea is worth exploring, I would lean towards size of the segment (instead of recency of the document) to decide on the compression logic.

sarthakaggarwal97 · 2024-04-10T04:32:03Z

Coming here from #13110. Sharing some numbers around the experiments.

Benchmarks:

Workload: NYC Taxis
We tested three hybrid compression size thesholds: 16mb, 32mb and 64mb.
The results are of 64mb (looked to be best amongst others)

Operation	Codec	Refresh Interval	Disk Throughput Configured	Throughput	Latency	Write IOPS
Index	default	1s	593 mb/s	4.50%	5.50%	-11%
	default	30s	593 mb/s	4.20%	5%	-220%
	default	30s	250 mb/s	6%	14%	-220%

Operation	Codec	Refresh Interval	Disk Throughput Configured	Throughput	Latency	Read IOPS
Update	default	1s	593 mb/s	3.50%	5.50%	-12%
	default	30s	593 mb/s	6	8.50%	-300%
	default	30s	250 mb/s	15%	16%	-300%

Note: +ve means improvement, -ve means degradation from the current behaviour.

Variance in Storage during the indexing of NYC Taxis Workload

There are steeper dips in the storage of the disk but it quickly recovers as we reach the 64 mb segment size thresholds.

Hybrid Compression:

Default Compression

Benchmarking Setup:

3 Dedicated Master Nodes - r6x.xlarge
1 Data Node - r6x.2xlarge
EBS Configuration:
1. Storage: 1000gb
2. IOPS: 3000
3. Throughput: 593 mb/s and 250 mb/s

backslasht · 2024-04-14T04:25:35Z

Thanks @sarthakaggarwal97 for the experiments. It is interesting to see the impact on read and write IOPS when compression is disabled for smaller segments. Couple of follow up questions.

What is the impact on search performance with this change?
Would the numbers looks different if it is performed on NVMe based storage/disk?

sarthakaggarwal97 · 2024-05-08T05:54:54Z

Sharing some benchmarks around NVMe based storage.

Summary:

4.7% improvement in throughput, 5.5% improvement in latency for append-only
2-3% improvement in update workloads

NYC Taxis: Append Only Workload

NYC Taxis: Update Workload

I did not observe significant deviation in the disk operations and writes with these instances across different thresholds. The disk writes and ops were slightly increased for hybrid merges.

Benchmarking Setup:

3 Dedicated Master Nodes - i3x.xlarge
Data Node - i3.2xlarge instance

sarthakaggarwal97 · 2024-05-08T09:24:20Z

Using a similar benchmarking setup as here, I employed an custom workload to trigger index and search simultaneously.

Sharing the benchmarks around it:

We are introducing hybrid compression in stored fields, and most of the queries available across different workloads (nyc_taxis, http_logs etc) do not directly search on the fdt files. Moreover, the search would be performant for the more recently indexed data, since we will save up on decompression compute. For the already indexed data, and compressed during merges, we dod not see any regression.

sarthakaggarwal97 · 2024-05-08T09:41:06Z

Explored introducing the idea of hybrid compression during flushes like merges.

Broadly, there are two reasons with which we can trigger a stored fields flush. Either, with a flush triggered via external factors like refresh, translog flush etc or via internal stored fields writer when the Chunk Size and the Number of Documents thresholds are met.

Here, whenever the flush was triggered by internal factors, the data was not compressed. The thresholds for the chunk size were increased to few mbs from present 8kb. If the flush happened due to external factors, we checked if the data is already written in the fdt file of the segment, if yes, we would pick the compression (codec compression or no compression) type chosen else we will compress using codec as defau;t.

Sharing the POC implementation for Hybrid Flushes and Merges:

sarthakaggarwal97 · 2024-05-08T10:01:41Z

Sharing some numbers and analysis around the benchmarks with hybrid compression enabled in both flushes and merges.

It is visible that we see a good regression when we enable hybrid compression in flushes alongside merges. Let's dive into why is that happening?

The left run is without hybrid compression and the right one is with hybrid compression.

The increase in JVM is expected since we're storing more documents in the buffer. But the increase is significant.

This increase is leading to twice the number of YoungGC being collected and recycled, alongside almost 100% increase in the GC latencies.

CPU is higher than the baseline for the runs.

While comparing profiles, I came across that there are additional merges going on during the hybrid compression runs which would definitely affect the tail latencies.

Effects on disks:

Disk Queue almost looks similar, so enough disk is available to process the incoming traffic.
There is an increase in the Write IOPS and Disk Write Latency owing to us writing more data, since we are not compressing in some cases.

Benchmarking Setup:

3 Dedicated Master Nodes - r6x.xlarge
1 Data Node - r6x.2xlarge
Index Refresh Interval: 1s
EBS Configuration:
1. Storage: 1000gb
2. IOPS: 3000
3. Throughput: 593 mb/s and 250 mb/s

mgodwan added enhancement Enhancement or improvement to existing feature or request untriaged labels Dec 13, 2023

mgodwan added Indexing Indexing, Bulk Indexing and anything related to indexing Indexing:Performance and removed untriaged labels Dec 13, 2023

mgodwan mentioned this issue Apr 8, 2024

[Feature] Hybrid Compression #13110

Open

sarthakaggarwal97 mentioned this issue May 17, 2024

[FEATURE] Hybrid Compression with Zstandard Codecs opensearch-project/custom-codecs#149

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance Idea] Granular Choice for Stored Fields Compression Algorithm #11605

[Performance Idea] Granular Choice for Stored Fields Compression Algorithm #11605

mgodwan commented Dec 13, 2023 •

edited

Loading

mgodwan commented Dec 13, 2023

backslasht commented Dec 13, 2023

mgodwan commented Dec 13, 2023

shwetathareja commented Dec 26, 2023

backslasht commented Jan 1, 2024

mgodwan commented Jan 2, 2024

backslasht commented Jan 3, 2024

sarthakaggarwal97 commented Apr 10, 2024

backslasht commented Apr 14, 2024

sarthakaggarwal97 commented May 8, 2024

sarthakaggarwal97 commented May 8, 2024

sarthakaggarwal97 commented May 8, 2024

sarthakaggarwal97 commented May 8, 2024

[Performance Idea] Granular Choice for Stored Fields Compression Algorithm #11605

[Performance Idea] Granular Choice for Stored Fields Compression Algorithm #11605

Comments

mgodwan commented Dec 13, 2023 • edited Loading

mgodwan commented Dec 13, 2023

backslasht commented Dec 13, 2023

mgodwan commented Dec 13, 2023

shwetathareja commented Dec 26, 2023

backslasht commented Jan 1, 2024

mgodwan commented Jan 2, 2024

backslasht commented Jan 3, 2024

sarthakaggarwal97 commented Apr 10, 2024

backslasht commented Apr 14, 2024

sarthakaggarwal97 commented May 8, 2024

sarthakaggarwal97 commented May 8, 2024

sarthakaggarwal97 commented May 8, 2024

sarthakaggarwal97 commented May 8, 2024

mgodwan commented Dec 13, 2023 •

edited

Loading