Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance Idea] Granular Choice for Stored Fields Compression Algorithm #11605

Open
mgodwan opened this issue Dec 13, 2023 · 13 comments
Open
Labels
enhancement Enhancement or improvement to existing feature or request Indexing:Performance Indexing Indexing, Bulk Indexing and anything related to indexing

Comments

@mgodwan
Copy link
Member

mgodwan commented Dec 13, 2023

Is your feature request related to a problem? Please describe.
Today, an operator can set index.codec setting which applies the required compression technique based on the value set by operator (e.g. zstd, deflate, lz4) . Whenever a segment is written, the configured compression algorithm is applied by Lucene for the stored fields.

For use cases where customer is looking for updates on recently ingested documents getting recent documents, it may be better to not compress as we need to retrieve the stored _source field which needs to be decompressed and may require additional CPU usage.

Once the segments are merged into larger segments, and the documents may not be frequently accessed, we can compress to take advantage of the required storage size and reduce write amplification due to lesser size.

The lever can be based on different parameters such as FieldInfo, Segment Size, etc. and can be incorporated into merge policy or Per Field codec configuration, and not just based on temporal locality.

Describe the solution you'd like
This is a rough idea I'm looking for feedback on if it may be a good idea to explore.

Additional context

I was analyzing the performance of an update benchmark workload where I witnessed recently ingested documents were being retrieved for updates and spending CPU cycles in decompression.

Screenshot 2023-12-13 at 12 18 49 PM

@mgodwan mgodwan added enhancement Enhancement or improvement to existing feature or request untriaged labels Dec 13, 2023
@mgodwan
Copy link
Member Author

mgodwan commented Dec 13, 2023

@shwetathareja @backslasht @msfroh Looking forward to your thoughts on this.

@mgodwan mgodwan added Indexing Indexing, Bulk Indexing and anything related to indexing Indexing:Performance and removed untriaged labels Dec 13, 2023
@backslasht
Copy link
Contributor

Thanks @mgodwan for the proposal.

For use cases where customer is looking for updates on recently ingested documents getting recent documents, it may be better to not compress as we need to retrieve the stored _source field which needs to be decompressed and may require additional CPU usage.

Trying to understand the use case better. What would be a real world scenario for this?

@mgodwan
Copy link
Member Author

mgodwan commented Dec 13, 2023

What would be a real world scenario for this?

One of the examples I can think of is around eCommerce-order/payment transaction data, where clients may start with creating a document, and as the lifecycle of order/transaction changes (e.g. payment gets authorized/captured), the updates are applied within a few seconds to the same document.

@shwetathareja
Copy link
Member

Once the segments are merged into larger segments, and the documents may not be frequently accessed, we can compress to take advantage of the required storage size and reduce write amplification due to lesser size.

Merging is continuous process. Along with immediate updates use cases (which are done within very short period of time (say in secs) after document creation), this should also benefit the merging of newly created segments as they go through multiple merging cycles and would save on decompression/ compression overhead.

@backslasht
Copy link
Contributor

Along with immediate updates use cases (which are done within very short period of time (say in secs) after document creation), this should also benefit the merging of newly created segments as they go through multiple merging cycles and would save on decompression/ compression overhead.

@shwetathareja - Interesting thought. Are you suggesting to do compression only when the segment under creation is of size greater than X bytes?

@mgodwan - is this only applicable for zstd as the decompression is comparatively slower when compared to LZ4?

Trying to gauge the advantage it provides considering the complexity it brings on the decision logic to compress or not (and/or possibly compress using a faster compression algorithm)?

@mgodwan
Copy link
Member Author

mgodwan commented Jan 2, 2024

Are you suggesting to do compression only when the segment under creation is of size greater than X bytes?

@backslasht This was the intended solution with this proposal as over time, older smaller segments will be merged into larger segments. The solution can be to either skip the compression/use a faster compression for smaller/recent segments, and storage-optimizing compression for older/larger segments.

is this only applicable for zstd as the decompression is comparatively slower when compared to LZ4?

I'll need to measure the trade-off between disk latency/IOPS and CPU usage, but characteristically speaking, we will need to analyse how this behaves for all compression algorithms. LZ4 usually takes lesser CPU for compression/decompression, but the overall indexing throughput observed is lesser as well there.

@backslasht
Copy link
Contributor

Thanks @mgodwan for the explanation.

I agree, this idea is worth exploring, I would lean towards size of the segment (instead of recency of the document) to decide on the compression logic.

@sarthakaggarwal97
Copy link
Contributor

Coming here from #13110. Sharing some numbers around the experiments.

Benchmarks:

Workload: NYC Taxis
We tested three hybrid compression size thesholds: 16mb, 32mb and 64mb.
The results are of 64mb (looked to be best amongst others)

Operation Codec Refresh Interval Disk Throughput Configured Throughput Latency Write IOPS
Index default 1s 593 mb/s 4.50% 5.50% -11%
  default 30s 593 mb/s 4.20% 5% -220%
  default 30s 250 mb/s 6% 14% -220%
Operation Codec Refresh Interval Disk Throughput Configured Throughput Latency Read IOPS
Update default 1s 593 mb/s 3.50% 5.50% -12%
  default 30s 593 mb/s 6 8.50% -300%
  default 30s 250 mb/s 15% 16% -300%

Note: +ve means improvement, -ve means degradation from the current behaviour.

Variance in Storage during the indexing of NYC Taxis Workload

There are steeper dips in the storage of the disk but it quickly recovers as we reach the 64 mb segment size thresholds.

Hybrid Compression:
Screenshot 2024-03-31 at 13 18 16

Default Compression
Screenshot 2024-03-31 at 13 18 25

Benchmarking Setup:

  1. 3 Dedicated Master Nodes - r6x.xlarge
  2. 1 Data Node - r6x.2xlarge
  3. EBS Configuration:
    1. Storage: 1000gb
    2. IOPS: 3000
    3. Throughput: 593 mb/s and 250 mb/s

@backslasht
Copy link
Contributor

Thanks @sarthakaggarwal97 for the experiments. It is interesting to see the impact on read and write IOPS when compression is disabled for smaller segments. Couple of follow up questions.

  1. What is the impact on search performance with this change?
  2. Would the numbers looks different if it is performed on NVMe based storage/disk?

@sarthakaggarwal97
Copy link
Contributor

Sharing some benchmarks around NVMe based storage.

Summary:

  • 4.7% improvement in throughput, 5.5% improvement in latency for append-only
  • 2-3% improvement in update workloads

NYC Taxis: Append Only Workload

Screenshot 2024-04-10 at 16 42 45 Screenshot 2024-04-10 at 16 42 51

NYC Taxis: Update Workload

Screenshot 2024-04-10 at 16 42 45(1)

Screenshot 2024-04-10 at 16 42 51(1)

I did not observe significant deviation in the disk operations and writes with these instances across different thresholds. The disk writes and ops were slightly increased for hybrid merges.

Benchmarking Setup:

  1. 3 Dedicated Master Nodes - i3x.xlarge
  2. Data Node - i3.2xlarge instance

@sarthakaggarwal97
Copy link
Contributor

Using a similar benchmarking setup as here, I employed an custom workload to trigger index and search simultaneously.

Sharing the benchmarks around it:

Screenshot 2024-05-08 at 14 47 52 Screenshot 2024-05-08 at 14 47 58

We are introducing hybrid compression in stored fields, and most of the queries available across different workloads (nyc_taxis, http_logs etc) do not directly search on the fdt files. Moreover, the search would be performant for the more recently indexed data, since we will save up on decompression compute. For the already indexed data, and compressed during merges, we dod not see any regression.

@sarthakaggarwal97
Copy link
Contributor

Explored introducing the idea of hybrid compression during flushes like merges.

Broadly, there are two reasons with which we can trigger a stored fields flush. Either, with a flush triggered via external factors like refresh, translog flush etc or via internal stored fields writer when the Chunk Size and the Number of Documents thresholds are met.

Here, whenever the flush was triggered by internal factors, the data was not compressed. The thresholds for the chunk size were increased to few mbs from present 8kb. If the flush happened due to external factors, we checked if the data is already written in the fdt file of the segment, if yes, we would pick the compression (codec compression or no compression) type chosen else we will compress using codec as defau;t.

Sharing the POC implementation for Hybrid Flushes and Merges:

  1. OpenSearch Code Changes
  2. Lucene Code Changes

@sarthakaggarwal97
Copy link
Contributor

Sharing some numbers and analysis around the benchmarks with hybrid compression enabled in both flushes and merges.

Screenshot 2024-04-29 at 10 46 49(1) Screenshot 2024-04-29 at 10 46 54(1)

It is visible that we see a good regression when we enable hybrid compression in flushes alongside merges. Let's dive into why is that happening?

The left run is without hybrid compression and the right one is with hybrid compression.

  1. The increase in JVM is expected since we're storing more documents in the buffer. But the increase is significant.
    Screenshot 2024-05-06 at 09 05 22(1)

This increase is leading to twice the number of YoungGC being collected and recycled, alongside almost 100% increase in the GC latencies.

  1. CPU is higher than the baseline for the runs.
    Screenshot 2024-05-06 at 09 05 33(1)

While comparing profiles, I came across that there are additional merges going on during the hybrid compression runs which would definitely affect the tail latencies.

Screenshot 2024-05-06 at 09 08 59(1)

  1. Effects on disks:
  • Disk Queue almost looks similar, so enough disk is available to process the incoming traffic.
    Screenshot 2024-05-06 at 09 05 11

  • There is an increase in the Write IOPS and Disk Write Latency owing to us writing more data, since we are not compressing in some cases.
    Screenshot 2024-05-06 at 09 04 59

Screenshot 2024-05-06 at 09 04 53

Benchmarking Setup:

  1. 3 Dedicated Master Nodes - r6x.xlarge
  2. 1 Data Node - r6x.2xlarge
  3. Index Refresh Interval: 1s
  4. EBS Configuration:
    1. Storage: 1000gb
    2. IOPS: 3000
    3. Throughput: 593 mb/s and 250 mb/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing:Performance Indexing Indexing, Bulk Indexing and anything related to indexing
Projects
None yet
Development

No branches or pull requests

4 participants