Timestamp ordering isn't maintained the way corpus is segmented into chunks among indexing clients #365

rishabhmaurya · 2023-08-21T18:43:01Z

Is your feature request related to a problem? Please describe.

The corpus is allocated across distinct indexing clients, determined by a predefined offset (typically set at 50,000 documents), from which they initiate the process of ingesting documents. This approach introduces randomness into the chronological order of timestamps when viewed from the perspective of OpenSearch's ingestion sequence. However, for workloads that are time-dependent, this divergence from the standard ingestion procedure is not representative of real-world scenarios.

As an illustration, consider the structure of http_logs data:

➜  opensearch-benchmark git:(main) ✗ cat ~/.benchmark/benchmarks/data/http_logs/documents-181998.json.offset
50000;6643195
100000;13279424
150000;19946710
200000;26630234
250000;33311767
300000;39990235
350000;46672840

Describe the solution you'd like
More realistic workload simulation by with more accurate timestamp order when ingestion is distributed among multiple indexing clients. This is not straightforward as inter client ordering could lead to slowness while ingestion. One way is split and precompute chunks in a way, which when run concurrently maintain the order of timestamp as close as possible. Open to comments and suggestions here.

Describe alternatives you've considered
None. I'm currently generating such dataset manually for my local testing

Additional context
This is in relation to change of merge policy from Tiered to LogByteSize for time dependent workloads, more details - opensearch-project/OpenSearch#9241 (comment)

The text was updated successfully, but these errors were encountered:

rishabhmaurya · 2023-08-22T01:36:43Z

this is how the timestamp distribution looks like when 8 clients are indexing concurrently for http_logs logs-181998 index -

The difference between the timestamp processed by the first and the last client is quite significant.

itiyama · 2023-08-23T17:59:33Z

@rishabhmaurya Wouldn't this issue be introduced from Opensearch server which has more threads indexing in parallel on a shard, say for a machine with 32 cores?

rishabhmaurya · 2023-08-23T20:44:53Z

@rishabhmaurya Wouldn't this issue be introduced from Opensearch server which has more threads indexing in parallel on a shard, say for a machine with 32 cores?

I'm not sure if I get your point. This shouldn't have any relation to server side at all. We want the client behavior as close as possible to the real world scenario where the logs emitted by different clients at a time shound't have a significant differences in the timestamp.

rishabhmaurya · 2023-08-24T21:09:49Z

@bbarani @rishabh6788 @gkamat @IanHoang what do you guys think of adding this capability. This is blocking the testing of opensearch-project/OpenSearch#9241

For now in my custom setup, I have created 6 benchmark clients on different machines (with bulk_indexing_clients:1), and they are processing the same log file in order, so it will create 6x more documents but they all will be ingested in ordered way.
In order to finally merge the PR, I would the official benchmark to support the capability described in the issue.

jordarlu · 2023-08-29T19:14:29Z

@gkamat @IanHoang , pls help take a look and comment as well .. thanks a lot !!

gkamat · 2023-08-29T21:14:54Z

There is no straightforward way in the short term to achieve the end result desired, apart from using a single client as you are currently doing.

To ingest in a strictly monotonically increasing order with multiple clients requires all of them to process only 1 document at a time, before they go back to the coordinator process to identify the next document in the sequence that they should process. Essentially, the offset file will reduce to 1 document per chunk, which makes it redundant. This mode of operation would be very inefficient.

Furthermore, the chunk size is hardcoded currently to 50K documents. The mechanism described above could be implemented at some point, but it would not be a promising approach.

bbarani · 2023-08-31T22:21:48Z

@gkamat @rishabhmaurya Can we look in to https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/README.md#traffic-replayer?

rishabhmaurya · 2023-09-14T18:22:07Z

I'm not sure how traffic replayer would help here

rishabhmaurya added the enhancement New feature or request label Aug 21, 2023

github-actions bot added the untriaged label Aug 21, 2023

rishabhmaurya mentioned this issue Aug 21, 2023

Explore use of LogByteSizeMergePolicy for time series data use cases opensearch-project/OpenSearch#9241

Open

peterzhuamazon removed the untriaged label Aug 29, 2023

rishabh6788 added the good first issue Good for newcomers label Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timestamp ordering isn't maintained the way corpus is segmented into chunks among indexing clients #365

Timestamp ordering isn't maintained the way corpus is segmented into chunks among indexing clients #365

rishabhmaurya commented Aug 21, 2023 •

edited

Loading

rishabhmaurya commented Aug 22, 2023 •

edited

Loading

itiyama commented Aug 23, 2023

rishabhmaurya commented Aug 23, 2023 •

edited

Loading

rishabhmaurya commented Aug 24, 2023

jordarlu commented Aug 29, 2023

gkamat commented Aug 29, 2023

bbarani commented Aug 31, 2023 •

edited

Loading

rishabhmaurya commented Sep 14, 2023

Timestamp ordering isn't maintained the way corpus is segmented into chunks among indexing clients #365

Timestamp ordering isn't maintained the way corpus is segmented into chunks among indexing clients #365

Comments

rishabhmaurya commented Aug 21, 2023 • edited Loading

rishabhmaurya commented Aug 22, 2023 • edited Loading

itiyama commented Aug 23, 2023

rishabhmaurya commented Aug 23, 2023 • edited Loading

rishabhmaurya commented Aug 24, 2023

jordarlu commented Aug 29, 2023

gkamat commented Aug 29, 2023

bbarani commented Aug 31, 2023 • edited Loading

rishabhmaurya commented Sep 14, 2023

rishabhmaurya commented Aug 21, 2023 •

edited

Loading

rishabhmaurya commented Aug 22, 2023 •

edited

Loading

rishabhmaurya commented Aug 23, 2023 •

edited

Loading

bbarani commented Aug 31, 2023 •

edited

Loading