Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamp ordering isn't maintained the way corpus is segmented into chunks among indexing clients #365

Open
rishabhmaurya opened this issue Aug 21, 2023 · 8 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@rishabhmaurya
Copy link

rishabhmaurya commented Aug 21, 2023

Is your feature request related to a problem? Please describe.

The corpus is allocated across distinct indexing clients, determined by a predefined offset (typically set at 50,000 documents), from which they initiate the process of ingesting documents. This approach introduces randomness into the chronological order of timestamps when viewed from the perspective of OpenSearch's ingestion sequence. However, for workloads that are time-dependent, this divergence from the standard ingestion procedure is not representative of real-world scenarios.

As an illustration, consider the structure of http_logs data:

➜  opensearch-benchmark git:(main) ✗ cat ~/.benchmark/benchmarks/data/http_logs/documents-181998.json.offset
50000;6643195
100000;13279424
150000;19946710
200000;26630234
250000;33311767
300000;39990235
350000;46672840

Describe the solution you'd like
More realistic workload simulation by with more accurate timestamp order when ingestion is distributed among multiple indexing clients. This is not straightforward as inter client ordering could lead to slowness while ingestion. One way is split and precompute chunks in a way, which when run concurrently maintain the order of timestamp as close as possible. Open to comments and suggestions here.

Describe alternatives you've considered
None. I'm currently generating such dataset manually for my local testing

Additional context
This is in relation to change of merge policy from Tiered to LogByteSize for time dependent workloads, more details - opensearch-project/OpenSearch#9241 (comment)

@rishabhmaurya
Copy link
Author

rishabhmaurya commented Aug 22, 2023

this is how the timestamp distribution looks like when 8 clients are indexing concurrently for http_logs logs-181998 index -

http_logs

The difference between the timestamp processed by the first and the last client is quite significant.

@itiyama
Copy link

itiyama commented Aug 23, 2023

@rishabhmaurya Wouldn't this issue be introduced from Opensearch server which has more threads indexing in parallel on a shard, say for a machine with 32 cores?

@rishabhmaurya
Copy link
Author

rishabhmaurya commented Aug 23, 2023

@rishabhmaurya Wouldn't this issue be introduced from Opensearch server which has more threads indexing in parallel on a shard, say for a machine with 32 cores?

I'm not sure if I get your point. This shouldn't have any relation to server side at all. We want the client behavior as close as possible to the real world scenario where the logs emitted by different clients at a time shound't have a significant differences in the timestamp.

@rishabhmaurya
Copy link
Author

@bbarani @rishabh6788 @gkamat @IanHoang what do you guys think of adding this capability. This is blocking the testing of opensearch-project/OpenSearch#9241

For now in my custom setup, I have created 6 benchmark clients on different machines (with bulk_indexing_clients:1), and they are processing the same log file in order, so it will create 6x more documents but they all will be ingested in ordered way.
In order to finally merge the PR, I would the official benchmark to support the capability described in the issue.

@jordarlu
Copy link

@gkamat @IanHoang , pls help take a look and comment as well .. thanks a lot !!

@gkamat
Copy link
Collaborator

gkamat commented Aug 29, 2023

There is no straightforward way in the short term to achieve the end result desired, apart from using a single client as you are currently doing.

To ingest in a strictly monotonically increasing order with multiple clients requires all of them to process only 1 document at a time, before they go back to the coordinator process to identify the next document in the sequence that they should process. Essentially, the offset file will reduce to 1 document per chunk, which makes it redundant. This mode of operation would be very inefficient.

Furthermore, the chunk size is hardcoded currently to 50K documents. The mechanism described above could be implemented at some point, but it would not be a promising approach.

@bbarani
Copy link
Member

bbarani commented Aug 31, 2023

@gkamat @rishabhmaurya Can we look in to https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/README.md#traffic-replayer?

@rishabh6788 rishabh6788 added the good first issue Good for newcomers label Sep 13, 2023
@rishabhmaurya
Copy link
Author

I'm not sure how traffic replayer would help here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

7 participants