-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timestamp ordering isn't maintained the way corpus is segmented into chunks among indexing clients #365
Comments
@rishabhmaurya Wouldn't this issue be introduced from Opensearch server which has more threads indexing in parallel on a shard, say for a machine with 32 cores? |
I'm not sure if I get your point. This shouldn't have any relation to server side at all. We want the client behavior as close as possible to the real world scenario where the logs emitted by different clients at a time shound't have a significant differences in the timestamp. |
@bbarani @rishabh6788 @gkamat @IanHoang what do you guys think of adding this capability. This is blocking the testing of opensearch-project/OpenSearch#9241 For now in my custom setup, I have created 6 benchmark clients on different machines (with |
There is no straightforward way in the short term to achieve the end result desired, apart from using a single client as you are currently doing. To ingest in a strictly monotonically increasing order with multiple clients requires all of them to process only 1 document at a time, before they go back to the coordinator process to identify the next document in the sequence that they should process. Essentially, the offset file will reduce to 1 document per chunk, which makes it redundant. This mode of operation would be very inefficient. Furthermore, the chunk size is hardcoded currently to 50K documents. The mechanism described above could be implemented at some point, but it would not be a promising approach. |
I'm not sure how traffic replayer would help here |
Is your feature request related to a problem? Please describe.
The corpus is allocated across distinct indexing clients, determined by a predefined offset (typically set at 50,000 documents), from which they initiate the process of ingesting documents. This approach introduces randomness into the chronological order of timestamps when viewed from the perspective of OpenSearch's ingestion sequence. However, for workloads that are time-dependent, this divergence from the standard ingestion procedure is not representative of real-world scenarios.
As an illustration, consider the structure of http_logs data:
Describe the solution you'd like
More realistic workload simulation by with more accurate timestamp order when ingestion is distributed among multiple indexing clients. This is not straightforward as inter client ordering could lead to slowness while ingestion. One way is split and precompute chunks in a way, which when run concurrently maintain the order of timestamp as close as possible. Open to comments and suggestions here.
Describe alternatives you've considered
None. I'm currently generating such dataset manually for my local testing
Additional context
This is in relation to change of merge policy from Tiered to LogByteSize for time dependent workloads, more details - opensearch-project/OpenSearch#9241 (comment)
The text was updated successfully, but these errors were encountered: