Another TPCH fix: Wait for all shards#5106
Conversation
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
📝 WalkthroughSummary by CodeRabbit
WalkthroughTest utilities now always send an index JSON body with Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
|
nvm, this hangs because it waits forever for replicas on a single node. But I can work around that... |
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
(cherry picked from commit 8073b4e) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
| // loadIndex() could directly return when isIndexExist()=true, | ||
| // e.g. the index is created in the cluster but data hasn't been flushed. | ||
| // We block loadIndex() until data loaded to resolve | ||
| // https://github.com/opensearch-project/sql/issues/4261 | ||
| int countDown = 3; // 1500ms timeout | ||
| while (countDown != 0 && getDocCount(client, indexName) == 0) { | ||
| try { | ||
| Thread.sleep(500); | ||
| countDown--; | ||
| } catch (InterruptedException e) { | ||
| throw new IOException(e); | ||
| } | ||
| } |
There was a problem hiding this comment.
for safety, I think we can keep this logic to mitigate the impacts.
There was a problem hiding this comment.
This logic also caused build fleet testing times to go from 2 hours to 5 hours (which I think means it was broken in general), we should avoid hardcoded sleep durations anyway if we need to have some sort of check. At worst maybe a retry policy directly on the TPCH ITs.
Description
TPCH is flaky again in CI (#5103 / #4261), but I haven't seen it failing in
mainfor a while and I see we have the workaround that waits for docs to be reported.I think the situation is that the CI server is running a multi-node integ test, while our ITs normally are single-node. For multiple nodes, it may be the case that one node reports documents while the other doesn't. On that hypothesis, this change forces the tests to wait for all shards to accept new documents.
I tried just
wait_for_active_shards=allbut this causes integ tests to hang on single-node setups because there's nowhere to put the replica. So in addition I added a helper that sets the replica count to 0 on integ test indices, which should fix both the waiting issue and the above routing issue.Related Issues
Resolves #5103
Resolves #4261 (I hope)
Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.