Another TPCH fix: Wait for all shards by Swiddis · Pull Request #5106 · opensearch-project/sql

Swiddis · 2026-02-03T17:52:03Z

Description

TPCH is flaky again in CI (#5103 / #4261), but I haven't seen it failing in main for a while and I see we have the workaround that waits for docs to be reported.

I think the situation is that the CI server is running a multi-node integ test, while our ITs normally are single-node. For multiple nodes, it may be the case that one node reports documents while the other doesn't. On that hypothesis, this change forces the tests to wait for all shards to accept new documents.

I tried just wait_for_active_shards=all but this causes integ tests to hang on single-node setups because there's nowhere to put the replica. So in addition I added a helper that sets the replica count to 0 on integ test indices, which should fix both the waiting issue and the above routing issue.

Related Issues

Resolves #5103
Resolves #4261 (I hope)

Check List

New functionality includes testing.
New functionality has been documented.
New functionality has javadoc added.
New functionality has a user manual doc added.
New PPL command checklist all confirmed.
API changes companion pull request created.
Commits are signed per the DCO using --signoff or -s.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

coderabbitai · 2026-02-03T17:52:22Z

📝 Walkthrough

Summary by CodeRabbit

Tests
- Updated test index creation to set replicas to zero
- Modified bulk data insertion refresh behavior for improved synchronization
- Streamlined test setup by removing unnecessary wait conditions

Walkthrough

Test utilities now always send an index JSON body with settings.index.number_of_replicas = 0 when creating indices, bulk-loads use refresh=wait_for&wait_for_active_shards=all, and an integration test wait-for-data loop after index creation was removed to return immediately after loading existing indices.

Changes

Cohort / File(s)	Summary
Test utilities: index creation & bulk load `integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java`	Always build/send an index JSON body when creating an index; added private helper to set `settings.index.number_of_replicas = 0`. Changed bulk insert URL from `?refresh=true` to `?refresh=wait_for&wait_for_active_shards=all`.
Integration test flow: remove post-create wait `integ-test/src/test/java/org/opensearch/sql/legacy/SQLIntegTestCase.java`	Removed the post-index-creation polling loop that waited for documents (and its InterruptedException handling); the function now returns immediately after loading the index if it exists.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Another TPCH fix: Wait for all shards' accurately reflects the main change: modifying shard synchronization behavior in integration tests to address TPCH flakiness.
Description check	✅ Passed	The description clearly explains the problem (TPCH flakiness in CI), the hypothesis (multi-node vs single-node environments), and the implemented solutions (waiting for all shards and setting replicas to 0).
Linked Issues check	✅ Passed	The changes directly address the root causes identified in `#5103` and `#4261`: waiting for all shards to prevent timing issues in multi-node setups and setting replicas to 0 to avoid hangs on single-node test environments.
Out of Scope Changes check	✅ Passed	All changes are focused on fixing TPCH integration test reliability: modifying bulk insert refresh behavior, adding replica count helper, and removing unnecessary wait-for-docs loop, all directly scoped to the linked issues.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Swiddis · 2026-02-03T19:13:59Z

nvm, this hangs because it waits forever for replicas on a single node. But I can work around that...

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

(cherry picked from commit 8073b4e) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

(cherry picked from commit 8073b4e) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

LantaoJin · 2026-02-05T03:31:05Z

integ-test/src/test/java/org/opensearch/sql/legacy/SQLIntegTestCase.java

-    // loadIndex() could directly return when isIndexExist()=true,
-    // e.g. the index is created in the cluster but data hasn't been flushed.
-    // We block loadIndex() until data loaded to resolve
-    // https://github.com/opensearch-project/sql/issues/4261
-    int countDown = 3; // 1500ms timeout
-    while (countDown != 0 && getDocCount(client, indexName) == 0) {
-      try {
-        Thread.sleep(500);
-        countDown--;
-      } catch (InterruptedException e) {
-        throw new IOException(e);
-      }
-    }


for safety, I think we can keep this logic to mitigate the impacts.

This logic also caused build fleet testing times to go from 2 hours to 5 hours (which I think means it was broken in general), we should avoid hardcoded sleep durations anyway if we need to have some sort of check. At worst maybe a retry policy directly on the TPCH ITs.

Another TCPH fix: Wait for all shards

c73e8eb

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

Swiddis added the skip-changelog label Feb 3, 2026

RyanL1997 previously approved these changes Feb 3, 2026

View reviewed changes

Swiddis added the backport 3.5 label Feb 3, 2026

Swiddis enabled auto-merge (squash) February 3, 2026 18:17

Set replicas to 0 for integ test mappings by default

2357a4f

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

Swiddis dismissed RyanL1997’s stale review via 2357a4f February 3, 2026 19:34

Add more ctx to a comment

410a638

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

RyanL1997 approved these changes Feb 3, 2026

View reviewed changes

mengweieric approved these changes Feb 3, 2026

View reviewed changes

Swiddis merged commit 8073b4e into opensearch-project:main Feb 3, 2026
37 checks passed

opensearch-trigger-bot bot pushed a commit that referenced this pull request Feb 3, 2026

Another TCPH fix: Wait for all shards (#5106)

932bd32

(cherry picked from commit 8073b4e) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

opensearch-trigger-bot bot mentioned this pull request Feb 3, 2026

[Backport 3.5] Another TPCH fix: Wait for all shards #5108

Merged

RyanL1997 added the flaky-test Flaky build or test issue label Feb 3, 2026

LantaoJin reviewed Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another TPCH fix: Wait for all shards#5106

Another TPCH fix: Wait for all shards#5106
Swiddis merged 3 commits intoopensearch-project:mainfrom
Swiddis:bugfix/tpch-flaky

Swiddis commented Feb 3, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 3, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

Swiddis commented Feb 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

LantaoJin Feb 5, 2026

Uh oh!

Swiddis Feb 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Swiddis commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Uh oh!

coderabbitai bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

Swiddis commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

LantaoJin Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Swiddis Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Swiddis commented Feb 3, 2026 •

edited

Loading

coderabbitai bot commented Feb 3, 2026 •

edited

Loading

Swiddis commented Feb 3, 2026 •

edited

Loading

Swiddis Feb 5, 2026 •

edited

Loading