Skip to content

compile: allow local notify streams for shuffle dispatch#24159

Merged
mergify[bot] merged 3 commits into
matrixorigin:mainfrom
LeftHandCold:fix/send-notify-self-connection-main
Apr 20, 2026
Merged

compile: allow local notify streams for shuffle dispatch#24159
mergify[bot] merged 3 commits into
matrixorigin:mainfrom
LeftHandCold:fix/send-notify-self-connection-main

Conversation

@LeftHandCold
Copy link
Copy Markdown
Contributor

@LeftHandCold LeftHandCold commented Apr 20, 2026

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

fixes #24158

What this PR does / why we need it:

This fixes the self-connection bug in the compile/remoterun notification path.

Scope.RemoteRun already handles the local-scope case with ipAddrMatch, but Scope.sendNotifyMessage still creates a pipeline RPC stream for every RemoteReceivRegInfo.FromAddr. When one dispatch operator is scheduled on the same CN as the coordinator, fromAddr equals the local service address. The old pipelineClient.NewStream rejected that loopback connection with remote run pipeline in local, so the prepare-done notification path could not be established for the local dispatch operator.

That can break remote dispatch/shuffle coordination for queries such as secondary-index UPDATE plans using shuffle join with serial_full(...) and surface as failure or hang.

This PR keeps the existing RemoteRun local-execution guard intact, but allows the notification stream itself to connect to the local CN address. That is the smallest fix that restores the existing dispatch notification protocol without refactoring local/remote receiver handling.

Also included:

  • a unit test that pins the real regression point in pipelineClient.NewStream

Validation

  • go test ./pkg/cnservice/cnclient ./pkg/sql/compile -count=1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a compile/remoterun shuffle-dispatch coordination bug where sendNotifyMessage could fail/hang when a dispatch operator is scheduled on the same CN as the coordinator by allowing the pipeline notification RPC stream to connect to the local CN address.

Changes:

  • Remove the self-connection rejection in pipelineClient.NewStream, allowing loopback/local backend streams.
  • Add a focused unit test ensuring local backend addresses are forwarded to the underlying morpc client.
  • Add a root-cause write-up documenting the failure mode and why the fix is safe.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
pkg/cnservice/cnclient/client.go Removes the local-backend guard so notification streams can connect to the local CN address.
pkg/cnservice/cnclient/client_test.go Adds a regression test verifying NewStream permits local backend connections and passes lock=true.
doc/issue_24158_send_notify_self_connection.md Documents root cause, failure mechanism, and rationale for the narrow fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 20, 2026

Merge Queue Status

  • Entered queue2026-04-20 10:05 UTC · Rule: main
  • Checks skipped · PR is already up-to-date
  • Merged2026-04-20 10:05 UTC · at fb9c5868863dc762cd4776ef17974427fb980586 · squash

This pull request spent 16 seconds in the queue, including 1 second running CI.

Required conditions to merge
  • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
  • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
  • #review-threads-unresolved = 0 [🛡 GitHub branch protection]
  • branch-protection-review-decision = APPROVED [🛡 GitHub branch protection]
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-neutral = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-skipped = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / SCA Test on Ubuntu/x86
    • check-neutral = Matrixone CI / SCA Test on Ubuntu/x86
    • check-skipped = Matrixone CI / SCA Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / UT Test on Ubuntu/x86
    • check-neutral = Matrixone CI / UT Test on Ubuntu/x86
    • check-skipped = Matrixone CI / UT Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-neutral = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-skipped = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Utils CI / Coverage
    • check-neutral = Matrixone Utils CI / Coverage
    • check-skipped = Matrixone Utils CI / Coverage

@mergify mergify Bot merged commit 3864540 into matrixorigin:main Apr 20, 2026
43 of 45 checks passed
@LeftHandCold LeftHandCold deleted the fix/send-notify-self-connection-main branch April 20, 2026 10:46
fengttt pushed a commit to fengttt/matrixone that referenced this pull request May 7, 2026
…n#24159)

This fixes the self-connection bug in the compile/remoterun notification path.

`Scope.RemoteRun` already handles the local-scope case with `ipAddrMatch`, but `Scope.sendNotifyMessage` still creates a pipeline RPC stream for every `RemoteReceivRegInfo.FromAddr`. When one dispatch operator is scheduled on the same CN as the coordinator, `fromAddr` equals the local service address. The old `pipelineClient.NewStream` rejected that loopback connection with `remote run pipeline in local`, so the prepare-done notification path could not be established for the local dispatch operator.

That can break remote dispatch/shuffle coordination for queries such as secondary-index UPDATE plans using shuffle join with `serial_full(...)` and surface as failure or hang.

This PR keeps the existing `RemoteRun` local-execution guard intact, but allows the notification stream itself to connect to the local CN address. That is the smallest fix that restores the existing dispatch notification protocol without refactoring local/remote receiver handling.

Also included:
- a unit test that pins the real regression point in `pipelineClient.NewStream`

Approved by: @XuPeng-SH
LeftHandCold added a commit to LeftHandCold/matrixone that referenced this pull request Jun 4, 2026
…n#24159)

This fixes the self-connection bug in the compile/remoterun notification path.

`Scope.RemoteRun` already handles the local-scope case with `ipAddrMatch`, but `Scope.sendNotifyMessage` still creates a pipeline RPC stream for every `RemoteReceivRegInfo.FromAddr`. When one dispatch operator is scheduled on the same CN as the coordinator, `fromAddr` equals the local service address. The old `pipelineClient.NewStream` rejected that loopback connection with `remote run pipeline in local`, so the prepare-done notification path could not be established for the local dispatch operator.

That can break remote dispatch/shuffle coordination for queries such as secondary-index UPDATE plans using shuffle join with `serial_full(...)` and surface as failure or hang.

This PR keeps the existing `RemoteRun` local-execution guard intact, but allows the notification stream itself to connect to the local CN address. That is the smallest fix that restores the existing dispatch notification protocol without refactoring local/remote receiver handling.

Also included:
- a unit test that pins the real regression point in `pipelineClient.NewStream`

Approved by: @XuPeng-SH
mergify Bot pushed a commit that referenced this pull request Jun 7, 2026
…4854)

This fixes the self-connection bug in the compile/remoterun notification path.

`Scope.RemoteRun` already handles the local-scope case with `ipAddrMatch`, but `Scope.sendNotifyMessage` still creates a pipeline RPC stream for every `RemoteReceivRegInfo.FromAddr`. When one dispatch operator is scheduled on the same CN as the coordinator, `fromAddr` equals the local service address. The old `pipelineClient.NewStream` rejected that loopback connection with `remote run pipeline in local`, so the prepare-done notification path could not be established for the local dispatch operator.

That can break remote dispatch/shuffle coordination for queries such as secondary-index UPDATE plans using shuffle join with `serial_full(...)` and surface as failure or hang.

This PR keeps the existing `RemoteRun` local-execution guard intact, but allows the notification stream itself to connect to the local CN address. That is the smallest fix that restores the existing dispatch notification protocol without refactoring local/remote receiver handling.

Also included:
- a unit test that pins the real regression point in `pipelineClient.NewStream`

Approved by: @XuPeng-SH
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/S Denotes a PR that changes [10,99] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: sendNotifyMessage fails with self-connection when shuffle dispatch operator is on local CN

4 participants