Skip to content

Fix goroutine leak in SSH DialContext#3882

Open
Elvand-Lie wants to merge 2 commits into
knative:mainfrom
Elvand-Lie:fix-ssh-dialer-leak-3881
Open

Fix goroutine leak in SSH DialContext#3882
Elvand-Lie wants to merge 2 commits into
knative:mainfrom
Elvand-Lie:fix-ssh-dialer-leak-3881

Conversation

@Elvand-Lie
Copy link
Copy Markdown
Contributor

@Elvand-Lie Elvand-Lie commented Jun 6, 2026

Changes

Fixes #3881

Why is this needed?

The previous DialContext implementation spawned a background goroutine that watched ctx.Done() and force-closed the connection when the context expired. This had two problems:

  1. Goroutine leak: If the caller closed the connection manually while the context was still alive (e.g., context.Background()), the goroutine blocked forever on <-ctx.Done(), leaking 1 goroutine per connection.

  2. Violated net.Dialer.DialContext semantics: Per the Go docs: "Once successfully connected, any expiration of the context will not affect the connection." The old code did the opposite it killed established connections when the context expired. The k8s dialer in this same repo already documents this contract correctly (pkg/k8s/dialer.go L80-84).

The fix

The golang.org/x/crypto/ssh library already provides Client.DialContext which handles context-cancellable dialing correctly:

  • Context cancellation during the dial phase → returns error
  • Context expiration after connection is established → does not affect the connection

So the fix simply delegates to d.sshClient.DialContext(ctx, d.network, d.addr) instead of calling d.Dial() + spawning a broken monitoring goroutine. The connWithDone wrapper and sync import are removed as dead code.

Testing performed

  • go test ./pkg/ssh/... passes.
  • The change is a net deletion of 30 lines removing incorrect behavior, not adding new logic.
Fixed a goroutine leak and incorrect context handling in the SSH dialer. DialContext now correctly delegates to ssh.Client.DialContext, respecting standard Go context semantics.
NONE

@knative-prow knative-prow Bot added the kind/bug Bugs label Jun 6, 2026
@knative-prow knative-prow Bot requested review from dsimansk and jrangelramos June 6, 2026 08:29
@knative-prow
Copy link
Copy Markdown

knative-prow Bot commented Jun 6, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Elvand-Lie
Once this PR has been reviewed and has the lgtm label, please assign dsimansk for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow Bot added size/S 🤖 PR changes 10-29 lines, ignoring generated files. needs-ok-to-test 🤖 Needs an org member to approve testing labels Jun 6, 2026
@knative-prow
Copy link
Copy Markdown

knative-prow Bot commented Jun 6, 2026

Hi @Elvand-Lie. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 48.63%. Comparing base (a6ba689) to head (1e0117a).

❗ There is a different number of reports uploaded between BASE (a6ba689) and HEAD (1e0117a). Click for more details.

HEAD has 6 uploads less than BASE
Flag BASE (a6ba689) HEAD (1e0117a)
e2e 3 2
e2e rust 1 0
e2e go 1 0
e2e python 1 0
e2e quarkus 1 0
integration 1 0
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3882      +/-   ##
==========================================
- Coverage   54.03%   48.63%   -5.41%     
==========================================
  Files         200      200              
  Lines       23567    23558       -9     
==========================================
- Hits        12734    11457    -1277     
- Misses       9604    11062    +1458     
+ Partials     1229     1039     -190     
Flag Coverage Δ
e2e 21.63% <0.00%> (-11.86%) ⬇️
e2e go ?
e2e node 25.74% <0.00%> (+0.01%) ⬆️
e2e python ?
e2e quarkus ?
e2e rust ?
e2e springboot 24.01% <0.00%> (+<0.01%) ⬆️
e2e typescript 25.85% <0.00%> (+0.01%) ⬆️
e2e-config-ci 26.96% <0.00%> (+0.01%) ⬆️
integration ?
unit macos-14 42.89% <100.00%> (-0.04%) ⬇️
unit macos-latest 42.89% <100.00%> (-0.04%) ⬇️
unit ubuntu-24.04-arm 43.21% <100.00%> (-0.04%) ⬇️
unit ubuntu-latest 43.75% <100.00%> (-0.04%) ⬇️
unit windows-latest 42.95% <100.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@matejvasek
Copy link
Copy Markdown
Contributor

Looking at the code I am not sure whether even the old version is correct:

https://pkg.go.dev/net#Dialer.DialContext

The provided Context must be non-nil. If the context expires before the connection is complete, an error is returned. Once successfully connected, any expiration of the context will not affect the connection.

I guess we should simply ignore the ctx and spawn no goroutine at all.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a goroutine leak in pkg/ssh’s DialContext by ensuring the background context-monitoring goroutine can also exit when the returned connection is manually closed (not only when the context is cancelled). This aligns the implementation with the lifecycle patterns described in Issue #3881 for long-lived contexts (e.g., context.Background()).

Changes:

  • Introduced a connWithDone wrapper that signals a done channel on Close().
  • Updated the monitoring goroutine to select on both ctx.Done() and the connection’s done signal.
  • Returned the wrapped connection from DialContext to ensure caller-initiated Close() triggers goroutine cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/ssh/ssh_dialer.go Outdated
Comment on lines +151 to +155
select {
case <-ctx.Done():
conn.Close()
case <-wrapped.done:
// Connection was closed by the caller; exit cleanly.
Comment thread pkg/ssh/ssh_dialer.go Outdated
Comment on lines +149 to +153
go func() {
if ctx != nil {
<-ctx.Done()
conn.Close()
select {
case <-ctx.Done():
conn.Close()
The monitoring goroutine in DialContext violated net.Dialer.DialContext semantics:
context should only govern connection establishment, not connection lifetime.
The goroutine also leaked when connections were closed manually while the
context remained open.

Remove the goroutine and connWithDone wrapper entirely. Delegate to
ssh.Client.DialContext which already handles context-cancellable dialing
correctly per the Go contract.
@Elvand-Lie
Copy link
Copy Markdown
Contributor Author

Elvand-Lie commented Jun 6, 2026

@matejvasek I've removed the goroutine entirely instead of patching it since the goroutine is a bit redundant. x/crypto/ssh already has Client.DialContext that handles context cancellation during dial correctly, so I'm just delegating to it now. Net -30 lines. Should we have a regression test for this, or is the deletion self-evident enough?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Bugs needs-ok-to-test 🤖 Needs an org member to approve testing size/S 🤖 PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pkg/ssh: DialContext leaks goroutines if context is never cancelled

3 participants