Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues in SPDY RoundTripper #109651

Merged
merged 2 commits into from May 15, 2022
Merged

Conversation

ash2k
Copy link
Member

@ash2k ash2k commented Apr 25, 2022

What type of PR is this?

/kind bug
/sig api-machinery

What this PR does / why we need it:

  • Feature: Pass context when doing TLS dials
  • Bugfix: Don't leak connection when an error happens after a successful TLS dial
  • Cleanup: Don't clone headers twice
  • Cleanup: No need for MultiReader

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Apr 25, 2022
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.24 branch. This means every merged PR has to be cherry-picked into the release branch to be part of the upcoming v1.24.0 release.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 25, 2022
netD := s.Dialer
if netD == nil {
netD = &net.Dialer{
Timeout: 30 * time.Second,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not set a timeout here. The socks proxy dialer is ~new, so copying its timeout to this old code path is not compelling... since we're now plumbing the context, inheriting any timeout associated with it seems sufficient to me.

if err != nil {
return nil, err
}
defer func() {
if !success {
c.Close()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection is now always closed on error. It was leaked before (on error).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure is leaked, it looks to me that it will be closed later

// Dial implements k8s.io/apimachinery/pkg/util/net.Dialer.
func (s *SpdyRoundTripper) Dial(req *http.Request) (net.Conn, error) {
	conn, err := s.dial(req)
	if err != nil {
		return nil, err
	}

	if err := req.Write(conn); err != nil {
		conn.Close()
		return nil, err
	}

	return conn, nil
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it will not be - s.dial(req) returns an error and function returns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yeah, you are right, do you mind if we leave the connection leaking for other PR?
it seems a whole topic in its own and I can't see tests for those

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I still think the conn will not leak

func (s *SpdyRoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
	header := utilnet.CloneHeader(req.Header)
	header.Add(httpstream.HeaderConnection, httpstream.HeaderUpgrade)
	header.Add(httpstream.HeaderUpgrade, HeaderSpdy31)

	var (
		conn        net.Conn
		rawResponse []byte
		err         error
	)

	clone := utilnet.CloneRequest(req)
	clone.Header = header
	conn, err = s.Dial(clone)
	if err != nil {
		return nil, err
	}

is not that it runs in a goroutine, once the RoundTrip exits it will be destroyed, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A connection is a kind of object that needs to get it's lifetime managed by the program. GC will collect the object, but I don't know if there is a finalizer to clean up the unreleased underlying OS resources. It's poor practice to rely on finalizers anyway. I don't understand what is your concern. Why not close the connection if it's not needed (because an error is returned)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because this code is too old that there are a lot of assumptions that we don't know and we may break something without noticing, I'm sure you are right, but what I'm not sure is about the consequences of the changes, that is why I'm trying to be conservative ... is not the first time that for fixing something a bug arise in other place of the stack, and then is very hard to solve these bugs ... we can open a bug for the conns leaks and follow up later, I think that is not blocking your problem, and I can work on the test if you want

@aojea
Copy link
Member

aojea commented Apr 25, 2022

too many different changes in a single commit?

@ash2k
Copy link
Member Author

ash2k commented Apr 26, 2022

@aojea They are all one the same lines of code, so it was easier to just clean it all up in a single pass.

Need to fix tests...

@aojea
Copy link
Member

aojea commented Apr 26, 2022

Feature: Pass context when doing TLS dials
Bugfix: Don't leak connection when an error happens after a successful TLS dial
Cleanup: Don't clone headers twice
Cleanup: No need for MultiReader

I personally think that each item should be an independent commit because it is easier to backport the changes or revert if required.

Also, I think that the new feature and the bugfix should have unit tests

@fedebongio
Copy link
Contributor

/assign @aojea
(thank you Antonio!)
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 26, 2022
@ash2k
Copy link
Member Author

ash2k commented Apr 29, 2022

@aojea I've split it into 3 commits, should be cleaner now. PTAL.

@aojea
Copy link
Member

aojea commented Apr 29, 2022

LGTM if we remove the "connection leaking" code so we can tackle that problem in a different PR verifying the leaks with tests, :)

@ash2k
Copy link
Member Author

ash2k commented Apr 29, 2022

LGTM if we remove the "connection leaking" code so we can tackle that problem in a different PR verifying the leaks with tests, :)

I'm sorry but I'm not going to come back to this. The only reason I'm looking at this code is because I want to fix #89163 and I was learning how things work currently. I don't want to polish this code to look perfectly, I want to make it obsolete and eventually delete it. It's clear that there is a leak and to test that I'd need to inject a stub dialer (to return a mock connection and then check that Close() has been called). Honestly, I don't think it's worth the effort. Maybe you have a better idea of how to test it?

@aojea
Copy link
Member

aojea commented Apr 29, 2022

LGTM if we remove the "connection leaking" code so we can tackle that problem in a different PR verifying the leaks with tests, :)

I'm sorry but I'm not going to come back to this. The only reason I'm looking at this code is because I want to fix #89163 and I was learning how things work currently. I don't want to polish this code to look perfectly, I want to make it obsolete and eventually delete it. It's clear that there is a leak and to test that I'd need to inject a stub dialer (to return a mock connection and then check that Close() has been called). Honestly, I don't think it's worth the effort. Maybe you have a better idea of how to test it?

@ash2k I will work on the tests, I think that your commits add value, give me some days to come back to you

Regarding websockets, this comment will be interesting for you #7452 (comment)

@ash2k
Copy link
Member Author

ash2k commented Apr 30, 2022

👍

@aojea
Copy link
Member

aojea commented May 2, 2022

the tls.Dial already performs the handshake and closes the underline connection if it fails https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/crypto/tls/tls.go;l=156-161;drc=860704317e02d699e4e4a24103853c4782d746c1

	rawConn, err := netDialer.DialContext(ctx, network, addr)
	if err != nil {
		return nil, err
	}

	colonPos := strings.LastIndex(addr, ":")
	if colonPos == -1 {
		colonPos = len(addr)
	}
	hostname := addr[:colonPos]

	if config == nil {
		config = defaultConfig()
	}
	// If no ServerName is set, infer the ServerName
	// from the hostname we're connecting to.
	if config.ServerName == "" {
		// Make a copy to avoid polluting argument or default.
		c := config.Clone()
		c.ServerName = hostname
		config = c
	}

	conn := Client(rawConn, config)
	if err := conn.HandshakeContext(ctx); err != nil {
		rawConn.Close()
		return nil, err
	}
	return conn, nil

Another thing is that the code that verify the hostname seems not needed anymore, https://groups.google.com/g/golang-nuts/c/4vnt7NdLvVU/m/b1SJ4u0ikb0J

You can play with the different combinations using https://gist.github.com/aojea/4e172d7f3835cd3ffc6ea4f83eea6608

I've filed a new PR for that adding you as co-author #109750

@ash2k
Copy link
Member Author

ash2k commented May 2, 2022

I'll rebased once your PR is merged. Cheers.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 6, 2022
CloneRequest() clones headers too
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 7, 2022
@ash2k
Copy link
Member Author

ash2k commented May 7, 2022

@aojea Rebased, PTAL.

/remove-kind bug
/kind cleanup

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. and removed kind/bug Categorizes issue or PR as related to a bug. labels May 7, 2022
@aojea
Copy link
Member

aojea commented May 7, 2022

/retest
Unrelated failure k8s.io/kubernetes/pkg/kubelet/cm/devicemanager

/Lgtm
/assign @liggitt

For approval

Thanks

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2022
require.NoError(t, err)
spdyTransport := NewRoundTripper(&tls.Config{})
_, err = spdyTransport.Dial(req)
assert.EqualError(t, err, "dial tcp 127.0.0.1:1233: operation was canceled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to check the type of this error rather than the string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dialer returns a *net.OpError but it's a generic thing. Within it there is an Err field which is the original error. In Go 1.18.x it's NOT possible to check it, but in the yet unreleased 1.19 it will be possible via errors.Is(). See golang/go#51428 (commit golang/go@85b5f86). In any case, I think it's not that important as it's a test only. String comparison is probably good enough for a test.

netD := s.Dialer
if netD == nil {
netD = &net.Dialer{
Timeout: 30 * time.Second,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not set a timeout here. The socks proxy dialer is ~new, so copying its timeout to this old code path is not compelling... since we're now plumbing the context, inheriting any timeout associated with it seems sufficient to me.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 11, 2022
@ash2k
Copy link
Member Author

ash2k commented May 11, 2022

/retest

@ash2k ash2k requested a review from liggitt May 13, 2022 06:49
@liggitt
Copy link
Member

liggitt commented May 15, 2022

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 15, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ash2k, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2022
@k8s-ci-robot k8s-ci-robot merged commit dfee09a into kubernetes:master May 15, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone May 15, 2022
@ash2k ash2k deleted the ash2k/spdy-cleanup branch May 15, 2022 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants