Decouple read remote from write to server. #310

cheftako · 2021-12-13T23:08:51Z

Currently we read from the remote and on the same routine, we write to
the konn server. This means that reading from subsequent reads from
remote are effectively blocked on the write to the konn server. This PR
decouples the read and write. It combines all the writes on to a
channel. Then the writes are on their own routine which reads from that
channel.

k8s-triage-robot · 2021-12-14T01:11:16Z

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla

mainred

Thanks for your work, left a few comments.

mainred · 2021-12-14T14:52:52Z

pkg/agent/client.go

 	defer a.sendLock.Unlock()

-	err := a.stream.Send(pkt)
+	/*err := a.stream.Send(pkt)


Should we just remove these commented codes?

Yes. (My bad for not removing prior to sending this out)

mainred · 2021-12-14T14:54:40Z

pkg/agent/client.go

+		if err != nil && err != io.EOF {
+			metrics.Metrics.ObserveFailure(metrics.DirectionToServer)
+			a.cs.RemoveClient(a.serverID)
+			a.serverError = err


when will serverError be recovered/reset?

I'm making an assumption that this channel/connection to the server is no longer functional. By removing the serverID from the ClientSet we are enabling

apiserver-network-proxy/pkg/agent/clientset.go

Line 176 in b9bf6c5

if err := cs.connectOnce(); err != nil {

to now establish a new channel to the server. Anyone using the old channel/connection will get an error. All new requests should use the new channel/connection once it has been established. This does mean in flight requests will fail. However this is supposed to be an exception occurence.

Thanks for your explanation, do you mind adding a log when removing server ID happens, like around https://github.com/kubernetes-sigs/apiserver-network-proxy/pull/310/files#diff-c2eff45b2c7db0ae864aecb2e8b733dfbdfe7070ffe6f7feb1204e086c403a0aR277

I can add a UT to protect the UT which removes and adds a server back later.

mainred · 2021-12-14T15:08:47Z

pkg/agent/client.go

-				Data:      buf[:n],
-				ConnectID: connID,
-			}}
+			data := make([]byte, 0, n)


This decoupling is based on the fact we can store/cache the data received from the remote before sending it to the server.
If the connection to the server is unhealthy or closed, maybe it's in some sense a good idea to block receiving more data from remote until the healthy connection to the server is back, because the cached data won't be sent to the server for connection is closed otherwise.

That's a really nice idea for an improvement. However it would require we implement a resume all the way through. It involved being able to move in flight connections from 1 channel to newer channel. That means we need to coordinate that in both agent and server. While I really like this idea I think its better done in a subsequent change.

mainred · 2021-12-22T01:05:16Z

pkg/agent/client.go

+		if err != nil && err != io.EOF {
+			metrics.Metrics.ObserveFailure(metrics.DirectionToServer)
+			a.cs.RemoveClient(a.serverID)
+			a.serverError = err


Thanks for your explanation, do you mind adding a log when removing server ID happens, like around https://github.com/kubernetes-sigs/apiserver-network-proxy/pull/310/files#diff-c2eff45b2c7db0ae864aecb2e8b733dfbdfe7070ffe6f7feb1204e086c403a0aR277

mainred · 2021-12-22T01:07:56Z

pkg/agent/client.go

+		if err != nil && err != io.EOF {
+			metrics.Metrics.ObserveFailure(metrics.DirectionToServer)
+			a.cs.RemoveClient(a.serverID)
+			a.serverError = err


I can add a UT to protect the UT which removes and adds a server back later.

k8s-ci-robot · 2021-12-22T01:08:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheftako, mainred

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cheftako]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/agent/client.go

jkh52 · 2022-01-14T21:43:52Z

pkg/agent/client.go

 	a.conn = conn
 	a.stream = stream
+	a.serverChannel = make(chan *client.Packet, xfrChannelSize)
+	a.serverError = nil


serverError could deserve some comments about: why it doesn't need lock protection, and when set the client is effectively dead.

jkh52 · 2022-01-14T21:45:26Z

/hold

(to allow address my single comment)

/lgtm

k8s-ci-robot · 2022-02-01T06:36:09Z

New changes are detected. LGTM label has been removed.

mihivagyok · 2022-02-01T08:54:31Z

pkg/agent/client.go

 	return serverCount, nil
 }

+func (a *Client) writeToKonnServer() {


I think it is already hard to understand the naming in the code (proxy, server, agent, client, frontend, backend, remote etc), that's why I think the writeToKonnServer function name is not the best - also so far this Konn name is not appearing in the code. That's why I would like to suggest some other name like writeToProxyServer or something like that. What do you think? Thanks! Adam

Currently we read from the remote and on the same routine, we write to the konn server. This means that reading from subsequent reads from remote are effectively blocked on the write to the konn server. This PR decouples the read and write. It combines all the writes on to a channel. Then the writes are on their own routine which reads from that channel. Factored in suggested changes from mainred. Factoring in comment from jkh52 and andrewsykim. Factoring in comment from mihivagyok. Renamed writeToKonnServer as writeToProxyServer.

andrewsykim · 2022-02-18T22:04:45Z

/hold cancel

andrewsykim · 2022-02-23T02:15:25Z

pkg/agent/client.go

+		// It just means we do not know yet if it will fail.
+		// Slight back-flips here to ensure the write is closing the channel.
+		a.cleanChannel.Do(func() {
+			klog.V(2).InfoS("Data channel to server has errored out", "serverID", a.serverID)


Should this be klog.Error including a.serverError as one of the keys?

andrewsykim · 2022-02-23T02:23:33Z

pkg/agent/client.go

+		if panicInfo := recover(); panicInfo != nil {
+			klog.V(2).InfoS("Exiting writeToProxyServer with recovery", "panicInfo", panicInfo, "serverID", a.serverID)
+		} else {
+			klog.V(2).InfoS("Exiting writeToProxyServer", "serverID", a.serverID)


v=2 seems a little low for this doesn't it? I would expect this to be v=4 at least

andrewsykim · 2022-02-23T02:26:34Z

pkg/agent/client.go

+	for pkt := range a.serverChannel {
+		klog.V(5).InfoS("writeToProxyServer recevied packet to send to KonnServer", "serverID", a.serverID)
+		err := a.stream.Send(pkt)
+		if err != nil && err != io.EOF {


Should we be cleaning up this goroutine if err == io.EOF?

k8s-triage-robot · 2022-05-24T02:37:09Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-06-23T02:42:46Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-07-23T03:17:49Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-07-23T03:17:59Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cheftako · 2022-09-09T22:33:35Z

/remove-lifecycle rotten

k8s-ci-robot · 2022-09-09T22:33:49Z

@cheftako: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-09-09T22:33:52Z

@cheftako: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-apiserver-network-proxy-docker-build-arm64	`ec4745f`	link	true	`/test pull-apiserver-network-proxy-docker-build-arm64`
pull-apiserver-network-proxy-test	`ec4745f`	link	true	`/test pull-apiserver-network-proxy-test`
pull-apiserver-network-proxy-docker-build-amd64	`ec4745f`	link	true	`/test pull-apiserver-network-proxy-docker-build-amd64`
pull-apiserver-network-proxy-make-lint	`ec4745f`	link	true	`/test pull-apiserver-network-proxy-make-lint`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-triage-robot · 2022-12-08T22:53:05Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-01-07T23:45:19Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2023-02-07T00:24:43Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2023-02-07T00:24:47Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cheftako requested a review from jkh52 December 13, 2021 23:09

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 13, 2021

k8s-ci-robot requested review from Jefftree and dberkov December 13, 2021 23:09

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 13, 2021

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 14, 2021

mainred reviewed Dec 14, 2021

View reviewed changes

mainred approved these changes Dec 22, 2021

View reviewed changes

cheftako force-pushed the master branch from 0c101db to c460769 Compare December 28, 2021 21:58

andrewsykim reviewed Jan 14, 2022

View reviewed changes

pkg/agent/client.go Show resolved Hide resolved

jkh52 reviewed Jan 14, 2022

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 14, 2022

k8s-ci-robot assigned jkh52 Jan 14, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 14, 2022

cheftako force-pushed the master branch from c460769 to 4f6416c Compare February 1, 2022 06:36

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 1, 2022

mihivagyok reviewed Feb 1, 2022

View reviewed changes

cheftako force-pushed the master branch from 4f6416c to ec4745f Compare February 18, 2022 18:17

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 18, 2022

andrewsykim reviewed Feb 23, 2022

View reviewed changes

mihivagyok mentioned this pull request Feb 25, 2022

proxy->agent send stream is blocked due to flow control write quota, leading to goroutine leaks #335

Closed

andrewsykim mentioned this pull request Mar 17, 2022

Release v0.1.0? #346

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 24, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 23, 2022

k8s-ci-robot closed this Jul 23, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 9, 2022

cheftako reopened this Sep 9, 2022

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 9, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 8, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 7, 2023

k8s-ci-robot closed this Feb 7, 2023

Decouple read remote from write to server. #310

Decouple read remote from write to server. #310

Uh oh!

Conversation

cheftako commented Dec 13, 2021

Uh oh!

k8s-triage-robot commented Dec 14, 2021

Uh oh!

mainred left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mainred Dec 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Dec 22, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkh52 commented Jan 14, 2022

Uh oh!

k8s-ci-robot commented Feb 1, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewsykim commented Feb 18, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-triage-robot commented May 24, 2022

Uh oh!

k8s-triage-robot commented Jun 23, 2022

Uh oh!

k8s-triage-robot commented Jul 23, 2022

Uh oh!

k8s-ci-robot commented Jul 23, 2022

Uh oh!

cheftako commented Sep 9, 2022

Uh oh!

k8s-ci-robot commented Sep 9, 2022

Uh oh!

k8s-ci-robot commented Sep 9, 2022

Uh oh!

k8s-triage-robot commented Dec 8, 2022

Uh oh!

k8s-triage-robot commented Jan 7, 2023

Uh oh!

k8s-triage-robot commented Feb 7, 2023

Uh oh!

k8s-ci-robot commented Feb 7, 2023

Uh oh!

mainred Dec 14, 2021 •

edited

Loading