Improve dynamic server count detect logic of agent #358

zqzten · 2022-05-31T12:03:33Z

Currently if we want the agent to dynamic detect the server count change, we have to enable the syncForever mode, however, with this mode on, agent just call the Connect method of server constantly, and for most cases it just grap the sever count and close the connection (since the server count is not chaging that frequently). This cause a lot of garbage error log (specifically server.go:767] "stream read failure" err="rpc error: code = Canceled desc = context canceled" on server side) and a waste of networking and computing resource.

So would it be applicable that we introduce a lightweight method ServerCount to server and let agent call that method to sync server count instead of the heavier Connect?

The text was updated successfully, but these errors were encountered:

ipochi · 2022-08-18T13:27:15Z

@cheftako @tallclair

I'd be interested to know more about what others in the community feel regarding this and is there a scope/priority for a better solution ?

I'd like to collaborate with relevant folks and formulate an approach.

tallclair · 2022-08-24T18:25:25Z

I agree that the current approach is problematic. Introducing a new lightweight method makes sense to me, but I'd like someone with more experience on this project to weigh in.

cheftako · 2022-08-24T20:16:47Z

I'd be ok with having a ServerCount method to the server. Presumably then the agent would then poll the server. Not sure if we would need to secure the ServerCount method. If we did I'm not sure it would save us much over calling Connect. Another option would be to add a UpdateMetaData packet to our protocol. Then the servers could push an update server count to the agents.

One thing we haven't address yet is how the server gets that value. Today it gets it from https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/cmd/server/app/options/options.go#L103. One advantage to the method you mentioned is that you can just a new server with the new value to our LB and eventually the agent will notice. It does mean that the agent has to assume it if gets conflicting answers it should always use the larger of the two vaues. Extending the protocol means you would want a mechanism to update the serverCount on the running server. That is more work but it means agent would respond more quickly and also that you can support down sizing. Mechanisms for lowering could be an admin command to the server, dynamic config or loading it from a CRD on the KAS.

zqzten · 2022-08-25T12:55:13Z

One thing we haven't address yet is how the server gets that value.

Good point. I've noticed that there was another rotting issue #273 tracking on this. We may consider that together with this issue.

k8s-triage-robot · 2022-11-23T13:01:06Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-12-23T13:42:19Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

zqzten · 2022-12-23T13:51:38Z

/remove-lifecycle rotten

tallclair · 2023-01-04T23:33:47Z

/lifecycle frozen

jkh52 · 2023-05-26T18:41:01Z

Good point. I've noticed that there was another rotting issue #273 tracking on this. We may consider that together with this issue.

We could keep this issue open, to improve "how the agents learn server count". There are some ideas here that weren't considered there. #273 can track the remaining server half of the feature.

This is a complaint at kubernetes-sigs#358 Note that this extends an existing pattern.

carreter · 2024-06-13T21:19:14Z

FYI, if an agent receives an increased server count from a server it is already connected to, it doesn't reset the backoff and just uses the last set duration. I'm guessing that this isn't the desired behavior and that the agent should immediately reset the backoff and shift into "fast sync" mode if it is supposed to be connected to more servers than it currently is.

This will only be a problem once #273 is implemented as there is currently no way to update the server count on the server side.

EDIT: Here are some logs. Note that the connection attempts are separated by 30 seconds even after the server count gets updated to be greater than the client count.

I0613 20:40:17.950671       1 clientset.go:186] "duplicate server" serverID="d5f7af4c-9540-4fb9-a9a3-066a655ee029" serverCount=1 clientsCount=1
I0613 20:40:47.953658       1 client.go:474] "received DATA" connectionID=2
I0613 20:40:47.953719       1 client.go:474] "received DATA" connectionID=1
I0613 20:40:47.953777       1 client.go:600] "write to remote" connectionID=2 lastData=39 dataSize=39
I0613 20:40:47.953810       1 client.go:600] "write to remote" connectionID=1 lastData=39 dataSize=39
I0613 20:40:48.612637       1 client.go:215] "Connect to server" serverID="d5f7af4c-9540-4fb9-a9a3-066a655ee029"
I0613 20:40:48.612694       1 clientset.go:216] "Server count change suggestion by server" current=1 serverID="d5f7af4c-9540-4fb9-a9a3-066a655ee029" actual=4 # SERVER COUNT GETS UPDATED HERE
I0613 20:40:48.613038       1 clientset.go:186] "duplicate server" serverID="d5f7af4c-9540-4fb9-a9a3-066a655ee029" serverCount=4 clientsCount=1
I0613 20:40:48.615871       1 client.go:474] "received DATA" connectionID=2
I0613 20:40:48.615974       1 client.go:600] "write to remote" connectionID=2 lastData=35 dataSize=35
I0613 20:41:18.616285       1 client.go:474] "received DATA" connectionID=1
I0613 20:41:18.616399       1 client.go:600] "write to remote" connectionID=1 lastData=39 dataSize=39
I0613 20:41:18.617362       1 client.go:474] "received DATA" connectionID=2
I0613 20:41:18.617413       1 client.go:600] "write to remote" connectionID=2 lastData=39 dataSize=39
I0613 20:41:19.330032       1 client.go:215] "Connect to server" serverID="d5f7af4c-9540-4fb9-a9a3-066a655ee029"
I0613 20:41:19.330344       1 clientset.go:186] "duplicate server" serverID="d5f7af4c-9540-4fb9-a9a3-066a655ee029" serverCount=4 clientsCount=1

carreter · 2024-08-23T23:00:19Z

Fixed by #643 !

zqzten mentioned this issue Jul 2, 2022

REQUEST: New membership for zqzten kubernetes/org#3537

Closed

9 tasks

jnummelin mentioned this issue Aug 30, 2022

Support for multiple server addresses on agent #394

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 23, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 23, 2022

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 4, 2023

jkh52 mentioned this issue May 9, 2023

KEP-1281: Clarify beta status. kubernetes/enhancements#3992

Merged

jkh52 mentioned this issue May 26, 2023

Support dynamic number of proxy-servers #273

Open

jkh52 added a commit to jkh52/apiserver-network-proxy that referenced this issue May 31, 2023

Quiet more error logs on stream canceled.

4a1ebcb

This is a complaint at kubernetes-sigs#358 Note that this extends an existing pattern.

jkh52 mentioned this issue May 31, 2023

Quiet more error logs on stream canceled. #494

Merged

This was referenced Jun 13, 2024

Always reset agent connection backoff and enter fast sync when client count < server count #632

Merged

New proxy server counting logic for agent and server #634

Closed

New proxy server counting logic for agent and server #635

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve dynamic server count detect logic of agent #358

Improve dynamic server count detect logic of agent #358

zqzten commented May 31, 2022 •

edited

Loading

ipochi commented Aug 18, 2022

tallclair commented Aug 24, 2022

cheftako commented Aug 24, 2022

zqzten commented Aug 25, 2022

k8s-triage-robot commented Nov 23, 2022

k8s-triage-robot commented Dec 23, 2022

zqzten commented Dec 23, 2022

tallclair commented Jan 4, 2023

jkh52 commented May 26, 2023

carreter commented Jun 13, 2024 •

edited

Loading

carreter commented Aug 23, 2024

Improve dynamic server count detect logic of agent #358

Improve dynamic server count detect logic of agent #358

Comments

zqzten commented May 31, 2022 • edited Loading

ipochi commented Aug 18, 2022

tallclair commented Aug 24, 2022

cheftako commented Aug 24, 2022

zqzten commented Aug 25, 2022

k8s-triage-robot commented Nov 23, 2022

k8s-triage-robot commented Dec 23, 2022

zqzten commented Dec 23, 2022

tallclair commented Jan 4, 2023

jkh52 commented May 26, 2023

carreter commented Jun 13, 2024 • edited Loading

carreter commented Aug 23, 2024

zqzten commented May 31, 2022 •

edited

Loading

carreter commented Jun 13, 2024 •

edited

Loading