Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve dynamic server count detect logic of agent #358

Open
zqzten opened this issue May 31, 2022 · 11 comments
Open

Improve dynamic server count detect logic of agent #358

zqzten opened this issue May 31, 2022 · 11 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@zqzten
Copy link
Member

zqzten commented May 31, 2022

Currently if we want the agent to dynamic detect the server count change, we have to enable the syncForever mode, however, with this mode on, agent just call the Connect method of server constantly, and for most cases it just grap the sever count and close the connection (since the server count is not chaging that frequently). This cause a lot of garbage error log (specifically server.go:767] "stream read failure" err="rpc error: code = Canceled desc = context canceled" on server side) and a waste of networking and computing resource.

So would it be applicable that we introduce a lightweight method ServerCount to server and let agent call that method to sync server count instead of the heavier Connect?

@ipochi
Copy link
Contributor

ipochi commented Aug 18, 2022

@cheftako @tallclair

I'd be interested to know more about what others in the community feel regarding this and is there a scope/priority for a better solution ?

I'd like to collaborate with relevant folks and formulate an approach.

@tallclair
Copy link
Contributor

I agree that the current approach is problematic. Introducing a new lightweight method makes sense to me, but I'd like someone with more experience on this project to weigh in.

@cheftako
Copy link
Contributor

I'd be ok with having a ServerCount method to the server. Presumably then the agent would then poll the server. Not sure if we would need to secure the ServerCount method. If we did I'm not sure it would save us much over calling Connect. Another option would be to add a UpdateMetaData packet to our protocol. Then the servers could push an update server count to the agents.

One thing we haven't address yet is how the server gets that value. Today it gets it from https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/cmd/server/app/options/options.go#L103. One advantage to the method you mentioned is that you can just a new server with the new value to our LB and eventually the agent will notice. It does mean that the agent has to assume it if gets conflicting answers it should always use the larger of the two vaues. Extending the protocol means you would want a mechanism to update the serverCount on the running server. That is more work but it means agent would respond more quickly and also that you can support down sizing. Mechanisms for lowering could be an admin command to the server, dynamic config or loading it from a CRD on the KAS.

@zqzten
Copy link
Member Author

zqzten commented Aug 25, 2022

One thing we haven't address yet is how the server gets that value.

Good point. I've noticed that there was another rotting issue #273 tracking on this. We may consider that together with this issue.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 23, 2022
@zqzten
Copy link
Member Author

zqzten commented Dec 23, 2022

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 23, 2022
@tallclair
Copy link
Contributor

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 4, 2023
@jkh52
Copy link
Contributor

jkh52 commented May 26, 2023

Good point. I've noticed that there was another rotting issue #273 tracking on this. We may consider that together with this issue.

We could keep this issue open, to improve "how the agents learn server count". There are some ideas here that weren't considered there. #273 can track the remaining server half of the feature.

jkh52 added a commit to jkh52/apiserver-network-proxy that referenced this issue May 31, 2023
This is a complaint at kubernetes-sigs#358

Note that this extends an existing pattern.
@carreter
Copy link
Contributor

carreter commented Jun 13, 2024

FYI, if an agent receives an increased server count from a server it is already connected to, it doesn't reset the backoff and just uses the last set duration. I'm guessing that this isn't the desired behavior and that the agent should immediately reset the backoff and shift into "fast sync" mode if it is supposed to be connected to more servers than it currently is.

This will only be a problem once #273 is implemented as there is currently no way to update the server count on the server side.

EDIT: Here are some logs. Note that the connection attempts are separated by 30 seconds even after the server count gets updated to be greater than the client count.

I0613 20:40:17.950671       1 clientset.go:186] "duplicate server" serverID="d5f7af4c-9540-4fb9-a9a3-066a655ee029" serverCount=1 clientsCount=1
I0613 20:40:47.953658       1 client.go:474] "received DATA" connectionID=2
I0613 20:40:47.953719       1 client.go:474] "received DATA" connectionID=1
I0613 20:40:47.953777       1 client.go:600] "write to remote" connectionID=2 lastData=39 dataSize=39
I0613 20:40:47.953810       1 client.go:600] "write to remote" connectionID=1 lastData=39 dataSize=39
I0613 20:40:48.612637       1 client.go:215] "Connect to server" serverID="d5f7af4c-9540-4fb9-a9a3-066a655ee029"
I0613 20:40:48.612694       1 clientset.go:216] "Server count change suggestion by server" current=1 serverID="d5f7af4c-9540-4fb9-a9a3-066a655ee029" actual=4 # SERVER COUNT GETS UPDATED HERE
I0613 20:40:48.613038       1 clientset.go:186] "duplicate server" serverID="d5f7af4c-9540-4fb9-a9a3-066a655ee029" serverCount=4 clientsCount=1
I0613 20:40:48.615871       1 client.go:474] "received DATA" connectionID=2
I0613 20:40:48.615974       1 client.go:600] "write to remote" connectionID=2 lastData=35 dataSize=35
I0613 20:41:18.616285       1 client.go:474] "received DATA" connectionID=1
I0613 20:41:18.616399       1 client.go:600] "write to remote" connectionID=1 lastData=39 dataSize=39
I0613 20:41:18.617362       1 client.go:474] "received DATA" connectionID=2
I0613 20:41:18.617413       1 client.go:600] "write to remote" connectionID=2 lastData=39 dataSize=39
I0613 20:41:19.330032       1 client.go:215] "Connect to server" serverID="d5f7af4c-9540-4fb9-a9a3-066a655ee029"
I0613 20:41:19.330344       1 clientset.go:186] "duplicate server" serverID="d5f7af4c-9540-4fb9-a9a3-066a655ee029" serverCount=4 clientsCount=1

@carreter
Copy link
Contributor

Fixed by #643 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

8 participants