-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idle connections are stuck after establishing connection. They never send PacketType_DATA #311
Comments
This commit builds on top of the previous commit. It is noticed that the memory consumption of Konnectivity server, increases when idle connections pile up. The issue is described in more detail in kubernetes-sigs#311 In order to mitigate the issue, we create a separate goroutine, that tracks all the connetions per stream and if there is no activity in the stream after a specified timeout (configured via --grpc-max-idle-time), we close the stream thereby releasing the resources. The idea behind this approach is that in a sucessful connection lifecycle, the apiserver sends a CLOSE_REQ and proxy-server responds with a CLOSE_RSP but for idle connections this never happens. In order to to simulate the behaviour of apiserver sending the CLOSE_REQ, the proxy-server fires up a goroutine and tracks inactive connections, upon finding so, itself sends a CLOSE_REQ to the same channel on which it is listening from the api-server. Upon receiving the simulated CLOSE_REQ, proxy-server follows the codepath of the successful lifecyle of the packet i.e forwards the CLOSE_REQ to the agent, to which the agent responds with CLOSE_RSP and proxy-server forwards the CLOSE_RSP back to the apiserver. There are also those connections which are stuck by never receiving the PacketType_DATA, for those connections, we close the stream once the timeout occurs. This also fixes the issue of leaking file descriptors in kubernetes-sigs#276 Signed-off-by: Imran Pochi <imran@kinvolk.io>
It is noticed that the memory consumption of Konnectivity server, increases when idle connections pile up. The issue is described in more detail in kubernetes-sigs#311 In order to mitigate the issue, we create a separate goroutine, that tracks all the connetions per stream and if there is no activity in the stream after a specified timeout (configured via --grpc-max-idle-time), we close the stream thereby releasing the resources. The idea behind this approach is that in a sucessful connection lifecycle, the apiserver sends a CLOSE_REQ and proxy-server responds with a CLOSE_RSP but for idle connections this never happens. In order to to simulate the behaviour of apiserver sending the CLOSE_REQ, the proxy-server fires up a goroutine and tracks inactive connections, upon finding so, itself sends a CLOSE_REQ to the same channel on which it is listening from the api-server. There are also those connections which are stuck by never receiving the PacketType_DATA, for those connections, we close the stream once the timeout occurs. This also fixes the issue of leaking file descriptors in kubernetes-sigs#276 Signed-off-by: Imran Pochi <imran@kinvolk.io>
It is noticed that the memory consumption of Konnectivity server, increases when idle connections pile up. The issue is described in more detail in kubernetes-sigs#311 In order to mitigate the issue, we create a separate goroutine, that tracks all the connetions per stream and if there is no activity in the stream after a specified timeout (configured via --grpc-max-idle-time), we close the stream thereby releasing the resources. The idea behind this approach is that in a sucessful connection lifecycle, the apiserver sends a CLOSE_REQ and proxy-server responds with a CLOSE_RSP but for idle connections this never happens. In order to to simulate the behaviour of apiserver sending the CLOSE_REQ, the proxy-server fires up a goroutine and tracks inactive connections, upon finding so, itself sends a CLOSE_REQ to the same channel on which it is listening from the api-server. There are also those connections which are stuck by never receiving the PacketType_DATA, for those connections, we close the stream once the timeout occurs. This also fixes the issue of leaking file descriptors in kubernetes-sigs#276 Signed-off-by: Imran Pochi <imran@kinvolk.io>
@ipochi is it possible that you tested this without #290 ? I think that might be the case and that is why it seems the dial rsp was received, although it was an invalid one. Let me elaborate. The diff of the k8s code you show doesn't include the return added in PR #290 and, if you look carefully at this part of the diff you pasted:
You are printing you receive the response before checking if it is a valid response or not. See the rest of the context here: https://github.com/kubernetes-sigs/apiserver-network-proxy/pull/290/files#diff-a37bb159e5baec4208de32e78a3905ecef6282ceaf095115842e4e3738d96310L109-R118. Just after that, we check if the dialResp is recognized or not. That will make it print that we received a dialResp, even though it is not recognized and ignored. Therefore, no PacketType_DATA will be sent here and, by those prints, it will appear that the DialResp was received although it was not valid. So, if my analysis is correct, it would be completely possible for you to see this without patch #290. If you apply it, you should probably not see this issue. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
@ipochi any chance we can test this again with the recent fixes that merged? |
Thanks @andrewsykim for the fixes. I'll be looking into it sometime next week and circle back on results. |
UPDATE: I still see many idle connections are stuck after establishing connection. I followed the above reproducing steps on v0.0.30 tag. In all there were 1412 dial requests were sent, as identified by
Out of which 90 were missing close responses and none of those 90 connections had send PacketType_DATA
|
@ipochi sorry, but which Kubernetes version are you using? That is a key thing, as the bug fixes are the vendored files from k8s. It will be good to make sure the k8s version you are using contains all the ANP fixes too. |
@rata latest master that has the v0.0.30 vendored konnectivity-client. k8s version - master branch which includes the v0.0.30 konnectivity-client |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
While exploring more on issue #276, it was noticed that the idle connections are somehow stuck in a weird state where the connection is established i.e the kube-apiserver receives a DIAL_RSP from the proxy-server, but then thereafter there is complete inactivity from the apiserver i.e no packets of PacketType_DATA are sent.
In order to reproduce this, I added some debug statements to the
konnectivity-client
library which is vendored in Kubernetes and built a local cluster.This is the diff in the kubernetes repo to build a local cluster
Once the local cluster is up with
./hack/local-up-cluster.sh
, you can start the proxy-server and agent as local processes in separate terminals with:Stress the cluster, with concurrent kubectl log requests
Create some workloads
get-logs.sh script:
run.sh script:
Execute the run.sh script after saving and giving execute permissions:
To verify the above hypothesis, I've created a small script, that prints those connection IDs which are idle i.e
CLOSE_REQ
has not been sent by the api-server and hence as a result never received aCLOSE_RSP
from the proxy-server.check.sh script:
We are interested in the case of
Both missing from $connection_id
. All of the connection ids, that are printed for this case if cross-checked against the proxy-server logs have only the following logs snippet, exampleconnectionID=505
You can also verify if
PacketType_DATA
is sent by the apiserver for the same connection idThe problematic aspect of this is that in order to close/terminate these idle connections, we need connectionID which is only received when DIAL_RSP is sent by the agent and received by proxy-server and when PacketType_DATA is sent by apiserver to proxy-server.
/cc @iaguis @rata
The text was updated successfully, but these errors were encountered: