Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataplane not getting information from the ControlPlane in Hybrid mode #13076

Closed
1 task done
lays147 opened this issue May 23, 2024 · 6 comments
Closed
1 task done

Dataplane not getting information from the ControlPlane in Hybrid mode #13076

lays147 opened this issue May 23, 2024 · 6 comments
Labels
pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc...

Comments

@lays147
Copy link

lays147 commented May 23, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Kong version ($ kong version)

3.6.1

Current Behavior

I have a Kong Cluster running in AWS ECS in Hybrid Mode.
The connection between the DP and CP is made with AWS Service Discovery/Cloudmap.

When the cluster scales, the new dataplane node appears to be unable to retrieve the information from the Control Plane.
I have a handful of different type of errors, which I don't know where to investigate further, since Kong is not my expertise.

2024/05/23 18:35:25 [error] 1336#0: *73509 [lua] control_plane.lua:482: serve_cluster_listener(): [clustering] failed to receive the first 2 bytes: closed [id: a4941ffd-1759-4329-85b5-7bf2c3ff2ba0, host:.ec2.internal, ip: , version: 3.6.1], client: , server: kong_cluster_listener, request: "GET /v1/outlet?node_id=a4941ffd-1759-4329-85b5-7bf2c3ff2ba0&node_hostname=.ec2.internal&node_version=3.6.1 HTTP/1.1", host: "controlplane.internal.prd:8005" 
2024/05/23 18:45:25 [error] 1364#0: *139 [lua] data_plane.lua:158: communicate(): [clustering] connection to control plane wss://controlplane..prd:8005/v1/outlet?node_id=7a076aad-7de8-44bd-8c3e-f17602feec39&node_hostname=.ec2.internal&node_version=3.6.1 broken: failed to connect: connection refused (retrying after 7 seconds) [controlplaneinternal.prd:8005], context: ngx.timer
2024/05/23 18:28:48 [error] 1364#0: *7 [lua] data_plane.lua:263: [clustering] unable to update running config: unable to open DB for access: MDB_NOTFOUND: No matching key/data pair found, context: ngx.timer

Within these cases, the Dataplane appears to not be able to retry the connections with de Control Plane, so I need to force the restart of the CP, so the new nodes can update themselves.

I require your guidance to help to investigate this problem, since this is causing disruptions in our production environment.

Expected Behavior

The new nodes of Data Plane should get the info from the Control Plane without issues.

@ADD-SP
Copy link
Contributor

ADD-SP commented May 27, 2024

@lays147 Thanks for your report. Seems the file of LMDB is corrupted. Now Kong is using LMDB to store entities on the data_plane. Do you have some deployment that might share/change the kong_prefix directory across different data_plane instances?

@ADD-SP
Copy link
Contributor

ADD-SP commented May 27, 2024

And do you have steps to help us to reproduce this issue locally?

@ADD-SP ADD-SP added the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label May 27, 2024
@lays147
Copy link
Author

lays147 commented May 27, 2024

@ADD-SP

Do you have some deployment that might share/change the kong_prefix directory across different data_plane instances?

No, there's no configuration changes between deployments.

And do you have steps to help us to reproduce this issue locally?

So, the scenario is like I told on the bug report. I have one control plane, and two data planes in my production environment running on ECS. The autoscaling configuration(based on cpu and memory consumption) is very sensitive, so it scales out very drastically(3-to-5).

And when this happens, some or all of the new containers have one or more errors like the ones that I added, and does not load the configuration from the control plane. The containers appear to be in a deadlock.

I don't know if this can be reproduced locally. I might throw a hand, and maybe you can try using some Kubernetes stuff like k6, adding very little resources to the dataplane node, and forcing it to scale in the cluster and see if these issues happen. The problem happens on trying to add new data plane nodes to an existent environment.

One thing that might stabilize my services, is changing the health check from the /status to /status/ready route, what would avoid the data planes without control plane communication to serve traffic. I'm going to deploy this change today, and hope that it stabilizes everything.

@ADD-SP ADD-SP removed the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label May 28, 2024
@StarlightIbuki
Copy link
Contributor

The first 2 errors looks like a network glitch and if it does not persists it should be fine to ignore them;
Could you check and share the status of LMDB cache file?

@StarlightIbuki StarlightIbuki added the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Jun 17, 2024
@lays147
Copy link
Author

lays147 commented Jun 17, 2024

@StarlightIbuki I don't have access to the underlying infrastructure to check that file.

And apparently, after changing the health check, it fixed this issue for me on the events of autoscaling.

@StarlightIbuki
Copy link
Contributor

@StarlightIbuki I don't have access to the underlying infrastructure to check that file.

And apparently, after changing the health check, it fixed this issue for me on the events of autoscaling.

Good to know it works for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc...
Projects
None yet
Development

No branches or pull requests

3 participants