Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing the nodebalancer config rebuilds to include ids of preexisting nodebalancer nodes #192

Conversation

akaokunc
Copy link
Contributor

@akaokunc akaokunc commented Mar 14, 2024

Currently nodebalancer configs are rebuilt when nodes change, but that causes nodebalancer to reload and that can have some impact on performance. For nodes which didn't have changed this is not necessary, but we are not sending node ids. This PR bumps linodego to 1.30 (to allow sending these ids) and before rebuilding the the nodebalancer config it does API request to obtain ids of current nodes.

Here is a recorded POST request after this change. Node 192.168.242.85:31059 was added while the others were kept form the original config, so they have ids set.

POST  /v4/nodebalancers/xxxx/configs/xxxx/rebuild  HTTP/1.1
HOST   : censored
HEADERS:
	Accept: application/json
	Authorization: Bearer censored
	Content-Type: application/json
	User-Agent: linode-cloud-controller-manager linodego/v1.30.0 https://github.com/linode/linodego
BODY   :
{
   "port": 80,
   "protocol": "tcp",
   "proxy_protocol": "none",
   "check": "connection",
   "check_interval": 5,
   "check_attempts": 2,
   "check_passive": true,
   "check_timeout": 3,
   "nodes": [
      {
         "address": "192.168.191.200:31059",
         "label": "okunc-c5-okunc-c1-3",
         "weight": 100,
         "mode": "accept",
         "id": 96539997
      },
      {
         "address": "192.168.242.85:31059",
         "label": "okunc-c5-okunc-c1-4",
         "weight": 100,
         "mode": "accept"
      },
      {
         "address": "192.168.242.39:31059",
         "label": "okunc-c5-okunc-c1-1",
         "weight": 100,
         "mode": "accept",
         "id": 96540073
      },
      {
         "address": "192.168.242.45:31059",
         "label": "okunc-c5-okunc-c1-2",
         "weight": 100,
         "mode": "accept",
         "id": 96540074
      }
   ]
}

General:

  • Have you removed all sensitive information, including but not limited to access keys and passwords?
  • Have you checked to ensure there aren't other open or closed Pull Requests for the same bug/feature/question?

Pull Request Guidelines:

  1. Does your submission pass tests?
  2. Have you added tests?
  3. Are you addressing a single feature in this PR?
  4. Are your commits atomic, addressing one change per commit?
  5. Are you following the conventions of the language?
  6. Have you saved your large formatting changes for a different PR, so we can focus on your work?
  7. Have you explained your rationale for why this feature is needed?
  8. Have you linked your PR to an open issue

@akaokunc
Copy link
Contributor Author

To succeed all the CI steps I had to drop the support for 1.20 which is unable to build it after bumping k8s deps and tidying the modules.

Not really sure what we should do when the node_controller or service_controller would not be able to register the informers, for now I'm just logging the error. See the implementation of the change which introduced these errors: kubernetes/client-go@ecdc8bf

@akaokunc akaokunc force-pushed the okunc/fix_nodebalancer_config_rebuild_2nd_attempt branch from e5f931c to 50ecce4 Compare March 26, 2024 15:05
Copy link
Contributor

@okokes-akamai okokes-akamai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments here or there.

One more important point from is that I'd ideally like to see this reflected in (unit) tests - what was the behaviour like before this change and what thing we're trying to achieve here. Just to avoid regressions.

cloud/linode/loadbalancers.go Show resolved Hide resolved
cloud/linode/loadbalancers.go Outdated Show resolved Hide resolved
cloud/linode/mock_client_test.go Outdated Show resolved Hide resolved
cloud/linode/mock_client_test.go Outdated Show resolved Hide resolved
e2e/go.mod Show resolved Hide resolved
e2e/go.sum Show resolved Hide resolved
go.mod Show resolved Hide resolved
@rahulait
Copy link
Contributor

Can we have the CI tests pass for this PR? Also, not sure if we have tested building the image and running it on a k8s cluster. I am seeing pods crashlooping with this branch and hence checking. Looks like complaining due to the change in main.go

      --vmodule pattern=N,...          comma-separated list of pattern=N settings for file-filtered logging (only works for text log format)

error: unknown flag: --port

@akaokunc
Copy link
Contributor Author

akaokunc commented Apr 3, 2024

Can we have the CI tests pass for this PR? Also, not sure if we have tested building the image and running it on a k8s cluster. I am seeing pods crashlooping with this branch and hence checking. Looks like complaining due to the change in main.go

      --vmodule pattern=N,...          comma-separated list of pattern=N settings for file-filtered logging (only works for text log format)

error: unknown flag: --port

Is it still the case? I believe it was some glitch during rebasing, now the changes in main.go are those which are really needed to bump linodego up and subsequently k8s deps.

cloud/linode/loadbalancers.go Show resolved Hide resolved
cloud/linode/loadbalancers_test.go Outdated Show resolved Hide resolved
cloud/linode/loadbalancers_test.go Outdated Show resolved Hide resolved
cloud/linode/loadbalancers.go Outdated Show resolved Hide resolved
cloud/linode/loadbalancers_test.go Outdated Show resolved Hide resolved
cloud/linode/loadbalancers_test.go Outdated Show resolved Hide resolved
cloud/linode/node_controller.go Outdated Show resolved Hide resolved
cloud/linode/service_controller.go Outdated Show resolved Hide resolved
cloud/linode/loadbalancers.go Show resolved Hide resolved
@akaokunc
Copy link
Contributor Author

akaokunc commented Apr 8, 2024

E2E tests passed.

Ran 22 of 23 Specs in 467.720 seconds
SUCCESS! -- 22 Passed | 0 Failed | 0 Pending | 1 Skipped
PASS

Had to do a manifest change

<             - --leader-elect-resource-lock=endpoints
>             - --leader-elect-resource-lock=leases

@rahulait
Copy link
Contributor

rahulait commented Apr 9, 2024

Can we also add the following diff to this PR so that helm installs work fine after the change?

diff --git a/deploy/chart/templates/daemonset.yaml b/deploy/chart/templates/daemonset.yaml
index 6a38e6e..6931591 100644
--- a/deploy/chart/templates/daemonset.yaml
+++ b/deploy/chart/templates/daemonset.yaml
@@ -29,9 +29,8 @@ spec:
           imagePullPolicy: {{ .Values.image.pullPolicy }}
           name: ccm-linode
           args:
-            - --leader-elect-resource-lock=endpoints
+            - --leader-elect-resource-lock=leases
             - --v=3
-            - --port=0
             - --secure-port=10253
             {{- if .Values.linodegoDebug }}
             - --linodego-debug={{ .Values.linodegoDebug }}

e2e/go.mod Show resolved Hide resolved
@rahulait
Copy link
Contributor

rahulait commented Apr 9, 2024

Also, once I switch to leases instead of endpoints, I see following error on all CCM pods:

E0409 04:33:57.324416       1 leaderelection.go:332] error retrieving resource lock kube-system/cloud-controller-manager: leases.coordination.k8s.io "cloud-controller-manager" is forbidden: User "system:serviceaccount:kube-system:ccm-linode" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
E0409 04:34:01.398603       1 leaderelection.go:332] error retrieving resource lock kube-system/cloud-controller-manager: leases.coordination.k8s.io "cloud-controller-manager" is forbidden: User "system:serviceaccount:kube-system:ccm-linode" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
E0409 04:34:05.004898       1 leaderelection.go:332] error retrieving resource lock kube-system/cloud-controller-manager: leases.coordination.k8s.io "cloud-controller-manager" is forbidden: User "system:serviceaccount:kube-system:ccm-linode" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"

One can test helm install working fine or not by following these steps.

  1. Uninstall existing linode-ccm
  2. Clone the linode-ccm repo with branch checked out
  3. Run helm install to install ccm
export LINODE_API_TOKEN=<linodeapitoken>
export REGION=<linoderegion>
helm install ccm-linode --set apiToken=$LINODE_API_TOKEN,region=$REGION deploy/chart/

@rahulait
Copy link
Contributor

rahulait commented Apr 9, 2024

Not sure why its not happening in the cluster you are testing. Can we add these changes as well, this seems to fix my issue:

diff --git a/deploy/chart/templates/clusterrole-rbac.yaml b/deploy/chart/templates/clusterrole-rbac.yaml
index 2953213..3d18d68 100644
--- a/deploy/chart/templates/clusterrole-rbac.yaml
+++ b/deploy/chart/templates/clusterrole-rbac.yaml
@@ -6,6 +6,9 @@ rules:
   - apiGroups: [""]
     resources: ["endpoints"]
     verbs: ["get", "watch", "list", "update", "create"]
+  - apiGroups: ["coordination.k8s.io"]
+    resources: ["leases"]
+    verbs: ["get", "watch", "list", "update", "create"]
   - apiGroups: [""]
     resources: ["nodes"]
     verbs: ["get", "watch", "list", "update", "delete", "patch"]

…leader election since the new k8s api introduced.
Copy link
Contributor

@rahulait rahulait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@akaokunc akaokunc merged commit 4899055 into linode:main Apr 9, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants