Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verbose debuging #95

Closed
davem-git opened this issue Oct 12, 2022 · 30 comments
Closed

verbose debuging #95

davem-git opened this issue Oct 12, 2022 · 30 comments

Comments

@davem-git
Copy link

davem-git commented Oct 12, 2022

I've installed kubemod with a basic setup. I'd like to be able to add taints to nodes as they come online. So far I haven't gotten it to do anything. I don't see any errors or any indication of what's going on at all. I've reverted to a basic simple match. Its possible that's still incorrect.

I've ran the full setup, including the patch with extra objects, though that seems to be the default now.

i've tried this

apiVersion: api.kubemod.io/v1beta1
kind: ModRule
metadata:
  name: add-cilium-taint
  namespace: kubemod-system
spec:
  type: Patch
  match:
    - select: '$.kind'
      matchValue: Node
    ## make this rule idempotent
    - select: '$.spec.taints[*].key'
      matchValue: node.cilium.io/agent-not-ready
      negate: true
  patch:
    - op: add
      path: /spec/taints/-1
      value: |-
        effect: NoSchedule
        key: node.cilium.io/agent-not-ready
        value: "true"

i've even tried a more basic

apiVersion: api.kubemod.io/v1beta1
kind: ModRule
metadata:
  name: add-cilium-taint
  namespace: kubemod-system
spec:
  type: Patch
  match:
    - select: '$.kind'
      matchValue: Node
  patch:
    - op: add
      path: /spec/taints/-1
      value: |-
        effect: NoSchedule
        key: node.cilium.io/agent-not-ready
        value: "true"

I don't get anything useful in the logs

 kubectl -n kubemod-system logs kubemod-operator-6484b8d847-m56rw
{"level":"info","ts":"2022-10-11 22:35:40.391Z","logger":"webapp-setup","msg":"web app server is starting to listen","addr":":8081"}
{"level":"info","ts":"2022-10-11 22:35:40.695Z","logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8082"}
{"level":"info","ts":"2022-10-11 22:35:40.695Z","logger":"operator-setup","msg":"health server is starting to listen","addr":":8083"}
{"level":"info","ts":"2022-10-11 22:35:40.696Z","logger":"controller-runtime.builder","msg":"Registering a mutating webhook","GVK":"api.kubemod.io/v1beta1, Kind=ModRule","path":"/mutate-api-kubemod-io-v1beta1-modrule"}
{"level":"info","ts":"2022-10-11 22:35:40.696Z","logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-api-kubemod-io-v1beta1-modrule"}
{"level":"info","ts":"2022-10-11 22:35:40.696Z","logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"api.kubemod.io/v1beta1, Kind=ModRule","path":"/validate-api-kubemod-io-v1beta1-modrule"}
{"level":"info","ts":"2022-10-11 22:35:40.696Z","logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-api-kubemod-io-v1beta1-modrule"}
{"level":"info","ts":"2022-10-11 22:35:40.696Z","logger":"operator-setup","msg":"registering core mutating webhook"}
{"level":"info","ts":"2022-10-11 22:35:40.696Z","logger":"controller-runtime.webhook","msg":"registering webhook","path":"/dragnet-webhook"}
{"level":"info","ts":"2022-10-11 22:35:40.696Z","logger":"operator-setup","msg":"starting manager"}
{"level":"info","ts":"2022-10-11 22:35:40.697Z","logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":"2022-10-11 22:35:40.697Z","logger":"controller","msg":"Starting EventSource","reconcilerGroup":"api.kubemod.io","reconcilerKind":"ModRule","controller":"modrule","source":"kind source: /, Kind="}
{"level":"info","ts":"2022-10-11 22:35:40.697Z","logger":"controller-runtime.webhook.webhooks","msg":"starting webhook server"}
{"level":"info","ts":"2022-10-11 22:35:40.698Z","logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":"2022-10-11 22:35:40.789Z","logger":"controller-runtime.webhook","msg":"serving webhook server","host":"","port":9443}
{"level":"info","ts":"2022-10-11 22:35:40.790Z","logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
{"level":"info","ts":"2022-10-11 22:35:40.897Z","logger":"controller","msg":"Starting Controller","reconcilerGroup":"api.kubemod.io","reconcilerKind":"ModRule","controller":"modrule"}
{"level":"info","ts":"2022-10-11 22:35:40.897Z","logger":"controller","msg":"Starting workers","reconcilerGroup":"api.kubemod.io","reconcilerKind":"ModRule","controller":"modrule","worker count":1}

any help will be appreciated

I'm on AKS Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.6", GitCommit:"b39bf148cd654599a52e867485c02c4f9d28b312", GitTreeState:"clean", BuildDate:"2022-09-21T21:46:51Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}

with the latest install of kubemod

@vassilvk
Copy link
Member

Hi @davem-git,

At a first glance I see no issues with your modrule or the way you are using it.

Can you remove the idempotency check select and try again?

I wonder if:

  1. There is an issue with detecting nodes being brought online...or
  2. The idempotency check is failing the intercept

@vassilvk
Copy link
Member

Also, I don't see a difference between your two modrules.

@davem-git
Copy link
Author

davem-git commented Oct 12, 2022

Oops I removed the bottom part of the selection. I just didn't copy and paste it properly. I corrected the above sample

@vassilvk
Copy link
Member

Ok, so KubeMod does not detect the Node-related event.
Can you please elaborate what you mean by "come online"?
Are you rebooting the nodes? Or draining/removing/adding them back to the cluster?

@davem-git
Copy link
Author

I'm causing the nodes to scale out and in. I've even deleted the user node pool and added a brand new one back. I haven't tried rebooting. Ideally this will pick up nodes as they come online through auto scaling events

@vassilvk
Copy link
Member

Got it, thanks.
I don't have access to an AKS cluster at this time, but will play around with node modrules though node updates when I have some time.

@davem-git
Copy link
Author

davem-git commented Oct 12, 2022

Thank you! If there's anyway i can help debug this, any flags i can enable for more verbose logging please let me know.

@vassilvk
Copy link
Member

vassilvk commented Oct 12, 2022

Logging every CREATE and UPDATE event that passes through KubeMod won't be reasonable as KubeMod interception is quite promiscuous and there may be hundreds of events per second at the time large application stacks are being stood up, torn down, scaled up/down.

It may make sense to add flags to log events for specific resources (for example nodes).

Let me see what I can find in my node modrule tests.
If nothing comes out, I'll ship a new version with event logging for specific resources.

@davem-git
Copy link
Author

Great! sounds like a good plan. Thanks again!

@vassilvk
Copy link
Member

vassilvk commented Oct 14, 2022

Hi @davem-git,

I have an update. I tried your ModRule (first version above) on my Docker for Windows deployment of Kubernetes.

Then I triggered an UPDATE event on the docker-desktop node by setting a random label, for example:

kubectl.exe label nodes docker-desktop color=blue

This triggered your ModRule and applied the taint on the node as expected, issuing the following log line in kubemod-operator:

{"level":"info","ts":"2022-10-14 19:19:01.689Z","logger":"modrule-webhook","msg":"Applying ModRule patch","request uid":"542a5404-92e8-4833-a2d4-08a536d3e3ce","namespace":"","resource":"nodes/docker-desktop","operation":"UPDATE","patch":[{"op":"add","path":"/spec/taints","value":[{"effect":"NoSchedule","key":"node.cilium.io/agent-not-ready","value":"true"}]}]}

Can you try the same on your end?
Keep the ModRule in the system, and trigger a node UPDATE event (by assigning a label) against one of your currently running taint-free nodes.

See if that triggers the ModRule patch.

This is obviously not your use case - I am only trying to test if node-related events are being intercepted by KubeMod in your AKS cluster.

@davem-git
Copy link
Author

I'll give it a try. Thanks!

@davem-git
Copy link
Author

still nothing. I uploaded my yaml as a .MD since its not supported as yaml format. incase I have something just wrong, but I don't think i do as there wasn't much to customize for my environment.
kubemod.MD

Labels:             agentpool=pool1az0
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=Standard_E4ds_v5
                    beta.kubernetes.io/os=linux
                    color=blue
❯ kubectl -n kubemod-system logs kubemod-operator-d595864b8-dz25h
{"level":"info","ts":"2022-10-14 20:30:56.834Z","logger":"webapp-setup","msg":"web app server is starting to listen","addr":":8081"}
{"level":"info","ts":"2022-10-14 20:30:57.149Z","logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8082"}
{"level":"info","ts":"2022-10-14 20:30:57.149Z","logger":"operator-setup","msg":"health server is starting to listen","addr":":8083"}
{"level":"info","ts":"2022-10-14 20:30:57.149Z","logger":"controller-runtime.builder","msg":"Registering a mutating webhook","GVK":"api.kubemod.io/v1beta1, Kind=ModRule","path":"/mutate-api-kubemod-io-v1beta1-modrule"}
{"level":"info","ts":"2022-10-14 20:30:57.149Z","logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-api-kubemod-io-v1beta1-modrule"}
{"level":"info","ts":"2022-10-14 20:30:57.149Z","logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"api.kubemod.io/v1beta1, Kind=ModRule","path":"/validate-api-kubemod-io-v1beta1-modrule"}
{"level":"info","ts":"2022-10-14 20:30:57.149Z","logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-api-kubemod-io-v1beta1-modrule"}
{"level":"info","ts":"2022-10-14 20:30:57.149Z","logger":"operator-setup","msg":"registering core mutating webhook"}
{"level":"info","ts":"2022-10-14 20:30:57.149Z","logger":"controller-runtime.webhook","msg":"registering webhook","path":"/dragnet-webhook"}
{"level":"info","ts":"2022-10-14 20:30:57.149Z","logger":"operator-setup","msg":"starting manager"}
{"level":"info","ts":"2022-10-14 20:30:57.149Z","logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":"2022-10-14 20:30:57.149Z","logger":"controller","msg":"Starting EventSource","reconcilerGroup":"api.kubemod.io","reconcilerKind":"ModRule","controller":"modrule","source":"kind source: /, Kind="}
{"level":"info","ts":"2022-10-14 20:30:57.149Z","logger":"controller-runtime.webhook.webhooks","msg":"starting webhook server"}
{"level":"info","ts":"2022-10-14 20:30:57.150Z","logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":"2022-10-14 20:30:57.150Z","logger":"controller-runtime.webhook","msg":"serving webhook server","host":"","port":9443}
{"level":"info","ts":"2022-10-14 20:30:57.150Z","logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
{"level":"info","ts":"2022-10-14 20:30:57.250Z","logger":"controller","msg":"Starting Controller","reconcilerGroup":"api.kubemod.io","reconcilerKind":"ModRule","controller":"modrule"}
{"level":"info","ts":"2022-10-14 20:30:57.250Z","logger":"controller","msg":"Starting workers","reconcilerGroup":"api.kubemod.io","reconcilerKind":"ModRule","controller":"modrule","worker count":1}

@vassilvk
Copy link
Member

vassilvk commented Oct 14, 2022

I wonder if there is some timing issue involved.

Please make sure that your test steps are in this order:

  1. Install KubeMod (I assume you've already done that)
  2. Deploy your modrule using kubectl
  3. Make sure the target node doesn't have the taint created by the modrule
  4. Make the change to the node label (which should trigger an UPDATE event, which should be intercepted by KubeMod)
  5. Observe the kubemod-operator logs

If this is the order in which you've performed your test, and you still don't see any activity in the log, then I think there might be some configuration in your environment that prevents KubeMod from receiving node-related admission webhook requests.

Can you test if KubeMod works against other resources in your cluster?

For example, you can test by deploying the following sample modrue to any namespace...

kubectl apply -f https://raw.githubusercontent.com/kubemod/kubemod/master/samples/modrules/modrule-1.yaml

... and then deploy the following sample NGINX deployment in the same namespace:

kubectl apply -f https://raw.githubusercontent.com/kubemod/kubemod/master/samples/stack/nginx-deployment.yaml

The above should produce KubeMod log item like this:

{"level":"info","ts":"2022-10-14 21:45:08.596Z","logger":"modrule-webhook","msg":"Applying ModRule patch","request uid":"f384b4a7-9a3e-4e2e-a6d3-2827c6fce7c4","namespace":"default","resource":"deployments/nginx","operation":"CREATE","patch":[{"op":"replace","path":"/metadata/annotations/kubectl.kubernetes.io~1last-applied-configuration","value":"{\"apiVersion\":\"apps/v1\",\"kind\":\"Deployment\",\"metadata\":{\"annotations\":{},\"labels\":{\"app\":\"nginx\",\"color\":\"whatever\"},\"name\":\"nginx\",\"namespace\":\"default\"},\"spec\":{\"replicas\":1,\"selector\":{\"matchLabels\":{\"app\":\"nginx\"}},\"template\":{\"metadata\":{\"labels\":{\"app\":\"nginx\"}},\"spec\":{\"containers\":[{\"image\":\"bitnami/nginx:1.14.2\",\"name\":\"nginx\",\"ports\":[{\"containerPort\":8080,\"protocol\":\"TCP\"}],\"resources\":{\"limits\":{\"cpu\":\"500m\",\"memory\":\"1Gi\"}}},{\"command\":[\"sh\",\"-c\",\"while true; do sleep 5; done;\"],\"image\":\"alpine:3\",\"name\":\"injected\"}]}}}}"},{"op":"add","path":"/metadata/labels/color","value":"whatever"},{"op":"add","path":"/spec/template/spec/containers/1","value":{"command":["sh","-c","while true; do sleep 5; done;"],"image":"alpine:3","name":"injected"}},{"op":"replace","path":"/spec/template/spec/containers/0/image","value":"bitnami/nginx:1.14.2"},{"op":"replace","path":"/spec/template/spec/containers/0/ports/0/containerPort","value":8080}]}

You can then clean up by deleting the modrule and deployment:

kubectl delete -f https://raw.githubusercontent.com/kubemod/kubemod/master/samples/stack/nginx-deployment.yaml
kubectl delete -f https://raw.githubusercontent.com/kubemod/kubemod/master/samples/modrules/modrule-1.yaml

Please let me know if the above produces logs.
This is the simplest smoke test - if this does not generate activity, we need to look further into what's causing this.
If it does generate activity, then we know there's something specific to node events.

@davem-git
Copy link
Author

no luck. That was the order of operations I was using. I didn't have any luck with the nginx rule test either.
I noticed the modrule doesn't include a namespace. I am adding it on mine. Do i not need too? Could that have any issues?

I'm going to remove kubemod and try again seeing if I missed something in the setup.

@davem-git
Copy link
Author

after doing that, the modrule kicked off for the nginx test! I'll go back any try the node rules again.

{"level":"info","ts":"2022-10-17 15:04:04.010Z","logger":"modrule-webhook","msg":"Applying ModRule patch","request uid":"9e7e0c3b-90bd-468b-b474-0683620f05d7","namespace":"default","resource":"deployments/nginx","operation":"CREATE","patch":[{"op":"replace","path":"/metadata/annotations/kubectl.kubernetes.io~1last-applied-configuration","value":"{\"apiVersion\":\"apps/v1\",\"kind\":\"Deployment\",\"metadata\":{\"annotations\":{},\"labels\":{\"app\":\"nginx\",\"color\":\"whatever\"},\"name\":\"nginx\",\"namespace\":\"default\"},\"spec\":{\"replicas\":1,\"selector\":{\"matchLabels\":{\"app\":\"nginx\"}},\"template\":{\"metadata\":{\"labels\":{\"app\":\"nginx\"}},\"spec\":{\"containers\":[{\"image\":\"bitnami/nginx:1.14.2\",\"name\":\"nginx\",\"ports\":[{\"containerPort\":8080,\"protocol\":\"TCP\"}],\"resources\":{\"limits\":{\"cpu\":\"500m\",\"memory\":\"1Gi\"}}},{\"command\":[\"sh\",\"-c\",\"while true; do sleep 5; done;\"],\"image\":\"alpine:3\",\"name\":\"injected\"}]}}}}"},{"op":"add","path":"/metadata/labels/color","value":"whatever"},{"op":"add","path":"/spec/template/spec/containers/1","value":{"command":["sh","-c","while true; do sleep 5; done;"],"image":"alpine:3","name":"injected"}},{"op":"replace","path":"/spec/template/spec/containers/0/image","value":"bitnami/nginx:1.14.2"},{"op":"replace","path":"/spec/template/spec/containers/0/ports/0/containerPort","value":8080}]}

@davem-git
Copy link
Author

no luck. It doesn't seem to be seeing any node rules. At least the add label hasn't triggered it to run

@davem-git
Copy link
Author

I'm not sure what I did, but it seems to be working?

color","value":"whatever"},{"op":"add","path":"/spec/template/spec/containers/1","value":{"command":["sh","-c","while true; do sleep 5; done;"],"image":"alpine:3","name":"injected"}},{"op":"replace","path":"/spec/template/spec/containers/0/image","value":"bitnami/nginx:1.14.2"},{"op":"replace","path":"/spec/template/spec/containers/0/ports/0/containerPort","value":8080}]}
{"level":"info","ts":"2022-10-17 15:31:13.070Z","logger":"modrule-webhook","msg":"Applying ModRule patch","request uid":"8deefcc6-6235-40dc-9720-e7313e0fe876","namespace":"","resource":"nodes/aks-pool0-38869568-vmss000000","operation":"UPDATE","patch":[{"op":"add","path":"/spec/taints","value":[{"effect":"NoSchedule","key":"node.cilium.io/agent-not-ready","value":"true"}]}]}
{"level":"info","ts":"2022-10-17 15:31:13.108Z","logger":"modrule-webhook","msg":"Applying ModRule patch","request uid":"3c0bbb55-be4e-4152-993d-efb72d865d56","namespace":"","resource":"nodes/aks-pool1az0-57119049-vmss000003","operation":"UPDATE","patch":[{"op":"add","path":"/spec/taints","value":[{"effect":"NoSchedule","key":"node.cilium.io/agent-not-ready","value":"true"}]}]}
{"level":"info","ts":"2022-10-17 15:31:13.142Z","logger":"modrule-webhook","msg":"Applying ModRule patch","request uid":"a33e82fb-65ba-4312-8ef4-bd6c31c10f89","namespace":"","resource":"nodes/aks-pool1az0-57119049-vmss000004","operation":"UPDATE","patch":[{"op":"add","path":"/spec/taints","value":[{"effect":"NoSchedule","key":"node.cilium.io/agent-not-ready","value":"true"}]}]}
{"level":"info","ts":"2022-10-17 15:31:13.174Z","logger":"modrule-webhook","msg":"Applying ModRule patch","request uid":"f43b0106-6fa9-4091-aa43-1ef2b3a3edde","namespace":"","resource":"nodes/aks-pool1az0-57119049-vmss000005","operation":"UPDATE","patch":[{"op":"add","path":"/spec/taints","value":[{"effect":"NoSchedule","key":"node.cilium.io/agent-not-ready","value":"true"}]}]

@vassilvk
Copy link
Member

Hmm, it does feel (again) like timing has something to do with it.
Several points:

  • When you create a new ModRule, it does take some time (usually less than a second) for KubeMod to acquire the ModRule. If you create both the ModRule and the resources it targets in a very quick succession (for example from script), this latency may lead to KubeMod missing the resources. I know this is probably not related to your experiments, but I thought it's important to note.
  • When you create Kubernetes resources using kubectl, if you don't specify the namespace in the manifest, the resource is created in the default namespace, or in the namespace specified by the -n switch.

I wonder if we can reset and retest by following these steps:

  1. Uninstall KubeMod:
kubectl delete -f https://raw.githubusercontent.com/kubemod/kubemod/v0.15.3/bundle.yaml
  1. Install KubeMod again:
kubectl delete job kubemod-crt-job -n kubemod-system
kubectl label namespace kube-system admission.kubemod.io/ignore=true --overwrite
kubectl apply -f https://raw.githubusercontent.com/kubemod/kubemod/v0.15.3/bundle.yaml
  1. Create your node ModRule in namespace kubemod-system
  2. Wait a minute (just in case)
  3. Test by changing a node's label

If this works, then test your ModRule by scaling your node pool up and down.

@davem-git
Copy link
Author

seems like things are mostly working. Everything besides trying to get the rule to apply only once. It seems to be running over and over again. i'll play around with the

    - select: '$.spec.taints[*].key'
      matchValue: node.cilium.io/agent-not-ready
      negate: true

@davem-git
Copy link
Author

i think i found this to work better. still running some tests

apiVersion: api.kubemod.io/v1beta1
kind: ModRule
metadata:
  name: add-cilium-taint
  namespace: kubemod-system
spec:
  type: Patch
  match:
    - select: '$.kind'
      matchValue: Node
    - select: '$.metadata.labels["kube-moderule-applied"]'
      negate: true
  patch:
    - op: add
      path: /metadata/labels/kube-moderule-applied
    - op: add
      path: /spec/taints/-1
      value: |-
        effect: NoSchedule
        key: node.cilium.io/agent-not-ready
        value: "true"

@vassilvk
Copy link
Member

@davem-git any update?

@davem-git
Copy link
Author

that rule seems to fixed it. I think this is safe to close. I'm going to delete my cluster and start again to ensure my results are repeatable. I was pulled away on to something else I should be finishing today. I had hope to tests this again by now.

@vassilvk
Copy link
Member

Great - thanks!

@davem-git
Copy link
Author

hmm weird re-install didn't work again on a new cluster. seems inconsistent trying to figure out what's different

@davem-git
Copy link
Author

It seems to be an order of operation issues. My company uses kustomize. so I've broken the deployment down to fit our standard. Kustomize seems to try to deploy the modrule before its an object on the cluster. It errors, which case i normally run it again and it applies cleanly.

kb workload/kube-mod/base  | kubectl apply -f -
namespace/kubemod-system created
customresourcedefinition.apiextensions.k8s.io/modrules.api.kubemod.io created
role.rbac.authorization.k8s.io/kubemod-crt created
clusterrole.rbac.authorization.k8s.io/kubemod-crt created
clusterrole.rbac.authorization.k8s.io/kubemod-manager created
rolebinding.rbac.authorization.k8s.io/kubemod-crt created
clusterrolebinding.rbac.authorization.k8s.io/kubemod-crt created
clusterrolebinding.rbac.authorization.k8s.io/kubemod-manager created
service/kubemod-webapp-service created
service/kubemod-webhook-service created
deployment.apps/kubemod-operator created
cronjob.batch/kubemod-crt-cron-job created
job.batch/kubemod-crt-job created
mutatingwebhookconfiguration.admissionregistration.k8s.io/kubemod-mutating-webhook-configuration created
validatingwebhookconfiguration.admissionregistration.k8s.io/kubemod-validating-webhook-configuration created
error: unable to recognize "STDIN": no matches for kind "ModRule" in version "api.kubemod.io/v1beta1"
❯ kb workload/kube-mod/base  | kubectl apply -f -
namespace/kubemod-system unchanged
customresourcedefinition.apiextensions.k8s.io/modrules.api.kubemod.io configured
role.rbac.authorization.k8s.io/kubemod-crt unchanged
clusterrole.rbac.authorization.k8s.io/kubemod-crt unchanged
clusterrole.rbac.authorization.k8s.io/kubemod-manager configured
rolebinding.rbac.authorization.k8s.io/kubemod-crt unchanged
clusterrolebinding.rbac.authorization.k8s.io/kubemod-crt unchanged
clusterrolebinding.rbac.authorization.k8s.io/kubemod-manager unchanged
service/kubemod-webapp-service unchanged
service/kubemod-webhook-service unchanged
deployment.apps/kubemod-operator unchanged
cronjob.batch/kubemod-crt-cron-job unchanged
modrule.api.kubemod.io/add-cilium-taint created
job.batch/kubemod-crt-job unchanged
mutatingwebhookconfiguration.admissionregistration.k8s.io/kubemod-mutating-webhook-configuration configured
validatingwebhookconfiguration.admissionregistration.k8s.io/kubemod-validating-webhook-configuration configured

some order of operation issue with this error causes kube-mod to not get any events, at least from nodes. my testing shows from anything at this point.

If i remove kubemod re-install it without the rule, then add the rule after it works fine. Not sure what's going on. I'm working on a way to ensure that always runs last to see if that fixes it.

@vassilvk
Copy link
Member

vassilvk commented Oct 27, 2022

Ah, this makes sense.

So what's happening is, you are installing KubeMod and an instance of a ModRule (add-cilium-taint) in a quick succession.

Part of the KubeMod installation registers ModRule as a CRD with the K8S API server (third line from the top of your log).
I suspect that this is an asynchronous operation for Kubernetes, and the ModRule CRD registration has not propagated yet through the K8S API server at the time you hit it with the add-cilium-taint ModRule.

Please see this as a description of your issue: kubernetes/kubectl#1117

It makes sense to split your installation into three parts:

  1. Install KubeMod
  2. Wait for the ModRule CRD to be registered: kubectl wait --for condition=established crd modrules.api.kubemod.io
  3. Install your ModRules

And to be 100% safe, you can wait for the kubemod operator to be up and running before you apply your modrule.
To do that, you can replace point 3 above with:

kubemod_pod=$(kubectl get pod -l app.kubernetes.io/component=kubemod-operator -n kubemod-system -o jsonpath="{.items[0].metadata.name}")
kubectl wait --for=condition=ready pod/$kubemod_pod -n kubemod-system --timeout=360s

@davem-git
Copy link
Author

apparently that's easier said than done with the tools i have at my disposal. Ideally this would be something we can deploy with argoCD as the rest of our applications use it. That method above really doesn't fit that format. It doesn't seem to have a way to wait. Normally this isn't a problem as ArgoCD will try to deploy it again, and again deploy whats missing.

That doesn't work in this case as it seems to permanently break kube mod with no logs. which requires manual removal and redeployment without the rule.

Since this will be used as part of provisioning, I think i can remove the argo aspect and get some way of manually installing it when new clusters are provisioning. It would be nice to know why this order of operations issue can't be fixed without a reinstall

@vassilvk
Copy link
Member

I think it is clear that you cannot deploy an instance of a CRD before the CRD was registered with the cluster.
Since the creation of CRD is async, the solution (of waiting) belongs to the client deploying the K8S resources.

ArgoCD provides the following solution: argoproj/argo-cd#1999

@davem-git
Copy link
Author

oh good to know ARGOCD. Thanks for all your help. This can be closed out!

@vassilvk
Copy link
Member

Thanks! Closing :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants