Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document supported k8s versions #1257

Closed
Agalin opened this issue Jul 14, 2023 · 10 comments · Fixed by #1277
Closed

Document supported k8s versions #1257

Agalin opened this issue Jul 14, 2023 · 10 comments · Fixed by #1277
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature.

Comments

@Agalin
Copy link

Agalin commented Jul 14, 2023

What would you like to be added:

Document currently compatible versions of k8s in readme or docs and last verison of NFD working on a particular k8s version.

Why is this needed:

I'm currently working on deploying NFD in our environment. We have multiple clusters using k8s 1.23 and 1.27. To my surprise, Helm deployment working just fine on 1.27 cluster failed with master pod crashlooping on 1.23 clusters. It seems that recent kubernetes API library version bump is the problem - master works fine on 0.13.1.

Ensuring compatibility with old (I'm aware of 1.23 not being supported any more) k8s versions makes no sense but compatibility matrix would be a welcome addition and would save me a few hours of debugging. 😄

@Agalin Agalin added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 14, 2023
@marquiz
Copy link
Contributor

marquiz commented Jul 17, 2023

Hi @Agalin. NFD really should practically work on any kubernetes version that is still being used as we're only using stable apis. Bumping the patch version of the k8s deps really shouldn't break anything

Could you share more details about your deployment and the failure?

@Agalin
Copy link
Author

Agalin commented Jul 17, 2023

It seems to be some kind of an incompatibility between the latest version of the Kubernetes API library and K8s 1.23 (1.23.10 to be precise).

Master deployed from 0.13.2 crashloops with the following error in logs:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1724548]

goroutine 14 [running]:
sigs.k8s.io/node-feature-discovery/pkg/nfd-master.(*nfdMaster).nfdAPIUpdateOneNode(0xc000482a00, {0xc000158f60, 0xc})
        /go/node-feature-discovery/pkg/nfd-master/nfd-master.go:637 +0x108
sigs.k8s.io/node-feature-discovery/pkg/nfd-master.(*nfdMaster).nfdAPIUpdateAllNodes(0xc000482a00)
        /go/node-feature-discovery/pkg/nfd-master/nfd-master.go:627 +0x170
sigs.k8s.io/node-feature-discovery/pkg/nfd-master.(*nfdMaster).nfdAPIUpdateHandler(0xc000482a00)
        /go/node-feature-discovery/pkg/nfd-master/nfd-master.go:338 +0x1cc
created by sigs.k8s.io/node-feature-discovery/pkg/nfd-master.(*nfdMaster).Run
        /go/node-feature-discovery/pkg/nfd-master/nfd-master.go:228 +0x6f8

Same config works just fine on k8s 1.27.

Looking at logs, it breaks somewhere between line 689 and line 710. It reaches single node update function but doesn't print NodeFeature retrieval error nor processing of node initiated by NodeFeature API.

@marquiz
Copy link
Contributor

marquiz commented Jul 17, 2023

Hmm, strange, it segfaults in nfd-master on line 637 🤔 What exact instructions do you use for deploying nfd? Do you have the NodeFeature API enabled?

@marquiz
Copy link
Contributor

marquiz commented Jul 17, 2023

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 17, 2023
@Agalin
Copy link
Author

Agalin commented Jul 17, 2023

I'm using a Helm chart as a subchart deployed through Skaffold (it's the only subchart, it's done like that so we can add our CI and possibly additional subcharts in the future) with the following values applied:

node-feature-discovery:
  nameOverride: nfd
  master:
    replicaCount: 3
    podSecurityContext: &podSecurityContext
      runAsUser: 10000
      runAsGroup: 10000
      fsGroup: 10000
    securityContext: &securityContext
      runAsUser: 10000
      runAsGroup: 10000
    resources: &resources
      requests:
        memory: 256Mi
        cpu: 1
      limits:
        memory: 256Mi
        cpu: 1
    tolerations:
      - key: "node-role.kubernetes.io/etcd"
        operator: "Equal"
        value: "true"
        effect: "NoExecute"
      - key: "node-role.kubernetes.io/controlplane"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
      - key: "node-role.kubernetes.io/control-plane"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"

  worker:
    podSecurityContext: *podSecurityContext
    securityContext: *securityContext
    resources: *resources
    config:
      core:
        featureSources:
          - cpu
      sources:
        cpu:
          cpuid:
            attributeBlacklist: []
            attributeWhitelist: []

and nodeSelector defined in a cluster-specific file.

@Agalin
Copy link
Author

Agalin commented Jul 17, 2023

I'm not sure if it's worth debugging, k8s 1.23 is deprecated and even 1.24 will reach EOL next week.

@marquiz
Copy link
Contributor

marquiz commented Jul 17, 2023

There is some bug in the codebase that v1.23 reveals. I'd like to understand what

@marquiz
Copy link
Contributor

marquiz commented Jul 17, 2023

I might've found it, #1259 should fix the problem (or #1258 when backported)

@marquiz
Copy link
Contributor

marquiz commented Jul 21, 2023

@Agalin the issue you were facing should be now fixed in the just released v0.13.3

But back to your original issue about supported K8s versions I created #1277

@Agalin
Copy link
Author

Agalin commented Jul 21, 2023

Thanks! Hopefully I'll be able to test it next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants