Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Kubernetes node feature-labeling as sgx-capable when SGX feature is turned off in BIOS and host has no /dev/sgx_* devices #638

Closed
MustDie95 opened this issue Nov 2, 2021 · 17 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@MustDie95
Copy link

What happened:

Kubernetes node has feature.node.kubernetes.io/custom-intel.sgx: 'true' even if SGX support was forcibly turned off in BIOS

What you expected to happen:

Kubernetes node is not feature-labeled with SGX or this label was removed when SGX was off in BIOS and there is no /dev/sgx_* devices on host (even if this feature supported by CPU and OS).

How to reproduce it (as minimally and precisely as possible):

Install server, setup Kubernetes and deploy Node Feature Discovery with 'kubectl apply -k https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.22.0'

All nodes are marked with feature.node.kubernetes.io/custom-intel.sgx: true' despite BIOS settings.

And Intel sgxdeviceplugin-sample constantly trying to start (because has nodeSelector based on this feature) on this node but failing.

Anything else we need to know?:

nfd-worker.conf :

sources:
custom:

  • name: "intel.sgx"
    matchOn:
    • kConfig: ["X86_SGX"]
      cpuId: ["SGX", "SGXLC"]

I think we should add more rules to make sure Intel SGX is really functional on the host. For example, examine the presence of /dev/sgx_* devices

Environment:

  • Kubernetes version: 1.22.2
  • Cloud provider or hardware configuration: Intel M50CYP1UR212 server (2 x Intel Xeon Platinum 8352S @ 2.2GHz, 512GB RAM)
  • OS: Ubuntu Server 20.04.3 LTS
  • Kernel: 5.11.0-38-generic
@MustDie95 MustDie95 added the kind/bug Categorizes issue or PR as related to a bug. label Nov 2, 2021
@MustDie95 MustDie95 changed the title Unexpected Kubernetes node feature-labeling as sgx-capable when SGX feature is turned off in BIOS and host has not /dev/sgx_* devices Unexpected Kubernetes node feature-labeling as sgx-capable when SGX feature is turned off in BIOS and host has no /dev/sgx_* devices Nov 2, 2021
@marquiz
Copy link
Contributor

marquiz commented Nov 2, 2021

@mythi any thoughts?

In principle, this is not a bug of NFD itself, but merely in the rule configuration. Where is that config maintained? However, there might be something that would be good to add to NFD to make proper detection of SGX easier (possible without custom hooks or sidecar containers).

@MustDie95
Copy link
Author

@marquiz, I think the rule metioned above in nfd-worker.conf was added by "Intel Software Guard Extensions (SGX) device plugin for Kubernetes" (https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/deployments/sgx_nfd/nfd-worker.conf) so you are right, NFD itself is not responsible for this behavior.

@MustDie95
Copy link
Author

It looks I should been address this issue to SGX device plugin maintainers rather the NFD

@marquiz
Copy link
Contributor

marquiz commented Nov 2, 2021

It looks I should been address this issue to SGX device plugin maintainers rather the NFD

Yeah, I suggest to submit an issue there, too. But we can keep this one open until a solution is found.

@MustDie95
Copy link
Author

https://github.com/kubernetes-sigs/node-feature-discovery/blob/master/docs/get-started/features.md#custom

By the way I cant find how to detect the presence of /dev/sgx_* devices with attributes/methods available in this Doc.
So plugin developers has no way to detect is SGX functional properly in my case..

@marquiz
Copy link
Contributor

marquiz commented Nov 2, 2021

By the way I cant find how to detect the presence of /dev/sgx_* devices with attributes/methods available in this Doc.

That's correct. It's not possible without using hooks or side-car containers (doing the dev node detection).

So plugin developers has no way to detect is SGX functional properly in my case..

I wonder if there is some other reliable way of detecting this apart from looking at the devices directly. @mythi ??

@mythi
Copy link
Contributor

mythi commented Nov 2, 2021

I wonder if there is some other reliable way of detecting this apart from looking at the devices directly

You can enumerate the cpuid leaf 12h for SGX EPC sections and non-zero value means the BIOS has put aside memory for SGX. klauspost/cpuid supports this so it could be added to NFD pretty easily.

Our sgx_epchook NFD hook does that but it's run on nodes with intel-sgx: true label so the "0 EPC sections" is not taken into account while creating the label.

@marquiz
Copy link
Contributor

marquiz commented Nov 3, 2021

Hmm, sounds like we'd need to add more capabilities to the cpu source 🧐

@mythi
Copy link
Contributor

mythi commented Nov 3, 2021

cpu source is probably easy but any suggestions how to make it available to custom?

@marquiz
Copy link
Contributor

marquiz commented Nov 3, 2021

cpu source is probably easy but any suggestions how to make it available to custom

#468 (of which #464 should be enough)

@Walnux
Copy link

Walnux commented Nov 4, 2021

I found some interesting behaviour on NFD side, after doing some tests from my side.

I have a Server with SGX support by default(which means the BIOS is set up properly and the Kernel driver works properly). After I start NFD using the same way mentioned by @MustDie95. Everything works fine.
Using command to check the label.
$ kubectl get nodes -o json | jq .items[].metadata.labels | grep sgx
"feature.node.kubernetes.io/custom-intel.sgx": "true",
Check the CPU flags
$ cat /dev/cpuinfo | grep -i SGX
flags : fpu ... sgx bmi1 ... sgx_lc ...arch_capabilities
And this should be all right, since my kernel has build in SGX driver support and the CPU has the "sgx" and "sgx_lc" supported
That does match the config
name: "intel.sgx"
matchOn:
kConfig: ["X86_SGX"]
cpuId: ["SGX", "SGXLC"]

Then I disabled the SGX in BIOS and reboot the system
Booting message shows: SGX is disabled by BIOS
Start the NFD and check the label
$ kubectl get nodes -o json | jq .items[].metadata.labels | grep sgx
The label is still true
"feature.node.kubernetes.io/custom-intel.sgx": "true",
Check the CPU flags
$ cat /dev/cpuinfo | grep -i SGX
sgx and sgx_lc flags are gone.

It looks like an issue to me.
I ever assume that NFD will update the label "feature.node.kubernetes.io/custom-intel.sgx" according to the change of the CPU flags. But it doesn't. Please correct me if I have some misunderstanding on cpuid["SGX","SGXLC"].

Lastly, I did the same thing on another Server (which runs the same HostOS with the server with SGX support). And SGX is never enabled before on this server.
After starting the NFD using the same way and checking the Lable. nothing is output.
Using command to check CPU flag
$ cat /dev/cpuinfo | grep -i SGX
nothing output.
I think it should also be right

Looks like NFD works fine if the CPU flags are not changed. But if CPU flags are changed NFD doesn't change the label accordingly.
Thanks!

@mythi
Copy link
Contributor

mythi commented Nov 4, 2021

$ cat /dev/cpuinfo | grep -i SGX
sgx and sgx_lc flags are gone.

That's because the kernel also checks the MSR registers for "BIOS enabled" and the cpuid package checks cpuid leafs only.

@marquiz
Copy link
Contributor

marquiz commented Nov 4, 2021

Yeah, nfd's cpuid does not parse /proc/cpuinfo but uses the cpuid instruction, instead

@mythi
Copy link
Contributor

mythi commented Dec 1, 2021

Fixed by #647 (not sure why Fixes did not work)

/close

@k8s-ci-robot
Copy link
Contributor

@mythi: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

Fixed by #647 (not sure why Fixes did not work)

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@marquiz
Copy link
Contributor

marquiz commented Dec 1, 2021

Closing this now as fixed/implemented. @MustDie95 please report back if you still have concerns

/close

@k8s-ci-robot
Copy link
Contributor

@marquiz: Closing this issue.

In response to this:

Closing this now as fixed/implemented. @MustDie95 please report back if you still have concerns

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants