Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU plugin fails on CoreOS #260

Closed
poussa opened this issue Jan 14, 2020 · 2 comments · Fixed by #272
Closed

GPU plugin fails on CoreOS #260

poussa opened this issue Jan 14, 2020 · 2 comments · Fixed by #272
Assignees

Comments

@poussa
Copy link
Contributor

poussa commented Jan 14, 2020

kubectl get pod -n kube-system intel-gpu-plugin-gktbr -o wide
NAME                     READY   STATUS             RESTARTS   AGE   IP          NODE             NOMINATED NODE   READINESS GATES
intel-gpu-plugin-gktbr   0/1     CrashLoopBackOff   7          13m   10.44.0.1   w-1-k8s-node-1   <none>           <none>
spoussa@cloud-manager:~$ kubectl logs -n kube-system intel-gpu-plugin-gktbr
GPU device plugin started
Device scan failed: open /sys/class/drm: no such file or directory
Can't read sysfs folder
main.(*devicePlugin).scan
	/intel-device-plugins-for-kubernetes/cmd/gpu_plugin/gpu_plugin.go:83
main.(*devicePlugin).Scan
	/intel-device-plugins-for-kubernetes/cmd/gpu_plugin/gpu_plugin.go:69
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*Manager).Run.func1
	/intel-device-plugins-for-kubernetes/pkg/deviceplugin/manager.go:96
runtime.goexit
	/usr/lib/golang/src/runtime/asm_amd64.s:1357
w-1-k8s-master ~ # uname -a
Linux w-1-k8s-master 4.19.86-coreos #1 SMP Mon Dec 2 20:13:38 -00 2019 x86_64 QEMU Virtual CPU version 2.5+ GenuineIntel GNU/Linux
w-1-k8s-master ~ # cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=2303.3.0
VERSION_ID=2303.3.0
BUILD_ID=2019-12-02-2049
PRETTY_NAME="Container Linux by CoreOS 2303.3.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
@mythi
Copy link
Contributor

mythi commented Jan 20, 2020

Don't crash/exit but report no devices?

files, err := ioutil.ReadDir(dp.sysfsDir)
	if err != nil {
		return nil, errors.Wrap(err, "Can't read sysfs folder")
	}

@mythi
Copy link
Contributor

mythi commented Jan 22, 2020

/cc @grahamwhaley

grahamwhaley pushed a commit to grahamwhaley/intel-device-plugins-for-kubernetes that referenced this issue Jan 29, 2020
If we fail to scan for GPU devices (note, that is potentially
different from not finding any devices during a scan), then
warn on it, and go around the poll loop again. Do not treat
it as a fatal error or we might end up in a re-launch death
deploy loop...

Of course, getting a warning in your logs every 5s could also
be annoying, but is somewhat 'less fatal'.

Fixes: intel#260
Fixes: intel#230

Signed-off-by: Graham Whaley <graham.whaley@intel.com>
askervin pushed a commit to askervin/intel-device-plugins-for-kubernetes that referenced this issue May 6, 2020
memtier: fix a number of bugs related to scoring, restoring saved state, and memory accounting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants