Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oryp7 Nvidia GPU issues "RmInitAdapter failed!" #113

Open
bflanagin opened this issue Jul 19, 2021 · 15 comments
Open

Oryp7 Nvidia GPU issues "RmInitAdapter failed!" #113

bflanagin opened this issue Jul 19, 2021 · 15 comments

Comments

@bflanagin
Copy link
Contributor

bflanagin commented Jul 19, 2021

Based on issues reported by support and internal testing the discrete video card is failing on the oryp7 and reverting to the integrated video card. When this happens system76-driver still reports that the system is in nvidia mode.

Dmesg reports:

NVRM: GPU 000:01:00.0: RmInitAdapter failed!(0x23:0x56:643
NVRM: GPU 000:01:00.0: rm_init_adapter failed, device minor number 0

The issue occurs randomly after reboot or power cycle and can be remedied the same way.

The issue can be replicated on Pop 20.10 and 21.04, as well as Ubuntu 20.10 and Windows10

@arnaudsj
Copy link

I second that. It has been a problem since Nvidia driver 465 for me. On my end, I also get a loud fan noise when the nvidia discrete GPU is not detected. I confirm that it only happens in NVIDIA mode (not hybrid mode).
A typical reboot does not always fix it for me, however I have found that causing the system to delay its boot a few secs (by pressing Esc to get the boot menu) solves the problem. So it happens to be a timing issue (the Nvidia card not initializing fast enough?)

@leviport
Copy link
Member

Well I can't seem to make it happen with #114 on 20.04. I'll keep trying, but this is looking promising.

@mitchelljohnmartel1
Copy link

oryp7 with a 3070
dmesg.txt

@bflanagin
Copy link
Contributor Author

Here are the logs created by nvidia-bug-report.

The line that includes "Failed to allocate NvKmsKapiDevice" may be relevant to the issue as it only appears in the not_working logs.

oryp7-3070-nvidia-bug-report.not_working.log.gz
oryp7-3070-nvidia-bug-report.log.gz

@cstrahan-blueshift
Copy link

Is there a workaround for this?

$ sudo dmesg | grep nvidia
[sudo] password for cstrahan: 
[   13.658667] nvidia: module license 'NVIDIA' taints kernel.
[   13.705633] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[   13.708498] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[   13.778779] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  470.86  Tue Oct 26 21:46:51 UTC 2021
[   13.799315] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[   15.264593] nvidia 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[   15.265282] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[   15.265579] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
[   15.272756] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[   15.275985] nvidia-uvm: Loaded the UVM driver, major device number 506.
[   16.865132] audit: type=1400 audit(1644425595.427:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1030 comm="apparmor_parser"
[   16.865139] audit: type=1400 audit(1644425595.427:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1030 comm="apparmor_parser"

$ sudo dmesg | grep NVRM
[   13.754980] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  470.86  Tue Oct 26 21:55:45 UTC 2021
[   15.265145] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[   15.265221] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   30.103488] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[   30.103565] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   30.112983] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[   30.113054] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   38.746389] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[   38.746444] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   38.746717] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[   38.746761] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

@cstrahan-blueshift
Copy link

I saw this comment on the Nvidia forums: https://forums.developer.nvidia.com/t/bug-470-42-01-1-dgpu-can-not-be-initialized/183627/3

Same here with 470.57.02. Happened after a PopOS update. Intel graphics only and non functioning HDMI. Same computer (4k OLED version). First I thought the NVIDIA chip was broken but put back the original Windows 10 SSD and everything was running fine.

Unistalled PopOS NVIDIA drivers:
sudo apt remove nvidia*

Installed version 465.31, downloaded from nvidia.com:
chmod +x NVIDIA-Linux-x86_64-465.31.run
sudo ./NVIDIA-Linux-x86_64-465.31.run

And after a reboot PopOS is running perfectly with NVIDIA GTX 1650 Max-Q graphics again.

Going to try downgrading to 465.31, downloaded from here: https://download.nvidia.com/XFree86/Linux-x86_64/465.31/

@cstrahan-blueshift
Copy link

Instead of downgrading to 465.31, I've decided to try upgrading to 495.46:

sudo apt remove 'nvidia-*'
chmod +x NVIDIA-Linux-x86_64-*
sudo ./NVIDIA-Linux-x86_64-495.46.run

restarted, and now I see this:

[   20.347610] NVRM: The NVIDIA probe routine was not called for 1 device(s).
[   20.381479] NVRM: This can occur when a driver such as: 
               NVRM: nouveau, rivafb, nvidiafb or rivatv 
               NVRM: was loaded and obtained ownership of the NVIDIA device(s).
[   20.381482] NVRM: Try unloading the conflicting kernel module (and/or
               NVRM: reconfigure your kernel without the conflicting
               NVRM: driver(s)), then try loading the NVIDIA kernel module
               NVRM: again.
[   20.381485] NVRM: No NVIDIA devices probed.

and

$ lsmod | grep nouveau
nouveau              2269184  1
mxm_wmi                16384  1 nouveau
wmi                    32768  2 mxm_wmi,nouveau
drm_ttm_helper         16384  1 nouveau
i2c_algo_bit           16384  2 i915,nouveau
ttm                    86016  3 drm_ttm_helper,i915,nouveau
drm_kms_helper        307200  2 i915,nouveau
drm                   606208  11 drm_kms_helper,drm_ttm_helper,i915,ttm,nouveau
video                  53248  2 i915,nouveau

going to try blacklisting nouveau and see how things go.

@cstrahan-blueshift
Copy link

Success! Blacklisted nouveau like so:

sudo bash -c "echo blacklist nouveau > /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
sudo update-initramfs -u
sudo kernelstub

After a restart, when I run NVIDIA X Server Settings the window is no longer empty.

$ sudo lsmod | grep nvidia
nvidia_drm             65536  5
nvidia_modeset       1150976  5 nvidia_drm
nvidia              36917248  219 nvidia_modeset
drm_kms_helper        307200  2 nvidia_drm,i915
drm                   606208  10 drm_kms_helper,nvidia,nvidia_drm,i915,ttm
$ sudo dmesg | grep 'NVRM\|nvidia'
[   13.636879] nvidia: module license 'NVIDIA' taints kernel.
[   13.720349] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[   13.726867] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[   13.780378] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  495.46  Wed Oct 27 16:31:33 UTC 2021
[   13.819784] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  495.46  Wed Oct 27 16:22:48 UTC 2021
[   13.830860] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[   16.311752] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
[   17.130636] audit: type=1400 audit(1644430025.691:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1040 comm="apparmor_parser"
[   17.130642] audit: type=1400 audit(1644430025.691:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1040 comm="apparmor_parser"

With the nvidia driver now actually working, my external monitor is working once again.

NOTE: when installing the driver, I had to explicitly opt into having DKMS setup when prompted ("No" is highlighted by default when the prompt comes up).

@jacobgkau
Copy link
Member

@cstrahan-blueshift Thank you for sharing your results!

Testing with #134 (NVIDIA driver 510.54), I still saw the issue occur when rebooting in NVIDIA mode on oryp7. However, with pop-os/linux#122 (Linux kernel 5.15.23), I am not currently seeing the issue occur on either driver version (although it's hard to rule anything out since it's intermittent.)

@cstrahan-blueshift
Copy link

Just rebooted earlier, hoping that might resolve Zoom issues that have been plaguing me for the past couple weeks or so. Attached monitor just showed a blinking _. Realized I must have been on integrated graphics somehow, so disconnected displayport, went into the settings and switched to nvidia graphics; rebooted. Same thing. Settings show I'm still on integrated graphics.

Going to try to install NVIDIA-Linux-x86_64-510.60.02.run, and see if I have any luck.

Feeling a bit embarrassed at work, as I was the one that requested my oryp7, but my productivity has been hampered by graphics driver problems 😞.

@cstrahan-blueshift
Copy link

That appears to have worked. Though now I'm wondering:

  • Was it the new version that fixed things? Or
  • Was it that a recent system system update pulled in a new kernel and initramfs, but failed to the DKMS stuff for nvidia, the real fix was just manually running sudo update-initramfs -u?

Don't know how I'd figure that out.

@cstrahan-blueshift
Copy link

Actually, scratch what I last wrote. The kernel module loaded successfully, but I couldn't use my external monitor and the display settings didn't show the monitor. I think some xserver components must have got mangled when I tried to get rid of the old nvidia packages to install NVIDIA-Linux-x86_64-510.60.02.run.

Decided to try out the packaged nvidia drivers again, following what was described: #144 (comment)

Now everything is confirmed to be working again.

This is what I have installed presently:

$ apt list --installed | grep nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-cfg1-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
libnvidia-common-510/impish,impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 all [installed,automatic]
libnvidia-compute-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
libnvidia-compute-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic]
libnvidia-decode-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
libnvidia-decode-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic]
libnvidia-egl-wayland1/impish,now 1:1.1.7-2build1 amd64 [installed,automatic]
libnvidia-encode-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
libnvidia-encode-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic]
libnvidia-extra-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
libnvidia-fbc1-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
libnvidia-fbc1-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic]
libnvidia-gl-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
libnvidia-gl-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic]
nvidia-compute-utils-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
nvidia-dkms-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
nvidia-driver-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed]
nvidia-kernel-common-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
nvidia-kernel-source-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
nvidia-settings/impish-updates,now 470.57.01-0ubuntu3.1~0.21.10.1 amd64 [installed,automatic]
nvidia-utils-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
xserver-xorg-video-nvidia-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]

@crubel
Copy link

crubel commented Apr 7, 2022 via email

@crubel
Copy link

crubel commented Oct 11, 2022 via email

@leviport
Copy link
Member

There is a firmware update in the works that should make this bug go away. I don't have an ETA at this time, but I'm hoping it will be ready soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants