Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Can't use the GPU in Jellyfin QNAP Docker image for hardware acceleration #9806

Closed
1 task done
DrakeHamString opened this issue May 24, 2023 · 14 comments
Closed
1 task done
Labels
bug Something isn't working

Comments

@DrakeHamString
Copy link

DrakeHamString commented May 24, 2023

Please describe your bug

Hi there,
I'm having the issue that I can't get Jellyfin to use my GPU (Nvidia Quadro P400) on my QNAP TS-673A due to persistent Nvidia driver in the kernel.
I was following your guide here: https://jellyfin.org/docs/general/administration/hardware-acceleration/
and working on the command line inside of the docker image.

I'm already passing the GPU to the container and its recognized:

root@Jellyfin:/# lshw -C display
  *-display                 
       description: VGA compatible controller
       product: GP107GL [Quadro P400]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:91 memory:f6000000-f6ffffff memory:e0000000-efffffff memory:f0000000-f1ffffff ioport:f000(size=128) memory:c0000-dffff

OS inside of the container:

root@Jellyfin:/# cat /etc/*rel*
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Your guide also guided me to this site, where I followed the instructions to install the GPU driver for Debian 11 "Bullseye":
https://wiki.debian.org/NvidiaGraphicsDrivers

From that I used those commands:
Added this to /etc/apt/sources.list

deb http://deb.debian.org/debian/ bullseye main contrib non-free
# apt update
# apt install -y jellyfin-ffmpeg5
# apt install -y nvidia-driver firmware-misc-nonfree
# apt install -y libnvcuvid1 libnvidia-encode1

While Installing I get this message:

Mismatching nvidia kernel module loaded

The NVIDIA driver that is being installed (version 470.182.03) does not match the nvidia kernel module currently loaded (version 515.48.07).

The X server, OpenGL, and GPGPU applications may not work properly.

The easiest way to fix this is to reboot the machine once the installation has finished. You can also stop the X server (usually by stopping the login manager, e.g. gdm3, sddm, or xdm), manually unload the module ("modprobe -r nvidia"), and restart the X server.

(none of those suggested things worked)

When executing nvidia-smi I get this:

Failed to initialize NVML: Driver/library version mismatch

So I suspect that there is already a newer version of the Nvidia driver embedded in the kernel of this docker image, which I cant use.
Kernel info:

root@Jellyfin:/# uname -a
Linux Jellyfin 5.10.60-qnap #1 SMP Fri Apr 21 07:36:35 CST 2023 x86_64 GNU/Linux

I already tried to purge everything (apt purge nvidia* libnvidia*) and installed again. Same error.

When installed:

root@Jellyfin:/# dkms status
nvidia-current, 470.182.03, 5.10.0-23-amd64, x86_64: installed

(it outputs nothing when no driver installed)

And no, reboot did not help too (I can't do more than stopping and starting the container)

When trying to enable hardware encoding in the Jellyfin settings and start a video, then I get the iconic error message with incompatible media.

I hope you can help with with that issue.

Jellyfin Version

Other

if other:

10.8.10

Environment

- OS: Debian 11 Bullseye
- Linux Kernel: Linux Jellyfin 5.10.60-qnap #1 SMP Fri Apr 21 07:36:35 CST 2023 x86_64 GNU/Linux
- Virtualization: Docker
- Clients: Browser
- Browser: Chrome, Edge
- FFmpeg Version: 5.1.3-Jellyfin
- Playback Method: Transcode
- Hardware Acceleration: NVENC (Not Working)
- GPU Model: Nvidia Quadro P400
- Plugins: none
- Reverse Proxy: Nginx Proxy Manger
- Base URL: --
- Networking: Bridged with own IP address
- Storage: Jellyfin runs on a NAS

Jellyfin logs

No response

FFmpeg logs

/usr/lib/jellyfin-ffmpeg/ffmpeg -analyzeduration 200M -init_hw_device cuda=cu:0 -filter_hw_device cu -hwaccel cuda -hwaccel_output_format cuda -threads 1 -autorotate 0 -i file:"/media/Filme/Lorem.mkv" -autoscale 0 -map_metadata -1 -map_chapters -1 -threads 0 -map 0:0 -map 0:2 -map -0:0 -codec:v:0 h264_nvenc -preset p1 -b:v 27690523 -maxrate 27690523 -bufsize 55381046 -profile:v:0 high -g:v:0 72 -keyint_min:v:0 72 -filter_complex "[0:5]scale=s=1920x1080:flags=fast_bilinear,format=yuva420p,hwupload=derive_device=cuda[sub];[0:0]setparams=color_primaries=bt709:color_trc=bt709:colorspace=bt709,scale_cuda=format=yuv420p[main];[main][sub]overlay_cuda=eof_action=endall:shortest=1:repeatlast=0" -start_at_zero -codec:a:0 libfdk_aac -ac 6 -ab 640000 -copyts -avoid_negative_ts disabled -max_muxing_queue_size 2048 -f hls -max_delay 5000000 -hls_time 3 -hls_segment_type mpegts -start_number 0 -hls_segment_filename "/config/transcodes/619157c251631fe5937ebb12ac305373%d.ts" -hls_playlist_type vod -hls_list_size 0 -y "/config/transcodes/619157c251631fe5937ebb12ac305373.m3u8"


ffmpeg version 5.1.3-Jellyfin Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 10 (Debian 10.2.1-6)
  configuration: --prefix=/usr/lib/jellyfin-ffmpeg --target-os=linux --extra-libs=-lfftw3f --extra-version=Jellyfin --disable-doc --disable-ffplay --disable-ptx-compression --disable-static --disable-libxcb --disable-sdl2 --disable-xlib --enable-lto --enable-gpl --enable-version3 --enable-shared --enable-gmp --enable-gnutls --enable-chromaprint --enable-libdrm --enable-libass --enable-libfreetype --enable-libfribidi --enable-libfontconfig --enable-libbluray --enable-libmp3lame --enable-libopus --enable-libtheora --enable-libvorbis --enable-libopenmpt --enable-libdav1d --enable-libwebp --enable-libvpx --enable-libx264 --enable-libx265 --enable-libzvbi --enable-libzimg --enable-libfdk-aac --arch=amd64 --enable-libsvtav1 --enable-libshaderc --enable-libplacebo --enable-vulkan --enable-opencl --enable-vaapi --enable-amf --enable-libmfx --enable-ffnvcodec --enable-cuda --enable-cuda-llvm --enable-cuvid --enable-nvdec --enable-nvenc
  libavutil      57. 28.100 / 57. 28.100
  libavcodec     59. 37.100 / 59. 37.100
  libavformat    59. 27.100 / 59. 27.100
  libavdevice    59.  7.100 / 59.  7.100
  libavfilter     8. 44.100 /  8. 44.100
  libswscale      6.  7.100 /  6.  7.100
  libswresample   4.  7.100 /  4.  7.100
  libpostproc    56.  6.100 / 56.  6.100
[AVHWDeviceContext @ 0x55da004dc800] cu->cuInit(0) failed -> CUDA_ERROR_SYSTEM_DRIVER_MISMATCH: system has unsupported display driver / cuda driver combination
Device creation failed: -542398533.
Failed to set value 'cuda=cu:0' for option 'init_hw_device': Generic error in an external library
Error parsing global options: Generic error in an external library

Please attach any browser or client logs here

No response

Please attach any screenshots here

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@DrakeHamString DrakeHamString added the bug Something isn't working label May 24, 2023
@nyanmisaka
Copy link
Member

I doubt you can’t use the standard NVIDIA driver package from the distro. Instead, you should install it according to QNAP’s documentation.

@jellyfin-bot jellyfin-bot added this to Needs triage in Issue Triage for Main Repo May 24, 2023
@DrakeHamString
Copy link
Author

On the QNAP Host there is already automatically installed a Nvidia driver and its working

@nyanmisaka
Copy link
Member

I see. So you don't need to install user mode driver from the repo since the versions are mismatched.

sudo apt-get purge nvidia-driver libnvcuvid1 libnvidia-encode1

Remove them and check the extra docker cli from this doc - https://www.qnap.com/en-us/how-to/tutorial/article/how-to-use-tensorflow-with-container-station

@DrakeHamString
Copy link
Author

thanks, thats what I also tried

and I checked with the extra devices and mappings from the doc - everything is already there in my container:

root@Jellyfin:/# ls -l /dev/nv*
crw-rw-rw- 1 root root 195,   0 May 24 12:44 nvidia0
crw-rw-rw- 1 root root 195, 255 May 24 12:44 nvidiactl
crw-rw-rw- 1 root root 511,   0 May 24 12:44 nvidia-uvm
root@Jellyfin:/# ls -l /usr/local/nv*
drwxrwxrwx  8 1000 1000  4096 May 23 17:55 nvidia
drwxrwxrwx  7 1000 1000  4096 May 23 15:10 nvidia.u18.04

@nyanmisaka
Copy link
Member

nyanmisaka commented May 24, 2023

Maybe this one helps? https://gist.github.com/weshofmann/620b924cde5dd498880e9315e48e793b?permalink_comment_id=4487458#gistcomment-4487458

[/share/CACHEDEV1_DATA/.qpkg/NVIDIA_GPU_DRV/usr/bin] # ./nvidia-smi should give correct output.

@prahal
Copy link

prahal commented May 25, 2023

@grossmaul there is no kernel in a container.
So the issue is with your host setup (though drivers could be in userspace so driver could have to be installed in the container).
On the contrary, the libraries that access the host kernel device files are loaded from the container.
So:

nvidia-smi:

Failed to initialize NVML: Driver/library version mismatch

means you should have the same Nvidia library version inside the container as the kernel driver on the host.
Seems Nvidia is binding dependencies hard.

@DrakeHamString
Copy link
Author

yeah, yesterday I learned that I shall not edit anything inside of the container, so I reverted everything and went to the host instead.

The command by @nyanmisaka gives this output:

[/share/CACHEDEV1_DATA/.qpkg/NVIDIA_GPU_DRV/usr/bin] # ./nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

So I highly suspect that its a driver issue from QNAP. There is only one driver (plus Kernel driver) I can install and nothing else...

image
image

Now I opened a support case with QNAP

@DrakeHamString
Copy link
Author

DrakeHamString commented May 25, 2023

Finally found the solution! Had to use this fix.

My solution works on a QNAP TS-673A (applicable for every TS-x76A, like TS-476A and TS-876A) with a Quadro P400 (but other cards should work too since the solution is independent from the GPU model).
I think that this solution might work on other QNAP NAS models too.

Problem I want to fix:

[/share/CACHEDEV1_DATA/.qpkg/NVIDIA_GPU_DRV/usr/bin] # ./nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

This causes that containers can't use the GPU for hardware acceleration (HWA)

Step One:

  1. Normally install the drivers on QNAP
  2. SSH on QNAP
  3. vi /etc/ld.so.conf
  4. Add this line: /opt/NVIDIA_GPU_DRV/usr/lib (you may have to adjust it)
  5. ldconfig

After this you get a proper output from nvidia-smi:

[/share/CACHEDEV1_DATA/.qpkg/NVIDIA_GPU_DRV/usr/bin] # ./nvidia-smi
Thu May 25 12:18:28 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P400         Off  | 00000000:01:00.0 Off |                  N/A |
| 34%   36C    P8    N/A /  N/A |      2MiB /  2048MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Step Two:

Now you have to do the same inside of the container, but swap the path with the linked path in your container's environmet.
In my case /share/CACHEDEV1_DATA/.qpkg/NVIDIA_GPU_DRV/usr is linked with /usr/local/nvidia in the container. So I did this:

  1. In Container Station: Connect to console with /bin/bash
  2. vi /etc/ld.so.conf (vim was not installed, so I had to do apt update && apt install -y vim)
  3. Add this line: /usr/local/nvidia/lib
  4. ldconfig
  5. Test it:
root@Jellyfin:/# /usr/local/nvidia/bin/nvidia-smi
Thu May 25 10:44:29 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P400         Off  | 00000000:01:00.0 Off |                  N/A |
| 34%   44C    P5    N/A /  N/A |      2MiB /  2048MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
  1. Enable HWA in Jellyfin

Done!

HWA works in Jellyfin now!
Just found out that Step One is not necessary for HWA to work inside of the container. But you are save for future applications if you did this I think.

Issue Triage for Main Repo automation moved this from Needs triage to Closed/Done May 25, 2023
@deejayexe
Copy link

deejayexe commented Sep 28, 2023

Hi!
This can be more easy if you add this to your enviroment compose, it is permantly in linuxserver/jellyfin docker
environment:
LD_LIBRARY_PATH: /share/CACHEDEV1_DATA/.qpkg/NVIDIA_GPU_DRV/usr:/usr/local/nvidia/lib

@Eniot666
Copy link

Eniot666 commented Nov 21, 2023

Good morning,
I am able to play step 1 but when I get to step 2 I get the error:
Failed to initialize NVML: Unknown Error

Could you share here the docker compose that you are using?

look at mine

version: '3'
services:
  jellyfin:
    container_name: jellyfin
    image: lscr.io/linuxserver/jellyfin:latest
    hostname: jellyfin
    ports:
      - "8096:8096"
    volumes:
      - config:/config/
      - /etc/localtime:/etc/localtime:ro
      - /share/Multimedia:/multimedia
      - /share/MD0_DATA/.qpkg/NVIDIA_GPU_DRV/usr/:/usr/local/nvidia:ro
    devices:
      #- /dev/dri:/dev/dri
      - /dev/dri/renderD128:/dev/dri/renderD128
      - /dev/dri/card0:/dev/dri/card0
    restart: unless-stopped
    environment:
        - TZ=Europe/Paris
        - NVIDIA_VISIBLE_DEVICES=all
        - LD_LIBRARY_PATH=/usr/local/nvidia/lib
volumes:
  config:

Eniot

@deejayexe
Copy link

hi @Eniot666 , this is my compose:
version: '3'

services:
jellyfin:
container_name: jellyfin
network_mode: "dockernetwork"
restart: always
ports:
- 8096:8096
- 8619:8920
- 7359:7359
- 1900:1900
devices:
- /dev/nvidia0
- /dev/nvidiactl
- /dev/nvidia-uvm
volumes:
- /share/DockerVolumesMM2/jellyfin/:/config
- /share/Series:/tv
- /opt/NVIDIA_GPU_DRV/usr:/usr/local/nvidia:ro
environment:
PUID: 1002
PGID: 1000
JELLYFIN_PublishedServerUrl: https://jellyfin.domain.com #optional
NVIDIA_VISIBLE_DEVICES: all
NVIDIA_DRIVER_CAPABILITIES: all
JELLYFIN_FFmpeg__probesize: 50000000
JELLYFIN_FFmpeg__analyzeduration: 50000000
LD_LIBRARY_PATH: /share/CACHEDEV1_DATA/.qpkg/NVIDIA_GPU_DRV/usr:/usr/local/nvidia/lib
image: lscr.io/linuxserver/jellyfin:latest

Check that u have MD0_DATA and i have CACHEDEV1_DATA

@Eniot666
Copy link

Eniot666 commented Nov 22, 2023

hi, @deejayexe,

Thanks. The devices were the wrong way

See my final yml compose :

version: '3'
services:
  jellyfin:
    container_name: jellyfin
    image: lscr.io/linuxserver/jellyfin:latest
    hostname: jellyfin
    ports:
      - "8096:8096"
    volumes:
      - config:/config/
      - /etc/localtime:/etc/localtime:ro
      - /share/Multimedia:/multimedia
      - /opt/NVIDIA_GPU_DRV/usr:/usr/local/nvidia:ro
    devices:
      - /dev/nvidia0
      - /dev/nvidiactl
      - /dev/nvidia-caps
    restart: unless-stopped

    environment:
        - TZ=Europe/Paris
        #- NVIDIA_VISIBLE_DEVICES=all
        #- NVIDIA_DRIVER_CAPABILITIES=all
        - LD_LIBRARY_PATH=/usr/local/nvidia/lib

result here :

root@jellyfin:/# /usr/local/nvidia/bin/nvidia-smi
Wed Nov 22 08:56:19 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T400         Off  | 00000000:01:00.0 Off |                  N/A |
| 38%   35C    P8    N/A /  31W |      1MiB /  2048MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@jellyfin:/# echo $LD_LIBRARY_PATH
/usr/local/nvidia/lib

@Eniot666
Copy link

This evening I'm trying to start the transcoding of a film and I get an error:

[AVHWDeviceContext @ 0x55f4b6b40a00] cu->cuInit(0) failed -> CUDA_ERROR_UNKNOWN: unknown error
Device creation failed: -542398533.
Failed to set value 'cuda=cu:0' for option 'init_hw_device': Generic error in an external library
Error parsing global options: Generic error in an external library

Do you know if you also need to put this container in place:
https://github.com/NVIDIA/nvidia-container-toolkit

Have you done anything else on QNAP that could explain why it doesn't work...
on my config: I have /dev/nvidia-caps instead of /dev/nvidia-uvm and I don't understand why...

@Eniot666
Copy link

Eniot666 commented Nov 23, 2023

Good morning,

I am answering myself, it is also necessary to follow:
https://www.qnap.com/en-us/how-to/tutorial/article/how-to-use-tensorflow-with-container-station

everything works fine now:

Output #0, hls, to '/config/data/transcodes/e461531fdf6c3a308c7b7077c96b6dbc.m3u8':
   Metadata:
     encode: Lavf59.27.100
   Stream #0:0: Video: h264 (Main), cuda(tv, bt709, progressive), 1920x1080 [SAR 1:1 DAR 16:9], q=2-31, 4001 kb/s, 23.98 fps, 90k tbn (default)
     Metadata:
       encode: Lavc59.37.100 h264_nvenc
     Side data:
       cpb: bitrate max/min/avg: 4001968/0/4001968 buffer size: 8003936 vbv_delay: N/A
   Stream #0:1: Audio: aac (LC), 44100 Hz, stereo, fltp (default)

my new docker compose :

version: '3'
services:
  jellyfin:
    container_name: jellyfin
    image: lscr.io/linuxserver/jellyfin:latest
    hostname: jellyfin
    ports:
      - "8096:8096"
    volumes:
      - config:/config/
      - /etc/localtime:/etc/localtime:ro
      - /share/Multimedia:/multimedia
      - /opt/NVIDIA_GPU_DRV/usr:/usr/local/nvidia:ro
    devices:
      - /dev/nvidia0
      - /dev/nvidiactl
      - /dev/nvidia-uvm
    restart: unless-stopped
    environment:
        - TZ=Europe/Paris
        - NVIDIA_VISIBLE_DEVICES=all
        - NVIDIA_DRIVER_CAPABILITIES=all
        - LD_LIBRARY_PATH=/usr/local/nvidia/lib
volumes:
  config:

on the other hand, no use noted here:

Every 2.0s: /usr/local/nvidia/bin/nvidia-smi                                                                                                                                                                     jellyfin: Thu Nov 23 12:31:36 2023

Thu Nov 23 12:31:36 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T400         Off  | 00000000:01:00.0 Off |                  N/A |
| 38%   52C    P0    N/A /  31W |    292MiB /  2048MiB |     69%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
[12:33]

tensorflow container once created can be deleted

Edit : you can delete it after but after a reboot you also have to create it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

5 participants