Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware acceleration doesn't work when the container's (hardcoded) render group's GID doesn't match the host's #2739

Closed
melyux opened this issue Sep 29, 2022 · 12 comments
Assignees
Labels
bug Something isn't working invalid Works as documented or cannot be reproduced

Comments

@melyux
Copy link

melyux commented Sep 29, 2022

1. What is not working as documented?

My video playback was extremely slow with non-HEVC videos, so I looked at the logs and saw a bunch of:

h264_qsv: failed transcoding

messages. I turned on debug logging and kept seeing

Error initializing an internal MFX session: unsupported (-3)

Hardware transcoding was not working at all. I tried a lot of things that didn't work, but finally ran ls -l /dev/dri inside the container and saw that the GID for the renderd128 device was a bare number instead of the render group's name. This seems to be a problem.

So on the host, I ran chmod 777 /dev/dri/renderD128 and restarted the docker container. This time, no more errors and I could see in intel_gpu_top that the GPU was working!

This, however, is not sustainable because it resets on host reboot, and other containers work with hardware acceleration without doing host permission changes (like Plex).

2. How can we reproduce it?

Steps to reproduce the behavior:

  1. Use :preview docker image on an Intel machine that can do QSV and has its /dev/dri/renderd128 owned by the render group that doesn't have the GID 115 that PhotoPrism seems to hardcode in create-users.sh.
  2. Enable FFMPEG's intel encoder in the docker options.
  3. Try to stream some video files that required transcoding.
  4. Check the log to see that it fails to transcode using hardware encoder.

3. What behavior do you expect?

The docker container should handle GIDs for the render group that aren't 115. I think.

4. What could be the cause of your problem?

The create-users.sh file hardcodes the render group's PID as 115 so there's a mismatch between the container and the host, stopping the container from accessing the /dev/dri/renderd128 device. The logs don't clearly indicate this as the cause so it requires lots of debugging.

Someone had a similar issue on the Plex Linuxserver container (linuxserver/docker-plex#207), and it was solved by Linuxserver changing their user/group addition logic to be more dynamic, it seems (https://github.com/linuxserver/docker-plex/blob/master/root/etc/cont-init.d/50-gid-video).

Giving everything in /dev/dri 777 permissions "fixes" the problem, pointing to a permissions issue.

5. Can you provide us with example files for testing, error logs, or screenshots?

See above for the ffmpeg errors.

6. Which software versions do you use?

(a) PhotoPrism Architecture & Build Number: AMD64, 220919-cc8bab446

(b) Database Type & Version: MariaDB, latest

(c) Operating System Types & Versions: Linux

(d) Browser Types & Versions: Safari on Mac

(e) Ad Blockers, Browser Plugins, and/or Firewall Software? No

7. On what kind of device is PhotoPrism installed?

(a) Device / Processor Type: Intel Core i7-7700K

(b) Physical Memory & Swap Space in GB: 16GB + 8GB swap

(c) Storage Type: SSD + HDD

(d) Anything else that might be helpful to know?

I'm also using vGPU for a Windows VM on the same machine. Plex works with hw acceleration in another container.

8. Do you use a Reverse Proxy, Firewall, VPN, or CDN?

No

@melyux melyux added the bug Something isn't working label Sep 29, 2022
@melyux melyux changed the title Hardware acceleration doesn't work when the container's render group's GID doesn't match the host's Hardware acceleration doesn't work when the container's (hardcoded) render group's GID doesn't match the host's Sep 29, 2022
@lastzero
Copy link
Member

@lastzero lastzero added please-test Ready for acceptance test help wanted Well suited for external contributors! labels Oct 19, 2022
@lastzero
Copy link
Member

lastzero commented Nov 2, 2022

Can anyone confirm that this has been fixed/implemented as best as possible?

@lastzero
Copy link
Member

lastzero commented Nov 2, 2022

Otherwise, if no one has time to test it, I would go ahead and close this issue since it seems to be solved with the pull request referenced above...

@lastzero
Copy link
Member

lastzero commented Nov 4, 2022

I'll close this since we received no more feedback. We welcome contributions to better handle permissions and groups if needed.

@lastzero lastzero closed this as completed Nov 4, 2022
@lastzero lastzero added invalid Works as documented or cannot be reproduced and removed help wanted Well suited for external contributors! please-test Ready for acceptance test labels Nov 4, 2022
@melyux
Copy link
Author

melyux commented Dec 30, 2022

Sorry about the tardiness, I didn't get notifications from this thread. I'm not sure if there's anything I need to change in the docker compose in light of the above pull request. I tried again with the latest Photoprism and it still has the same problem.

@lastzero
Copy link
Member

Maybe I also didn't fully understand what the PR does and what not. For example, it might still be necessary to start the container with the correct group ID, e.g. by using the "user:" property in your docker-compose.yml or the PHOTOPRISM_UID and PHOTOPRISM_GID variables:

@melyux
Copy link
Author

melyux commented Jan 1, 2023

I'm setting those UID and GID to be the non-root user at the moment, same as all other containers including Plex (where GPU usage works). What would the right UID to use?

@lastzero
Copy link
Member

lastzero commented Jan 1, 2023

At the end, you need to share the rendering device and have access to that device. Plus the ffmpeg parameters must be supported by your CPU including the target codec & resolution. I don't know at what point it fails for you specifically and if that has to do with the permissions or one of the other requirements mentioned above. Enabling trace mode will show additional logs.

@melyux
Copy link
Author

melyux commented Feb 28, 2023

Which requirements did you mean? It's definitely permissions because doing the chmod 777 on the host /dev/dri category fixes the problem. It's the mismatch I mentioned in the original post. PP hardcodes the UID for the /dev/dri devices and doesn't use the one in the docker compose file. Though I'm not well versed with Linux permissions, but still

@lastzero lastzero moved this to Released 🌈 in Roadmap 🚀✨ Jun 8, 2023
@melyux
Copy link
Author

melyux commented Aug 7, 2023

Saw that this was "Released", was any change made to fix the UID hardcoding for the /dev/dri devices?

@lastzero
Copy link
Member

lastzero commented Aug 7, 2023

How exactly would that look like?

@melyux
Copy link
Author

melyux commented Aug 7, 2023

The Linuxserver Plex container does this thing to make sure the hardware acceleration stuff has the right permissions: https://github.com/linuxserver/docker-plex/blob/master/root/etc/s6-overlay/s6-rc.d/init-plex-gid-video/run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working invalid Works as documented or cannot be reproduced
Projects
Status: Release 🌈
Development

No branches or pull requests

2 participants