Skip to content

Conversation

@fat-tire
Copy link
Contributor

I was intrigued by the Docker support and decided to try it. When it comes to containers, I always prefer Podman running "rootless" rather than Docker running as root, so made a few changes to support this as well.

This was tested on both Podman and Docker.

Notes:

  • To run w/Podman, just set CONTAINER_ENGINE="podman" (default is "docker") on build.sh and run.sh. Otherwise, everything should hopefully run as before.
  • For fun, it was also built on an Raspberry Pi 4 w/debian 11. No, I couldn't generate any images as the 8GB ram quickly filled up, but at least it built and got the web interface up and running.
  • I do have Podman running at full-speed with cuda on Podman on my machine, but I didn't want to dramatically break any of the current docker behavior and (and wasn't really sure how cuda was even working as-is). My cuda Dockerfile uses the nvidia/cuda ubuntu image as a base, not the python base image. If there is interest I can clean up the other Dockerfile and provide it. I'm guessing it will work with Docker as well.)
  • To support Podman, the last commit f70fb02 pulls two Docker features that I really wanted to keep but don't yet have Podman support. These changes shouldn't break Docker's build, but may make subsequent builds less-efficient. I hope Podman 4.0 will support at least some of them. See the notes in that commit.

Could someone with podman and/or docker test it? Even if y'all don't want ALL the commits here, hopefully some of it will be of value for Docker users too.

Enjoy!

…fault

Debian/Ubuntu/etc have a package called python-is-python3 tho.
…er.io

podman requires specifying docker.io-- doesn't default to it.
…cker"

build.sh:

When creating the volume, grab its path and do a "podman unshare"
to give the running "appuser" access to the volume's files.  Without it, you'd
have to run the rootless user as "root" in the container, and no one wants
that.

run.sh:

"--platform" seems to have problems with podman. This may actually
be an issue with arm64 as I'm trying both at the same time.  Regardless,
I unset it on podman to keep behavior the same for docker.
Instead of chmodding w/a mountpoint path, set the ownership of /data and
/data/outputs from inside the container w/a one-time root login.

This is the "proper" method that doesn't rely on a hard path, for future
compatibility.

Moved the creation of ./outputs to here as it should probably only be done
1x and because it needs to exist to set up the correct ownership for podman
This removes some docker/buildkit-specific lines that break on
Podman... for now.

See:  containers/buildah#4325
      containers/buildah#3815

This is a painful patch, but without using another Dockerfile for Podman,
I couldnt find a non-convoluted alternative.
@fat-tire fat-tire requested a review from mauwii as a code owner February 19, 2023 04:40
--mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
apt-get update \
RUN apt-get update \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you remove the build cache?

This question of course also applys to all further removements of the build cache, but not adding the same comment now x times 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the notes above and in f70fb02 it breaks podman unfortunately.

build-essential=12.9 \
gcc=4:10.2.* \
python3-dev=3.9.*
python3-dev>=${PYTHON_VERSION}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last time I checked there was no version of the python 3.10 headers available for the slim image, has this changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno-- it builds fine :) I'll change back to 3.9 tho

# syntax=docker/dockerfile:1

ARG PYTHON_VERSION=3.9
ARG PYTHON_VERSION=3.10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the security score of the Python3.9 container is much better than the one of 3.10 I would prefer to stay on 3.9 until they fixed this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay-- wasn't sure why it was at 3.9 because the other docs said it was tested on 3.10 and known to be working, but I can change it back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed working (even when using the 3.9.* python-dev package on python 3.10), but the 3.9 image had only 25% of the security issues I got with 3.10. Since there are people which use this image in the cloud, security should be a high prio 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha... I'm going back to 3.9....

Quick question-- I presume you have this running w/a gpu. To run it on my machine w/nvidia support I needed to rely on a base image that supported cuda, namely nvidia/cuda... I was trying to figure out-- where/how does this docker image get the gpu support from?

python3-dev>=${PYTHON_VERSION}

# prepare pip for buildkit cache
ARG APPNAME=InvokeAI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already defined in previous stage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but in podman the args get cleared out at every stage. I don't know why but it took me longer than I care to admit to figure that out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is producing a "huge" maintanance overheap (one spot to change a default value vs three spots)
So if rly necesarry, then plz create env variables from the arguments in the base image and reuse those in the later stages, while cleaning them out to not polute the container

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I can try that.

--upgrade-deps

# copy sources
COPY --link . .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you remove the --link?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See f70fb02. Breaks podman. I hate it too :(


# Create a new user
ARG APPDIR=/usr/src
ARG APPNAME=InvokeAI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both already defined in previous stage

-U \
"${UNAME}"
"${UNAME}" \
-u 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to define the UID, isn't it enough to have a user group created with the similar name of the user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for podman-- the way that podman works with rootless containers is that it uses another uid that is distinct from the uid of the user running the container on the host. In this case, the uid/guid of 1000:1000 in the container is actually something like 100999:100999 in the mounts/volumes. Or at least that's what you WANT it to be. So I needed to explicitly set that when the account is created so that the user can be run correctly. (I could be wrong and there might be another way to do this, but if there is I don't know it. I could run invokeai in the container as the "root" user rather than as appuser, at which point there would no longer be file access issues, but this seems like a bad idea.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just built the current main branch locally and executed some commands:

./docker/build.sh
Activated virtual environment: /Users/mauwii/git/mauwii/InvokeAI/.venv
You are using these values:

Dockerfile:		./Dockerfile
index-url:		https://download.pytorch.org/whl/cpu
Volumename:		invokeai_data
Platform:		linux/arm64
Container Registry:	ghcr.io
Container Repository:	mauwii/invokeai
Container Tag:		main-cpu
Container Flavor:	cpu
Container Image:	ghcr.io/mauwii/invokeai:main-cpu

Volume already exists

[+] Building 178.5s (23/23) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                                 0.0s
 => => transferring dockerfile: 2.77kB                                                                                                                                               0.0s
 => [internal] load .dockerignore                                                                                                                                                    0.0s
 => => transferring context: 35B                                                                                                                                                     0.0s
 => resolve image config for docker.io/docker/dockerfile:1                                                                                                                           2.3s
 => [auth] docker/dockerfile:pull token for registry-1.docker.io                                                                                                                     0.0s
 => docker-image://docker.io/docker/dockerfile:1@sha256:39b85bbfa7536a5feceb7372a0817649ecb2724562a38360f4d6a7782a409b14                                                             3.8s
 => => resolve docker.io/docker/dockerfile:1@sha256:39b85bbfa7536a5feceb7372a0817649ecb2724562a38360f4d6a7782a409b14                                                                 0.0s
 => => sha256:39b85bbfa7536a5feceb7372a0817649ecb2724562a38360f4d6a7782a409b14 8.40kB / 8.40kB                                                                                       0.0s
 => => sha256:7f44e51970d0422c2cbff3b20b6b5ef861f6244c396a06e1a96f7aa4fa83a4e6 482B / 482B                                                                                           0.0s
 => => sha256:a28edb2041b8f23c38382d8be273f0239f51ff1f510f98bccc8d0e7f42249e97 2.90kB / 2.90kB                                                                                       0.0s
 => => sha256:9d0cd65540a143ce38aa0be7c5e9efeed30d3580d03667f107cd76354f2bee65 10.82MB / 10.82MB                                                                                     3.1s
 => => extracting sha256:9d0cd65540a143ce38aa0be7c5e9efeed30d3580d03667f107cd76354f2bee65                                                                                            0.6s
 => [internal] load .dockerignore                                                                                                                                                    0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                 0.0s
 => [internal] load metadata for docker.io/library/python:3.9-slim                                                                                                                   0.0s
 => [internal] load build context                                                                                                                                                    0.1s
 => => transferring context: 3.36MB                                                                                                                                                  0.0s
 => [python-base 1/4] FROM docker.io/library/python:3.9-slim                                                                                                                         0.0s
 => CACHED [python-base 2/4] RUN rm -f /etc/apt/apt.conf.d/docker-clean   && echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' >/etc/apt/apt.conf.d/keep-cache               0.0s
 => CACHED [python-base 3/4] RUN   --mount=type=cache,target=/var/cache/apt,sharing=locked   --mount=type=cache,target=/var/lib/apt,sharing=locked   apt-get update   && apt-get in  0.0s
 => CACHED [python-base 4/4] WORKDIR /usr/src                                                                                                                                        0.0s
 => CACHED [pyproject-builder 1/6] RUN   --mount=type=cache,target=/var/cache/apt,sharing=locked   --mount=type=cache,target=/var/lib/apt,sharing=locked   apt-get update   && apt-  0.0s
 => CACHED [pyproject-builder 2/6] RUN mkdir -p /var/cache/buildkit/pip                                                                                                              0.0s
 => CACHED [pyproject-builder 3/6] RUN --mount=type=cache,target=/var/cache/buildkit/pip,sharing=locked   python3 -m venv "InvokeAI"   --upgrade-deps                                0.0s
 => [pyproject-builder 4/6] COPY --link . .                                                                                                                                          0.1s
 => [pyproject-builder 5/6] RUN --mount=type=cache,target=/var/cache/buildkit/pip,sharing=locked   "InvokeAI/bin/pip" install .                                                    148.7s
 => [pyproject-builder 6/6] RUN python3 -c "from patchmatch import patch_match"                                                                                                      3.9s
 => CACHED [runtime 1/3] RUN useradd   --no-log-init   -m   -U   "appuser"                                                                                                           0.0s
 => CACHED [runtime 2/3] RUN mkdir -p "/data"   && chown -R "appuser" "/data"                                                                                                        0.0s
 => [runtime 3/3] COPY --chown=appuser --from=pyproject-builder /usr/src/InvokeAI InvokeAI                                                                                          11.0s
 => exporting to image                                                                                                                                                               4.8s
 => => exporting layers                                                                                                                                                              4.7s
 => => writing image sha256:d274038b0dd470a06f4bcfb8da22fb1fbe071c73ca947d96ef82c5e346dbf62b                                                                                         0.0s
 => => naming to ghcr.io/mauwii/invokeai:main-cpu                                                                                                                                    0.0s

Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
 ~/git/mauwii/InvokeAI   main ±  docker run --rm --interactive --tty --entrypoint=/bin/bash ghcr.io/mauwii/invokeai:main-cpu
appuser@299ed35c86f9:/usr/src$ id -u
1000
appuser@299ed35c86f9:/usr/src$ id -g
1000
appuser@299ed35c86f9:/usr/src$ whoami
appuser
appuser@299ed35c86f9:/usr/src$ apt-get update
Reading package lists... Done
E: List directory /var/lib/apt/lists/partial is missing. - Acquire (13: Permission denied)
appuser@299ed35c86f9:/usr/src$ sudo apt-get update
bash: sudo: command not found
appuser@299ed35c86f9:/usr/src$
  • no sudo for building the container
  • appuser already has uid 1000
  • no sudo inside the container

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inside the container the uid, guid is 1000. Same as w/podman. But if this is rootless docker, try to touch /data/outputs/testfile inside the container, and then jump out and look at the uid/guid of the file in ./outputs . With podman, it's something other than 1000, like some big #. I believe with rootless docker it's the same, which is why you'd use newuidmap and newguidmap. Although I don't know much about rootless docker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, just out of curiosity, are you running docker run from user 1000/1000? What happens if you run it rootlessly from 1001/1001? Will that affect the file ownership of files you create?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since /data/outputs is mounted as a bindmount, on my local file system the file permissions are set for my current user (501:20 / mauwii:staff), while in the container they are mounted with permissions set to 1000:1000 / appuser:appuser.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh... that's very different than rootless podman, where multiple users running in the container would each have their own uid/guid. If you added a second user, 1001:1001 in the container and created a file, would it still appear as 501:20 outside the container?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should it be different to creating a new user with 1000:1000????

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fat-tire is correct, if the user in container is 1000, then in general the files it creates on the bind mounted volume will also be owned by uid=1000. Is it possible that docker on mac changes ownership to the current user for convenience? If so, that's not a standard or generally expected behaviour.

then
ARCH=arm64
fi
if [ $ARCH == "x86_64" ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no problems with neither aarch64 nor x86_64 (tested on a M1 and a I7)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why but both could not find the repository when I tried those arch names until I changed it to the ones displayed on the docker hub site. Also, on podman you have to specify docker.io fwiw

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the missing docker.io in the base image tag is totally my fault and the dockerfile is much "cleaner" with the registry preponed to the base-image tag 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh sounds good

docker/env.sh Outdated
# docker is the default container engine, but should work with "podman" (rootless) too!
if [[ -z "${CONTAINER_ENGINE}" ]]; then
CONTAINER_ENGINE="docker"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using CONTAINER_ENGINE=${CONTAINER_ENGINE:-docker}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I'm a dummy. I can fix that!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah - didn't mean it like that, but since I am always interested in learning new tricks I thought there could be a reason for the if statement 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, no I really am a dummy. Standby for a new commit coming soon!

docker run \
if [[ "${CONTAINER_ENGINE}" == "podman" ]]; then
PODMAN_ARGS="--user=appuser:appuser"
unset PLATFORM #causes problems
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if PLATFORM causes problems, then maybe runpod is not buildkit compatible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like podman is stlil behind when it comes to buildkit. The good news is that podman 4.1.1 has caching mount support for example. The bad news is debian 11 includes 3.0.1 and even Ubuntu 22.10 (the latest release) uses podman version 3.4.4. The author of that article even suggests using actual buildkit with podman, then says wait never mind it doesn't work very well.

(Since you have security as a primary concern, I recommend considering trying podman as the container is run and managed by a local user (and the container's user is ALSO a local user.) So even if someone breaks out of the container's local user to root user, then they break out of the container entirely, they're STILL constrained within a user process.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run docker rootless and the user in the container runtime is also not having root permissions 🙈

Please try if your problems are resolved when pulling the built image from https://docs.docker.com/engine/security/rootless/ which would be much better than removing all those features from the Dockerfile 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I'll try it. I'm also adding a commit with everything discussed so far that I can change and not break the build.

@mauwii
Copy link
Contributor

mauwii commented Feb 19, 2023

I was intrigued by the Docker support and decided to try it. When it comes to containers, I always prefer Podman running "rootless" rather than Docker running as root, so made a few changes to support this as well.

Docker can also be used rootless: https://docs.docker.com/engine/security/rootless/

  • I do have Podman running at full-speed with cuda on Podman on my machine, but I didn't want to dramatically break any of the current docker behavior and (and wasn't really sure how cuda was even working as-is). My cuda Dockerfile uses the nvidia/cuda ubuntu image as a base, not the python base image. If there is interest I can clean up the other Dockerfile and provide it. I'm guessing it will work with Docker as well.)

People where already using this image with the cuda runtime: https://invoke-ai.github.io/InvokeAI/installation/040_INSTALL_DOCKER/

  • To support Podman, the last commit f70fb02 pulls two Docker features that I really wanted to keep but don't yet have Podman support. These changes shouldn't break Docker's build, but may make subsequent builds less-efficient. I hope Podman 4.0 will support at least some of them. See the notes in that commit.

But it is rly no option to remove the build cache which is not only used by our CI/CD. Also it would be nice to have the linked copy job and not need to change default values for one build argument in three stages.

And btw: I always make sure that the built image is compatible with https://www.runpod.io, so maybe your problems could already be solved by pulling the built image from https://hub.docker.com/r/invokeai/invokeai instead of building it locally

@fat-tire
Copy link
Contributor Author

Does runpod use podman or docker?

I can try to pull the full image from docker hub and see what happens. The build issues at least should not be a facgtor. But I think I'll be stuck w/cpu until I can figure out how to use the existing image with the cuda runtime.

In the meantime, I may as well push the requested changes here, and you can decide if you want to use any parts of it. If not, I can always just host a "Podman/CUDA"-specific version for my own use. No big deal.

@mauwii
Copy link
Contributor

mauwii commented Feb 19, 2023

Does runpod use podman or docker?

I can try to pull the full image from docker hub and see what happens. The build issues at least should not be a facgtor. But I think I'll be stuck w/cpu until I can figure out how to use the existing image with the cuda runtime.

The latest tag is built for CUDA

${PLATFORM+--platform="${PLATFORM}"} \
--name="${REPOSITORY_NAME,,}" \
--hostname="${REPOSITORY_NAME,,}" \
--mount=source="${VOLUMENAME}",target=/data \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"mount=source=" is that valid syndax? Podman was confused by it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2722 (comment)

Was working there, never used Podman 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I looked in the documentation and didn't see it... maybe type defaults to volume and the second = is the equivalent of a comma..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker run -d \
  --name=nginxtest \
  --mount source=nginx-vol,destination=/usr/share/nginx/html,readonly \
  nginx:latest

So replace the equal sign in --mount= with a space and you get the same as the docker docs are refering to. So can be changed if necesarry.

@ebr
Copy link
Contributor

ebr commented Feb 20, 2023

[...] (and wasn't really sure how cuda was even working as-is). My cuda Dockerfile uses the nvidia/cuda ubuntu image as a base, not the python base image [...]

@fat-tire: The torch cuda-enabled distribution bundles the entirety of its CUDA dependencies (cuda,cudnn, cublas etc.) That is the reason for its ~1.8GB installed size for Linux (w/ cuda), vs mere megabytes for Mac (and Windows is somewhere in between). So you'll be happy to know that the massive nvidia/cuda image is redundant when using cuda-supporting torch - you can use any minimal image as needed. hope this saves you 2GB of base layer :)

@ebr
Copy link
Contributor

ebr commented Feb 20, 2023

Does runpod use podman or docker?

Runpod uses Kubernetes as far as I can tell (so it's containerd most likely), but generally any OCI-compliant image should work equally well with either docker, podman, or k8s. Not sure of the rootless implications specifically, but could build-time compatibility issues (buildkit syntax support, etc) be avoided by building with docker and running with podman? Or does that undermine anything for your use case, specifically needing to run docker as root?

@fat-tire
Copy link
Contributor Author

@fat-tire: The torch cuda-enabled distribution bundles the entirety of its CUDA dependencies (cuda,cudnn, cublas etc.) That is the reason for its ~1.8GB installed size for Linux (w/ cuda), vs mere megabytes for Mac (and Windows is somewhere in between). So you'll be happy to know that the massive nvidia/cuda image is redundant when using cuda-supporting torch - you can use any minimal image as needed. hope this saves you 2GB of base layer :)

Hey thanks for the response! I did try building with "cuda" set as the flavor (manually, just to be sure), and did notice a ton of packages coming in and as you suggest at first I assumed I had everything I needed to run w/cuda (as the docs said). But- no matter what I did and how I started it (again, this is with Podman) it kept coming up "cpu". Going to the container shell, running python and importing torch and checking if gpu is available, I constantly got "False". I thought maybe the problem was that the container didn't have accesss to the nvidia hardware, so I tried adding to the run.sh file all the

   --device /dev/dri \
     --device /dev/input \
     --device /dev/nvidia0 \
     --device /dev/nvidiactl \
     --device /dev/nvidia-modeset \
     --device /dev/nvidia-uvm \
     --device /dev/nvidia-uvm-tools \

stuff, and I played with giving it --privileged permissions, added the GPU_FLAGS=ALL, manually added the --gpus set to all as well. Tried a bunch of combinations. Nothing was working.

As a last resort, I tried the nvidia/cuda repo as a base and boom it came up. Well, once i also installed the nvidia-driver matching my host driver, that is.

As for your suggestion to not worry re building in Podman and just use the prebuilt images-- sure that would be fine and that way there's no need to pull all those things that aren't working in Docker yet. I've yet to try pulling/running the pre-built docker-built container rather than building it myself but I'll give it a try in the next day or so and report back.

Thx again!

@fat-tire
Copy link
Contributor Author

fat-tire commented Feb 26, 2023

Had some time to do a Podman test with the officially prebaked image (latest tag) on docker hub-- here's the command to run it (note I'm using both ./mounts/outputs and ./mounts/invokeai for /data/outputs and /data just because I like to organize things in local folders rather than as a volume but this should make no difference really.)

Command to run:

GPU_FLAGS=all podman run --interactive --tty --rm --name=invokeai --hostname=invokeai \
--mount type=bind,source=./mounts/invokeai,target=/data \
--mount type=bind,source=./mounts/outputs,target=/data/outputs \
--publish=9090:9090 --user=appuser:appuser --gpus=all \
--device /dev/dri --device /dev/input --device /dev/nvidia0 \
--device /dev/nvidiactl --device /dev/nvidia-modeset \
--device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools \
--cap-add=sys_nice invokeai/invokeai

Note I'm including explicit access to the nvidia devices just in case it's needed when running rootless (it's what I use when I'm running the nvidia/cuda image so I didn't want to make any changes since I know they provide access the container needs, For all I know --gpus=all might be enough but I didn't want to introduce anything different from the known-working command).

Running this command downloads the image from docker hub and runs, resulting in the (expected) error:

PermissionError: [Errno 13] Permission denied: '/data/models'

This was expected due to the uid/guid ownership issue discussed above.

The workaround was to run this 1x. It can only be run after the image is downloaded and the files/volumes are created:

#!/bin/bash
CONTAINER_ENGINE="podman"

# Podman only:  set ownership for user 1000:1000 (appuser) the right way
# this fixes PermissionError: [Errno 13] Permission denied: '/data/models'
if [[ ${CONTAINER_ENGINE} == "podman" ]] ; then
   echo Setting ownership for container\'s appuser on /data and /data/outputs
   podman run \
      --mount type=bind,source="$(pwd)/mounts/invokeai",target=/data \
      --user root --entrypoint "/bin/chown" "invokeai" \
      -R 1000:1000 /data
   podman run \
      --mount type=bind,source="$(pwd)"/mounts/outputs,target=/data/outputs \
      --user root --entrypoint "/bin/chown" "invokeai" \
      -R 1000:1000 /data/outputs
fi

Now that the mounted directories are correct, the above podman run command can work, the various models are loaded and the web server starts.

The problem though--

$ ./run.sh 
Setting ownership for container's appuser on /data and /data/outputs
* Initializing, be patient...
>> Initialization file /data/invokeai.init found. Loading...
>> Internet connectivity is True
>> InvokeAI, version 2.3.1
>> InvokeAI runtime directory is "/data"
>> GFPGAN Initialized
>> CodeFormer Initialized
>> ESRGAN Initialized
>> Using device_type cpu
>> xformers not installed
>> Initializing NSFW checker

CPU. not CUDA.

From the container:

root@invokeai:/usr/src# python3
Python 3.9.16 (main, Feb  9 2023, 05:40:23) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False

Also:

oot@invokeai:/usr/src# pip list | grep torch
clip-anytorch               2.5.0
pytorch-lightning           1.7.7
torch                       1.13.1
torch-fidelity              0.3.0
torchdiffeq                 0.2.3
torchmetrics                0.11.1
torchsde                    0.2.5
torchvision                 0.14.1
root@invokeai:/usr/src#  pip list | grep nvidia
nvidia-cublas-cu11          11.10.3.66
nvidia-cuda-nvrtc-cu11      11.7.99
nvidia-cuda-runtime-cu11    11.7.99
nvidia-cudnn-cu11           8.5.0.96

Contrast with my nvidia/cuda image-based container:

root@invokeai:/usr/src# python3
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True

and

* Initializing, be patient...
>> Initialization file /data/invokeai.init found. Loading...
>> Internet connectivity is True
>> InvokeAI, version 2.3.1
>> InvokeAI runtime directory is "/data"
>> GFPGAN Initialized
>> CodeFormer Initialized
>> ESRGAN Initialized
>> Using device_type cuda

and

root@invokeai:/usr/src# pip list | grep torch
clip-anytorch               2.5.0
pytorch-lightning           1.7.7
torch                       1.13.1+cu117
torch-fidelity              0.3.0
torchdiffeq                 0.2.3
torchmetrics                0.11.1
torchsde                    0.2.5
torchvision                 0.14.1+cu117
# pip list | grep nvidia
root@invokeai:/usr/src# 

Most obviously torch/torchvision have cu117 on my version but for some reason this didn't make it to the upstream container (or it wasn't properly d/l'd if it was supposed to be).

Thoughts?

@fat-tire
Copy link
Contributor Author

fat-tire commented Feb 28, 2023

For a moment I thought perhaps I was just using the wrong tag and getting the cpu version as a result,, but trying the :main-cuda tag came up "cpu" as well:

>> Using device_type cpu

If I pip uninstall torch torchvision and then pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu117 I can get the right version of torch in there. But it's not there by default.

root@invokeai:/data# pip list | grep cu117
torch                       1.13.1+cu117
torchaudio                  0.13.1+cu117
torchvision                 0.14.1+cu117

Unfortunately, this didn't work. I also tried installing cu113 (no dice) and the cuda package from nvida directly to the base image per nvidia's instructions.

This didn't work either.

@ebr
Copy link
Contributor

ebr commented Mar 4, 2023

I don't know podman at all, sadly, just looked over some of the docs... but this is very curious indeed if the nvidia/cuda based images work correctly for you, but not an image with torch..+cu117 installed. Because as I mentioned above, Torch bundles all of the required cuda dependencies and works out of the box - that's been well tested, albeit only in Docker and Kubernetes.

I see you're mapping quite a lot of devices in one of the above commands. Are you certain to be running the invoke image identically to the known working nvidia/cuda?

Curious as to the mininal set of podman arguments required to get the nvidia/cuda image to see the GPU, and whether the same works with the invoke image... does podman run --device nvidia.com/gpu0 work for you? the docs seem to suggest this is the correct way

@fat-tire
Copy link
Contributor Author

fat-tire commented Mar 4, 2023

Thanks for taking a look. Yeah I'm sure I am running it the same way as I copy/pasted the RUN command from the nvidia/cuda run.sh, minus some QT_FONT_DPI cruft I had that I'm pretty sure isn't needed. I literally added an echo to the front of the RUN command, then used the output of that with the invokeai prebaked image so I knew I was running it just as it was running the run.sh file. Also tried prepending with GPU_FLAGS=all. It always comes up "cpu" no matter what I seem to do.

Trying with --device nvidia.com/gpu0 returned a Error: stat nvidia.com/gpu0: no such file or directory failure. I hadn't seen this syntax before, but didn't seem to work.

Curious why would the main-cuda tagged image includes versions 1.13.1 and 0.14.1 of torch/torchvision by default rather than the cu117 versions? Again, after running, then manually removing these versions and re-installing with pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu117 they did appear in pip, but again, no dice getting "cuda" to appear when running invokeai --web right afterwards:

appuser@invokeai:/usr/src$ pip list | grep torch
clip-anytorch               2.5.0
pytorch-lightning           1.7.7
torch                       1.13.1+cu117
torch-fidelity              0.3.0
torchdiffeq                 0.2.3
torchmetrics                0.11.1
torchsde                    0.2.5
torchvision                 0.14.1+cu117
appuser@invokeai:/usr/src$ python
Python 3.9.16 (main, Feb  9 2023, 05:40:23) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False
>>> quit()

@kriptolix
Copy link

Thanks for taking a look. Yeah I'm sure I am running it the same way as I copy/pasted the RUN command from the nvidia/cuda run.sh, minus some QT_FONT_DPI cruft I had that I'm pretty sure isn't needed. I literally added an echo to the front of the RUN command, then used the output of that with the invokeai prebaked image so I knew I was running it just as it was running the run.sh file. Also tried prepending with GPU_FLAGS=all. It always comes up "cpu" no matter what I seem to do.

Trying with --device nvidia.com/gpu0 returned a Error: stat nvidia.com/gpu0: no such file or directory failure. I hadn't seen this syntax before, but didn't seem to work.

Curious why would the main-cuda tagged image includes versions 1.13.1 and 0.14.1 of torch/torchvision by default rather than the cu117 versions? Again, after running, then manually removing these versions and re-installing with pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu117 they did appear in pip, but again, no dice getting "cuda" to appear when running invokeai --web right afterwards:

appuser@invokeai:/usr/src$ pip list | grep torch
clip-anytorch               2.5.0
pytorch-lightning           1.7.7
torch                       1.13.1+cu117
torch-fidelity              0.3.0
torchdiffeq                 0.2.3
torchmetrics                0.11.1
torchsde                    0.2.5
torchvision                 0.14.1+cu117
appuser@invokeai:/usr/src$ python
Python 3.9.16 (main, Feb  9 2023, 05:40:23) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False
>>> quit()

I don't know if it's still necessary, but the last time I needed to use graphics acceleration inside a podman container (fedora silverblue, to use an old version of invoke) I had to install the nvidia-container-runtime package (which only has a version for RHEL 8.3, but it works on fedora) on the host system, plus the respective gpu drivers inside the container as well. Also, I had to edit /etc/nvidia-container-runtime/config.toml and set no-cgroups = true.

I don't have an Nvidia gpu anymore, I don't have anything to test with, but maybe this is the way to go.

@fat-tire
Copy link
Contributor Author

fat-tire commented Mar 7, 2023

Well I have some good news.. I got it running from the pre-built on podman/rootless! What's nice is that the image is significantly smaller, even with the nvidia drivers installed, than the previous container I was using, and rebuilding will be VERY fast now.

Steps:

  1. Modify Dockerfile to start FROM the upstream prebuilt image. Then apt-get install curl and kmod packages, then curl/download and install the Nvidia driver into the container. (The container driver version must match the host driver version.) Then clean up the installer.
  2. Create a new build.sh to determine in the correct nvidia driver version since it will change in future, then build the image with buildah bud, passing in the driver info.
  3. Create a run.sh script which MUST explicitly allow access to the various nvidia devices (or else nvidia-smi gives a "can't connect" error. Before every run it makes sure the mounted volumes/binds have correct 1000:1000 ownership for the local user to avoid fixes PermissionError: [Errno 13] Permission denied: '/data/models' podman errors.

I have all of the above working with the latest "3.0.0+a0" version.

Does anyone want this code... or what should I do with it? Since the real work is done in the prebuilt, these are all very small, simple files. With a $CONTAINER_ENGINE flag, it could be integrated into the source in this repo.. but I can also make a dedicated podman_invokeai repository solely for this purpose (it would be maybe 4 files and a README). I don't know if there other other podman users who would be interested or not.

I didn't have to touch /etc/nvidia-container-runtime/ anything, I didn't even have to install the nvidia-container-runtime package on the host at all, though I am running the 525 driver, so I think it's included now.

I don't know if it's still necessary, but the last time I needed to use graphics acceleration inside a podman container (fedora silverblue, to use an old version of invoke) I had to install the nvidia-container-runtime package (which only has a version for RHEL 8.3, but it works on fedora) on the host system, plus the respective gpu drivers inside the container as well. Also, I had to edit /etc/nvidia-container-runtime/config.toml and set no-cgroups = true.

I don't have an Nvidia gpu anymore, I don't have anything to test with, but maybe this is the way to go.

@ebr
Copy link
Contributor

ebr commented Mar 7, 2023

Great to hear you got it working! If there was a way to run this without installing the nvidia driver into the image, that would be ideal, in my opinion. Generally you really want to be using the driver that is already loaded by the kernel. But perhaps that's a hard limitation due to podman's rootless nature - i'm not sure.

I think your work here is valuable for supporting users who wish to run in a rootless container. Is there a way to do this without maintaining a separate Dockerfile and build/run scripts? Will leave it up to @mauwii to make the call on how to proceed next.

@mauwii
Copy link
Contributor

mauwii commented Mar 7, 2023

I already addressed a lot of changes (see the 11 unresolved conversations) and made clear that I would not want to remove the caching 😅

@fat-tire
Copy link
Contributor Author

fat-tire commented Mar 8, 2023

Thanks for addressing the changes in the other convos--

You wouldn't have to remove the caching as podman does now run with the prebaked image (built with caching, --linking etc). My working Dockerfile builds on top of that by adding the NVIDIA drivers on top. It basically looks like this:

FROM docker.io/invokeai/invokeai:main-cuda

ARG ARCH=x86_64
ARG NVIDIA_VERSION=525.85

USER 0
RUN apt update && apt install -y kmod curl
RUN cd /tmp && curl https://us.download.nvidia.com/XFree86/Linux-${ARCH}/${NVIDIA_VERSION}/NVIDIA-Linux-${ARCH}-${NVIDIA_VERSION}.run -o /tmp/NVIDIA-Linux-${ARCH}-${NVIDIA_VERSION}.run \
       && bash /tmp/NVIDIA-Linux-${ARCH}-${NVIDIA_VERSION}.run --no-kernel-module --no-kernel-module
-source --run-nvidia-xconfig --no-backup --no-questions --accept-license --ui=none \
       && rm -f /tmp/NVIDIA-Linux-${ARCH}-${NVIDIA_VERSION}.run \
       && rm -rf /tmp/*
RUN apt remove --purge kmod curl -y && apt-get clean

The build.sh looks something like:

#!/bin/bash

CONTAINER_BUILD="buildah bud"
TAG=latest

if [ -z "$RESOLVE_NVIDIA_VERSION" ]; then
   export NVIDIA_VERSION=`nvidia-smi --query-gpu=driver_version --format=csv,noheader`
else
   export NVIDIA_VERSION="${RESOLVE_NVIDIA_VERSION}"
fi

${CONTAINER_BUILD} -t "invokeai:${TAG}" -t "invokeai" --build-arg ARCH=`uname -m` --build-arg NVIDIA_VERSION="${NVIDIA_VERSION}"

As you can see it passes the current NVIDIA driver and ARCH to the build command. The container has to match the host, so this may be an issue for making any generic image for rootless.

I did try avoiding installing the nvidia-driver and instead tried using only the nvidia-container-runtime package in the Dockerfile. (Notes for automating that installation are here.)

RUN apt update && apt install gpg curl -y
RUN distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
RUN apt update && apt install -y nvidia-container-runtime
RUN apt remove --purge kmod curl gpg -y && apt-get clean

But this did test NOT work, at least not for me.

The run.sh is pretty much the same, except these --device lines were needed and the couple lines in this PR that verify the user permissions of the mount/bind volumes/directories, which I guess is a rootless thing to make sure that the userid & usergroup for the user running in the container can access the volume/directories.

  --device /dev/dri --device /dev/input --device /dev/nvidia0 \
  --device /dev/nvidiactl --device /dev/nvidia-modeset \
  --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools \

That's basically all I got! Just happy it's working now, and, while I don't know what if anything above can be integrated, hopefully someone else running rootlessly can find value in it.

@fat-tire
Copy link
Contributor Author

Hey it's been a few weeks and I'm inclined to close this just to not clutter up the PR area. Is there anything anyone wants from here? I've got podman running locally w/the method outlined above-- one thing I did add recently is:

  --env TRANSFORMERS_CACHE=/data/.cache \

to the podman run command, as it wasn't set anywhere else.

I've also noticed I get this:

 Server error: [Errno 18] Invalid cross-device link:

When trying to delete an image via the trash can icon in the web ui, because image files can't apparently be moved from /data/outputs/ to /data/.Trash-1000/files/. I had this problem previously. Dunno if it's podman only or an issue with the mounts or what.

@ebr
Copy link
Contributor

ebr commented Jul 28, 2023

@fat-tire I'm going to close this PR as outdated - we'll have it for reference if/when implementing Podman support, as discussed in #3587

@ebr ebr closed this Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants