Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot connect to dalle when run in docker #18

Closed
AstrocyteTaki opened this issue May 18, 2022 · 10 comments
Closed

Cannot connect to dalle when run in docker #18

AstrocyteTaki opened this issue May 18, 2022 · 10 comments

Comments

@AstrocyteTaki
Copy link

Hello, thanks for sharing this wonderful project.

I had a problem there, I tried to run it in docker and access it locally. The docker build and run process is smooth, but when I started the client and tried to access it locally, this error occurs:
ConnectionError: failed to connect to all addresses |Gateway: Communication error with deployment at address 0.0.0.0:49336. Head or worker may be down.
I checked the port and see it should be the port of dalle as:
gateway/rep-0@60 adding connection for deployment dalle/heads/0 to grpc://0.0.0.0:49336

Any idea on how I could fix this? Thank you so much.

@nthomsencph
Copy link

nthomsencph commented May 18, 2022

Having the same issue on AWS Deep Learning AMI GPU PyTorch 1.11.0 (Amazon Linux 2) 20220328 (CUDA 116).

@hanxiao
Copy link
Member

hanxiao commented May 21, 2022

we recently fixed in the Dockerfile in #20 , you could give it a try

this should solve the problem as we have successfully run it on p2.8xlarge, @jina-ai/engineering will share more details next week.

@alaeddine-13
Copy link
Contributor

alaeddine-13 commented May 21, 2022

Hey @nthomsencph , does nvcc -v, nvidia-smi and torch.cuda.device_count() print results correctly inside the ec2 instance and inside the docker image ?
(In order to get inside the docker image and run the commands you can do docker run -it --entypoint /bin/bash jina-ai/dalle-flow)

@nthomsencph
Copy link

Now it works 🔥

Rebooted ec2 instance, ran docker prune -a, pulled repo and ran instructions. Thanks!

@spuliz
Copy link

spuliz commented May 23, 2022

Hi @nthomsencph I'm having the same issue on a g5x.large, which EC2 instance are you using? Which instructions did you follow to install the nvidia toolkit on docker? How did you install the cudnn8 inside docker?

Thanks!

@nthomsencph
Copy link

Hi @spuliz. We sprung for a AWS Deep Learning AMI (One which comes with CUDA116 and more - See above) to skip the hassle of configuring this.

@spuliz
Copy link

spuliz commented May 26, 2022

Thanks @nthomsencph which EC2 instance did you use? I am having an issue with the Tesla K-series GPUs as your AMI does not have the NVDIA drivers already installed. The issue I am having is that I am not able to find an AMI with cuda 11.6 installed

@nthomsencph
Copy link

We used a p1.large with a 16GB GPU. No more is necessary since we don't expect too many requests. The Deep Learning AMI we use for this have CUDA 116 preinstalled.

On honeymoon so that's all the help I can offer ☀️

@hanxiao
Copy link
Member

hanxiao commented Jun 11, 2022

Did you try building docker and run it via docker container? I just rebuild and run without any issue.

https://github.com/jina-ai/dalle-flow#run-in-docker

git clone https://github.com/jina-ai/dalle-flow.git
cd dalle-flow

docker build --build-arg GROUP_ID=$(id -g ${USER}) --build-arg USER_ID=$(id -u ${USER}) -t jinaai/dalle-flow .

docker run -p 51005:51005 -v $HOME/.cache:/home/dalle/.cache --gpus all jinaai/dalle-flow

@delgermurun
Copy link
Contributor

I believe this issue has been resolved. Feel free to reopen if the problem occurs again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants