Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker run error for image_segmentation/pytorch test following the guide #689

Closed
gaowayne opened this issue Nov 24, 2023 · 3 comments
Closed

Comments

@gaowayne
Copy link

gaowayne commented Nov 24, 2023

the guide link is
image_segmentation/pytorch

when I try to run the container, I got below error, mention the runtime nvidia does not exist. could you please shed some light?

[stg@oq1 pytorch]$ sudo docker run --ipc=host -it --rm --runtime=nvidia -v /mnt/pytorch/mlperf/1/training/image_segmentation/pytorch/raw-data-dir:/raw_data -v /mnt/pytorch/mlperf/1/training/image_segmentation/pytorch/data:/data -v /mnt/pytorch/mlperf/1/training/image_segmentation/pytorch/results:/results unet3d:latest /bin/bash
[sudo] password for stg: 
docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.
[stg@oq1 pytorch]$ 

I am using FedoraOS37, I failed to install cuda container support because this scripts does not support FedoraOS

[stg@oq1 training]$ sudo sh install_cuda_docker.sh 
[sudo] password for stg: 
--2023-11-23 19:31:16--  https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 190 [application/octet-stream]
Saving to: ‘cuda-ubuntu2004.pin’

cuda-ubuntu2004.pin                                   100%[======================================================================================================================>]     190  --.-KB/s    in 0s      

2023-11-23 19:31:16 (10.6 MB/s) - ‘cuda-ubuntu2004.pin’ saved [190/190]

mv: cannot move 'cuda-ubuntu2004.pin' to '/etc/apt/preferences.d/cuda-repository-pin-600': No such file or directory
--2023-11-23 19:31:16--  https://developer.download.nvidia.com/compute/cuda/11.6.0/local_installers/cuda-repo-ubuntu2004-11-6-local_11.6.0-510.39.01-1_amd64.deb
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2681112370 (2.5G) [application/x-deb]
Saving to: ‘cuda-repo-ubuntu2004-11-6-local_11.6.0-510.39.01-1_amd64.deb’

cuda-repo-ubuntu2004-11-6-local_11.6.0-510.39.01-1_am 100%[======================================================================================================================>]   2.50G  21.1MB/s    in 2m 36s  

2023-11-23 19:33:53 (16.4 MB/s) - ‘cuda-repo-ubuntu2004-11-6-local_11.6.0-510.39.01-1_amd64.deb’ saved [2681112370/2681112370]

sudo: dpkg: command not found
sudo: apt-key: command not found
sudo: apt-get: command not found
sudo: apt-get: command not found
sudo: apt-get: command not found
gpg: can't create '/usr/share/keyrings/docker-archive-keyring.gpg': No such file or directory
gpg: no valid OpenPGP data found.
gpg: dearmoring failed: No such file or directory
curl: (23) Failed writing body
install_cuda_docker.sh: line 15: dpkg: command not found
install_cuda_docker.sh: line 15: lsb_release: command not found
tee: /etc/apt/sources.list.d/docker.list: No such file or directory
sudo: apt: command not found
sudo: apt-get: command not found
@gaowayne
Copy link
Author

guys, I installed nvidia docker in fedora, now I can start container,
but when I run next step it shows me error like below. how to fix this?

root@6ec7b9c99e06:/# ls
bin  boot  data  dev  etc  home  lib  lib64  media  mnt  opt  proc  raw_data  results  root  run  sbin  srv  sys  tmp  usr  var  workspace
root@6ec7b9c99e06:/# cd workspace/unet3d/
root@6ec7b9c99e06:/workspace/unet3d# python3 preprocess_dataset.py --data_dir /raw_data --results_dir /data
Preprocessing /raw_data
/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/opt/conda/lib/python3.8/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
Mean value: nan, std: nan, d: nan, h: nan, w: nan
Traceback (most recent call last):
  File "preprocess_dataset.py", line 147, in <module>
    verify_dataset(args.results_dir)
  File "preprocess_dataset.py", line 127, in verify_dataset
    assert len(source) == len(os.listdir(results_dir))
AssertionError
root@6ec7b9c99e06:/workspace/unet3d# 

@gaowayne
Copy link
Author

gaowayne commented Dec 6, 2023

guys, I install host OS with Ubuntun22.04, I still see this error, could you please shed some light?

dcg@oq1:/mnt/nvme1n1/mlperf/ubuntu/training/image_segmentation/pytorch$ sudo docker run --ipc=host -it --rm --runtime=nvidia -v /mnt/nvme1n1/mlperf/ubuntu/training/image_segmentation/pytorch/raw-data-dir:/raw_data -v /mnt/nvme1n1/mlperf/ubuntu/training/image_segmentation/pytorch/data:/data -v /mnt/nvme1n1/mlperf/ubuntu/training/image_segmentation/pytorch/results:/results unet3d:latest /bin/bash
root@7f2d8fc3d617:/workspace/unet3d# ls
Dockerfile  LICENCE  README.md  checksum.json  data_loading  evaluation_cases.txt  main.py  model  oldREADME.md  preprocess_dataset.py  requirements.txt  run_and_time.sh  runtime
root@7f2d8fc3d617:/workspace/unet3d# python3 preprocess_dataset.py --data_dir /raw_data --results_dir /data
Preprocessing /raw_data
/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/opt/conda/lib/python3.8/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
Mean value: nan, std: nan, d: nan, h: nan, w: nan
Traceback (most recent call last):
  File "preprocess_dataset.py", line 147, in <module>
    verify_dataset(args.results_dir)
  File "preprocess_dataset.py", line 127, in verify_dataset
    assert len(source) == len(os.listdir(results_dir))
AssertionError
root@7f2d8fc3d617:/workspace/unet3d# 

@ShriyaPalsamudram
Copy link
Contributor

Sorry but the unet3d benchmark is dropped from the training benchmarks suite so this issue cannot be addressed at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants