Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example script hanging without any output or any error hint. #99

Closed
seonwoo-min opened this issue Mar 16, 2022 · 8 comments
Closed

Example script hanging without any output or any error hint. #99

seonwoo-min opened this issue Mar 16, 2022 · 8 comments

Comments

@seonwoo-min
Copy link

I have completed the installation with Conda (using the install.sh script), but could not successfully run the example script.
When I run the pcqv2.sh script, it just hangs without any output or any error message.
I'm not sure anybody else has faced the same problem. Can you give me some advice to resolve the issue?

For more information, I'm using the GCP with NVIDIA V100 GPUs and CUDA11.1.
Within the same environment, I have checked running the fairseq NMT example codes and the Graphormer v1. codes.
Both of them did not produce any errors.

@zhengsx
Copy link
Collaborator

zhengsx commented Mar 16, 2022

Thanks for using Graphormer.

If there is no error message, would you please provide a minial docker image and code/script/instruction which could reproduce your problem?

@seonwoo-min
Copy link
Author

@zhengsx, Thanks for the quick reply.

I have been using the Conda environment without Docker.
I tried to use the following script to create a Docker image and still faced the same problem.
(Btw, I'm sorry for the messy Dockerfile script, I'm not really used to using Docker.)

FROM nvidia/cuda:11.1.1-cudnn8-devel-ubuntu18.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install git sudo -y

RUN apt-get install -y software-properties-common
RUN add-apt-repository -y ppa:deadsnakes/ppa
RUN apt-get update && apt-get install git sudo -y

RUN apt-get install --no-install-recommends -y python3.9 python3.9-dev python3-pip python3-dev python-dev python-setuptools python3-setuptools
RUN apt-get install -y python3.9-distutils
RUN apt-get install -y gfortran libopenblas-dev liblapack-dev
RUN apt-get install -y gcc g++

RUN update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.9 1

RUN python -m pip install --upgrade wheel setuptools pip distlib 

Within the Docker image, I simply followed the instructions to install Graphormer.
Then, I used the example script pcqv2.sh as follows.

sudo docker run -it --gpus all --ipc=host --name=graphormer graphormer:latest bash
git clone --recursive https://github.com/microsoft/Graphormer.git
cd Graphormer
bash install.sh
cd examples/property_prediction
bash pcqv2.sh

Still, it just hangs without any output or any error message.

@skye95git
Copy link

@mswzeus @zhengsx I meet the same problem. Have you solved the problem yet?
Sometimes there is no response for half an hour after running the command bash zinc.sh:
image

Sometimes errors are reported:

    from torch_geometric.data import InMemoryDataset
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch_geometric/__init__.py", line 5, in <module>
    import torch_geometric.data
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch_geometric/data/__init__.py", line 1, in <module>
    from .data import Data
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch_geometric/data/data.py", line 8, in <module>
    from torch_sparse import coalesce, SparseTensor
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch_sparse/__init__.py", line 41, in <module>
    from .tensor import SparseTensor  # noqa
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch_sparse/tensor.py", line 13, in <module>
    class SparseTensor(object):
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch/jit/_script.py", line 1128, in script
    _compile_and_register_class(obj, _rcb, qualified_name)
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch/jit/_script.py", line 138, in _compile_and_register_class
    script_class = torch._C._jit_script_class_compile(qualified_name, ast, defaults, rcb)
RuntimeError:
object has no attribute sparse_csr_tensor:
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch_sparse/tensor.py", line 511
            value = torch.ones(self.nnz(), dtype=dtype, device=self.device())

        return torch.sparse_csr_tensor(rowptr, col, value, self.sizes())
               ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE

@seonwoo-min
Copy link
Author

@zhengsx @skye95git Fortunately, a friend told me that it seems like a setuptools issue.
After the installation, I removed the setuptools package pip uninstall setuptools.
I'm not sure why it was the problem but it seems like the example script now runs without raising any errors.

@skye95git
Copy link

skye95git commented Mar 29, 2022

@zhengsx @skye95git Fortunately, a friend told me that it seems like a setuptools issue. After the installation, I removed the setuptools package pip uninstall setuptools. I'm not sure why it was the problem but it seems like the example script now runs without raising any errors.

Thanks very much! I have try it and find it works. After removing the setuptools package pip uninstall setuptools, there is en error No module named '_distutils_hack'. It does not affect running.

But I can't download molecules and zinc datasets directly when I run zinc.sh. Can I download and load manually? If so, where should I modify it?

File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch_geometric/datasets/zinc.py", line 86, in download
    path = download_url(self.url, self.root)
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch_geometric/data/download.py", line 34, in download_url
    data = urllib.request.urlopen(url, context=context)
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/urllib/request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/urllib/request.py", line 517, in open
    response = self._open(req, data)
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/urllib/request.py", line 534, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/urllib/request.py", line 494, in _call_chain
    result = func(*args)
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/urllib/request.py", line 1389, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/urllib/request.py", line 1349, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 101] Network is unreachable>

Update:I have solved it. Thanks.

@skye95git
Copy link

skye95git commented Mar 30, 2022

@zhengsx @skye95git Fortunately, a friend told me that it seems like a setuptools issue. After the installation, I removed the setuptools package pip uninstall setuptools. I'm not sure why it was the problem but it seems like the example script now runs without raising any errors.

Hi, Now pcqv1.sh can actually run without any errors, but it doesn't seem to work anymore after it prints out some information. It stops in the following position:

/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch/cuda/__init__.py:106: UserWarning:
A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
2022-03-30 10:30:27 | INFO | fairseq.distributed.utils | distributed init (rank 0): tcp://localhost:16416
/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch/cuda/__init__.py:106: UserWarning:
A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
2022-03-30 10:30:27 | INFO | fairseq.distributed.utils | distributed init (rank 2): tcp://localhost:16416
2022-03-30 10:30:27 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 2
/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch/cuda/__init__.py:106: UserWarning:
A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch/cuda/__init__.py:106: UserWarning:
A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
2022-03-30 10:30:27 | INFO | fairseq.distributed.utils | distributed init (rank 3): tcp://localhost:16416
2022-03-30 10:30:27 | INFO | fairseq.distributed.utils | distributed init (rank 1): tcp://localhost:16416
2022-03-30 10:30:27 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 1
2022-03-30 10:30:27 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 3
2022-03-30 10:30:27 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0
2022-03-30 10:30:27 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for 4 nodes.
2022-03-30 10:30:27 | INFO | fairseq.distributed.utils | initialized host dgx079.scc.idea as rank 0
2022-03-30 10:30:27 | INFO | torch.distributed.distributed_c10d | Rank 2: Completed store-based barrier for 4 nodes.
2022-03-30 10:30:27 | INFO | fairseq.distributed.utils | initialized host dgx079.scc.idea as rank 2
2022-03-30 10:30:27 | INFO | torch.distributed.distributed_c10d | Rank 1: Completed store-based barrier for 4 nodes.
2022-03-30 10:30:27 | INFO | torch.distributed.distributed_c10d | Rank 3: Completed store-based barrier for 4 nodes.
2022-03-30 10:30:27 | INFO | fairseq.distributed.utils | initialized host dgx079.scc.idea as rank 1
2022-03-30 10:30:27 | INFO | fairseq.distributed.utils | initialized host dgx079.scc.idea as rank 3

Have you finished running pcqv1.sh and obtained any results?

new error:

Traceback (most recent call last):
  File "/home/linjiayi/anaconda3/envs/graphormer/bin/fairseq-train", line 8, in <module>
    sys.exit(cli_main())
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/fairseq_cli/train.py", line 528, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/fairseq/distributed/utils.py", line 344, in call_main
    torch.multiprocessing.spawn(
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/fairseq/distributed/utils.py", line 322, in distributed_main
    cfg.distributed_training.distributed_rank = distributed_init(cfg)
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/fairseq/distributed/utils.py", line 272, in distributed_init
    dist.all_reduce(torch.zeros(1).cuda())
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1206, in all_red
uce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1623448238472/work/torch/lib/c10d/ProcessGroupNCCL.cpp:38, unhandled cuda error,
 NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

Update:I have solved it. Thanks.

@zhengsx
Copy link
Collaborator

zhengsx commented Mar 30, 2022

Hi @skye95git , I just have a galance of your error message, it seems like some usage problems of incompatible pytorch and your GPU card. Thus I initialize the Discussion forum, and you may ask all of your usage problems about how to use Graphormer and anything else you want to discuss at there, and discuss with the guys in Graphormer community who met similar problem with you.

Let's just leave the issue channel for real bugs and feature request of Graphormer, what do you think about it?

@skye95git
Copy link

skye95git commented Mar 30, 2022

Hi @skye95git , I just have a galance of your error message, it seems like some usage problems of incompatible pytorch and your GPU card. Thus I initialize the Discussion forum, and you may ask all of your usage problems about how to use Graphormer and anything else you want to discuss at there, and discuss with the guys in Graphormer community who met similar problem with you.

Let's just leave the issue channel for real bugs and feature request of Graphormer, what do you think about it?

Ok. Thanks. It seems that people are used to discussing problems on issue. I hope someone will join the Discussion forum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants