Skip to content

Example from doc hangs on enumerate on CPU only machines #148

@alcinos

Description

@alcinos

Hello,

I am trying to run the simple example provided in the documentation.
I have created a fresh conda env as advised. Note that I'm testing on a host without a gpu.
Relevant info:

Environment info
PyTorch version: 1.10.2
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (GCC) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.9.10 | packaged by conda-forge | (main, Feb  1 2022, 21:24:11)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-4.18.0-305.28.1.el8_4.x86_64-x86_64-with-glibc2.31
Is CUDA available: False
CUDA runtime version: 11.3.58
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.0
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.5
[pip3] pytorch-pfn-extras==0.5.6
[pip3] torch==1.10.2
[pip3] torchvision==0.11.3
[conda] blas                      2.113                       mkl    conda-forge
[conda] blas-devel                3.9.0            13_linux64_mkl    conda-forge
[conda] cudatoolkit               11.3.1              ha36c431_10    conda-forge
[conda] libblas                   3.9.0            13_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            13_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            13_linux64_mkl    conda-forge
[conda] liblapacke                3.9.0            13_linux64_mkl    conda-forge
[conda] mkl                       2022.0.1           h8d4b97c_803    conda-forge
[conda] mkl-devel                 2022.0.1           ha770c72_804    conda-forge
[conda] mkl-include               2022.0.1           h8d4b97c_803    conda-forge
[conda] mypy-extensions           0.4.3                    pypi_0    pypi
[conda] numpy                     1.21.5           py39haac66dc_0    conda-forge
[conda] pytorch                   1.10.2          py3.9_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pytorch-pfn-extras        0.5.6                    pypi_0    pypi
[conda] torchvision               0.11.3               py39_cu113    pytorch

For reference, here is the full code I'm running (taken from the documentation)

Demo code
from ffcv.writer import DatasetWriter
import numpy as np
from ffcv.fields import NDArrayField, FloatField
from ffcv.loader import Loader, OrderOption
from ffcv.fields.decoders import NDArrayDecoder, FloatDecoder
from ffcv.loader import OrderOption
from ffcv.transforms import ToTensor

class LinearRegressionDataset:
    def __init__(self, N, d):
        self.X = np.random.randn(N, d)
        self.Y = np.random.randn(N)
    def __getitem__(self, idx):
        return (self.X[idx].astype('float32'), self.Y[idx])
    def __len__(self):
        return len(self.X)

N, d = (10, 6)
dataset = LinearRegressionDataset(N, d)

writer = DatasetWriter("/tmp/new.beton", {
    'covariate': NDArrayField(shape=(d,), dtype=np.dtype('float32')),
    'label': FloatField(),
}, num_workers=16)
writer.from_indexed_dataset(dataset)

loader = Loader('/tmp/new.beton',
                batch_size=2,
                num_workers=1,
                order=OrderOption.RANDOM,
                pipelines={
                  'covariate': [NDArrayDecoder(), ToTensor()],
                  'label': [FloatDecoder(), ToTensor()]
                })

print(len(loader))

for l in loader:
    print(l)
The printed length is correct (5), however the code completely freezes when hitting the for loop. Ctrl+C doesn't work, suggesting that the issue is a multi-process one.

top shows 0% cpu usage, and the presence of a process launched with python -c from multiprocessing.resource_tracker import main;main(7)

Let me know if you need additional debugging information (I'm not sure how to obtain a trace-back in this case, ideas welcome...)

Best

Metadata

Metadata

Labels

bugSomething isn't workingwontfixThis will not be worked on

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions