Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why doesn't MultiProcessMapData() stop? #1501

Open
hsinhaoyu opened this issue Dec 16, 2020 · 2 comments
Open

Why doesn't MultiProcessMapData() stop? #1501

hsinhaoyu opened this issue Dec 16, 2020 · 2 comments

Comments

@hsinhaoyu
Copy link

I tried something very simple with MultiProcessMapData():

from tensorpack import *

class MyFlow(DataFlow):
    def __init__(self, n):
        super().__init__()
        self.n = n

    def __iter__(self):
        for i in range(self.n):
            yield i

    def __len__(self):
        return self.n

def f(i):
    return i*10

d0 = MyFlow(10)
d1 = MultiProcessMapData(d0, num_proc = 4, map_func=f, buffer_size=10, strict=False)
d1.reset_state()

for i in d1:
    print(i)
print("end")

In this example, the loop never stops. It just produces more and more numbers. If I set strict to False, the code produces 5 numbers (0, 10, 20, 30, 40) and then freezes. Is this the expected behaviour? I am using the latest version of Tensorpack on macOS. Thank you.

@ppwwyyxx
Copy link
Collaborator

ppwwyyxx commented Dec 16, 2020

In this example, the loop never stops. It just produces more and more numbers

This is expected and documented in https://tensorpack.readthedocs.io/modules/dataflow.html#tensorpack.dataflow.MultiProcessMapData: (because RepeatedData(MapData(df, ...), -1) means an infinite repeat).

If I set strict to False, the code produces 5 numbers (0, 10, 20, 30, 40) and then freezes

This is not expected. The code works on Linux so this seems like a bug specific to macOS. I'll take a look some time.

@hsinhaoyu
Copy link
Author

Thank you for the answer. Much appreciated. About the behaviour of strict=True, I ran the same code on a linux computer, it also hanged after a it produced a couple of numbers. Here is how the error looks like:

Python 3.8.5 (default, Sep  4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import test6_simplified
[1216 16:09:01 @argtools.py:138] WRN Starting a process with 'fork' method is efficient but not safe and may cause deadlock or crash.Use 'forkserver' or 'spawn' method instead if you run into such issues.See https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods on how to set them.
[1216 16:09:01 @argtools.py:138] WRN "import prctl" failed! Install python-prctl so that processes can be cleaned with guarantee.
0
10
20
30
40
50
60
70
80
90
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/hhyuaur/learn/tensorpack/test6_simplified.py", line 22, in <module>
    for i in d1:
  File "/home/hhyuaur/anaconda3/envs/nndev/lib/python3.8/site-packages/tensorpack/dataflow/parallel_map.py", line 310, in __iter__
    yield from super(MultiProcessMapDataZMQ, self).__iter__()
  File "/home/hhyuaur/anaconda3/envs/nndev/lib/python3.8/site-packages/tensorpack/dataflow/parallel_map.py", line 87, in __iter__
    yield from self.get_data_strict()
  File "/home/hhyuaur/anaconda3/envs/nndev/lib/python3.8/site-packages/tensorpack/dataflow/parallel_map.py", line 79, in get_data_strict
    dp = self._recv_filter_none()
  File "/home/hhyuaur/anaconda3/envs/nndev/lib/python3.8/site-packages/tensorpack/dataflow/parallel_map.py", line 45, in _recv_filter_none
    ret = self._recv()
  File "/home/hhyuaur/anaconda3/envs/nndev/lib/python3.8/site-packages/tensorpack/dataflow/parallel_map.py", line 304, in _recv
    msg = self.socket.recv_multipart(copy=False)
  File "/home/hhyuaur/anaconda3/envs/nndev/lib/python3.8/site-packages/zmq/sugar/socket.py", line 566, in recv_multipart
    parts = [self.recv(flags, copy=copy, track=track)]
  File "zmq/backend/cython/socket.pyx", line 783, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 821, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 170, in zmq.backend.cython.socket._recv_frame
  File "zmq/backend/cython/checkrc.pxd", line 13, in zmq.backend.cython.checkrc._check_rc
KeyboardInterrupt
>>>

Looks like it's waiting for a process to return a value.

I discovered that if I changed num_proc = 4 to num_proc = 8, the code worked. What do you think is causing this behaviour? Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants