Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid forking HDFS connection with multiprocessing #81

Closed
belldandyxtq opened this issue Nov 13, 2019 · 1 comment
Closed

Avoid forking HDFS connection with multiprocessing #81

belldandyxtq opened this issue Nov 13, 2019 · 1 comment

Comments

@belldandyxtq
Copy link
Member

Problem statement

After forking with prior HDFS calls in the main process, the program freezes at any future HDFS calls.
For example:

import chainerio
import multiprocessing


def func(hdfs_handler):
    # freeze here (2)
    print(list(hdfs_handler.list()))


hdfs_handler = chainerio.create_handler("hdfs")

# create HDFS connection interally (1)
print(list(hdfs_handler.list()))

p = multiprocessing.Process(target=func, args=(hdfs_handler, ))
p.start()
p.join()

Cause
Chainerio uses the pyarrow module to access the HDFS internally, and the pyarrow uses the HDFS Java module. The HDFS connection is pooled inside, and If the connection is first created (implicitly though calls to HDFS, like (1)) and then forked, it breaks the pooling and cause future HDFS calls to freeze.

Solution
Fork before the creation of HDFS connection, i.e. fork before any calls to the HDFS, e.g. (1).

@kuenishi
Copy link
Member

kuenishi commented Mar 5, 2021

Closing in favor of v2 API, which detects fork automatically with _check_fork() call, which raises exception.

@kuenishi kuenishi closed this as completed Mar 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants