Skip to content

Commit

Permalink
adds note on fixing muiltiprocessing issues
Browse files Browse the repository at this point in the history
  • Loading branch information
edublancas committed Oct 20, 2021
1 parent d49de72 commit b0a6b4e
Show file tree
Hide file tree
Showing 3 changed files with 104 additions and 1 deletion.
93 changes: 93 additions & 0 deletions doc/user-guide/faq/multiprocessing.rst
@@ -0,0 +1,93 @@
Multiprocessing errors on macOS and Windows
-------------------------------------------

:ref:`Show me the solution <solution-1-add-name-main>`.

By default, Ploomber executes :class:`ploomber.tasks.PythonCallable`
(i.e., function tasks) in a child process using the `multiprocessing <https://docs.python.org/3/library/multiprocessing.html>`_
library. On macOS and Windows, Python uses the `spawn <https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods>`_
method to create child processes; this isn't an issue if you're
running your pipeline from the command-line (i.e., ``ploomber build``), but you'll
encounter the following issue if running from a script:

.. code-block:: text
An attempt has been made to start a new process before the current process
has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
This happens if you store a script (say ``run.py``):

.. code-block:: python
:class: text-editor
from ploomber.spec import DAGSpec
dag = DAGSpec('pipeline.yaml').to_dag()
# This fails on macOS and Windows!
dag.build()
And call your pipeline with:


.. code-block:: console
python run.py
There are two ways to solve this problem.

.. _solution-1-add-name-main:

Solution 1: Add ``__name__ == '__main__'``
******************************************

To allow correct creation of child processes using ``spawn``, run your pipeline
like this:


.. code-block:: python
:class: text-editor
from ploomber.spec import DAGSpec
if __name__ == '__main__':
dag = DAGSpec('pipeline.yaml').to_dag()
# calling build under this if statement allows
# correct creation of child processes
dag.build()
Solution 2: Disable multiprocessing
***********************************

You can disable multiprocessing in your pipeline like this:

.. code-block:: python
:class: text-editor
from ploomber.spec import DAGSpec
from ploomber.executor import Serial
dag = DAGSpec('pipeline.yaml').to_dag()
# overwrite executor regardless of what the pipeline.yaml
# says in the 'executor' field
dag.executor = Serial(build_in_subprocess=False)
dag.build()
2 changes: 2 additions & 0 deletions doc/user-guide/faq_index.rst
Expand Up @@ -13,6 +13,8 @@ FAQ and Glossary

.. include:: faq/autoreload.rst

.. include:: faq/multiprocessing.rst

Glossary
--------

Expand Down
10 changes: 9 additions & 1 deletion src/ploomber/executors/serial.py
Expand Up @@ -207,7 +207,15 @@ def build_in_current_process(task, build_kwargs, reports_all):

def build_in_subprocess(task, build_kwargs, reports_all):
if callable(task.source.primitive):
p = Pool(processes=1)

try:
p = Pool(processes=1)
except RuntimeError as e:
# this is most likely due to child processes created with
# spawn (mac/windows) outside if __name__ == '__main__'
raise RuntimeError('For help solving this, go to: '
'https://ploomber.io/h/mp') from e

res = p.apply_async(func=task._build, kwds=build_kwargs)
# calling this make sure we catch the exception, from the docs:
# Return the result when it arrives. If timeout is not None and
Expand Down

0 comments on commit b0a6b4e

Please sign in to comment.