Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU distributed training with PyTorch #1838

Open
bouachalazhar opened this issue Apr 18, 2024 · 0 comments
Open

Multi-GPU distributed training with PyTorch #1838

bouachalazhar opened this issue Apr 18, 2024 · 0 comments
Assignees

Comments

@bouachalazhar
Copy link

Issue Type

Documentation Bug

Source

source

Keras Version

Keras 3.2.1

Custom Code

No

OS Platform and Distribution

Linux Ubuntu 22.04.3

Python version

No response

GPU model and memory

No response

Current Behavior?

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
x_train shape: (60000, 28, 28, 1)
---------------------------------------------------------------------------
ProcessRaisedException                    Traceback (most recent call last)
[<ipython-input-14-31fe06a97fdd>](https://localhost:8080/#) in <cell line: 1>()
      1 if __name__ == "__main__":
      2     # We use the "fork" method rather than "spawn" to support notebooks
----> 3     torch.multiprocessing.start_processes(
      4         per_device_launch_fn,
      5         args=(num_gpu,),

1 frames
[/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py](https://localhost:8080/#) in join(self, timeout)
    156         msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
    157         msg += original_trace
--> 158         raise ProcessRaisedException(msg, error_index, failed_process.pid)
    159 
    160 

ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
    fn(i, *args)
  File "<ipython-input-13-9a6bcf1473c9>", line 47, in per_device_launch_fn
    model = get_model()
  File "<ipython-input-9-4d33f81f3022>", line 14, in get_model
    x = keras.layers.Conv2D(filters=12, kernel_size=3, padding="same", use_bias=False)(
  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 288, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Standalone code to reproduce the issue or tutorial link

https://keras.io/guides/distributed_training_with_torch/

Relevant log output

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants