CUDA initialized before forking #115

berceanu · 2019-05-23T13:02:50Z

Description

I am trying to integrate fbpic, a well-known CUDA code (based on Python + Numba) for laser-plasma simulation with signac. The integration repo is signac-driven-fbpic.

I managed to succesfully run on a single GPU, via python3 src/project.py run from inside the signac folder, but if I add --parallel I get

numba.cuda.cudadrv.error.CudaDriverError: CUDA initialized before forking

The goal is to get 8 (independent) copies of fbpic (with different input params) running in parallel on the 8 NVIDIA P100 GPUs that are on the same machine.

To reproduce

Clone the signac-driven-fbpic repo and follow the install instructions. Then go to the signac subfolder, and do

conda activate signac-driven-fbpic
python3 src/init.py
python3 src/project.py run --parallel

Error output

(signac-driven-fbpic) andrei@ServerS:~/Development/signac-driven-fbpic/signac$ python3 src/project.py run --parallel --show-traceback
Using environment configuration: UnknownEnvironment
Serialize tasks|----------------------------------------------------------------------------------Serialize tasks|#####-----------------------------------------------------------------------------Serialize tasks|##########------------------------------------------------------------------------Serialize tasks|###############-------------------------------------------------------------------Serialize tasks|####################--------------------------------------------------------------Serialize tasks|##########################--------------------------------------------------------Serialize tasks|###############################---------------------------------------------------Serialize tasks|####################################----------------------------------------------Serialize tasks|#########################################-----------------------------------------Serialize tasks|###############################################-----------------------------------Serialize tasks|####################################################------------------------------Serialize tasks|#########################################################-------------------------Serialize tasks|##############################################################--------------------Serialize tasks|###################################################################---------------Serialize tasks|#########################################################################---------Serialize tasks|##############################################################################----Serialize tasks|##################################################################################Serialize tasks|##################################################################################Serialize tasks|##############################################################################################|100%
ERROR: Encountered error during program execution: 'CUDA initialized before forking'
Execute with '--show-traceback' or '--debug' to get more information.
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 2727, in _fork_with_serialization
    project._fork(project._loads_op(operation))
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 1467, in _fork
    self._operation_functions[operation.name](operation.job)
  File "src/project.py", line 172, in run_fbpic
    verbose_level=2,
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/fbpic/main.py", line 232, in __init__
    n_guard, n_damp, None, exchange_period, use_all_mpi_ranks )
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/fbpic/boundaries/boundary_communicator.py", line 267, in __init__
    self.d_left_damp = cuda.to_device( self.left_damp )
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/devices.py", line 212, in _require_cuda_context
    return fn(*args, **kws)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/api.py", line 103, in to_device
    to, new = devicearray.auto_device(obj, stream=stream, copy=copy)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/devicearray.py", line 683, in auto_device
    devobj = from_array_like(obj, stream=stream)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/devicearray.py", line 621, in from_array_like
    writeback=ary, stream=stream, gpu_data=gpu_data)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/devicearray.py", line 102, in __init__
    gpu_data = devices.get_context().memalloc(self.alloc_size)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 697, in memalloc
    self._attempt_allocation(allocator)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 680, in _attempt_allocation
    allocator()
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 695, in allocator
    driver.cuMemAlloc(byref(ptr), bytesize)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 290, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 324, in _check_error
    raise CudaDriverError("CUDA initialized before forking")
numba.cuda.cudadrv.error.CudaDriverError: CUDA initialized before forking
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "src/project.py", line 238, in <module>
    Project().main()
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 2721, in main
    _exit_or_raise()
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 2689, in main
    args.func(args)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 2414, in _main_run
    run()
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/legacy.py", line 193, in wrapper
    return func(self, jobs=jobs, names=names, *args, **kwargs)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 1597, in run
    np=np, timeout=timeout, progress=progress)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 1421, in run_operations
    pool, cloudpickle, operations, progress, timeout)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 1458, in _run_operations_in_parallel
    result.get(timeout=timeout)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
numba.cuda.cudadrv.error.CudaDriverError: CUDA initialized before forking

Relevant numba link.

System configuration

Operating System: Ubuntu 16.04
Version of Python: 3.6.8
Version of signac: 1.1.0
Version of signac-flow: 0.7.1
NVIDIA Driver Version: 410.72

The text was updated successfully, but these errors were encountered:

csadorf · 2019-05-23T14:00:58Z

Hey, thx for reporting this issue! Have you tried moving the import of anything fbpic related from the module-level to the operation level? Usually import of those packages triggers the CUDA initialization.

vyasr · 2019-05-23T15:57:30Z

I concur, it's likely the from fbpic.main import Simulation that's doing the GPU initialization, I would try moving that into the operation. If that isn't enough, try the other imports as well.

berceanu · 2019-05-23T17:26:55Z

You guys are awesome, it worked :)
Now I stumbled into a different issue (should I open a new one?) regarding the fact that all the runs are launched on the same GPU (0/7), instead of each of them claiming a separate card.

Looking at this fbpic example, they use MPI for parallel parameter scans on multiple GPUs, but I want to use signac of course! ;)

csadorf · 2019-05-23T17:45:47Z

While Github issues are usually not meant for tech support, I suggest we troubleshoot this as part of this issue, because it is a problem that we need to generally solve.

The issue is that each operation is executed completely independently so there is no way to tell each operation what GPU to use. One way we could mitigate that is to assign each process some kind of "task number". This task number could then be stored for instance in an environment variable, read by the operation and used to compute which GPU to run on. In your example, that would look like this:

gpu = int(os.environ['SIGNAC_FLOW_TASK_ID'] % 8)

Would any of the @glotzerlab/signac-developers want to give it a shot? This would be an alternative solution to the aggregation approach explored by @jglaser .

csadorf · 2019-05-24T13:47:29Z

@joaander Josh, this issue is similar to one that you brought up a while back. I believe that what I'm suggesting here is in line with what you proposed back then?

joaander · 2019-05-24T14:37:27Z

@csadorf Your proposed solution would provide efficient scheduling provided that 1) The number of parallel tasks was limited to the number of GPUs in the system and 2) All tasks take exactly the same amount of time. If either of these requirements is not met, this solution will result in situations where some GPUs may go unused at times and/or some GPUs may have multiple tasks assigned at times. This may or may not be desirable.

signac is not a resource manager or job scheduler and is not aware of the hardware on the system, the time it takes to run tasks, or what users are on the system. Such a system (i.e. SLURM in conjunction with the signac-flow submit functionality) would be required to obtain ideal scheduling on a multi-user system.

@berceanu If you are on a single-user workstation, you could consider enabling compute exclusive mode on your GPUs so the CUDA driver can auto-assign tasks to free GPUs. You would need to limit the amount of parallelism to the number of GPUs in the system.

berceanu · 2019-05-24T15:04:40Z

@joaander I set the compute mode on all 8 GPUs to "E. Process".

Now I get this error after the first operation completes on the first GPU:

ERROR: Encountered error during program execution: '[101] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_INVALID_DEVICE'
Execute with '--show-traceback' or '--debug' to get more information.
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 2727, in _fork_with_serialization
    project._fork(project._loads_op(operation))
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 1467, in _fork
    self._operation_functions[operation.name](operation.job)
  File "src/project.py", line 129, in run_fbpic
    from fbpic.main import Simulation
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/fbpic/main.py", line 20, in <module>
    mpi_select_gpus( MPI )
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/fbpic/utils/cuda.py", line 138, in mpi_select_gpus
    cuda.select_device(i_gpu)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/api.py", line 302, in select_device
    context = devices.get_context(device_id)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/devices.py", line 194, in get_context
    return _runtime.get_or_create_context(devnum)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/devices.py", line 162, in get_or_create_context
    return self.push_context(self.gpus[devnum])
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/devices.py", line 130, in push_context
    ctx = self._get_or_create_context(gpu)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/devices.py", line 120, in _get_or_create_context
    ctx = gpu.get_primary_context()
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 472, in get_primary_context
    driver.cuDevicePrimaryCtxRetain(byref(hctx), self.id)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 290, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 325, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [101] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_INVALID_DEVICE
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "src/project.py", line 239, in <module>
    Project().main()
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 2721, in main
    _exit_or_raise()
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 2689, in main
    args.func(args)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 2414, in _main_run
    run()
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/legacy.py", line 193, in wrapper
    return func(self, jobs=jobs, names=names, *args, **kwargs)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 1597, in run
    np=np, timeout=timeout, progress=progress)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 1421, in run_operations
    pool, cloudpickle, operations, progress, timeout)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 1458, in _run_operations_in_parallel
    result.get(timeout=timeout)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
numba.cuda.cudadrv.driver.CudaAPIError: [101] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_INVALID_DEVICE

~~Can this be because I didn't limit the parallelism to the number of free GPUs? How do I do that?~~
I get the error even with --parallel=8.

joaander · 2019-05-24T17:11:09Z

With compute exclusive mode, attempting to acquire a CUDA context will result in an error if there are no free GPUs. Are you sure all 8 GPUs are free? Try a smaller number and see if that works. Check nvidia-smi to see which processes are using which GPUs.

berceanu · 2019-05-24T17:12:51Z

Yes, I tried, two are not free, so I reduced it to 6 but still get the same problem.

joaander · 2019-05-24T17:28:56Z

@csadorf Does signac-flow reuse processes for multiple tasks? This would explain this behavior. Is there a way to make it launch a new process for each task?

With reused processes you would need to clean up and destroy the CUDA context at the end of each task so the GPU is free for the next one. The library you are using would need to provide an API call to destroy the context.

csadorf · 2019-05-24T17:54:27Z

@csadorf Does signac-flow reuse processes for multiple tasks?

@joaander Whenever possible, yes. Because it is much faster for smaller operations to avoid forking, which the run sub-command is designed for.

However, it is possible to suppress that behavior by specifying the executable manually, e.g. with directives(executable='python'). We should probably consider to add a @fork decorator or so that instructs signac-flow to fork without this work-around.

berceanu · 2019-06-05T10:29:02Z

I documented the work-around here: glotzerlab/signac-docs#27

berceanu · 2019-06-07T09:08:42Z

I just noticed a big inconvenience in the above work-around, that is that one has to run
python3 src/project.py submit --bundle=6 --parallel --test | /bin/bash
for every set of operations. What I mean is, each time this runs, it only executes the next elligible operations, and not the whole workflow, ie I now have two operations, one to run the simulations and another to plot the results, and I have to run this command twice, because the first time it just runs the simulations and stops there and the second time it does the plotting. This was not an issue with the usual python3 src/project.py run --parallel, which ran until all the operations were completed, not just the first batch.

csadorf · 2019-06-07T14:13:59Z

Yes, this inconvenience is currently addressed as part of PR #114 . I hope that we will be able to release this soon. I propose as a work-around until then, you could define a meta-operation manually by simply calling both functions in another function which is the one you actually submit.

vyasr · 2020-02-26T17:01:32Z

@csadorf I think since we declined to merge glotzerlab/signac-docs#27 we have decided that handling GPU scheduling is out of scope for signac-flow. Are you fine with closing this issue? The fork directive is sufficient to prevent redundant CUDA context creation, and I think that's the most we should probably do here.

csadorf · 2020-02-27T17:47:42Z

Before we close the issue, I'd be interested to know whether it can be resolved with groups on the user site.

vyasr · 2020-02-27T18:44:37Z

That's reasonable, if there is such a solution we could at least document that.

berceanu · 2021-02-13T14:36:09Z

Solved via SLURM in #455.

csadorf added the bug Something isn't working label May 23, 2019

csadorf added enhancement New feature or request and removed bug Something isn't working labels May 23, 2019

csadorf added the expertise needed Extra attention is needed label May 24, 2019

tcmoore3 added this to the v1.0 milestone Jul 5, 2019

bdice mentioned this issue Sep 3, 2019

Feature/fork directive #159

Merged

12 tasks

vyasr mentioned this issue Feb 8, 2021

Map GPU operation to GPU ID #455

Closed

csadorf closed this as completed Mar 4, 2021

joaander mentioned this issue Dec 15, 2021

Fix Expanse parallel gpu submission #594

Closed

12 tasks

vyasr mentioned this issue Jan 8, 2022

Clarify limits of submit --bundle --parallel glotzerlab/signac-docs#157

Merged

3 tasks

joaander mentioned this issue Nov 3, 2023

Refactor directives. #785

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA initialized before forking #115

CUDA initialized before forking #115

berceanu commented May 23, 2019 •

edited

Loading

csadorf commented May 23, 2019 •

edited

Loading

vyasr commented May 23, 2019

berceanu commented May 23, 2019

csadorf commented May 23, 2019

csadorf commented May 24, 2019

joaander commented May 24, 2019

berceanu commented May 24, 2019 •

edited

Loading

joaander commented May 24, 2019

berceanu commented May 24, 2019

joaander commented May 24, 2019

csadorf commented May 24, 2019 •

edited

Loading

berceanu commented Jun 5, 2019

berceanu commented Jun 7, 2019

csadorf commented Jun 7, 2019

vyasr commented Feb 26, 2020

csadorf commented Feb 27, 2020

vyasr commented Feb 27, 2020

berceanu commented Feb 13, 2021 •

edited

Loading

CUDA initialized before forking #115

CUDA initialized before forking #115

Comments

berceanu commented May 23, 2019 • edited Loading

Description

To reproduce

Error output

System configuration

csadorf commented May 23, 2019 • edited Loading

vyasr commented May 23, 2019

berceanu commented May 23, 2019

csadorf commented May 23, 2019

csadorf commented May 24, 2019

joaander commented May 24, 2019

berceanu commented May 24, 2019 • edited Loading

joaander commented May 24, 2019

berceanu commented May 24, 2019

joaander commented May 24, 2019

csadorf commented May 24, 2019 • edited Loading

berceanu commented Jun 5, 2019

berceanu commented Jun 7, 2019

csadorf commented Jun 7, 2019

vyasr commented Feb 26, 2020

csadorf commented Feb 27, 2020

vyasr commented Feb 27, 2020

berceanu commented Feb 13, 2021 • edited Loading

berceanu commented May 23, 2019 •

edited

Loading

csadorf commented May 23, 2019 •

edited

Loading

berceanu commented May 24, 2019 •

edited

Loading

csadorf commented May 24, 2019 •

edited

Loading

berceanu commented Feb 13, 2021 •

edited

Loading