Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA initialized before forking #115

Closed
berceanu opened this issue May 23, 2019 · 18 comments
Closed

CUDA initialized before forking #115

berceanu opened this issue May 23, 2019 · 18 comments
Labels
enhancement New feature or request expertise needed Extra attention is needed
Milestone

Comments

@berceanu
Copy link
Contributor

berceanu commented May 23, 2019

Description

I am trying to integrate fbpic, a well-known CUDA code (based on Python + Numba) for laser-plasma simulation with signac. The integration repo is signac-driven-fbpic.

I managed to succesfully run on a single GPU, via python3 src/project.py run from inside the signac folder, but if I add --parallel I get

numba.cuda.cudadrv.error.CudaDriverError: CUDA initialized before forking

The goal is to get 8 (independent) copies of fbpic (with different input params) running in parallel on the 8 NVIDIA P100 GPUs that are on the same machine.

To reproduce

Clone the signac-driven-fbpic repo and follow the install instructions. Then go to the signac subfolder, and do

conda activate signac-driven-fbpic
python3 src/init.py
python3 src/project.py run --parallel

Error output

(signac-driven-fbpic) andrei@ServerS:~/Development/signac-driven-fbpic/signac$ python3 src/project.py run --parallel --show-traceback
Using environment configuration: UnknownEnvironment
Serialize tasks|----------------------------------------------------------------------------------Serialize tasks|#####-----------------------------------------------------------------------------Serialize tasks|##########------------------------------------------------------------------------Serialize tasks|###############-------------------------------------------------------------------Serialize tasks|####################--------------------------------------------------------------Serialize tasks|##########################--------------------------------------------------------Serialize tasks|###############################---------------------------------------------------Serialize tasks|####################################----------------------------------------------Serialize tasks|#########################################-----------------------------------------Serialize tasks|###############################################-----------------------------------Serialize tasks|####################################################------------------------------Serialize tasks|#########################################################-------------------------Serialize tasks|##############################################################--------------------Serialize tasks|###################################################################---------------Serialize tasks|#########################################################################---------Serialize tasks|##############################################################################----Serialize tasks|##################################################################################Serialize tasks|##################################################################################Serialize tasks|##############################################################################################|100%
ERROR: Encountered error during program execution: 'CUDA initialized before forking'
Execute with '--show-traceback' or '--debug' to get more information.
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 2727, in _fork_with_serialization
    project._fork(project._loads_op(operation))
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 1467, in _fork
    self._operation_functions[operation.name](operation.job)
  File "src/project.py", line 172, in run_fbpic
    verbose_level=2,
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/fbpic/main.py", line 232, in __init__
    n_guard, n_damp, None, exchange_period, use_all_mpi_ranks )
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/fbpic/boundaries/boundary_communicator.py", line 267, in __init__
    self.d_left_damp = cuda.to_device( self.left_damp )
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/devices.py", line 212, in _require_cuda_context
    return fn(*args, **kws)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/api.py", line 103, in to_device
    to, new = devicearray.auto_device(obj, stream=stream, copy=copy)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/devicearray.py", line 683, in auto_device
    devobj = from_array_like(obj, stream=stream)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/devicearray.py", line 621, in from_array_like
    writeback=ary, stream=stream, gpu_data=gpu_data)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/devicearray.py", line 102, in __init__
    gpu_data = devices.get_context().memalloc(self.alloc_size)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 697, in memalloc
    self._attempt_allocation(allocator)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 680, in _attempt_allocation
    allocator()
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 695, in allocator
    driver.cuMemAlloc(byref(ptr), bytesize)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 290, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 324, in _check_error
    raise CudaDriverError("CUDA initialized before forking")
numba.cuda.cudadrv.error.CudaDriverError: CUDA initialized before forking
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "src/project.py", line 238, in <module>
    Project().main()
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 2721, in main
    _exit_or_raise()
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 2689, in main
    args.func(args)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 2414, in _main_run
    run()
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/legacy.py", line 193, in wrapper
    return func(self, jobs=jobs, names=names, *args, **kwargs)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 1597, in run
    np=np, timeout=timeout, progress=progress)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 1421, in run_operations
    pool, cloudpickle, operations, progress, timeout)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 1458, in _run_operations_in_parallel
    result.get(timeout=timeout)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
numba.cuda.cudadrv.error.CudaDriverError: CUDA initialized before forking

Relevant numba link.

System configuration

  • Operating System: Ubuntu 16.04
  • Version of Python: 3.6.8
  • Version of signac: 1.1.0
  • Version of signac-flow: 0.7.1
  • NVIDIA Driver Version: 410.72
@csadorf csadorf added the bug Something isn't working label May 23, 2019
@csadorf
Copy link
Contributor

csadorf commented May 23, 2019

Hey, thx for reporting this issue! Have you tried moving the import of anything fbpic related from the module-level to the operation level? Usually import of those packages triggers the CUDA initialization.

@vyasr
Copy link
Contributor

vyasr commented May 23, 2019

I concur, it's likely the from fbpic.main import Simulation that's doing the GPU initialization, I would try moving that into the operation. If that isn't enough, try the other imports as well.

@berceanu
Copy link
Contributor Author

You guys are awesome, it worked :)
Now I stumbled into a different issue (should I open a new one?) regarding the fact that all the runs are launched on the same GPU (0/7), instead of each of them claiming a separate card.

Looking at this fbpic example, they use MPI for parallel parameter scans on multiple GPUs, but I want to use signac of course! ;)

@csadorf
Copy link
Contributor

csadorf commented May 23, 2019

While Github issues are usually not meant for tech support, I suggest we troubleshoot this as part of this issue, because it is a problem that we need to generally solve.

The issue is that each operation is executed completely independently so there is no way to tell each operation what GPU to use. One way we could mitigate that is to assign each process some kind of "task number". This task number could then be stored for instance in an environment variable, read by the operation and used to compute which GPU to run on. In your example, that would look like this:

gpu = int(os.environ['SIGNAC_FLOW_TASK_ID'] % 8)

Would any of the @glotzerlab/signac-developers want to give it a shot? This would be an alternative solution to the aggregation approach explored by @jglaser .

@csadorf csadorf added enhancement New feature or request and removed bug Something isn't working labels May 23, 2019
@csadorf
Copy link
Contributor

csadorf commented May 24, 2019

@joaander Josh, this issue is similar to one that you brought up a while back. I believe that what I'm suggesting here is in line with what you proposed back then?

@csadorf csadorf added the expertise needed Extra attention is needed label May 24, 2019
@joaander
Copy link
Member

@csadorf Your proposed solution would provide efficient scheduling provided that 1) The number of parallel tasks was limited to the number of GPUs in the system and 2) All tasks take exactly the same amount of time. If either of these requirements is not met, this solution will result in situations where some GPUs may go unused at times and/or some GPUs may have multiple tasks assigned at times. This may or may not be desirable.

signac is not a resource manager or job scheduler and is not aware of the hardware on the system, the time it takes to run tasks, or what users are on the system. Such a system (i.e. SLURM in conjunction with the signac-flow submit functionality) would be required to obtain ideal scheduling on a multi-user system.

@berceanu If you are on a single-user workstation, you could consider enabling compute exclusive mode on your GPUs so the CUDA driver can auto-assign tasks to free GPUs. You would need to limit the amount of parallelism to the number of GPUs in the system.

@berceanu
Copy link
Contributor Author

berceanu commented May 24, 2019

@joaander I set the compute mode on all 8 GPUs to "E. Process".

Now I get this error after the first operation completes on the first GPU:

ERROR: Encountered error during program execution: '[101] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_INVALID_DEVICE'
Execute with '--show-traceback' or '--debug' to get more information.
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 2727, in _fork_with_serialization
    project._fork(project._loads_op(operation))
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 1467, in _fork
    self._operation_functions[operation.name](operation.job)
  File "src/project.py", line 129, in run_fbpic
    from fbpic.main import Simulation
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/fbpic/main.py", line 20, in <module>
    mpi_select_gpus( MPI )
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/fbpic/utils/cuda.py", line 138, in mpi_select_gpus
    cuda.select_device(i_gpu)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/api.py", line 302, in select_device
    context = devices.get_context(device_id)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/devices.py", line 194, in get_context
    return _runtime.get_or_create_context(devnum)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/devices.py", line 162, in get_or_create_context
    return self.push_context(self.gpus[devnum])
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/devices.py", line 130, in push_context
    ctx = self._get_or_create_context(gpu)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/devices.py", line 120, in _get_or_create_context
    ctx = gpu.get_primary_context()
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 472, in get_primary_context
    driver.cuDevicePrimaryCtxRetain(byref(hctx), self.id)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 290, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 325, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [101] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_INVALID_DEVICE
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "src/project.py", line 239, in <module>
    Project().main()
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 2721, in main
    _exit_or_raise()
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 2689, in main
    args.func(args)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 2414, in _main_run
    run()
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/legacy.py", line 193, in wrapper
    return func(self, jobs=jobs, names=names, *args, **kwargs)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 1597, in run
    np=np, timeout=timeout, progress=progress)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 1421, in run_operations
    pool, cloudpickle, operations, progress, timeout)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/site-packages/flow/project.py", line 1458, in _run_operations_in_parallel
    result.get(timeout=timeout)
  File "/home/andrei/anaconda3/envs/signac-driven-fbpic/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
numba.cuda.cudadrv.driver.CudaAPIError: [101] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_INVALID_DEVICE

Can this be because I didn't limit the parallelism to the number of free GPUs? How do I do that?
I get the error even with --parallel=8.

@joaander
Copy link
Member

With compute exclusive mode, attempting to acquire a CUDA context will result in an error if there are no free GPUs. Are you sure all 8 GPUs are free? Try a smaller number and see if that works. Check nvidia-smi to see which processes are using which GPUs.

@berceanu
Copy link
Contributor Author

Yes, I tried, two are not free, so I reduced it to 6 but still get the same problem.

@joaander
Copy link
Member

@csadorf Does signac-flow reuse processes for multiple tasks? This would explain this behavior. Is there a way to make it launch a new process for each task?

With reused processes you would need to clean up and destroy the CUDA context at the end of each task so the GPU is free for the next one. The library you are using would need to provide an API call to destroy the context.

@csadorf
Copy link
Contributor

csadorf commented May 24, 2019

@csadorf Does signac-flow reuse processes for multiple tasks?

@joaander Whenever possible, yes. Because it is much faster for smaller operations to avoid forking, which the run sub-command is designed for.

However, it is possible to suppress that behavior by specifying the executable manually, e.g. with directives(executable='python'). We should probably consider to add a @fork decorator or so that instructs signac-flow to fork without this work-around.

@berceanu
Copy link
Contributor Author

berceanu commented Jun 5, 2019

I documented the work-around here: glotzerlab/signac-docs#27

@berceanu
Copy link
Contributor Author

berceanu commented Jun 7, 2019

I just noticed a big inconvenience in the above work-around, that is that one has to run
python3 src/project.py submit --bundle=6 --parallel --test | /bin/bash
for every set of operations. What I mean is, each time this runs, it only executes the next elligible operations, and not the whole workflow, ie I now have two operations, one to run the simulations and another to plot the results, and I have to run this command twice, because the first time it just runs the simulations and stops there and the second time it does the plotting. This was not an issue with the usual python3 src/project.py run --parallel, which ran until all the operations were completed, not just the first batch.

@csadorf
Copy link
Contributor

csadorf commented Jun 7, 2019

Yes, this inconvenience is currently addressed as part of PR #114 . I hope that we will be able to release this soon. I propose as a work-around until then, you could define a meta-operation manually by simply calling both functions in another function which is the one you actually submit.

@tcmoore3 tcmoore3 added this to the v1.0 milestone Jul 5, 2019
@bdice bdice mentioned this issue Sep 3, 2019
12 tasks
@vyasr
Copy link
Contributor

vyasr commented Feb 26, 2020

@csadorf I think since we declined to merge glotzerlab/signac-docs#27 we have decided that handling GPU scheduling is out of scope for signac-flow. Are you fine with closing this issue? The fork directive is sufficient to prevent redundant CUDA context creation, and I think that's the most we should probably do here.

@csadorf
Copy link
Contributor

csadorf commented Feb 27, 2020

Before we close the issue, I'd be interested to know whether it can be resolved with groups on the user site.

@vyasr
Copy link
Contributor

vyasr commented Feb 27, 2020

That's reasonable, if there is such a solution we could at least document that.

@berceanu
Copy link
Contributor Author

berceanu commented Feb 13, 2021

Solved via SLURM in #455.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request expertise needed Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants