Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA platform error: CUDA_ERROR_INVALID_PTX #2251

Closed
tristanic opened this issue Jan 28, 2019 · 4 comments
Closed

CUDA platform error: CUDA_ERROR_INVALID_PTX #2251

tristanic opened this issue Jan 28, 2019 · 4 comments

Comments

@tristanic
Copy link
Contributor

This seems to be a problem either with my specific environment or with how OpenMM has been ported into ChimeraX, but there's not a lot for me to go on in the traceback. This happens for both OpenMM 7.2.2 and 7.3, and isn't specific to a given simulation (I get the same error if I try to run OpenMM's benchmark.py using ChimeraX's python). Really just looking for pointers on where to start with debugging.

Traceback (most recent call last):
  File "/home/tic20/.local/share/ChimeraX/0.9/site-packages/chimerax/isolde/isolde.py", line 2482, in _start_sim_or_toggle_pause
    self.start_sim()
  File "/home/tic20/.local/share/ChimeraX/0.9/site-packages/chimerax/isolde/isolde.py", line 2508, in start_sim
    sm.start_sim()
  File "/home/tic20/.local/share/ChimeraX/0.9/site-packages/chimerax/isolde/openmm/openmm_interface.py", line 632, in start_sim
    sh.start_sim()
  File "/home/tic20/.local/share/ChimeraX/0.9/site-packages/chimerax/isolde/openmm/openmm_interface.py", line 1393, in start_sim
    self._prepare_sim()
  File "/home/tic20/.local/share/ChimeraX/0.9/site-packages/chimerax/isolde/openmm/openmm_interface.py", line 1355, in _prepare_sim
    integrator, platform)
  File "/opt/UCSF/ChimeraX-daily/lib/python3.7/site-packages/simtk/openmm/app/simulation.py", line 103, in __init__
    self.context = mm.Context(self.system, self.integrator, platform)
  File "/opt/UCSF/ChimeraX-daily/lib/python3.7/site-packages/simtk/openmm/openmm.py", line 12231, in __init__
    this = _openmm.new_Context(*args)
Exception: Error loading CUDA module: CUDA_ERROR_INVALID_PTX (218)

Exception: Error loading CUDA module: CUDA_ERROR_INVALID_PTX (218)

File "/opt/UCSF/ChimeraX-daily/lib/python3.7/site-packages/simtk/openmm/openmm.py", line 12231, in __init__
this = _openmm.new_Context(*args)
@peastman
Copy link
Member

Sorry, not much idea. That's the sort of error that mostly just shouldn't happen. It could be an error in the CUDA compiler, or a corrupted file on disk, or perhaps trying to load an out of date file, or perhaps linking against one CUDA toolkit but using the compiler from a different version, or various other things. For what it's worth, here's where that error message gets generated:

https://github.com/pandegroup/openmm/blob/master/platforms/cuda/src/CudaContext.cpp#L675

Perhaps you can figure out what's wrong with the PTX it's trying to load.

@tristanic
Copy link
Contributor Author

Haven't dug into the code yet, but I have compiled OpenMM 7.3 against CUDA 8.0 on my Fedora 25 machine (seems I'll have to update to Fedora 27 if I want CUDA 9.2) and installed into the ChimeraX environment. That works fine. On the other hand, running from an installation of OpenMM 7.3 in a fresh Anaconda virtualenv on my CentOS 7 machine (with the CUDA9.2 library and bin dirs first in LD_LIBRARY_PATH and PATH respectively) gives the same CUDA_ERROR_INVALID_PTX. So it's an environment problem, nothing to do with ChimeraX. I guess the question now is, is this a problem with CentOS 7 in general, or just my machine? Will try some debugging and see.

@tristanic
Copy link
Contributor Author

Edited CudaContext.cpp to report the nvcc command and the filename it fails on to stderr, then immediately die so I could catch the temp file. It dies on the very first one (.cu, .log and .ptx attached). So it's building it using the correct nvcc version and writing it safely, but failing on reading.

ptx.zip

@tristanic
Copy link
Contributor Author

At the end of all that, it's a boring old driver incompatibility. Updated my display driver from 390 to 415, and all is well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants