Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudaErrorMemoryAllocation with KOKKOS on Volta GPU #1473

Closed
danicholson opened this issue May 22, 2019 · 26 comments · Fixed by #1474
Closed

[BUG] cudaErrorMemoryAllocation with KOKKOS on Volta GPU #1473

danicholson opened this issue May 22, 2019 · 26 comments · Fixed by #1474
Labels

Comments

@danicholson
Copy link
Collaborator

Summary

My KOKKOS/CUDA simulation crashes on a cudaErrorMemoryAllocation after ~14 million steps on a Titan V GPU. It is a molecular system without electrostatics.

LAMMPS Version and Platform

LAMMPS: 15 May 2019
OS: Centos 7
GCC: 4.8.5
CUDA: 9.1
GPU: Titan V
CPU: Xeon E5-2630 v4

compiled with:

cmake -D BUILD_MPI=no -D BUILD_OMP=no -D PKG_MOLECULE=yes -D KOKKOS_ARCH="BDW;Volta70" -D PKG_KOKKOS=yes -D KOKKOS_ENABLE_CUDA=yes -D KOKKOS_ENABLE_OPENMP=no -D KOKKOS_ENABLE_DEBUG=yes -D CMAKE_CXX_COMPILER=/home/david/git/lammps-clean/lib/kokkos/bin/nvcc_wrapper -D CMAKE_BUILD_TYPE=Debug ../../cmake/

Expected Behavior

The script runs without issue on CPUs

Actual Behavior

The simulation crashes with the following error:

warning: Cuda API error detected: cudaCreateTextureObject returned (0x2)

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaCreateTextureObject( & tex_obj , & resDesc, & texDesc, NULL ) error( cudaErrorMemoryAllocation): out of memory /home/david/git/lammps-clean/lib/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:290
Traceback functionality not available

Steps to Reproduce

Unfortunately I don't have a small system to reproduce this error quickly. I run the script below with default Kokkos settings "-sf kk -k on g 1".

Further Information, Files, and Links

files:
relax.in.txt
relax_440K.data.txt

Call stack from cuda-gdb:

#0  0x00002aaaacff0207 in raise () from /lib64/libc.so.6
#1  0x00002aaaacff18f8 in abort () from /lib64/libc.so.6
#2  0x00002aaaac7fb7d5 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3  0x00002aaaac7f9746 in std::rethrow_exception(std::__exception_ptr::exception_ptr) () from /lib64/libstdc++.so.6
#4  0x00002aaaac7f9773 in std::terminate() () from /lib64/libstdc++.so.6
#5  0x00002aaaac7f9993 in __cxa_throw () from /lib64/libstdc++.so.6
#6  0x00000000023a22ea in Kokkos::Impl::throw_runtime_exception (msg=...)
    at /home/david/git/lammps-clean/lib/kokkos/core/src/impl/Kokkos_Error.cpp:72
#7  0x00000000023ab169 in Kokkos::Impl::cuda_internal_error_throw (e=cudaErrorMemoryAllocation, 
    name=0x24ed160 <Kokkos::(anonymous namespace)::AllowPadding+792> "cudaCreateTextureObject( & tex_obj , & resDesc, & texDesc, NULL )", 
    file=0x24ece68 <Kokkos::(anonymous namespace)::AllowPadding+32> "/home/david/git/lammps-clean/lib/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp", 
    line=290) at /home/david/git/lammps-clean/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:129
#8  0x00000000017e12ba in Kokkos::Impl::cuda_internal_safe_call (e=cudaErrorMemoryAllocation, 
    name=0x24ed160 <Kokkos::(anonymous namespace)::AllowPadding+792> "cudaCreateTextureObject( & tex_obj , & resDesc, & texDesc, NULL )", 
    file=0x24ece68 <Kokkos::(anonymous namespace)::AllowPadding+32> "/home/david/git/lammps-clean/lib/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp", 
    line=290) at /home/david/git/lammps-clean/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Error.hpp:58
#9  0x00000000023a9148 in Kokkos::Impl::SharedAllocationRecord<Kokkos::CudaSpace, void>::attach_texture_object (sizeof_alias=4, 
    alloc_ptr=0x2aaadd022a00, alloc_size=60136) at /home/david/git/lammps-clean/lib/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:290
#10 0x000000000197b1f0 in Kokkos::Impl::SharedAllocationRecord<Kokkos::CudaSpace, void>::attach_texture_object<int> (this=0x23737ea0)
    at /home/david/git/lammps-clean/lib/kokkos/core/src/Kokkos_CudaSpace.hpp:776
#11 0x0000000001978e2e in Kokkos::Impl::CudaTextureFetch<int const, int>::CudaTextureFetch<Kokkos::CudaSpace> (this=0x7fffffffd2e0, 
    arg_ptr=0x2aaadd022a80, record=0x23737ea0) at /home/david/git/lammps-clean/lib/kokkos/core/src/Cuda/Kokkos_Cuda_View.hpp:129
#12 0x000000000197230d in Kokkos::Impl::ViewDataHandle<Kokkos::ViewTraits<int const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<2u> >, void>::assign (arg_data_ptr=0x2aaadd022a80, arg_tracker=...)
    at /home/david/git/lammps-clean/lib/kokkos/core/src/Cuda/Kokkos_Cuda_View.hpp:301
#13 0x000000000196ce22 in Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<2u> >, Kokkos::ViewTraits<int*, Kokkos::LayoutLeft, Kokkos::Cuda, void>, void>::assign(Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<2u> ><void> >&, Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int*, Kokkos::LayoutLeft, Kokkos::Cuda, void><void> > const&, Kokkos::Impl::SharedAllocationTracker const&) (dst=..., src=..., 
    src_track=...) at /home/david/git/lammps-clean/lib/kokkos/core/src/impl/Kokkos_ViewMapping.hpp:3009
#14 0x000000000196b6ef in Kokkos::View<int const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<2u> >::operator=<int*, Kokkos::LayoutLeft, Kokkos::Cuda, void>(Kokkos::View<int*<Kokkos::LayoutLeft, Kokkos::Cuda, void> > const&) (this=0x22354df0, rhs=...)
    at /home/david/git/lammps-clean/lib/kokkos/core/src/Kokkos_View.hpp:1985
#15 0x00000000019511cb in LAMMPS_NS::NeighBondKokkos<Kokkos::Cuda>::build_topology_kk (this=0x22354a10)
    at /home/david/git/lammps-clean/src/KOKKOS/neigh_bond_kokkos.cpp:223
#16 0x00000000019351fe in LAMMPS_NS::NeighborKokkos::build_topology (this=0x22353530)
---Type <return> to continue, or q <return> to quit---
    at /home/david/git/lammps-clean/src/KOKKOS/neighbor_kokkos.cpp:388
#17 0x0000000001938f26 in LAMMPS_NS::NeighborKokkos::build_kokkos<Kokkos::Cuda> (this=0x22353530, topoflag=1)
    at /home/david/git/lammps-clean/src/KOKKOS/neighbor_kokkos.cpp:322
#18 0x0000000001934e24 in LAMMPS_NS::NeighborKokkos::build (this=0x22353530, topoflag=1)
    at /home/david/git/lammps-clean/src/KOKKOS/neighbor_kokkos.cpp:237
#19 0x0000000001d35b0f in LAMMPS_NS::VerletKokkos::run (this=0x223716b0, n=20000000) at /home/david/git/lammps-clean/src/KOKKOS/verlet_kokkos.cpp:397
#20 0x00000000017cf780 in LAMMPS_NS::Run::command (this=0x7fffffffd990, narg=1, arg=0x22361f10) at /home/david/git/lammps-clean/src/run.cpp:183
#21 0x0000000001671353 in LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run> (lmp=0xcfb7710, narg=1, arg=0x22361f10)
    at /home/david/git/lammps-clean/src/input.cpp:873
#22 0x000000000166b1b8 in LAMMPS_NS::Input::execute_command (this=0x22315490) at /home/david/git/lammps-clean/src/input.cpp:856
#23 0x000000000166898c in LAMMPS_NS::Input::file (this=0x22315490) at /home/david/git/lammps-clean/src/input.cpp:243
#24 0x00000000013f00b3 in main (argc=9, argv=0x7fffffffdc48) at /home/david/git/lammps-clean/src/main.cpp:64
@stanmoore1
Copy link
Contributor

@danicholson you are running out of memory: error( cudaErrorMemoryAllocation): out of memory. The Titan V only has 12 GB of GPU memory, which is probably much less than your CPU. From your stack trace, it runs out of memory when building an atom map--currently the Kokkos package only supports the "array" style atom map which is more memory intensive than the "hash" style. That said, with only 15000 atoms I wouldn't expect OOM. I'll take a look, the Kokkos library has some nice memory profiling tools.

@danicholson
Copy link
Collaborator Author

@stanmoore1 Thanks for the reply. In a previous test, nvidia-smi did not report high memory usage. I wrote a script to check it every minute or so, and the usage was ~600 MiB at the last check before the error occurred. I also used the KOKKOS tool to check the GPU memory usage for a shorter simulation and it was stationary around a similar value. Could this error be due to memory fragmentation?

@stanmoore1
Copy link
Contributor

usage was ~600 MiB

Yeah that seems reasonable for your system size. The error mentions texture memory (i.e. Kokkos randomread memory), which that atom map variable uses, perhaps switching to regular global memory would fix the issue.

@stanmoore1
Copy link
Contributor

This issue sounds very similar to #542.

@stanmoore1
Copy link
Contributor

I've checked with Valgrind and Kokkos profiling tools and I don't see any memory growth over time. Can you describe how you are getting LAMMPS source code, are you cloning the GIT repo? Just to be absolutely certain, you used either make pu or make yes-kokkos after you update your repo?

@stanmoore1
Copy link
Contributor

I guess you are using cmake, so the comment about make pu wouldn't apply.

@danicholson
Copy link
Collaborator Author

For this test I cloned the repo fresh and checked out the unstable branch. I built using cmake (the command is in the issue description above) so I did not execute make pu or make yes-kokkos.

@danicholson
Copy link
Collaborator Author

I can do a test run using global memory rather than texture memory. Would this just require declaring map_array as typename AT::t_int_1d rather than typename AT::t_int_1d_randomread in neigh_bond_kokkos.h?

@stanmoore1
Copy link
Contributor

Yes, though looking back at #542, it also failed with the exact same error, and the root cause was a memory leak not texture memory, so I'm doubtful that will help.

@stanmoore1
Copy link
Contributor

I'm running this on a V100 GPU to see if I can reproduce the OOM. It will take several hours to reach 14 million timesteps, and may need to run even longer since V100 has 16 GB of memory instead of 12 GB for Titan V.

@stanmoore1
Copy link
Contributor

Other than a memory leak, something else in the simulation could be blowing up which is leading to a large memory allocation.

You could try writing a restart file every 1 million timesteps, and then after if fails at 14 million timesteps, read back in the latest restart file before it failed, and try running it again. If it fails at the same spot as before, then it is probably something wrong with the simulation. If it runs for another 14 million timesteps then fails, could be a memory leak.

@danicholson
Copy link
Collaborator Author

Thank you very much for you attention to this issue. I will run the test that you suggested. Based on a CPU-only run with the same input, this system should reach a stable equilibrium in a few nanoseconds, but I agree that it is worth checking.

@stanmoore1
Copy link
Contributor

Based on a CPU-only run with the same input, this system should reach a stable equilibrium in a few nanoseconds

Yes, I mean there may be a bug that is triggered by very rare events that causes an atom to explode out of the box, or something like that.

@stanmoore1
Copy link
Contributor

This looks more like a memory leak than that type of bug though, just can't find any evidence yet.

@danicholson
Copy link
Collaborator Author

A few things:

  1. Starting from a restart at 12 million steps, the script runs for 14 million steps then quits on the same error.

  2. The GPU memory usage and host RSS are increasing with time, but never approaching host/device limits.
    mem_GPU_randread
    RSS_host_randread

  3. When using device memory for both map_array and sametag, the script runs without error. I did not monitor memory usage for the whole run, only the last few minutes. For this period, the RSS and GPU memory usage were constant at values of 680 MB and 530 MiB respectively. These are close to the starting values of the runs performed with texture memory for these arrays.

Based on the profiling tool, it looks like most of the large allocations during the run are for sametag and map_array, hence the decision to use device memory for both rather than just map_array. To me, it looks like there is some sort of leak related to the texture memory.

@stanmoore1
Copy link
Contributor

I can reproduce this on V100. It failed just before 14 million timesteps.

@stanmoore1
Copy link
Contributor

@crtrott any ideas?

@stanmoore1
Copy link
Contributor

@danicholson I submitted another job on V100 with memory profiling to see if I can reproduce the memory growth you saw.

@danicholson
Copy link
Collaborator Author

@stanmoore1 Just to clarify, I did not use the kokkos profiling tools to monitor the memory. I used pmap and nvidia-smi.

@stanmoore1
Copy link
Contributor

@danicholson understood. Kokkos tools use getrusage to get total host memory, which should show the same rss growth as you saw. For the GPU memory, the Kokkos tools will tell if the leak is in Kokkos Views or not.

@stanmoore1
Copy link
Contributor

I do see significant host rss growth over 12 million timesteps. However the memory in Kokkos Views stays constant, at least according to the Kokkos profiling tool.

@stanmoore1
Copy link
Contributor

@danicholson I can confirm that the leak goes away if I don't use texture memory for map_array and sametag. This looks like a bug outside LAMMPS, i.e. in Kokkos or CUDA. That said, the code shouldn't be reallocating those arrays every time the neigh list is build. I'm guessing that would also fix the problem.

@stanmoore1
Copy link
Contributor

@danicholson my test shows #1474 fixes this issue. Could help performance a little too since it isn't reallocating as often.

@stanmoore1
Copy link
Contributor

I created a small reproducer and reported this to the Kokkos library developers: kokkos/kokkos#2155. @danicholson thanks for the bug report.

@danicholson
Copy link
Collaborator Author

danicholson commented May 24, 2019

@stanmoore1 Great, thank you for the investigation and the fix. I'm running it now, and the memory usage looks stable.

@stanmoore1
Copy link
Contributor

@danicholson sure, let us know if you see other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants