-
-
Notifications
You must be signed in to change notification settings - Fork 83
Training Halts when Using CuArrarys #691
Comments
How does training halt? You need to be getting some kind of error or reason it halts, right? Can you profile the code and see where the time goes? A typical problem is memory, where GPUs have much fewer RAM which doesn't compose well with Julia's GC and running close to the memory limit. Yours has 11GB though, so unless your model is huge that should generally work fine. Also, which version of Julia? |
The
When I look at
I have tested this on Julia 1.3.1 and Julia 1.4.1 and both have this problem. I also wonder if this is related to #350 . But I have tried taking out Update: if I call |
Ah, so that problem again. I thought training exited, but it hangs, which is consistent with the GC taking up all time. This is a tough problem, but it's good to have another (small-ish) reproducer. You can also try using the new, WIP, memory pool: |
Hey @maleadt, I work with @lpjiang97 and spent a bit looking into this. For what it's worth, I was able to replicate this on my machine (information below) 3/5 attempts, each in a fresh Julia session. When it does lock up, stacktrace shows that it's waiting on the lock in either alloc or free: Stacktrace:
[1] top-level scope at /home/colinxs/workspace/dev/Experiments/flux/foo/debug0.jl:29
[2] lock(::Base.Threads.SpinLock) at ./locks-mt.jl:71
[3] macro expansion at ./lock.jl:181 [inlined]
[4] free(::CUDAdrv.CuPtr{Nothing}) at /home/colinxs/.julia/packages/CuArrays/4Q1BY/src/
memory/binned.jl:393
[5] macro expansion at /home/colinxs/.julia/packages/TimerOutputs/NvIUx/src/TimerOutput
.jl:245 [inlined]
[6] macro expansion at /home/colinxs/.julia/packages/CuArrays/4Q1BY/src/memory.jl:231 [
inlined]
[7] macro expansion at ./util.jl:234 [inlined]
[8] free at /home/colinxs/.julia/packages/CuArrays/4Q1BY/src/memory.jl:230 [inlined]
[9] _unsafe_free!(::CuArray{Float32,2,Nothing}) at /home/colinxs/.julia/packages/CuArra
ys/4Q1BY/src/array.jl:51
[10] unsafe_free!(::CuArray{Float32,2,Nothing}) at /home/colinxs/.julia/packages/CuArra
ys/4Q1BY/src/array.jl:40 Single GPU (1050) julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
JULIA_DOWNLOAD = /home/colinxs/pkg/installed/julia
JULIA_NUM_THREADS = 6
JULIA_PKG_DEVDIR = /home/colinxs/workspace/juliadev
julia> Pkg.status()
Status `~/workspace/dev/Experiments/flux/foo/Project.toml`
[3895d2a7] CUDAapi v4.0.0
[be33ccc6] CUDAnative v3.0.4
[3a865a2d] CuArrays v2.1.0
[587475ba] Flux v0.10.4 |
I should've the open issues first, it appears you're already well aware of this: #685 |
Correct, I suspected a performance issue but the backtrace is useful in identifying the actual issue. I'll have a look at the deadlock, since a couple of users have been running into this. |
I'm fairly new to
CuArrays
andFlux
, and I met this problem of having halted training after some epochs. There is no CUDA out of memory error, but the usage is extremely high for this simple linear model (99.97% on a 1080 Ti). The code would sometimes finish all 500 epochs without problems, but other times halt around Epoch 150.I'm having this problem on Ubuntu 18.04, using
CuArrays
v 2.1.0. Would appreciate some pointers on this.The text was updated successfully, but these errors were encountered: