Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect Four training must be restarted about every 24 hours due to an OOM error #1

Closed
jonathan-laurent opened this issue Mar 23, 2020 · 2 comments
Assignees

Comments

@jonathan-laurent
Copy link
Owner

When training the connect four agent, the training process crashes with an out of memory error about every 24 hours and must then be restarted.

It would be interesting to see if this happens again after updating the dependencies and/or switching to Flux as a DL backend.

Configuration

  • GPU: 8GB Nvidia RTX 2070
  • Julia 1.5.0-DEV.11 (commit 18783434e9)
  • CUDAapi v2.1.0
  • CuArrays v1.6.0
  • Knet v1.3.3 (master)

Stacktrace

ERROR: LoadError: Out of GPU memory trying to allocate 21.000 MiB
Effective GPU memory usage: 99.73% (7.772 GiB/7.793 GiB)
CuArrays GPU memory usage: 6.767 GiB
SplittingPool usage: 2.207 GiB (2.179 GiB allocated, 28.807 MiB cached)
SplittingPool efficiency: 32.20% (2.179 GiB requested, 6.767 GiB allocated)

Stacktrace:                                                                                                                                                                                                 
 [1] alloc at /home/jonathan/.julia/packages/CuArrays/rNxse/src/memory.jl:162 [inlined]                                                                                                                     
 [2] CuArrays.CuArray{UInt8,1,P} where P(::UndefInitializer, ::Tuple{Int64}) at /home/jonathan/.julia/packages/CuArrays/rNxse/src/array.jl:90                                                               
 [3] CuArray at /home/jonathan/.julia/packages/CuArrays/rNxse/src/array.jl:98 [inlined]                                                                                                                     
 [4] CuArray at /home/jonathan/.julia/packages/CuArrays/rNxse/src/array.jl:99 [inlined]                                                                                                                     
 [5] KnetPtrCu(::Int64) at /home/jonathan/.julia/packages/Knet/FSBq5/src/cuarray.jl:90                                                                                                                      
 [6] KnetPtr at /home/jonathan/.julia/packages/Knet/FSBq5/src/kptr.jl:102 [inlined]                                                                                                                         
 [7] KnetArray at /home/jonathan/.julia/packages/Knet/FSBq5/src/karray.jl:82 [inlined]                                                                                                                      
 [8] similar at /home/jonathan/.julia/packages/Knet/FSBq5/src/karray.jl:164 [inlined]                                                                                                                       
 [9] similar at /home/jonathan/.julia/packages/Knet/FSBq5/src/karray.jl:167 [inlined]                                                                                                                       
 [10] broadcasted(::typeof(+), ::Knet.KnetArray{Float32,4}, ::Knet.KnetArray{Float32,4}) at /home/jonathan/.julia/packages/Knet/FSBq5/src/binary.jl:37                                                      
 [11] +(::Knet.KnetArray{Float32,4}, ::Knet.KnetArray{Float32,4}) at /home/jonathan/.julia/packages/Knet/FSBq5/src/binary.jl:232                                                                            
 [12] forw(::Function, ::AutoGrad.Result{Knet.KnetArray{Float32,4}}, ::Vararg{AutoGrad.Result{Knet.KnetArray{Float32,4}},N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tup$
e{}}}) at /home/jonathan/.julia/packages/AutoGrad/pTNVv/src/core.jl:66                                                                                                                                      
 [13] forw at /home/jonathan/.julia/packages/AutoGrad/pTNVv/src/core.jl:65 [inlined]                                                                                                                        
 [14] +(::AutoGrad.Result{Knet.KnetArray{Float32,4}}, ::AutoGrad.Result{Knet.KnetArray{Float32,4}}) at ./none:0                                                                                             
 [15] (::AlphaZero.KNets.SkipConnection)(::AutoGrad.Result{Knet.KnetArray{Float32,4}}) at /home/jonathan/AlphaZero.jl/src/networks/knet/layers.jl:104                                                       
 [16] (::AlphaZero.KNets.Chain)(::AutoGrad.Result{Knet.KnetArray{Float32,4}}) at /home/jonathan/AlphaZero.jl/src/networks/knet/layers.jl:19 (repeats 2 times)                                               
 [17] forward(::ResNet{Game}, ::Knet.KnetArray{Float32,4}) at /home/jonathan/AlphaZero.jl/src/networks/knet.jl:148
 [18] evaluate(::ResNet{Game}, ::Knet.KnetArray{Float32,4}, ::Knet.KnetArray{Float32,2}) at /home/jonathan/AlphaZero.jl/src/networks/network.jl:285
 [19] losses(::ResNet{Game}, ::LearningParams, ::Float32, ::Float32, ::Tuple{Knet.KnetArray{Float32,2},Knet.KnetArray{Float32,4},Knet.KnetArray{Float32,2},Knet.KnetArray{Float32,2},Knet.KnetArray{Float32$
2}}) at /home/jonathan/AlphaZero.jl/src/learning.jl:62
 [20] (::AlphaZero.var"#loss#50"{AlphaZero.Trainer})(::Knet.KnetArray{Float32,2}, ::Vararg{Any,N} where N) at /home/jonathan/AlphaZero.jl/src/learning.jl:102
 [21] (::Knet.var"#693#694"{Knet.Minimize{Base.Generator{Array{Tuple{Array{Float32,2},Array{Float32,4},Array{Float32,2},Array{Float32,2},Array{Float32,2}},1},AlphaZero.Util.var"#7#9"{AlphaZero.var"#47#51$
{AlphaZero.Trainer}}}},Tuple{Knet.KnetArray{Float32,2},Knet.KnetArray{Float32,4},Knet.KnetArray{Float32,2},Knet.KnetArray{Float32,2},Knet.KnetArray{Float32,2}}})() at /home/jonathan/.julia/packages/AutoG$
ad/pTNVv/src/core.jl:205
 [22] differentiate(::Function; o::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/jonathan/.julia/packages/AutoGrad/pTNVv/src/core.jl:144
 [23] differentiate at /home/jonathan/.julia/packages/AutoGrad/pTNVv/src/core.jl:135 [inlined]
 [24] iterate at /home/jonathan/.julia/packages/Knet/FSBq5/src/train.jl:23 [inlined]
 [25] iterate at ./iterators.jl:140 [inlined]
 [26] iterate at ./iterators.jl:139 [inlined]
 [27] train!(::AlphaZero.var"#49#53"{Array{Float32,1}}, ::ResNet{Game}, ::Adam, ::Function, ::Base.Generator{Array{Tuple{Array{Float32,2},Array{Float32,4},Array{Float32,2},Array{Float32,2},Array{Float32,2
}},1},AlphaZero.Util.var"#7#9"{AlphaZero.var"#47#51"{AlphaZero.Trainer}}}) at /home/jonathan/AlphaZero.jl/src/networks/knet.jl:119
 [28] training_epoch!(::AlphaZero.Trainer) at /home/jonathan/AlphaZero.jl/src/learning.jl:113
 [29] macro expansion at ./util.jl:302 [inlined]
 [30] learning!(::Env{Game,ResNet{Game},StaticArrays.SArray{Tuple{7,6},UInt8,2,42}}, ::Session{Env{Game,ResNet{Game},StaticArrays.SArray{Tuple{7,6},UInt8,2,42}}}) at /home/jonathan/AlphaZero.jl/src/traini
ng.jl:165
 [31] macro expansion at ./util.jl:302 [inlined]
 [32] macro expansion at /home/jonathan/AlphaZero.jl/src/report.jl:241 [inlined]
 [33] train!(::Env{Game,ResNet{Game},StaticArrays.SArray{Tuple{7,6},UInt8,2,42}}, ::Session{Env{Game,ResNet{Game},StaticArrays.SArray{Tuple{7,6},UInt8,2,42}}}) at /home/jonathan/AlphaZero.jl/src/training.
jl:266
 [34] resume!(::Session{Env{Game,ResNet{Game},StaticArrays.SArray{Tuple{7,6},UInt8,2,42}}}) at /home/jonathan/AlphaZero.jl/src/ui/session.jl:384
 [35] top-level scope at /home/jonathan/AlphaZero.jl/scripts/alphazero.jl:68
 [36] include(::Module, ::String) at ./Base.jl:377
 [37] exec_options(::Base.JLOptions) at ./client.jl:288
 [38] _start() at ./client.jl:484
@jonathan-laurent jonathan-laurent self-assigned this Mar 23, 2020
@jonathan-laurent
Copy link
Owner Author

jonathan-laurent commented Apr 8, 2020

Replicated with Julia 1.4.0, CuArrays to 1.7.3 and Knet to 1.3.4.
Looking at the generated performance reports, the ratio of time spent in GC for each iteratioin grows until the whole thing crashes (after about 40 hours in this case).

@jonathan-laurent
Copy link
Owner Author

Does not since to happen anymore since v0.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant