Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UndefVarError: lib not defined when training a connect four agent #5

Closed
brianprichardson opened this issue Apr 8, 2020 · 9 comments

Comments

@brianprichardson
Copy link

Trying to train per instructions in Training a Connect Four Agent section.
Ubuntu 18.04 with RTX 2080ti
At first thought it might be a Julia version issue.
Tried with 1.4.0 and 1.3.1 but both have an error (1.4.0 outputs more warning type info).
Perhaps I'm doing something wrong:

brian@1920x-Ubuntu:~$ julia
_
_ _ ()_ | Documentation: https://docs.julialang.org
() | () () |
_ _ | | __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 1.3.1 (2019-12-30)
/ |_'|||_'_| | Official https://julialang.org/ release
|__/ |

julia>
brian@1920x-Ubuntu:$ git clone https://github.com/jonathan-laurent/AlphaZero.jl.git
Cloning into 'AlphaZero.jl'...
remote: Enumerating objects: 47, done.
remote: Counting objects: 100% (47/47), done.
remote: Compressing objects: 100% (12/12), done.
remote: Total 5859 (delta 15), reused 47 (delta 15), pack-reused 5812
Receiving objects: 100% (5859/5859), 8.56 MiB | 12.84 MiB/s, done.
Resolving deltas: 100% (3141/3141), done.
brian@1920x-Ubuntu:
$ cd AlphaZero.jl/
brian@1920x-Ubuntu:/AlphaZero.jl$ julia --project -e "import Pkg; Pkg.instantiate()"
Updating registry at ~/.julia/registries/General
Updating git-repo https://github.com/JuliaRegistries/General.git
brian@1920x-Ubuntu:
/AlphaZero.jl$ julia --project --color=yes scripts/alphazero.jl --game connect-four train
CuArrays.jl SplittingPool statistics:

  • 0 pool allocations: 0 bytes in 0.0s
  • 0 CUDA allocations: 0 bytes in 0.0s
    CuArrays.jl SplittingPool statistics:
  • 0 pool allocations: 0 bytes in 0.0s
  • 0 CUDA allocations: 0 bytes in 0.0s

Initializing a new AlphaZero environment

Initial report

Number of network parameters: 617,480
Number of regularized network parameters: 617,408
Memory footprint per MCTS node: 380 bytes

Running benchmark: AlphaZero against MCTS (1000 rollouts)

UndefVarError: lib not defined
Stacktrace:
[1] broadcasted(::typeof(NNlib.relu), ::Knet.KnetArray{Float32,4}) at /home/brian/.julia/packages/Knet/vxHRi/src/unary.jl:17
[2] (::AlphaZero.KNets.BatchNorm)(::Knet.KnetArray{Float32,4}) at /home/brian/AlphaZero.jl/src/networks/knet/layers.jl:85
[3] (::AlphaZero.KNets.Chain)(::Knet.KnetArray{Float32,4}) at /home/brian/AlphaZero.jl/src/networks/knet/layers.jl:19
[4] forward(::ResNet{Game}, ::Knet.KnetArray{Float32,4}) at /home/brian/AlphaZero.jl/src/networks/knet.jl:148
[5] evaluate(::ResNet{Game}, ::Knet.KnetArray{Float32,4}, ::Knet.KnetArray{Float32,2}) at /home/brian/AlphaZero.jl/src/networks/network.jl:288
[6] evaluate_batch(::ResNet{Game}, ::Array{StaticArrays.SArray{Tuple{7,6},UInt8,2,42},1}) at /home/brian/AlphaZero.jl/src/networks/network.jl:313
[7] inference_server(::AlphaZero.MCTS.Env{Game,StaticArrays.SArray{Tuple{7,6},UInt8,2,42},ResNet{Game}}) at ./util.jl:288
[8] macro expansion at /home/brian/AlphaZero.jl/src/util.jl:64 [inlined]
[9] (::AlphaZero.MCTS.var"#21#23"{AlphaZero.MCTS.Env{Game,StaticArrays.SArray{Tuple{7,6},UInt8,2,42},ResNet{Game}}})() at ./task.jl:333

***************** Hangs here so after ctrl-C

^C
signal (2): Interrupt
in expression starting at /home/brian/AlphaZero.jl/scripts/alphazero.jl:70
epoll_pwait at /build/glibc-OTsEL5/glibc-2.27/misc/../sysdeps/unix/sysv/linux/epoll_pwait.c:42
uv__io_poll at /workspace/srcdir/libuv/src/unix/linux-core.c:270
uv_run at /workspace/srcdir/libuv/src/unix/core.c:359
jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:448
poptaskref at ./task.jl:660
wait at ./task.jl:667
wait at ./condition.jl:106
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2135 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2305
_wait at ./task.jl:238
sync_end at ./task.jl:278
macro expansion at ./task.jl:319 [inlined]
macro expansion at /home/brian/AlphaZero.jl/src/mcts.jl:427 [inlined]
macro expansion at ./util.jl:212 [inlined]
explore_async! at /home/brian/AlphaZero.jl/src/mcts.jl:426
explore! at /home/brian/AlphaZero.jl/src/mcts.jl:452 [inlined]
think at /home/brian/AlphaZero.jl/src/play.jl:176 [inlined]
#play_game#90 at /home/brian/AlphaZero.jl/src/play.jl:246
#play_game at ./none:0 [inlined]
#pit#93 at /home/brian/AlphaZero.jl/src/play.jl:296
unknown function (ip: 0x7efca1f99dd9)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2141 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2305
#pit at ./none:0
unknown function (ip: 0x7efca1f99a4a)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2141 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2305
macro expansion at /home/brian/AlphaZero.jl/src/benchmark.jl:111 [inlined]
macro expansion at ./util.jl:288 [inlined]
run at /home/brian/AlphaZero.jl/src/benchmark.jl:110
run_duel at /home/brian/AlphaZero.jl/src/ui/session.jl:252
run_benchmark at /home/brian/AlphaZero.jl/src/ui/session.jl:275
zeroth_iteration! at /home/brian/AlphaZero.jl/src/ui/session.jl:285
#Session#126 at /home/brian/AlphaZero.jl/src/ui/session.jl:356
Type at ./none:0
unknown function (ip: 0x7efca1f42f79)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2141 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2305
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1631 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:328
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:417
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:368 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:778
jl_interpret_toplevel_thunk_callback at /buildworker/worker/package_linux64/build/src/interpreter.c:888
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x7efcbc3d6c0f)
unknown function (ip: 0x7)
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:897
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:814
jl_parse_eval_all at /buildworker/worker/package_linux64/build/src/ast.c:873
jl_load at /buildworker/worker/package_linux64/build/src/toplevel.c:878
include at ./boot.jl:328 [inlined]
include_relative at ./loading.jl:1105
include at ./Base.jl:31
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2135 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2305
exec_options at ./client.jl:287
_start at ./client.jl:460
jfptr__start_2084.clone_1 at /opt/julia-1.3.1/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2135 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2305
unknown function (ip: 0x401931)
unknown function (ip: 0x401533)
__libc_start_main at /build/glibc-OTsEL5/glibc-2.27/csu/../csu/libc-start.c:310
unknown function (ip: 0x4015d4)
unknown function (ip: 0xffffffffffffffff)
Allocations: 159067857 (Pool: 159028147; Big: 39710); GC: 99
CuArrays.jl SplittingPool statistics:

  • 87 pool allocations: 6.728 MiB in 1.18s
  • 47 CUDA allocations: 4.368 MiB in 0.02s
    brian@1920x-Ubuntu:~/AlphaZero.jl$
@jonathan-laurent
Copy link
Owner

jonathan-laurent commented Apr 8, 2020

This looks like a Knet error. I can find several instances of "lib not defined" errors after a quick Google search, such as denizyuret/Knet.jl#411.

Would you mind telling me what version of Knet you are using? To figure it out, just run:

julia --project -e "import Pkg; Pkg.status()"

I know that Knet can be a bit tricky to install sometimes but I have mostly heard problems coming from windows users. Did you ever manage to make some Knet example work on your machine?

@brianprichardson
Copy link
Author

brianprichardson commented Apr 8, 2020

This is my first experience with Julan and Knet.
I did not realize that Knet needed to also be specifically installed.

brian@1920x-Ubuntu:/AlphaZero.jl$ julia --project -e "import Pkg; Pkg.status()"
Project AlphaZero v0.1.0
Status ~/AlphaZero.jl/Project.toml
[c7e460c6] ArgParse v1.1.0
[3895d2a7] CUDAapi v3.1.0
[35d6a980] ColorSchemes v3.6.0
[5ae59095] Colors v0.11.2
[a8cc5b0e] Crayons v4.0.1
[3a865a2d] CuArrays v1.7.3
[864edb3b] DataStructures v0.17.10
[31c24e10] Distributions v0.23.1
[e30172f5] Documenter v0.24.7
[587475ba] Flux v0.10.3
[59287772] Formatting v0.4.1
[2535ab7d] JSON2 v0.3.1
[0f8b85d8] JSON3 v1.0.1
[1902f260] Knet v1.3.4
[91a5bcdd] Plots v0.29.8
[92933f4c] ProgressMeter v1.2.0
[90137ffa] StaticArrays v0.12.1
[9a3f8284] Random
[9e88b42a] Serialization
[10745b16] Statistics
brian@1920x-Ubuntu:
/AlphaZero.jl$

I tried again, but still get the error:

brian@1920x-Ubuntu:~/AlphaZero.jl$ julia
_
_ _ ()_ | Documentation: https://docs.julialang.org
() | () () |
_ _ | | __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 1.3.1 (2019-12-30)
/ |_'|||_'_| | Official https://julialang.org/ release
|__/ |

julia> using Knet
ERROR: ArgumentError: Package Knet not found in current path:

  • Run import Pkg; Pkg.add("Knet") to install the Knet package.

Stacktrace:
[1] require(::Module, ::Symbol) at ./loading.jl:887

julia> using Pkg; Pkg.add("Knet")
Updating registry at ~/.julia/registries/General
Updating git-repo https://github.com/JuliaRegistries/General.git
Resolving package versions...
Installed CompilerSupportLibraries_jll ─ v0.3.3+0
Installed CodecZlib ──────────────────── v0.7.0
Installed Cthulhu ────────────────────── v1.0.1
Installed CUDAapi ────────────────────── v4.0.0
Installed DataStructures ─────────────── v0.17.11
Installed GPUArrays ──────────────────── v3.1.0
Installed CuArrays ───────────────────── v2.0.1
Installed ExprTools ──────────────────── v0.1.0
Installed CUDAdrv ────────────────────── v6.2.2
Installed CodeTracking ───────────────── v0.5.8
Installed CUDAnative ─────────────────── v3.0.3
Installed Knet ───────────────────────── v1.3.5
Installed JLD2 ───────────────────────── v0.1.3
Updating ~/.julia/environments/v1.3/Project.toml
[1902f260] + Knet v1.3.5
Updating ~/.julia/environments/v1.3/Manifest.toml
[621f4979] + AbstractFFTs v0.5.0
[79e6a3ab] + Adapt v1.0.1
[6710c13c] + AutoGrad v1.2.1
[b99e7846] + BinaryProvider v0.5.8
[fa961155] + CEnum v0.2.0
[3895d2a7] + CUDAapi v4.0.0
[c5f51814] + CUDAdrv v6.2.2
[be33ccc6] + CUDAnative v3.0.3
[da1fd8a2] + CodeTracking v0.5.8
[944b1d66] + CodecZlib v0.7.0
[e66e0078] + CompilerSupportLibraries_jll v0.3.3+0
[f68482b8] + Cthulhu v1.0.1
[3a865a2d] + CuArrays v2.0.1
[864edb3b] + DataStructures v0.17.11
[e2ba6199] + ExprTools v0.1.0
[5789e2e9] + FileIO v1.2.4
[0c68f7d7] + GPUArrays v3.1.0
[033835bb] + JLD2 v0.1.3
[1902f260] + Knet v1.3.5
[929cbde3] + LLVM v1.3.4
[1914dd2f] + MacroTools v0.5.5
[872c559c] + NNlib v0.6.6
[efe28fd5] + OpenSpecFun_jll v0.5.3+3
[bac558e1] + OrderedCollections v1.1.0
[189a3867] + Reexport v0.2.0
[ae029012] + Requires v1.0.1
[276daf66] + SpecialFunctions v0.10.0
[a759f4b9] + TimerOutputs v0.5.3
[3bb67fe8] + TranscodingStreams v0.9.5
[83775a58] + Zlib_jll v1.2.11+9
[2a0f44e3] + Base64
[ade2ca70] + Dates
[8ba89e20] + Distributed
[b77e0a4c] + InteractiveUtils
[76f85450] + LibGit2
[8f399da3] + Libdl
[37e2e46d] + LinearAlgebra
[56ddb016] + Logging
[d6f4376e] + Markdown
[a63ad114] + Mmap
[44cfe95a] + Pkg
[de0858da] + Printf
[3fa0cd96] + REPL
[9a3f8284] + Random
[ea8e919c] + SHA
[9e88b42a] + Serialization
[6462fe0b] + Sockets
[2f01184e] + SparseArrays
[10745b16] + Statistics
[8dfed614] + Test
[cf7118a7] + UUIDs
[4ec0a83e] + Unicode
Building Knet → ~/.julia/packages/Knet/bTNMd/deps/build.log

julia> using Knet
[ Info: Precompiling Knet [1902f260-5fb4-5aff-8c31-6271790ab950]

julia>
brian@1920x-Ubuntu:~/AlphaZero.jl$ julia --project --color=yes scripts/alphazero.jl --game connect-four train

Initializing a new AlphaZero environment

Initial report

Number of network parameters: 617,480
Number of regularized network parameters: 617,408
Memory footprint per MCTS node: 380 bytes

Running benchmark: AlphaZero against MCTS (1000 rollouts)

UndefVarError: lib not defined

@brianprichardson
Copy link
Author

brianprichardson commented Apr 8, 2020

Also ran this test:

julia> using Knet; include(Knet.dir("test/gpu.jl"))
Knet.gpuCount() = 1
Knet.gpu() = 0
Knet.tk = ["/usr/local/cuda-10.1/targets/x86_64-linux", "/usr/local/cuda-10.1", "/usr/local/cuda-10.2"]
Knet.libknet8 = ""
Knet.cudartfound = true
Knet.cudaRuntimeVersion = 10010
Knet.cudaDriverVersion = 10020
Knet.cudaGetDeviceCount() = 1
Knet.cudaGetDevice() = 0
Knet.cudaMemGetInfo() = (10907549696, 11546656768)
Knet.cudaDeviceSynchronize() = nothing
Knet.nvmlfound = true
Knet.nvmlDriverVersion = "440.64.00"
Knet.nvmlVersion = "10.440.64.00"
Knet.nvmlDeviceGetMemoryInfo() = (11546656768, 10907549696, 639107072)
Knet.cublashandle() = Ptr{Nothing} @0x000000000991b120
Knet.cublasVersion = 10202
Knet.cudnnhandle() = Ptr{Nothing} @0x00000000023e3ef0
Knet.cudnnVersion = 7604
Knet.dir() = "/home/brian/.julia/packages/Knet/bTNMd"
readdir(Knet.dir("deps")) = [".deprecated", ".gitignore", "Makefile", "README.windows", "build.jl", "build.log", "cuda01.jl", "cuda1.jl", "cuda11.jl", "cuda12.jl", "cuda13.jl", "cuda14.jl", "cuda16.jl", "cuda17.jl", "cuda20.jl", "cuda21.jl", "cuda22.jl", "gamma.jl"]
gpu: Test Failed at /home/brian/.julia/packages/Knet/bTNMd/test/gpu.jl:39
Expression: !(isempty(Knet.libknet8))
Evaluated: !(isempty(""))
Stacktrace:
[1] top-level scope at /home/brian/.julia/packages/Knet/bTNMd/test/gpu.jl:39
[2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Test/src/Test.jl:1107
[3] top-level scope at /home/brian/.julia/packages/Knet/bTNMd/test/gpu.jl:5
Test Summary: | Pass Fail Total
gpu | 19 1 20
ERROR: LoadError: Some tests did not pass: 19 passed, 1 failed, 0 errored, 0 broken.
in expression starting at /home/brian/.julia/packages/Knet/bTNMd/test/gpu.jl:3

julia>

brian@1920x-Ubuntu:~/AlphaZero.jl$ cat /usr/local/cuda/version.txt
CUDA Version 10.1.243

@jonathan-laurent
Copy link
Owner

From your last post, it seems to be a problem installing Knet indeed.

In theory, Knet is supposed to be installed automatically by the Pkg.instantiate() command. For example, on my machine (Ubuntu 18.04 also, with CUDA 10.2 and Julia 1.4.0), the command lines given in the tutorial work in a completely new Julia environment.

But installing all the CUDA-related dependencies that are necessary to make Knet work for all possible configurations is not a trivial problem and bugs like the one you observed are still happening (especially for windows users). It seems to me that Flux (another Julia ML framework) has been more successful in this regard.

My advice is to follow the steps proposed in the manual to install Knet:
https://denizyuret.github.io/Knet.jl/latest/install/#Setting-up-Knet-1

If this still results in a problem, you may want to open a Knet issue.

Also, please note that I am planning to add an option to switch to a Flux implementation of the networks library (see #2). Therefore, if you don't have time to debug Knet, you may want to wait a bit for this.

Finally, welcome to Julia! I am glad that AlphaZero.jl gave you an occasion to try out this great language and I hope you don't get discouraged by this initial bump. :-)

@brianprichardson
Copy link
Author

Got a bit further and looks like a Cuda/compiler issue after doing this:

julia> using Pkg

julia> Pkg.add("CUDAapi")
Updating registry at ~/.julia/registries/General
Updating git-repo https://github.com/JuliaRegistries/General.git
Resolving package versions...
Updating ~/.julia/environments/v1.3/Project.toml
[3895d2a7] + CUDAapi v4.0.0
Updating ~/.julia/environments/v1.3/Manifest.toml
[no changes]

julia> using CUDAapi

julia> CXX,CXXVER = CUDAapi.find_host_compiler()
ERROR: UndefVarError: find_host_compiler not defined
Stacktrace:
[1] getproperty(::Module, ::Symbol) at ./Base.jl:13
[2] top-level scope at REPL[4]:1

julia>

@jonathan-laurent
Copy link
Owner

This looks like a bug in CUDAapi indeed. I recommend filing an issue.

@brianprichardson
Copy link
Author

Worked on it for several hours.
With the "quick" test using Knet; include(Knet.dir("test/gpu.jl")) all 20 pass, but longer 10 minute test using Pkg; Pkg.test("Knet") fails. I tried to clean up my cuda and cudnn libraries but it seems to be more than that.

I have to say it is not really unexpected. The entire machine learning field is moving very fast and compatibility between various library stacks and frameworks is simply a moving target. My primary interest is with Lc0 (Leela Chess) being able to train nets and compile the engine. Most of that is tensorflow. Also got torch to work for some other things (A0Lite). As I mentioned, new to Julia and Knet. Perhaps TF will be supported at some point. I am reluctant to break things for Lc0 at this point but may take another stab at it. Thank you for sharing your work.

@jonathan-laurent
Copy link
Owner

Thanks for reporting back! In any case, you may want to file an issue because the maintainers of Knet and CuArrays may be interested in this.

I'll ping you after I fix the Flux backend in case you want to try again.

@elife33
Copy link

elife33 commented Apr 11, 2020

I encountered similar problem and after I rm ~/.julia and redo the
julia --project -e "import Pkg; Pkg.instantiate()"
julia --project --color=yes scripts/alphazero.jl --game connect-four train
it works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants