Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues running and changing the backend of the mnist example #84

Closed
mwbryant opened this issue Nov 7, 2022 · 6 comments · Fixed by #85
Closed

Issues running and changing the backend of the mnist example #84

mwbryant opened this issue Nov 7, 2022 · 6 comments · Fixed by #85

Comments

@mwbryant
Copy link

mwbryant commented Nov 7, 2022

Ubuntu 20.04 LTS, NVIDIA 3070 GPU (Driver 510.85.02, CUDA Version 11.6)

I am able to run the example as is and it trains successfully but it is very slow and appears to not be fully utilizing all the cores on my cpu. However at what appears to be the end of Epoch 2 (Last progress printout reports Iteration 80 Epoch 2/6, with 2 full bars) it crashes with this message:

thread 'main' panicked at 'called Result::unwrap() on an Err value: SendError { .. }', burn/burn/src/train/checkpoint/async_checkpoint.rs:68:40

I changed the example to use the Tch backend by changing main to this:

fn main() {
    use burn::tensor::backend::TchADBackend;

    let device = TchDevice::Cpu;
    training::run::<TchADBackend<f32>>(device);
    println!("Done.");
}

Which appeares to train using my full Cpu at a great speeds but then crashed both tries in 2 different ways. The first is the same message as above and upon using the vscode debugger it crashed in a different way:

thread '' panicked at 'attempt to subtract with overflow', burn/burn/src/train/checkpoint/file.rs:41:60

In that case epoch was 1 and self.num_keep was 2

I changed the example main as follows to try to use my GPU:

fn main() {
    use burn::tensor::backend::TchADBackend;

    let device = TchDevice::Cuda(0);
    training::run::<TchADBackend<f32>>(device);
    println!("Done.");
}

My first question is what does the magic number in TchDevice::Cuda(XXX) represent?

Then even with various numbers for that value (0, 1, 1024) the application crashes on the line model.to_device(device);
I always get this error message which I have been unable to solve:

thread 'main' panicked at 'called Result::unwrap() on an Err value: Torch("Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::empty_strided' is only available for these backends: [Dense, Conjugate, Negative, UNKNOWN_TENSOR_TYPE_ID, QuantizedXPU, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, SparseCPU, SparseCUDA, SparseHIP, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, SparseXPU, UNKNOWN_TENSOR_TYPE_ID, SparseVE, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, NestedTensorCUDA, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID].\n\nCPU: registered at aten/src/ATen/RegisterCPU.cpp:37386 [kernel]\nMeta: registered at aten/src/ATen/RegisterMeta.cpp:31637 [kernel]\nQuantizedCPU: registered at aten/src/ATen/RegisterQuantizedCPU.cpp:1294 [kernel]\nBackendSelect: registered at aten/src/ATen/RegisterBackendSelect.cpp:726 [kernel]\nPython: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:133 [backend fallback]\nNamed: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]\nConjugate: fallthrough registered at ../aten/src/ATen/ConjugateFallback.cpp:22 [kernel]\nNegative: fallthrough registered at ../aten/src/ATen/native/NegateFallback.cpp:22 [kernel]\nZeroTensor: fallthrough registered at ../aten/src/ATen/ZeroTensorFallback.cpp:90 [kernel]\nADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback]\nAutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nUNKNOWN_TENSOR_TYPE_ID: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradMPS: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradIPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradXPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradHPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nUNKNOWN_TENSOR_TYPE_ID: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradLazy: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nTracer: registered at ../torch/csrc/autograd/generated/TraceType_2.cpp:14069 [kernel]\nAutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:481 [backend fallback]\nAutocast: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:324 [backend fallback]\nBatched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1064 [backend fallback]\nVmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]\nFunctionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:89 [backend fallback]\nPythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:137 [backend fallback]\n\nException raised from reportError at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:447 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7f95aa2a79cb in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libc10.so)\nframe #1: c10::impl::OperatorEntry::reportError(c10::DispatchKey) const + 0x36b (0x7f95ab5e252b in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #2: + 0x1b4df9b (0x7f95abe40f9b in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #3: at::_ops::empty_strided::redispatch(c10::DispatchKeySet, c10::ArrayRef, c10::ArrayRef, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional) + 0xac (0x7f95ac011e6c in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #4: + 0x1fac735 (0x7f95ac29f735 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #5: at::_ops::empty_strided::call(c10::ArrayRef, c10::ArrayRef, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional) + 0x174 (0x7f95ac054114 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #6: at::empty_strided(c10::ArrayRef, c10::ArrayRef, c10::TensorOptions) + 0xd8 (0x55f15452c2a8 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #7: at::native::_to_copy(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x1447 (0x7f95aba2cf97 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #8: + 0x21479e3 (0x7f95ac43a9e3 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #9: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x10d (0x7f95abd9d78d in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #10: + 0x1faef51 (0x7f95ac2a1f51 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #11: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x10d (0x7f95abd9d78d in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #12: + 0x2fd82be (0x7f95ad2cb2be in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #13: + 0x2fd883b (0x7f95ad2cb83b in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #14: at::_ops::_to_copy::call(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x202 (0x7f95abe1a1e2 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #15: at::native::to(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, bool, c10::optionalc10::MemoryFormat) + 0x13e (0x7f95aba22dde in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #16: + 0x2251799 (0x7f95ac544799 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #17: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, bool, c10::optionalc10::MemoryFormat) + 0x216 (0x7f95abf47b26 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #18: at::Tensor::to(c10::TensorOptions, bool, bool, c10::optionalc10::MemoryFormat) const + 0xf0 (0x55f1545286e4 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #19: + 0x247491 (0x55f154531491 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #20: + 0x225035 (0x55f15450f035 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #21: + 0x226137 (0x55f154510137 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #22: + 0xdaf55 (0x55f1543c4f55 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #23: + 0xaa848 (0x55f154394848 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #24: + 0x9ec37 (0x55f154388c37 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #25: + 0x15bf7e (0x55f154445f7e in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #26: + 0x114627 (0x55f1543fe627 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #27: + 0x15b097 (0x55f154445097 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #28: + 0x1304b7 (0x55f15441a4b7 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #29: + 0x15b81b (0x55f15444581b in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #30: + 0x12f1b7 (0x55f1544191b7 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #31: + 0x11c5d6 (0x55f1544065d6 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #32: + 0xa0170 (0x55f15438a170 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #33: + 0xb54cb (0x55f15439f4cb in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #34: + 0x130afe (0x55f15441aafe in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #35: + 0x133c81 (0x55f15441dc81 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #36: + 0x34b21f (0x55f15463521f in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #37: + 0x133c5a (0x55f15441dc5a in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #38: + 0xa01d1 (0x55f15438a1d1 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #39: __libc_start_main + 0xf3 (0x7f95a9eb9083 in /lib/x86_64-linux-gnu/libc.so.6)\nframe #40: + 0x7962e (0x55f15436362e in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\n")', /home/matthew/.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.8.0/src/wrappers/tensor_generated.rs:12977:27

@HoshigaIkaro
Copy link

For the question about the number provided to TchDevice::Cuda, I was curious as well and looked through the documentation. According to the backend, it is the device index.

@mwbryant
Copy link
Author

mwbryant commented Nov 7, 2022

I think I saw that same source code which prompted me to try 0 and 1. It's still a bit unclear from the libraries what the value should be, is there a way to list valid device indices?

@nathanielsimard
Copy link
Member

I made a pull request that should fix the problem. The reason I did not see this before is that I always test using the --release flag, since it speeds up quite a bit the training without noticeable compilation time issue. I discovered that there was an underflow with the file checkpointer on debug builds.

I also added documentation on the TchDevice struct, the usize parameter is indeed the device index, which starts at zero. If you only have 1 GPU, it should just be zero.

I will work soon on error handling and proper logging to help understand those issues more clearly.

@mwbryant
Copy link
Author

mwbryant commented Nov 8, 2022

Have you tried this trick in your toml to get the speed optimizations without needing --release and losing debug info:

[profile.dev]
opt-level = 0

[profile.dev.package."*"]
opt-level = 3

@nathanielsimard
Copy link
Member

I didn't think about this but yeah this is probably a good default and something to include in the example!

@nathanielsimard
Copy link
Member

In turns out that you can't set the optimization profile in packages under a workspace, I'm not sure how to make debug builds use a different optimization level (only for examples).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants