-
Notifications
You must be signed in to change notification settings - Fork 943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latest tensor squeeze impl make cuda matmal fail #1948
Comments
That's actually expected, #1884 is the change that made squeeze/unsqueeze more efficient at the expense of breaking some existing use cases. In order to fix this, you should just call |
@LaurentMazare the test code is actually modified from some code with the candle_nn::Linear, while the input tensor x is already contiguous, and the w is a transposed version from loaded weights which can not be allowed to call x shape:[1, 2048], stride:[16384, 1], is_contiguous:true
w shape:[2048, 32], stride:[1, 2048], is_contiguous:false
Error: WithBacktrace { inner: Cuda(MatMulNonContiguous { lhs_stride: [16384, 1], rhs_stride: [1, 2048], mnk: (1, 32, 2048) }), |
Not sure to understand, where is |
u can see that the found this issue by the code from |
Couldn't you just make |
@LaurentMazare but x is already contiguous, while invoke |
Oh I see, actually the notion of |
let x = x.i((.., seq_len - 1, ..))?.contiguous()?;
let x = (x + 0.0)?;
println!("#### x shape:{:?}, stride:{:?}, is_contiguous:{}",
x.shape(),
x.stride(),
x.is_contiguous()
);
let logits = self.lm_head.forward(&x)?; #### x shape:[1, 2048], stride:[2048, 1], is_contiguous:true
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.10.0/src/driver/safe/core.rs:208:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.10.0/src/driver/safe/core.rs:208:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x563d5ce115a6 - std::backtrace_rs::backtrace::libunwind::trace::hbee8a7973eeb6c93
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
1: 0x563d5ce115a6 - std::backtrace_rs::backtrace::trace_unsynchronized::hc8ac75eea3aa6899
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x563d5ce115a6 - std::sys_common::backtrace::_print_fmt::hc7f3e3b5298b1083
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:68:5
3: 0x563d5ce115a6 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hbb235daedd7c6190
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:44:22
4: 0x563d5ce3e5c0 - core::fmt::rt::Argument::fmt::h76c38a80d925a410
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/fmt/rt.rs:142:9
5: 0x563d5ce3e5c0 - core::fmt::write::h3ed6aeaa977c8e45
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/fmt/mod.rs:1120:17
6: 0x563d5ce0ebaf - std::io::Write::write_fmt::h78b18af5775fedb5
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/io/mod.rs:1810:15
7: 0x563d5ce11384 - std::sys_common::backtrace::_print::h5d645a07e0fcfdbb
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:47:5
8: 0x563d5ce11384 - std::sys_common::backtrace::print::h85035a511aafe7a8
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:34:9
9: 0x563d5ce12c07 - std::panicking::default_hook::{{closure}}::hcce8cea212785a25
10: 0x563d5ce12969 - std::panicking::default_hook::hf5fcb0f213fe709a
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:292:9
11: 0x563d5ce13098 - std::panicking::rust_panic_with_hook::h095fccf1dc9379ee
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:779:13
12: 0x563d5ce12f72 - std::panicking::begin_panic_handler::{{closure}}::h032ba12139b353db
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:657:13
13: 0x563d5ce11aa6 - std::sys_common::backtrace::__rust_end_short_backtrace::h9259bc2ff8fd0f76
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:171:18
14: 0x563d5ce12cd0 - rust_begin_unwind
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
15: 0x563d5c558e85 - core::panicking::panic_fmt::h784f20a50eaab275
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
16: 0x563d5c5593d3 - core::result::unwrap_failed::h03d8a5018196e1cd
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/result.rs:1649:5
17: 0x563d5c619994 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h84a62617683e748a
18: 0x563d5c62649c - alloc::sync::Arc<T,A>::drop_slow::h48352ef3f3a7aa1c
19: 0x563d5c6275c0 - alloc::sync::Arc<T,A>::drop_slow::hc81f0778974720ba
20: 0x563d5c670bb3 - lmsf_core::model_executor::layers::sampler::Sampler::forward::hc8d671348d974142 |
sorry, the crash seems produced by another |
Cool, it's a bit worrying though if the |
@LaurentMazare the crash seems another issue, this can be simply reproduced by code: #[test]
fn test_mul() -> candle::Result<()> {
let device = candle::Device::new_cuda(0).unwrap();
let a = Tensor::ones((1, 256000), DType::F32, &device)?;
let b = Tensor::ones((1, 256000), DType::F32, &device)?;
let c = b.mul(&a)?;
println!("{}", c.to_string());
Ok(())
} --- after a fully |
The test worked ok for me. And indeed the building process is a bit flaky and could have caused this as there was a recent change in a |
tested success with latest commit |
while the test code run success with candle 0.4.1 but with different
x stride:[2048, 1]
;while with pytorch, it produce same
x stride
like candle main branch, but it run the matmul success.The text was updated successfully, but these errors were encountered: