-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recent revision for contiguous check has problems #1991
Comments
Yes that's expected, on the positive side it also avoid some unnecessary copy in some cases. The idea is to explicitely call |
While, the check for the non-contiguous squeezed tensor (1-dim) returns contiguous, meaning that we need to explicitly & immediately call the contiguous method if the tensor was squeezed into 1-dim? |
Sorry the tensor is actually contiguous so there should be no need to call |
Related error: #1992 , the annoying bit is that the matmul kernels consider that the layout is not contiguous whereas it actually is, I'll have to make a couple tweaks there and hopefully this will avoid the unnecessary copy. |
Can we consider removing the "if dim > 1" condition, as it seems to be causing issues in specific scenarios? In my case, I require the conversion of every tensor into a contiguous form because our custom DSA does not support non-contiguous kernels, which is different from CUDA. The Candle framework has been performing very well on our hardware until this condition was implemented (I found some 1d tensors were squeezed from non-contiguous 2d and it fasely regarded as contiguous). The reason Candle has not yet encountered issues on GPU, in my view, is that each implementation of CUDA kernels within Candle includes a corresponding non-contiguous version, and these non-contiguous versions do not include the "if dim > 1" condition (they utilize stride), which making the problem trivial on GPU. |
I found more problems with the revised version of tensor::squeeze and tensor::unsqueeze that introduced in #1884 I'm not sure about the benefits of the current revision but it causes problems in certain cases especially when dealing with non-contiguous tensor. pub fn squeeze<D: Dim>(&self, dim: D) -> Result<Self> {
// The PyTorch semantics are to return the same tensor if the target dimension
// does not have a size of 1.
let dims = self.dims();
let dim = dim.to_index(self.shape(), "squeeze")?;
if dims[dim] == 1 {
let mut dims = dims.to_vec();
let mut strides = self.stride().to_vec();
dims.remove(dim);
strides.remove(dim);
let tensor_ = Tensor_ {
id: TensorId::new(),
storage: self.storage.clone(),
layout: Layout::new(dims.into(), strides, self.layout.start_offset()),
op: BackpropOp::new1(self, Op::Reshape),
is_variable: false,
dtype: self.dtype,
device: self.device.clone(),
};
Ok(Tensor(Arc::new(tensor_)))
} else {
Ok(self.clone())
}
}
pub fn unsqueeze<D: Dim>(&self, dim: D) -> Result<Self> {
let mut dims = self.dims().to_vec();
let mut strides = self.stride().to_vec();
let dim = dim.to_index_plus_one(self.shape(), "unsqueeze")?;
// Cannot panic because to_index_plus_one already checks dimensions
dims.insert(dim, 1);
// Any stride would work here, but we pick one so as to maximize the probability to remain
// C contiguous.
let stride = if dim < strides.len() { strides[dim] } else { 1 };
strides.insert(dim, stride);
let tensor_ = Tensor_ {
id: TensorId::new(),
storage: self.storage.clone(),
layout: Layout::new(dims.into(), strides, self.layout.start_offset()),
op: BackpropOp::new1(self, Op::Reshape),
is_variable: false,
dtype: self.dtype,
device: self.device.clone(),
};
Ok(Tensor(Arc::new(tensor_)))
} The previous version works really well: pub fn squeeze<D: Dim>(&self, dim: D) -> Result<Self> {
// The PyTorch semantics are to return the same tensor if the target dimension
// does not have a size of 1.
let dims = self.dims();
let dim = dim.to_index(self.shape(), "squeeze")?;
if dims[dim] == 1 {
let mut dims = dims.to_vec();
dims.remove(dim);
self.reshape(dims)
} else {
Ok(self.clone())
}
}
pub fn unsqueeze<D: Dim>(&self, dim: D) -> Result<Self> {
let mut dims = self.dims().to_vec();
let dim = dim.to_index_plus_one(self.shape(), "unsqueeze")?;
// Cannot panic because to_index_plus_one already checks dimensions
dims.insert(dim, 1);
self.reshape(dims)
} |
But the point is that these tensors are actually contiguous so making a copy for them when calling |
Also I relaxed the contiguity check in the same way in the function you mentioned on the cuda side in #2000 , all the tests passed fined as well as the couple models I've tried. Certainly keen to see some breakage example on cuda too after this change. |
In my example, the introduction of "if dim > 1" condition for contiguity check causes some generation problems, e.g., the generated contents are slightly different from the expected. Have you checked the output of the model? The expected output (without "if dim > 1" condition)loaded the model in 2.834793587s Deep Learning: A Transformative Technology Deep learning (DL) is a subfield of machine learning (ML) that enables computers to learn from data without explicit programming. By mimicking the structure and function of the human brain, DL algorithms can discover patterns and relationships in data, enabling them to make accurate predictions and decisions. Architecture of a Deep Learning Model A typical deep learning model consists of multiple layers of interconnected nodes or neurons. Each layer receives input data, passes it through a series of transformations, and produces an output. The model can have different architectures, but they typically share these key components:
Types of Deep Learning Models There are various types of DL models, each with its strengths and weaknesses:
Applications of Deep Learning Deep learning has numerous applications in various domains, including:
Advantages of Deep Learning
Disadvantages of Deep Learning
Conclusion Deep learning is a transformative technology that has revolutionized various industries. Its ability to learn from data without explicit programming has opened up new possibilities for problem-solving. While DL models can achieve high accuracy, they also have limitations that need to be considered. As deep learning continues to evolve, it will likely play an increasingly important role in shaping the future of technology. The current output (with "if dim > 1" condition)loaded the model in 2.801835715s Key Concepts of Deep Learning:
Applications of Deep Learning:
Benefits of Deep Learning:
Challenges of Deep Learning:
Future of Deep Learning:
Conclusion: Deep learning is a transformative technology that has revolutionized various industries. Its ability to learn from data without explicit programming has opened up new possibilities for solving complex problems. While DL presents challenges such as data requirements, bias, and explainability, its future holds immense potential to further advance AI and unlock unprecedented advancements in various fields. |
I checked the outputs for mistral and llama2 (7b), both quantized and I quantized and didn't see a change. I'll check again but maybe you have a way for me to reproduce the issue? |
I found where the problem is: For a given tensor like [13, 1] if we perform a candle to_dtype on its broadcasted version (something like shape [1, 1, 13, 13] with a stride of [0, 0, 13, 1]) the previous contiguity check will regard it as non-contiguous triggering the broadcast operation in my backend (making it contiguous implicitly), while, the current check treats it as contiguous tensor because of dim == 1 in [0, 0, 13, 1] and this will cause a problem on my backend. In comparison, the cast CUDA kernel in candle is able to deal with non-contiguous tensors and can cast them into contiguous ones with get_strided_index. Similarly, the tensor::transpose op does not change the stride of the original tensor and the transposed tensor was treated as contiguous too in the revised version making problems in my backend (and also triggering get_strided_index in the underlying CUDA kernels to make it contiguous too). It seems that the copy was not avoided although the tensor was marked as contiguous (the underlying CUDA kernels do the dirty works using get_strided_index). |
Thanks for investigating and for the details. I'm not sure to get the full picture of it but to me it feels that the tensor layout you mention actually is contiguous in memory and so could be handled by contiguous kernels (so would work well in cuda post #2000 ), so the issue might be some assumptions on your backend side but maybe I'm missing something? |
No, the tensor was not contiguous, as I stated earlier: "the tensor::transpose operation does not change the stride of the original tensor, and the transposed tensor was treated as contiguous." The transposed tensor (with dimensions 1 and 2 swapped) should be considered non-contiguous because its memory layout necessitates alteration. The CUDA kernels implicitly made these transposed tensors contiguous, which is why no obvious issues were detected in the candle GPU backend. However, the correct logic dictates that a transpose resulting in a non-contiguous tensor, which was labeled as contiguous in the recent revision. I suspect that if all CUDA kernel contiguity checks adhere to this revision, it will lead to problems. |
A tensor with shape [1, 1, 13, 13] with stride [0, 0, 13, 1] is contiguous in memory with a C layout. |
The following code for the contiguous check in shape.rs will trigger problems for the squeezed tensor (n-dim to 1-dim) because of the " if dim > 1" condition (recently added ).
The text was updated successfully, but these errors were encountered: