-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SVD operator #4416
SVD operator #4416
Conversation
// Copy over all dimensions but the last two | ||
for (; dim_idx < A_dim_size - 2; ++dim_idx) { | ||
const auto dim = A_shape.dim(dim_idx); | ||
|
||
if (compute_uv) { | ||
*U_shape->add_dim() = dim; | ||
*Vh_shape->add_dim() = dim; | ||
} | ||
|
||
*S_shape->add_dim() = dim; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirming dimension ordering here is correct? I.e. dimension at index 0 is highest dimension and dimension at index A_shape.dim_size() - 1
is the lowest dimension.
f85db99
to
77fd7a6
Compare
I realize SVD is quite well-known, but it would be useful to mention example models that use SVD, if you are aware of any, as motivation. |
I'm not aware of example models off the top of my head as I just picked it up by looking through the open issues, but I will do some digging :) cc @coltonpeltierSE since you had the original open issue, do you have any pointers on example models we could include in the PR description? |
60528b9
to
bd92d78
Compare
I found some models using SVD:
|
Thank you! |
@williamberman - Hi, I'm actually not at Schneider Electric anymore (hence the "SE" in my old name), I've switched to Databricks 🥳 . So I didn't see you had tagged me in this. I can see @p-wysocki found some models which utilize SVD (thank you!). I'm not aware of any other public models which utilize the SVD off the top of my head but internally to SE we were looking at using the SVD as part of the model to perform some de-noising on signals before classification. |
Thank you for the context @coltonpeltier-db! Makes a ton of sense |
45849cb
to
2abf95d
Compare
2abf95d
to
aa12140
Compare
Signed-off-by: Will Berman <WLBberman@gmail.com>
8d8253e
to
a9254f0
Compare
Bug fix changing order of setting U's dimensions Check for input shapes being absent. Check for dimensions being present/concrete. Signed-off-by: Will Berman <WLBberman@gmail.com>
@williamberman Thank you for your work! I'm having trouble with exporting my model because of SVD right now. I hope this PR will be done. One comment on PR though. Shouldn't binary For example, like that one |
of course @enuk1dze , hope to have it merged soon! re: git lfs, since the generated protobufs are pretty small, I would say it's ok to commit them into vanilla git as I think they already are. However, that's probably up to the onnx maintainers |
There are different flavors of numerical methods used in practice to compute SVD (direct vs iterative, deterministic vs stochastic) suited to matrices with different sizes and properties. Do we want to provide more info in the ONNX spec with respect to what method is used (and also the accuracy/tolerance in the case of iterative methods)? |
@yuanyao-nv do you think docs on the method used to compute SVD might be more appropriate in the specific runtimes as opposed to the ONNX spec? If the spec mandates a particular accuracy/method which is not available on a particular platform, that might be an issue (I could be completely off base here). A good compromise to assist runtime implementors might be "here are a set of ways to compute SVD which have X properties" in the PR description? Regardless happy to provide the level of documentation appropriate for the PR/spec :) |
If they are going to produce different results, then the op should clarify which one is expected. What's the standard/default? I would assume that's what we want. If there is a demand for more than one of these, we may need an attribute to distinguish between them. But all of these should be driven by the use-cases (motivating examples/models). |
I think a good analogy here is with matrix diagonalization, which also has many different methods suited to different scenarios. We don't yet have a diagonalization operator in ONNX yet. I agree with @gramalingam that it should be driven by use cases. Unfortunately. the pytorch and tensorflow and doc pages also have no mention of what method they use, but the fact that they produce all singular values suggest a direct method is used. A quick search in the literature seems to suggest the industry standard for small SVD problems is a two-phase method: first reduce to bidiagonal form, then to diagonal form, with small variations for both phases. So far, I don't think ONNX has expanded into linear algebra computations - this would be a first. So I think it'd be worthwhile thinking more carefully about how (perhaps also if) we should handle such operations. And if SVD is included, it would seem incomplete not to include other ops, such as diagonalization, and various matrix decompositions. |
Here's what I could pull out of the source and docs. Happy to do more digging to go into more detail if it's helpful :) Tensorflow - CPUTF CPU uses eigen and calls bidiagonal divide and conquer which internally falls back to jacobi method for matrices with less than 16 cols. Tensorflow - GPUTF GPU uses cuSOLVER and calls gesvdj (jacobi method) for batches of smaller than 32x32 matrices. Additionally, the matrices must be either square or be computing the full factorization for the jacobi method. See source for the full condition. Otherwise, TF GPU uses gesvd which uses QR algorithm. Pytorch - CPUPytorch CPU uses lapack and calls gesdd which uses divide and conquer. Pytorch - GPU - MAGMAWhen using magma, pytorch calls gesdd which uses divide and conquer. Pytorch - GPU - cuSOLVERWhen using cuSOLVER, pytorch tip actually lets you pass an option ( See the doc string in source. The behavior in the current stable release is to use gesvdj and fallback to gesvd without the driver option.
Unfortunately none of the example models that I found gave specifics on hard requirements they needed out of the SVD implementation they used. Deep rotation estimation released their model in both pytorch and tensorflow if that's a relevant datapoint. |
Closing this PR as it seems svd and potentially related op types are outside of the current scope of the onnx spec :) |
Is this a good time to reopen this? |
Sure. |
Semantics:
The SVD operator covers pytorch, numpy, and tensorflow’s SVD semantics.
Numpy and tensorflow use the same
compute_uv
flag for computing just the singular values. Pytorch uses two different operations,svd
andsvdvals
.Pytorch and numpy return the same conjugate transpose,
Vh
. Tensorflow returnsV
directly.Tensorflow returns in the same order,
S U Vh
because S is the only non-optional return value. Pytorch and numpy return in the order of the factorization,U S Vh
.Derivative
"thin"/"partial" vs "full", computing only singular values vs the whole factorization, and real vs complex inputs all change the derivative, impacting both its value and its numerical stability.
There are different resources documenting the different derivative variants. The implementations in well known AD codebases (pytorch, tensorflow, and jax) all have slightly different implementations of the derivative.
I consolidated the documentation of the different cases and provided example implementations in this python notebook
Existing docs
Previous discussion on adding an SVD operator to ONNX
pytorch/pytorch#81084
#3839
Example models
An Analysis of SVD for Deep Rotation Estimation
SVD is used as a layer in a neural net for predicting rotation matrices. The layer is defined as$\mathrm{SVDO^+}(M) := U \Sigma ' V^\top$ where $\Sigma ' = diag(1, ..., 1, det(U V^\top))$ (See equation 2). There are two models, SVD-Train and SVD-Inference. SVD-Train uses $\mathrm{SVDO^+}$ as the final layer for both training and inference. SVD-Inference omits $\mathrm{SVDO^+}$ as the final layer in training but it is used as the final layer during inference (See section 4 methods).
The full network definition can be found on github. See
regress_from_features
for the pre-SVD layer definitions.Training Deep Networks with Structured Layers by Matrix Backpropagation
The image recognition layer called second-order pooling computes$log(F F^\top + \epsilon I)$ where F is a matrix of image features. Given the SVD of F, the layer can be simplified so log is computed element wise over a diagonalized matrix. Given $F = U \Sigma V^\top$ , the second-order pooling layer simplifies to $V log(Σ^\top Σ+ \epsilon I)V^\top $ . See section 5.2
Improving training of deep neural networks via Singular Value Bounding and Orthogonal Deep Neural Networks
Training proceeds by standard SGD except that weight matrices are maintained as near orthogonal by bounding/clipping their singular values near 1. Weight matrix singular values are bounded within the range$[\frac{1}{1 + \epsilon},1 + \epsilon]$ every $T_{svb}$ iterations where $\epsilon$ and $T_{svb}$ are hyperparameters. See Algorithm 1.
SVD-Softmax: Fast Softmax Approximation on Large Vocabulary Neural Networks
SVD-Softmax is a fast approximation of softmax that can be used during inference. The decomposition of the softmax weight matrix,$A = U \Sigma V^\top$ , is used to create the matrix $B = U \Sigma$ . A subset W of the columns of B are used to estimate the result of the softmax where W is a hyperparameter. The complete softmax is computed for the top N approximations where N is a hyperparameter. See Algorithm 1.
SVD-Embedded Deep Autoencoder for MIMO Communications
This model embeds the SVD factorization of the channel matrix into the DAE.
The singular values of the channel matrix are used as inputs to create part of the feature vector,$v_\gamma$ (equation 4). $v_\gamma$ is concatenated with the bit input to create the complete input to the Transmitter DAE (section III.A.2). $v_\gamma$ is also concatenated with the output of the Receiver Pre-processor to create the input to the Receiver DAE (section III.F).
The Transmitter Precoding adds one layer of non-trainable weights composed of the right singular vectors of the channel matrix (section III.C).
The Receiver Pre-processing adds two layers of non-trainable weights. One is the left-singular vectors of the channel matrix. The other is the pseudo-inverse of the matrix containing the singular values as its diagonal (section III.E).