Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVD operator #4416

Closed
wants to merge 3 commits into from
Closed

SVD operator #4416

wants to merge 3 commits into from

Conversation

williamberman
Copy link
Contributor

@williamberman williamberman commented Aug 6, 2022

Semantics:

The SVD operator covers pytorch, numpy, and tensorflow’s SVD semantics.

Numpy and tensorflow use the same compute_uv flag for computing just the singular values. Pytorch uses two different operations, svd and svdvals.

Pytorch and numpy return the same conjugate transpose, Vh. Tensorflow returns V directly.

Tensorflow returns in the same order, S U Vh because S is the only non-optional return value. Pytorch and numpy return in the order of the factorization, U S Vh.

Derivative

"thin"/"partial" vs "full", computing only singular values vs the whole factorization, and real vs complex inputs all change the derivative, impacting both its value and its numerical stability.

There are different resources documenting the different derivative variants. The implementations in well known AD codebases (pytorch, tensorflow, and jax) all have slightly different implementations of the derivative.

I consolidated the documentation of the different cases and provided example implementations in this python notebook

Existing docs

Previous discussion on adding an SVD operator to ONNX

pytorch/pytorch#81084
#3839

Example models

An Analysis of SVD for Deep Rotation Estimation

SVD is used as a layer in a neural net for predicting rotation matrices. The layer is defined as $\mathrm{SVDO^+}(M) := U \Sigma ' V^\top$ where $\Sigma ' = diag(1, ..., 1, det(U V^\top))$ (See equation 2). There are two models, SVD-Train and SVD-Inference. SVD-Train uses $\mathrm{SVDO^+}$ as the final layer for both training and inference. SVD-Inference omits $\mathrm{SVDO^+}$ as the final layer in training but it is used as the final layer during inference (See section 4 methods).

The full network definition can be found on github. See regress_from_features for the pre-SVD layer definitions.

Training Deep Networks with Structured Layers by Matrix Backpropagation

The image recognition layer called second-order pooling computes $log(F F^\top + \epsilon I)$ where F is a matrix of image features. Given the SVD of F, the layer can be simplified so log is computed element wise over a diagonalized matrix. Given $F = U \Sigma V^\top$, the second-order pooling layer simplifies to $V log(Σ^\top Σ+ \epsilon I)V^\top $. See section 5.2

Improving training of deep neural networks via Singular Value Bounding and Orthogonal Deep Neural Networks

Training proceeds by standard SGD except that weight matrices are maintained as near orthogonal by bounding/clipping their singular values near 1. Weight matrix singular values are bounded within the range $[\frac{1}{1 + \epsilon},1 + \epsilon]$ every $T_{svb}$ iterations where $\epsilon$ and $T_{svb}$ are hyperparameters. See Algorithm 1.

SVD-Softmax: Fast Softmax Approximation on Large Vocabulary Neural Networks

SVD-Softmax is a fast approximation of softmax that can be used during inference. The decomposition of the softmax weight matrix, $A = U \Sigma V^\top$, is used to create the matrix $B = U \Sigma$. A subset W of the columns of B are used to estimate the result of the softmax where W is a hyperparameter. The complete softmax is computed for the top N approximations where N is a hyperparameter. See Algorithm 1.

SVD-Embedded Deep Autoencoder for MIMO Communications

This model embeds the SVD factorization of the channel matrix into the DAE.

The singular values of the channel matrix are used as inputs to create part of the feature vector, $v_\gamma$ (equation 4). $v_\gamma$ is concatenated with the bit input to create the complete input to the Transmitter DAE (section III.A.2). $v_\gamma$ is also concatenated with the output of the Receiver Pre-processor to create the input to the Receiver DAE (section III.F).

The Transmitter Precoding adds one layer of non-trainable weights composed of the right singular vectors of the channel matrix (section III.C).

The Receiver Pre-processing adds two layers of non-trainable weights. One is the left-singular vectors of the channel matrix. The other is the pseudo-inverse of the matrix containing the singular values as its diagonal (section III.E).

Comment on lines +3410 to +3572
// Copy over all dimensions but the last two
for (; dim_idx < A_dim_size - 2; ++dim_idx) {
const auto dim = A_shape.dim(dim_idx);

if (compute_uv) {
*U_shape->add_dim() = dim;
*Vh_shape->add_dim() = dim;
}

*S_shape->add_dim() = dim;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming dimension ordering here is correct? I.e. dimension at index 0 is highest dimension and dimension at index A_shape.dim_size() - 1 is the lowest dimension.

onnx/defs/math/defs.cc Outdated Show resolved Hide resolved
onnx/defs/math/defs.cc Outdated Show resolved Hide resolved
onnx/defs/schema.h Outdated Show resolved Hide resolved
@gramalingam
Copy link
Contributor

I realize SVD is quite well-known, but it would be useful to mention example models that use SVD, if you are aware of any, as motivation.

@williamberman
Copy link
Contributor Author

I realize SVD is quite well-known, but it would be useful to mention example models that use SVD, if you are aware of any, as motivation.

I'm not aware of example models off the top of my head as I just picked it up by looking through the open issues, but I will do some digging :)

cc @coltonpeltierSE since you had the original open issue, do you have any pointers on example models we could include in the PR description?

@williamberman williamberman changed the title SVD and SVDVals ops SVD op Aug 18, 2022
@p-wysocki
Copy link
Contributor

I found some models using SVD:

@williamberman
Copy link
Contributor Author

I found some models using SVD:

Thank you!

@coltonpeltier-db
Copy link

@williamberman - Hi, I'm actually not at Schneider Electric anymore (hence the "SE" in my old name), I've switched to Databricks 🥳 .

So I didn't see you had tagged me in this. I can see @p-wysocki found some models which utilize SVD (thank you!).

I'm not aware of any other public models which utilize the SVD off the top of my head but internally to SE we were looking at using the SVD as part of the model to perform some de-noising on signals before classification.

@williamberman
Copy link
Contributor Author

@williamberman - Hi, I'm actually not at Schneider Electric anymore (hence the "SE" in my old name), I've switched to Databricks 🥳 .

So I didn't see you had tagged me in this. I can see @p-wysocki found some models which utilize SVD (thank you!).

I'm not aware of any other public models which utilize the SVD off the top of my head but internally to SE we were looking at using the SVD as part of the model to perform some de-noising on signals before classification.

Thank you for the context @coltonpeltier-db! Makes a ton of sense

@williamberman williamberman force-pushed the will/svd-op branch 2 times, most recently from 45849cb to 2abf95d Compare September 9, 2022 23:08
@williamberman williamberman changed the title SVD op SVD operator Sep 10, 2022
@williamberman williamberman marked this pull request as ready for review September 13, 2022 00:01
@williamberman williamberman requested review from a team as code owners September 13, 2022 00:01
Signed-off-by: Will Berman <WLBberman@gmail.com>
Signed-off-by: Will Berman <WLBberman@gmail.com>
Bug fix changing order of setting U's dimensions

Check for input shapes being absent.

Check for dimensions being present/concrete.

Signed-off-by: Will Berman <WLBberman@gmail.com>
@enuk1dze
Copy link

@williamberman Thank you for your work! I'm having trouble with exporting my model because of SVD right now. I hope this PR will be done.

One comment on PR though. Shouldn't binary .pb files be in git-lfs of something?

For example, like that one onnx/backend/test/data/node/test_svd_3d_partial_compute_uv_manually_set/test_data_set_0/output_1.pb

@williamberman
Copy link
Contributor Author

@williamberman Thank you for your work! I'm having trouble with exporting my model because of SVD right now. I hope this PR will be done.

One comment on PR though. Shouldn't binary .pb files be in git-lfs of something?

For example, like that one onnx/backend/test/data/node/test_svd_3d_partial_compute_uv_manually_set/test_data_set_0/output_1.pb

of course @enuk1dze , hope to have it merged soon!

re: git lfs, since the generated protobufs are pretty small, I would say it's ok to commit them into vanilla git as I think they already are. However, that's probably up to the onnx maintainers

@yuanyao-nv
Copy link
Contributor

There are different flavors of numerical methods used in practice to compute SVD (direct vs iterative, deterministic vs stochastic) suited to matrices with different sizes and properties. Do we want to provide more info in the ONNX spec with respect to what method is used (and also the accuracy/tolerance in the case of iterative methods)?

@williamberman
Copy link
Contributor Author

williamberman commented Sep 26, 2022

There are different flavors of numerical methods used in practice to compute SVD (direct vs iterative, deterministic vs stochastic) suited to matrices with different sizes and properties. Do we want to provide more info in the ONNX spec with respect to what method is used (and also the accuracy/tolerance in the case of iterative methods)?

@yuanyao-nv do you think docs on the method used to compute SVD might be more appropriate in the specific runtimes as opposed to the ONNX spec? If the spec mandates a particular accuracy/method which is not available on a particular platform, that might be an issue (I could be completely off base here).

A good compromise to assist runtime implementors might be "here are a set of ways to compute SVD which have X properties" in the PR description?

Regardless happy to provide the level of documentation appropriate for the PR/spec :)

@gramalingam
Copy link
Contributor

There are different flavors of numerical methods used in practice to compute SVD (direct vs iterative, deterministic vs stochastic) suited to matrices with different sizes and properties. Do we want to provide more info in the ONNX spec with respect to what method is used (and also the accuracy/tolerance in the case of iterative methods)?

If they are going to produce different results, then the op should clarify which one is expected. What's the standard/default? I would assume that's what we want. If there is a demand for more than one of these, we may need an attribute to distinguish between them. But all of these should be driven by the use-cases (motivating examples/models).

@yuanyao-nv
Copy link
Contributor

I think a good analogy here is with matrix diagonalization, which also has many different methods suited to different scenarios. We don't yet have a diagonalization operator in ONNX yet.

I agree with @gramalingam that it should be driven by use cases. Unfortunately. the pytorch and tensorflow and doc pages also have no mention of what method they use, but the fact that they produce all singular values suggest a direct method is used. A quick search in the literature seems to suggest the industry standard for small SVD problems is a two-phase method: first reduce to bidiagonal form, then to diagonal form, with small variations for both phases.

So far, I don't think ONNX has expanded into linear algebra computations - this would be a first. So I think it'd be worthwhile thinking more carefully about how (perhaps also if) we should handle such operations. And if SVD is included, it would seem incomplete not to include other ops, such as diagonalization, and various matrix decompositions.

@williamberman
Copy link
Contributor Author

williamberman commented Sep 28, 2022

Unfortunately. the pytorch and tensorflow and doc pages also have no mention of what method they use, but the fact that they produce all singular values suggest a direct method is used. A quick search in the literature seems to suggest the industry standard for small SVD problems is a two-phase method: first reduce to bidiagonal form, then to diagonal form, with small variations for both phases.

Here's what I could pull out of the source and docs. Happy to do more digging to go into more detail if it's helpful :)

Tensorflow - CPU

TF CPU uses eigen and calls bidiagonal divide and conquer which internally falls back to jacobi method for matrices with less than 16 cols.

Tensorflow - GPU

TF GPU uses cuSOLVER and calls gesvdj (jacobi method) for batches of smaller than 32x32 matrices. Additionally, the matrices must be either square or be computing the full factorization for the jacobi method. See source for the full condition.

Otherwise, TF GPU uses gesvd which uses QR algorithm.

Pytorch - CPU

Pytorch CPU uses lapack and calls gesdd which uses divide and conquer.

Pytorch - GPU - MAGMA

When using magma, pytorch calls gesdd which uses divide and conquer.

Pytorch - GPU - cuSOLVER

When using cuSOLVER, pytorch tip actually lets you pass an option (driver) that lets you choose between gesvd (QR), gesvdj (jacobi), or gesvda (approximates decompositions of skinny matrices). If gesvdj or gesvda are chosen, the result is checked for convergence and it falls back to gesvd. The default behavior is to use gesvdj with the gesvd fallback.

See the doc string in source.

The behavior in the current stable release is to use gesvdj and fallback to gesvd without the driver option.

I agree with @gramalingam that it should be driven by use cases.

Unfortunately none of the example models that I found gave specifics on hard requirements they needed out of the SVD implementation they used. Deep rotation estimation released their model in both pytorch and tensorflow if that's a relevant datapoint.

@williamberman
Copy link
Contributor Author

Closing this PR as it seems svd and potentially related op types are outside of the current scope of the onnx spec :)

@justinchuby
Copy link
Contributor

Is this a good time to reopen this?

@xadupre
Copy link
Contributor

xadupre commented Apr 5, 2023

Sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants