Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why Hessian can get by activation ($H = XX^T$) ? #31

Closed
Beatlesso opened this issue Apr 1, 2024 · 4 comments
Closed

Why Hessian can get by activation ($H = XX^T$) ? #31

Beatlesso opened this issue Apr 1, 2024 · 4 comments

Comments

@Beatlesso
Copy link

Beatlesso commented Apr 1, 2024

I don't really understand, isn't the Hessian matrix a second-order derivative matrix, so how can it be obtained by multiplying the activation value matrix by its transpose ($H = XX^T$) ?
Can you give me some more detailed instructions?

@efrantar
Copy link
Member

efrantar commented Apr 1, 2024

SparseGPT operates in a layer-wise manner, solving the corresponding layer-wise pruning problem, given by Equation (1) in the paper. This is essentially a masked linear squared error problem of which the Hessian is indeed $XX^T$; you are right that the whole network's Hessian includes second-order derivatives, but the SparseGPT algorithm does not require this Hessian.

@Beatlesso
Copy link
Author

Beatlesso commented Apr 1, 2024

SparseGPT operates in a layer-wise manner, solving the corresponding layer-wise pruning problem, given by Equation (1) in the paper. This is essentially a masked linear squared error problem of which the Hessian is indeed XXT; you are right that the whole network's Hessian includes second-order derivatives, but the SparseGPT algorithm does not require this Hessian.

Thanks for the answer, but why the Hessian for the masked linear squared error problem is $XX^T$ and how was this result derived?

@Donyme
Copy link
Contributor

Donyme commented Apr 2, 2024

@Beatlesso , check the least squares estimator here: https://global.oup.com/booksites/content/0199268010/samplesec3

Screenshot 2024-04-02 at 11 24 13 AM

The first order derivative of the squared loss function is as shown in equation 3.7, the second order derivative is as shown in equation 3.10 (which is essentially the same as XXT if you interchange the row and column dimension).

@Beatlesso
Copy link
Author

@Beatlesso , check the least squares estimator here: https://global.oup.com/booksites/content/0199268010/samplesec3

Screenshot 2024-04-02 at 11 24 13 AM The first order derivative of the squared loss function is as shown in equation 3.7, the second order derivative is as shown in equation 3.10 (which is essentially the same as XXT if you interchange the row and column dimension).

This is very useful, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants