-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why Hessian can get by activation ($H = XX^T$) ? #31
Comments
SparseGPT operates in a layer-wise manner, solving the corresponding layer-wise pruning problem, given by Equation (1) in the paper. This is essentially a masked linear squared error problem of which the Hessian is indeed |
Thanks for the answer, but why the Hessian for the masked linear squared error problem is |
@Beatlesso , check the least squares estimator here: https://global.oup.com/booksites/content/0199268010/samplesec3 The first order derivative of the squared loss function is as shown in equation 3.7, the second order derivative is as shown in equation 3.10 (which is essentially the same as XXT if you interchange the row and column dimension). |
This is very useful, thanks! |
I don't really understand, isn't the Hessian matrix a second-order derivative matrix, so how can it be obtained by multiplying the activation value matrix by its transpose ($H = XX^T$ ) ?
Can you give me some more detailed instructions?
The text was updated successfully, but these errors were encountered: