Why Hessian can get by activation ($H = XX^T$) ？ #31

Beatlesso · 2024-04-01T02:58:27Z

I don't really understand, isn't the Hessian matrix a second-order derivative matrix, so how can it be obtained by multiplying the activation value matrix by its transpose ($H = XX^T$) ?
Can you give me some more detailed instructions?

efrantar · 2024-04-01T10:56:10Z

SparseGPT operates in a layer-wise manner, solving the corresponding layer-wise pruning problem, given by Equation (1) in the paper. This is essentially a masked linear squared error problem of which the Hessian is indeed $XX^T$; you are right that the whole network's Hessian includes second-order derivatives, but the SparseGPT algorithm does not require this Hessian.

Beatlesso · 2024-04-01T11:12:36Z

SparseGPT operates in a layer-wise manner, solving the corresponding layer-wise pruning problem, given by Equation (1) in the paper. This is essentially a masked linear squared error problem of which the Hessian is indeed XXT; you are right that the whole network's Hessian includes second-order derivatives, but the SparseGPT algorithm does not require this Hessian.

Thanks for the answer, but why the Hessian for the masked linear squared error problem is $XX^T$ and how was this result derived?

Donyme · 2024-04-02T16:30:08Z

@Beatlesso , check the least squares estimator here: https://global.oup.com/booksites/content/0199268010/samplesec3

The first order derivative of the squared loss function is as shown in equation 3.7, the second order derivative is as shown in equation 3.10 (which is essentially the same as XX^T if you interchange the row and column dimension).

Beatlesso · 2024-04-03T02:27:16Z

@Beatlesso , check the least squares estimator here: https://global.oup.com/booksites/content/0199268010/samplesec3
The first order derivative of the squared loss function is as shown in equation 3.7, the second order derivative is as shown in equation 3.10 (which is essentially the same as XXT if you interchange the row and column dimension).

This is very useful, thanks!

Beatlesso closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Hessian can get by activation ($H = XX^T$) ？ #31

Why Hessian can get by activation ($H = XX^T$) ？ #31

Beatlesso commented Apr 1, 2024 •

edited

Loading

efrantar commented Apr 1, 2024

Beatlesso commented Apr 1, 2024 •

edited

Loading

Donyme commented Apr 2, 2024

Beatlesso commented Apr 3, 2024

Why Hessian can get by activation ($H = XX^T$) ？ #31

Why Hessian can get by activation ($H = XX^T$) ？ #31

Comments

Beatlesso commented Apr 1, 2024 • edited Loading

efrantar commented Apr 1, 2024

Beatlesso commented Apr 1, 2024 • edited Loading

Donyme commented Apr 2, 2024

Beatlesso commented Apr 3, 2024

Beatlesso commented Apr 1, 2024 •

edited

Loading

Beatlesso commented Apr 1, 2024 •

edited

Loading