Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approximation of lipschitz constant for use in step size #7

Open
jolars opened this issue Jun 3, 2018 · 3 comments
Open

Approximation of lipschitz constant for use in step size #7

jolars opened this issue Jun 3, 2018 · 3 comments

Comments

@jolars
Copy link
Owner

jolars commented Jun 3, 2018

Currently, we are approximating the Lipschitz constant as in scikit-learn by using the maximum absolute rowise squared norm, whilst the theoretically best choice, if I am not mistaken, is to take the largest eigenvalue of the hessian. The reason for this behavior, as I understand it, is that the decomposition is computationally demanding but it does seem suboptimal, particularly since we are using a constant step size.

Here is a short demonstration (in R) of the different methods for least squares:

set.seed(1)
x <- scale(matrix(rnorm(1000000), nrow = 10000, ncol = 100))

rowNormsMax <- function(x) max(abs(rowSums(x^2)))
hessianEigen <- function(x) max(eigen(crossprod(x), FALSE, TRUE)$values/nrow(x))

rowNormsMax(x)
#> [1] 164.6416
hessianEigen(x)
#> [1] 1.199674

microbenchmark::microbenchmark(rowNormsMax(x), hessianEigen(x))
#> Unit: milliseconds
#>             expr       min       lq      mean    median        uq       max neval
#>   rowNormsMax(x)  4.591024  7.10048  9.094845  7.317713  7.635671 127.13514   100
#>  hessianEigen(x) 11.155758 11.26153 11.678655 11.558702 11.881869  13.99339   100

I have planned to run proper tests on this to figure out a best solution but I though I would just put it up here to remember to revisit it later.

Also, I wanted to check that I am not missing something. The difference seems to be huge?

@michaelweylandt
Copy link
Collaborator

michaelweylandt commented Jun 8, 2018

In the call to eigen, you can set symmetric=TRUE since crossprod(x) is symmetric. You can also drop the abs in rowNormsMax since the sum of squares is guaranteed to be positive.

I think the rowNormsMax approach is essentially optimal here (though I just checked the math quickly). My hunch about it being a loose bound was incorrect.

SGD works by conceiving of the loss as E[l(X, y)] \approx 1/N \sum l(X_i, y_i) and, for SAGA, the relevant Lipschitz constant is one which applies to each term in the loss separately. For each $x_i$, the loss $(y_i - x_i^T\beta)^2 / 2= y_i^2 / 2 - y_i * (x_i^T\beta) + (x_i^T\beta)^2 / 2$ has gradient $ -x_i(y_i - x_i^T\beta)$ and Hessian $x_ix_i^T$. rowNormsMax is exactly the maximum value that can take.

Put another way, hessianEigen gives you the optimal step size you can take when you are dealing with all the data at once, while rowNormsMax is essentially a "single sample" hessian and is the maximum step size you can take when you see only one sample at a time. (Note that in the SAGA paper, L is the max Lipschitz constant of the gradient of each individual observation.)

hessianEigen lets you take larger steps because you get "cancellation" from the different samples, while rowNormsMax needs to take smaller step sizes to avoid overreacting to the sample it's currently using.

@jolars
Copy link
Owner Author

jolars commented Jun 9, 2018

Ah, yes, that makes sense. Thanks!

What do you think about using an adaptive step size as the lightning package optionally does, @tdhock and @michaelweylandt ?

https://github.com/scikit-learn-contrib/lightning/blob/e755e39c810fbf0e2c08e16bd3697c51554dcda2/lightning/impl/sag_fast.pyx#L386

@michaelweylandt
Copy link
Collaborator

Potentially interesting and might buy some speed, but I think there's probably more low-hanging fruit performance-wise elsewhere. The paper they reference describes the step-size search in a fully smooth setting (for a slightly different algorithm), so I'd want to check the math before implementing, but it seems like it would work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants