Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explanation of initial state parameter? #4

Open
GabeAl opened this issue Nov 26, 2019 · 4 comments
Open

Explanation of initial state parameter? #4

GabeAl opened this issue Nov 26, 2019 · 4 comments
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers

Comments

@GabeAl
Copy link

GabeAl commented Nov 26, 2019

Hello,

I'd like to use this package with extremely high-dimensional datasets, which aren't supported by glmnet because of its 4gb integer/array size limitation. Therefore I want to know how I can specify the initial state ("init" parameter of htlr() ) for the markov chain, and what format this variable can take.

For example, I would like to use the biglasso package on millions of features and tens of thousands of samples, which is trivial for my system with its terabytes of RAM and hundreds of threads. But of course impossible for glmnet, which uses very old fortran .Call bindings!

I also want to experimentally select subsets of variables to start with using other variable selection techniques like partial SIS, etc -- but not to restrict analysis to those features, just to initialize the chain. Random initialization takes a very long time and sometimes does not lead to a stable result, but with lasso's initialization it takes a few seconds, and each iteration happens in less than one second.

Thanks!

@syumet syumet added documentation Improvements or additions to documentation good first issue Good for newcomers labels Nov 26, 2019
@longhaiSK
Copy link
Owner

We did not think about making HTLR be superior than glmnet in handling datasets of bigger size. One way to start the MCMC in HTLR is to use lasso estimate. MCMC method indeed will require more memory than glmnet. However, this is a direction that we will work on in the future. You can try biglasso for handling very big datasets.

Using SIS to downsize the dataset is a good idea. But be mindful of the feature selection bias problem. One needs to re-select features using SIS in each fold of cross-validation for avoiding false discovery. More details are given in this thesis: https://math.usask.ca/~longhai/researchteam/theses/DONG-THESIS-2019.pdf

@GabeAl
Copy link
Author

GabeAl commented Dec 1, 2019

Thanks @longhaiSK !

This is all great advice. I'll avoid using it for ultra-high dimensional datasets, then.

I am still wondering if you might be able to provide a description of what the "init" parameter wants to see, other than the string. Can I provide other initial states?

Thanks again!

@syumet
Copy link
Collaborator

syumet commented Dec 12, 2019

Yes you can. For your reference, you can take a look at our code to generate Lasso initial states first. We will come up with a clear description in the next release.

About the dimensionality: To be honest we haven't tried the datasets of that size, but technically the limitation of memory allocation would not be a problem as our module is written in C++. You may give a try once you have your initial state ready, we look forward your feedback!

Best regards.

@longhaiSK
Copy link
Owner

Hi GabeAI

You can simply supply init with a matrix of regression coefficients found with another function. We take the form of (p+1)*K matrix, where p is the number of features, and K is the number of classes in y - 1. When K = 1, you can also supply a vector of p+1 long.
We will update the document about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants