-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement rank deficiency check in vector_auto_regression #123
Comments
Hi @witherscp thanks for opening up an issue! Checking matrix rank would not be sufficient as there is a linear algebra result "almost every matrix is diagonalizable" (i.e. full rank) because of numerical precision. A better value to check would be the condition number (using numpy/scipy). Re cross-validation step: I am okay w/ adding this in, but a bunch of open questions that are not that simple. However, what would the score be measured on? Is it how well it fits the training data?, or how well it fits the "next epoch", or "the next n time steps"? |
If I check condition number using np.linalg.cond(), how can I know whether it would be considered too large? In my sample data, when there is no common average reference the condition values are around ~1000 for each epoch, whereas they are 10^15 when the common average reference is applied which is ill-conditioned. In terms of the cross-validation, I was planning on using 5-fold cross validation with: |
We could just come up with a sensible default (e.g. 1e6 or 1e7)?
Sure I think this makes sense. But I suspect might need to split the epochs such that they predict some validation epoch (might be shorter than the window lengths). But maybe let's try it out in your PR and if you can show a good example on simulated (or real data) that would be good to inform the settings. To make ill conditioned data in simulation you can just add redundant sensors with small noise or something. Feel free to open the PR early to get some initial feedback. |
Submitted the PR. The dynamic model seems to work well. The residuals are quite small, even less than the OLS. I'm not sure if the average epochs model works as well because the optimal alpha value is the absolute minimum of the np.logspace I provide and if I give a bigger range then the alpha value continues to drop but there is an ill-conditioned solving error. So far, I have only tested with the ECOG dataset using a common average reference. |
Describe the problem
The current version of vector_auto_regression has an option to manually input l2_reg ridge penalty, but it does not check whether the matrix is rank deficient, in the case that the user chooses the default l2_reg of 0 (i.e., no regularization). This leads to a problem where the outputs are useless if the user does not realize that their matrix is rank deficient.
Describe your solution
I want to check for the matrix rank to ensure that it matches the number of nodes across all epochs. If any epoch is rank deficient, then for every epoch sklearn.linear_model.RidgeCV will be employed to automatically search for optimal alpha value, which will be used to regularize the matrix. This is more useful than asking the user to input an l2_reg because RidgeCV can solve for the best value using cross-validation.
Describe possible alternatives
A possible alternative would be to apply Ridge regression to all input matrices, regardless of whether they are full rank. Another option would be to remove the l2_reg value and use the automated search for all matrices if the user chooses.
Additional context
This issue was created based on this discussion post.
I am going to work on this and submit a pull request with my progress soon!
The text was updated successfully, but these errors were encountered: