-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for Issue #148 #220
Fix for Issue #148 #220
Conversation
feat: introduces the `retain.feats` parameter to (mint).(block).spls(da). Allows for some features to be specifically retained in model independent of keepX/Y
feat: introduced `retain.feats` functionality to `spca()`. additionally, fixed minor two issues causing incorrect AVE and weights in `spls(da)` functions as well as missing default values
doc: updated documentation for `spca`
Hi @aljabadi, I need a bit of advice for this one. I've successfully implemented a way to retain user specified features in sparse functions via the new I'll refer to features selected by the sparse method (via Here's the problem: the loading values of these retained features is severely overestimated. This is due to the selected and retained having their loadings reduced by
So, the features to retain are named A -> E so they can be easily identified. In a given iteration, the loading values of the selected and retained features are as follows:
Unsurprisingly, the selected features (starting with
Now if we look at the final loading values you can see the overestimation of the retained features (particularly on the X dataframe). plotLoadings(spls.obj) Created on 2022-06-22 by the reprex package (v2.0.1) Mathematically, these values are correct but the scaling results in their inflation. I played around with removing the My primary question is, seeing as the loading values are scaled down via the L2 norm ( Created on 2022-06-22 by the reprex package (v2.0.1) Seeing as the loading values (as well as other components, eg: AVE) change to differing degrees when we remove this subtraction, I'll assume that this subtraction is necessary. Hence, where should I go from here? Is there a better way to implement the Cheers |
My two cents is that mathematically it is not sound for the lasso to have this option (although one could consider it as a constraint lasso of some sort). That said we had implemented it earlier in the package bootPLS for a slightly different purpose (where we realised there was then some overfitting). What I would advise users to do instead is that they perform a sParse method, retrieve their selected variables, then run a classic non sparse model with the selected features and their ‘favourite’ features. That would be the quick fix.
Regards,
Kim-Anh
--
A/Prof. Kim-Anh Lê Cao
NHMRC Career Development Fellow
School of Mathematics and Statistics, Melbourne Integrative Genomics Bld 184 | The University of Melbourne
T: +61 (0)3834 43971 | http://mixomics.org/ | http://lecao-lab.science.unimelb.edu.au/ | @mixOmics_team
…________________________________
From: Max Bladen ***@***.***>
Sent: Wednesday, June 22, 2022 3:12:51 AM
To: mixOmicsTeam/mixOmics ***@***.***>
Cc: Subscribed ***@***.***>
Subject: Re: [mixOmicsTeam/mixOmics] Fix for Issue #148 (PR #220)
Hi @aljabadi<https://github.com/aljabadi>,
I need a bit of advice for this one. I've successfully implemented a way to retain user specified features in sparse functions via the new retain.feats parameter. It can take either indices or names of the desired features to retain and is implemented for spca, spls, splsda, mint.spls, mint.splsda, block.spls and block.splsda.
I'll refer to features selected by the sparse method (via keepX) as "selected" features. Those specified by the user (via retain.feats) will be referred to as "retained".
Here's the problem: the loading values of these retained features is severely overestimated. This is due to the selected and retained having their loadings reduced by max(abs.a[!select_feature]) (look here<https://github.com/mixOmicsTeam/mixOmics/blob/0ff2d1671615bb32ac26f634b80833d4ab1bf0fb/R/internal_mint.block_helpers.R#L133>). Using the below code as an example:
data(liver.toxicity)
X <- liver.toxicity$gene[, 1:200]
Y <- liver.toxicity$clinic
colnames(X)[6:10] <- c("A", "B", "C", "D", "E")
colnames(Y)[8:10] <- c("A", "B", "C")
retain.feats <- list(X=6:10,
Y=8:10)
spls.obj <- spls(X, Y, keepX = c(6,1), keepY = c(6,1), retain.feats = retain.feats)
So, the features to retain are named A -> E so they can be easily identified. In a given iteration, the loading values of the selected and retained features are as follows:
A B C D E
37.56952 13.64893 20.92075 3.42440 20.26891
A_42_P649672 A_43_P11568 A_43_P21626 A_42_P454114 A_42_P683537 A_43_P21372
63.32374 68.51839 66.69514 66.56536 67.45634 66.31351
Unsurprisingly, the selected features (starting with A_4*) have the maximal absolute loadings, while the retained features have a range of values. These are then each scaled down by the maximal loading of all non-selected features (max(abs.a[!select_feature]) = 62.85361). Now the values look like:
A B C D E
-25.2840898 -49.2046839 -41.9328549 -59.4292085 -42.5846995
A_42_P649672 A_43_P11568 A_43_P21626 A_42_P454114 A_42_P683537 A_43_P21372
0.4701328 5.6647843 3.8415326 3.7117548 4.6027321 3.4599011
Now if we look at the final loading values you can see the overestimation of the retained features (particularly on the X dataframe).
plotLoadings(spls.obj)
[https://camo.githubusercontent.com/5d50fc38d44b8b41ab4615283b88409e4956af5a183b1f3188732f2097839bd8/68747470733a2f2f692e696d6775722e636f6d2f6a67624b36524f2e706e67]<https://camo.githubusercontent.com/5d50fc38d44b8b41ab4615283b88409e4956af5a183b1f3188732f2097839bd8/68747470733a2f2f692e696d6775722e636f6d2f6a67624b36524f2e706e67>
Created on 2022-06-22 by the reprex package<https://reprex.tidyverse.org> (v2.0.1)
Mathematically, these values are correct but the scaling results in their inflation. I played around with removing the - max(abs.a[!select_feature]) but this obviously resulted in different values for other components of the method (eg. weights and AVE).
My primary question is, seeing as the loading values are scaled down via the L2 norm (sqrt(crossprod(loadings)) - look here<https://github.com/mixOmicsTeam/mixOmics/blob/0ff2d1671615bb32ac26f634b80833d4ab1bf0fb/R/internal_mint.block.R#L648>) is the - max(abs.a[!select_feature]) necessary? If not, then the removing this subtraction resolves the loading inflation issue, like so:
[https://camo.githubusercontent.com/9751ba57f855285daf53ac5b4ea06c2722219d0ffef84414264a26cfc46183c9/68747470733a2f2f692e696d6775722e636f6d2f32354a5a48766f2e706e67]<https://camo.githubusercontent.com/9751ba57f855285daf53ac5b4ea06c2722219d0ffef84414264a26cfc46183c9/68747470733a2f2f692e696d6775722e636f6d2f32354a5a48766f2e706e67>
Created on 2022-06-22 by the reprex package<https://reprex.tidyverse.org> (v2.0.1)
Seeing as the loading values (as well as other components, eg: AVE) change to differing degrees when we remove this subtraction, I'll assume that this subtraction is necessary. Hence, where should I go from here? Is there a better way to implement the retain.feats functionality without this issue? Should I just forget about this feature?
Cheers
—
Reply to this email directly, view it on GitHub<#220 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AJOK6LLJBN3PBLFXPCRGITTVQJSBHANCNFSM5ZKVR4GA>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Thanks for the clarification @mixOmicsTeam. The results certainly didn't feel right I'll inform users of this and put aside this PR. I'll do a bit of research and thinking as to if there is a way this could be implemented but I'll move on for now |
Users requested some time ago to include a feature within our sparse methods which allows for them to pass in a specific set of variables to be guaranteed to be included in the model. Relevant user facing functions (
spca
,(mint).(block).spls(da)
) all now take theretain.feats
parameter to allow this funcitonality.Check.retain.feats()
handles checks and pre-processing of the inputretain.feats
. This parameter influences the regularisation that occurs withinsoft_threshold_L1()
called bysparsity()
. Along withselect_feature
, all features included as part ofretain.feats
do NOT have their loadings reduced to 0.