Fix for Issue #148 #220

Max-Bladen · 2022-06-21T02:22:15Z

Users requested some time ago to include a feature within our sparse methods which allows for them to pass in a specific set of variables to be guaranteed to be included in the model. Relevant user facing functions (spca, (mint).(block).spls(da)) all now take the retain.feats parameter to allow this funcitonality.

Check.retain.feats() handles checks and pre-processing of the input retain.feats. This parameter influences the regularisation that occurs within soft_threshold_L1() called by sparsity(). Along with select_feature, all features included as part of retain.feats do NOT have their loadings reduced to 0.

feat: introduces the `retain.feats` parameter to (mint).(block).spls(da). Allows for some features to be specifically retained in model independent of keepX/Y

feat: introduced `retain.feats` functionality to `spca()`. additionally, fixed minor two issues causing incorrect AVE and weights in `spls(da)` functions as well as missing default values

doc: updated documentation for `spca`

Max-Bladen · 2022-06-22T01:12:39Z

Hi @aljabadi,

I need a bit of advice for this one. I've successfully implemented a way to retain user specified features in sparse functions via the new retain.feats parameter. It can take either indices or names of the desired features to retain and is implemented for spca, spls, splsda, mint.spls, mint.splsda, block.spls and block.splsda.

I'll refer to features selected by the sparse method (via keepX) as "selected" features. Those specified by the user (via retain.feats) will be referred to as "retained".

Here's the problem: the loading values of these retained features is severely overestimated. This is due to the selected and retained having their loadings reduced by max(abs.a[!select_feature]) (look here). Using the below code as an example:

data(liver.toxicity)
X <- liver.toxicity$gene[, 1:200]
Y <- liver.toxicity$clinic

colnames(X)[6:10] <- c("A", "B", "C", "D", "E")
colnames(Y)[8:10] <- c("A", "B", "C")

retain.feats <- list(X=6:10, 
                     Y=8:10)

spls.obj <- spls(X, Y, keepX = c(6,1), keepY = c(6,1), retain.feats = retain.feats)

So, the features to retain are named A -> E so they can be easily identified. In a given iteration, the loading values of the selected and retained features are as follows:

           A            B            C            D            E 
    37.56952     13.64893     20.92075      3.42440     20.26891     

    A_42_P649672  A_43_P11568  A_43_P21626 A_42_P454114 A_42_P683537  A_43_P21372 
    63.32374      68.51839     66.69514    66.56536     67.45634      66.31351

Unsurprisingly, the selected features (starting with A_4*) have the maximal absolute loadings, while the retained features have a range of values. These are then each scaled down by the maximal loading of all non-selected features (max(abs.a[!select_feature]) = 62.85361). Now the values look like:

           A            B            C            D            E 
 -25.2840898  -49.2046839  -41.9328549  -59.4292085  -42.5846995    

    A_42_P649672  A_43_P11568  A_43_P21626  A_42_P454114 A_42_P683537  A_43_P21372 
    0.4701328     5.6647843    3.8415326    3.7117548    4.6027321     3.4599011

Now if we look at the final loading values you can see the overestimation of the retained features (particularly on the X dataframe).

plotLoadings(spls.obj)

^{Created on 2022-06-22 by the reprex package (v2.0.1)}

Mathematically, these values are correct but the scaling results in their inflation. I played around with removing the - max(abs.a[!select_feature]) but this obviously resulted in different values for other components of the method (eg. weights and AVE).

My primary question is, seeing as the loading values are scaled down via the L2 norm (sqrt(crossprod(loadings)) - look here) is the - max(abs.a[!select_feature]) necessary? If not, then the removing this subtraction resolves the loading inflation issue, like so:

^{Created on 2022-06-22 by the reprex package (v2.0.1)}

Seeing as the loading values (as well as other components, eg: AVE) change to differing degrees when we remove this subtraction, I'll assume that this subtraction is necessary. Hence, where should I go from here? Is there a better way to implement the retain.feats functionality without this issue? Should I just forget about this feature?

Cheers

mixOmicsTeam · 2022-06-22T06:07:36Z

My two cents is that mathematically it is not sound for the lasso to have this option (although one could consider it as a constraint lasso of some sort). That said we had implemented it earlier in the package bootPLS for a slightly different purpose (where we realised there was then some overfitting). What I would advise users to do instead is that they perform a sParse method, retrieve their selected variables, then run a classic non sparse model with the selected features and their ‘favourite’ features. That would be the quick fix. Regards, Kim-Anh -- A/Prof. Kim-Anh Lê Cao NHMRC Career Development Fellow School of Mathematics and Statistics, Melbourne Integrative Genomics Bld 184 | The University of Melbourne T: +61 (0)3834 43971 | http://mixomics.org/ | http://lecao-lab.science.unimelb.edu.au/ | @mixOmics_team

…

________________________________ From: Max Bladen ***@***.***> Sent: Wednesday, June 22, 2022 3:12:51 AM To: mixOmicsTeam/mixOmics ***@***.***> Cc: Subscribed ***@***.***> Subject: Re: [mixOmicsTeam/mixOmics] Fix for Issue #148 (PR #220) Hi @aljabadi<https://github.com/aljabadi>, I need a bit of advice for this one. I've successfully implemented a way to retain user specified features in sparse functions via the new retain.feats parameter. It can take either indices or names of the desired features to retain and is implemented for spca, spls, splsda, mint.spls, mint.splsda, block.spls and block.splsda. I'll refer to features selected by the sparse method (via keepX) as "selected" features. Those specified by the user (via retain.feats) will be referred to as "retained". Here's the problem: the loading values of these retained features is severely overestimated. This is due to the selected and retained having their loadings reduced by max(abs.a[!select_feature]) (look here<https://github.com/mixOmicsTeam/mixOmics/blob/0ff2d1671615bb32ac26f634b80833d4ab1bf0fb/R/internal_mint.block_helpers.R#L133>). Using the below code as an example: data(liver.toxicity) X <- liver.toxicity$gene[, 1:200] Y <- liver.toxicity$clinic colnames(X)[6:10] <- c("A", "B", "C", "D", "E") colnames(Y)[8:10] <- c("A", "B", "C") retain.feats <- list(X=6:10, Y=8:10) spls.obj <- spls(X, Y, keepX = c(6,1), keepY = c(6,1), retain.feats = retain.feats) So, the features to retain are named A -> E so they can be easily identified. In a given iteration, the loading values of the selected and retained features are as follows: A B C D E 37.56952 13.64893 20.92075 3.42440 20.26891 A_42_P649672 A_43_P11568 A_43_P21626 A_42_P454114 A_42_P683537 A_43_P21372 63.32374 68.51839 66.69514 66.56536 67.45634 66.31351 Unsurprisingly, the selected features (starting with A_4*) have the maximal absolute loadings, while the retained features have a range of values. These are then each scaled down by the maximal loading of all non-selected features (max(abs.a[!select_feature]) = 62.85361). Now the values look like: A B C D E -25.2840898 -49.2046839 -41.9328549 -59.4292085 -42.5846995 A_42_P649672 A_43_P11568 A_43_P21626 A_42_P454114 A_42_P683537 A_43_P21372 0.4701328 5.6647843 3.8415326 3.7117548 4.6027321 3.4599011 Now if we look at the final loading values you can see the overestimation of the retained features (particularly on the X dataframe). plotLoadings(spls.obj) [https://camo.githubusercontent.com/5d50fc38d44b8b41ab4615283b88409e4956af5a183b1f3188732f2097839bd8/68747470733a2f2f692e696d6775722e636f6d2f6a67624b36524f2e706e67]<https://camo.githubusercontent.com/5d50fc38d44b8b41ab4615283b88409e4956af5a183b1f3188732f2097839bd8/68747470733a2f2f692e696d6775722e636f6d2f6a67624b36524f2e706e67> Created on 2022-06-22 by the reprex package<https://reprex.tidyverse.org> (v2.0.1) Mathematically, these values are correct but the scaling results in their inflation. I played around with removing the - max(abs.a[!select_feature]) but this obviously resulted in different values for other components of the method (eg. weights and AVE). My primary question is, seeing as the loading values are scaled down via the L2 norm (sqrt(crossprod(loadings)) - look here<https://github.com/mixOmicsTeam/mixOmics/blob/0ff2d1671615bb32ac26f634b80833d4ab1bf0fb/R/internal_mint.block.R#L648>) is the - max(abs.a[!select_feature]) necessary? If not, then the removing this subtraction resolves the loading inflation issue, like so: [https://camo.githubusercontent.com/9751ba57f855285daf53ac5b4ea06c2722219d0ffef84414264a26cfc46183c9/68747470733a2f2f692e696d6775722e636f6d2f32354a5a48766f2e706e67]<https://camo.githubusercontent.com/9751ba57f855285daf53ac5b4ea06c2722219d0ffef84414264a26cfc46183c9/68747470733a2f2f692e696d6775722e636f6d2f32354a5a48766f2e706e67> Created on 2022-06-22 by the reprex package<https://reprex.tidyverse.org> (v2.0.1) Seeing as the loading values (as well as other components, eg: AVE) change to differing degrees when we remove this subtraction, I'll assume that this subtraction is necessary. Hence, where should I go from here? Is there a better way to implement the retain.feats functionality without this issue? Should I just forget about this feature? Cheers — Reply to this email directly, view it on GitHub<#220 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AJOK6LLJBN3PBLFXPCRGITTVQJSBHANCNFSM5ZKVR4GA>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Max-Bladen · 2022-06-23T02:09:13Z

Thanks for the clarification @mixOmicsTeam. The results certainly didn't feel right

I'll inform users of this and put aside this PR. I'll do a bit of research and thinking as to if there is a way this could be implemented but I'll move on for now

Fix for Issue #148

a444851

feat: introduces the `retain.feats` parameter to (mint).(block).spls(da). Allows for some features to be specifically retained in model independent of keepX/Y

Max-Bladen added the enhancement-request New feature or request label Jun 21, 2022

Max-Bladen self-assigned this Jun 21, 2022

Max-Bladen linked an issue Jun 21, 2022 that may be closed by this pull request

feature request: Possibility to keep some descriptors in sPLS #148

Closed

Max-Bladen added 2 commits June 22, 2022 10:32

Updated fix for Issue #148

7ffc697

feat: introduced `retain.feats` functionality to `spca()`. additionally, fixed minor two issues causing incorrect AVE and weights in `spls(da)` functions as well as missing default values

Updated fix for Issue #148

e892568

doc: updated documentation for `spca`

Max-Bladen mentioned this pull request Jun 23, 2022

feature request: Possibility to keep some descriptors in sPLS #148

Closed

Max-Bladen closed this Jun 23, 2022

Max-Bladen added feature-request Can be implemented if there's enough interest and removed enhancement-request New feature or request labels Jun 23, 2022

Max-Bladen deleted the issue-148 branch December 13, 2022 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for Issue #148 #220

Fix for Issue #148 #220

Max-Bladen commented Jun 21, 2022 •

edited

Loading

Max-Bladen commented Jun 22, 2022

mixOmicsTeam commented Jun 22, 2022 via email

Max-Bladen commented Jun 23, 2022

Fix for Issue #148 #220

Fix for Issue #148 #220

Conversation

Max-Bladen commented Jun 21, 2022 • edited Loading

Max-Bladen commented Jun 22, 2022

mixOmicsTeam commented Jun 22, 2022 via email

Max-Bladen commented Jun 23, 2022

Max-Bladen commented Jun 21, 2022 •

edited

Loading