Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DIABLO does not account for differently ordered variables in test set #192

Closed
Ning-L opened this issue Mar 16, 2022 · 2 comments · Fixed by #194
Closed

DIABLO does not account for differently ordered variables in test set #192

Ning-L opened this issue Mar 16, 2022 · 2 comments · Fixed by #194
Assignees
Labels
bug Something isn't working

Comments

@Ning-L
Copy link

Ning-L commented Mar 16, 2022

Hi mixOmics team,

Thank you for your hard work on this great package!

I use the DIABLO pipeline to perform my multi-omics analysis. After built my final model using the multiblock sPLS-DA , I want to to predict new samples with it.

When using the predict function, I got this error message: Each 'newdata[[i]]' must include all the variables of 'object$X[[i]]', however all the variables are there in the new data set.

After checking in the source code, I found it was due the order of variables in one block of my new data list which is not exactly the same as in training data set. Once I reordered variables as how they were in training data set, everything goes well.

I think the key point is just to ensure all variables from de trained model are present in the new data, the order doesn't matter. So the checking by all.equal is not appropriate here

if(all.equal(lapply(newdata,colnames),lapply(X,colnames))!=TRUE)

I suggest to replace the if statement by following code:

if (any(unlist(lapply(seq_along(X), function(i) length(setdiff(colnames(X[[i]]), colnames(newdata[[i]]))) > 0))))

Best, Lijiao

@Ning-L
Copy link
Author

Ning-L commented Mar 17, 2022

Actually, I found that the order of variables matters here, because there is a step after that to scale the new data based on the training data as follow:

mixOmics/R/predict.R

Lines 385 to 389 in 2b6ab06

if (!is.null(attr(X[[1]], "scaled:center")))
newdata[which(!is.na(ind.match))] = lapply(which(!is.na(ind.match)), function(x){sweep(newdata[[x]], 2, STATS = attr(X[[x]], "scaled:center"))})
if (scale)
newdata[which(!is.na(ind.match))] = lapply(which(!is.na(ind.match)), function(x){sweep(newdata[[x]], 2, FUN = "/", STATS = attr(X[[x]], "scaled:scale"))})

So if all variables are present in the new data, just in a different order than in the training set, I think we can just add a step to sort them, such as 32e9ac6

@Max-Bladen Max-Bladen changed the title The if statement is not appropriate DIABLO does not account for differently ordered variables in test set Mar 20, 2022
@Max-Bladen Max-Bladen self-assigned this Mar 20, 2022
@Max-Bladen Max-Bladen added bug Something isn't working wip work-in-progress labels Mar 20, 2022
@Max-Bladen
Copy link
Collaborator

For consistency, using the template to describe the bug

🐞 Describe the bug:

When using the predict() function on a DIABLO, if one or more of the test dataframes is supplied with a variable order that differs from the equivalent training dataframe, the following error is raised:

Each 'newdata[[i]]' must include all the variables of 'object$X[[i]]

While the order is important for the algorithm, having differing orders should not prevent the method from running.


🔍 reprex results from reproducible example including sessioninfo():

suppressMessages(library(mixOmics))

data(breast.TCGA) # load in the data

# extract data
X.train = list(mirna = breast.TCGA$data.train$mirna,
               mrna = breast.TCGA$data.train$mrna)

X.test = list(mirna = breast.TCGA$data.test$mirna,
              mrna = breast.TCGA$data.test$mrna)

Y.train = breast.TCGA$data.train$subtype

# use optimal values from the case study on mixOmics.org
optimal.ncomp = 2
optimal.keepX = list(mirna = c(10,5),
                     mrna = c(26, 16))

# set design matrix
design = matrix(0.1, ncol = length(X.train), nrow = length(X.train),
                dimnames = list(names(X.train), names(X.train)))
diag(design) = 0

# generate model
final.diablo.model = block.splsda(X = X.train, Y = Y.train, ncomp = optimal.ncomp, # set the optimised DIABLO model
                                  keepX = optimal.keepX, design = design)
#> Design matrix has changed to include Y; each block will be
#>             linked to Y.


# create new test data with one dataframe being reordered
new.var.order = sample(1:dim(X.test$mirna)[2])
X.test.dup <- X.test
X.test.dup$mirna <- X.test.dup$mirna[, new.var.order]

predict.diablo = predict(final.diablo.model, newdata = X.test)

predict.diablo.reordered = predict(final.diablo.model, newdata = X.test.dup)
#> Error in predict.block.spls(final.diablo.model, newdata = X.test.dup): Each 'newdata[[i]]' must include all the variables of 'object$X[[i]]'

Created on 2022-03-21 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.1.2 Patched (2021-11-16 r81220)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_Australia.1252
#>  ctype    English_Australia.1252
#>  tz       Australia/Sydney
#>  date     2022-03-21
#>  pandoc   2.14.2 @ C:/Users/Work/AppData/Local/Pandoc/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package      * version date (UTC) lib source
#>  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.1.3)
#>  BiocParallel   1.28.3  2021-12-09 [1] Bioconductor
#>  cli            3.2.0   2022-02-14 [1] CRAN (R 4.1.2)
#>  colorspace     2.0-3   2022-02-21 [1] CRAN (R 4.1.2)
#>  corpcor        1.6.10  2021-09-16 [1] CRAN (R 4.1.1)
#>  crayon         1.5.0   2022-02-14 [1] CRAN (R 4.1.2)
#>  DBI            1.1.2   2021-12-20 [1] CRAN (R 4.1.3)
#>  digest         0.6.29  2021-12-01 [1] CRAN (R 4.1.2)
#>  dplyr          1.0.8   2022-02-08 [1] CRAN (R 4.1.2)
#>  ellipse        0.4.2   2020-05-27 [1] CRAN (R 4.1.2)
#>  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.1.2)
#>  evaluate       0.15    2022-02-18 [1] CRAN (R 4.1.2)
#>  fansi          1.0.2   2022-01-14 [1] CRAN (R 4.1.2)
#>  fastmap        1.1.0   2021-01-25 [1] CRAN (R 4.1.2)
#>  fs             1.5.2   2021-12-08 [1] CRAN (R 4.1.2)
#>  generics       0.1.2   2022-01-31 [1] CRAN (R 4.1.2)
#>  ggplot2      * 3.3.5   2021-06-25 [1] CRAN (R 4.1.2)
#>  ggrepel        0.9.1   2021-01-15 [1] CRAN (R 4.1.2)
#>  glue           1.6.2   2022-02-24 [1] CRAN (R 4.1.2)
#>  gridExtra      2.3     2017-09-09 [1] CRAN (R 4.1.2)
#>  gtable         0.3.0   2019-03-25 [1] CRAN (R 4.1.2)
#>  highr          0.9     2021-04-16 [1] CRAN (R 4.1.2)
#>  htmltools      0.5.2   2021-08-25 [1] CRAN (R 4.1.2)
#>  igraph         1.2.11  2022-01-04 [1] CRAN (R 4.1.2)
#>  knitr          1.37    2021-12-16 [1] CRAN (R 4.1.2)
#>  lattice      * 0.20-45 2021-09-22 [2] CRAN (R 4.1.2)
#>  lifecycle      1.0.1   2021-09-24 [1] CRAN (R 4.1.2)
#>  magrittr       2.0.2   2022-01-26 [1] CRAN (R 4.1.2)
#>  MASS         * 7.3-54  2021-05-03 [2] CRAN (R 4.1.2)
#>  Matrix         1.3-4   2021-06-01 [2] CRAN (R 4.1.2)
#>  matrixStats    0.61.0  2021-09-17 [1] CRAN (R 4.1.2)
#>  mixOmics     * 6.18.1  2021-11-18 [1] Bioconductor (R 4.1.2)
#>  munsell        0.5.0   2018-06-12 [1] CRAN (R 4.1.2)
#>  pillar         1.7.0   2022-02-01 [1] CRAN (R 4.1.2)
#>  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.1.2)
#>  plyr           1.8.6   2020-03-03 [1] CRAN (R 4.1.2)
#>  purrr          0.3.4   2020-04-17 [1] CRAN (R 4.1.2)
#>  R.cache        0.15.0  2021-04-30 [1] CRAN (R 4.1.2)
#>  R.methodsS3    1.8.1   2020-08-26 [1] CRAN (R 4.1.1)
#>  R.oo           1.24.0  2020-08-26 [1] CRAN (R 4.1.1)
#>  R.utils        2.11.0  2021-09-26 [1] CRAN (R 4.1.2)
#>  R6             2.5.1   2021-08-19 [1] CRAN (R 4.1.2)
#>  rARPACK        0.11-0  2016-03-10 [1] CRAN (R 4.1.2)
#>  RColorBrewer   1.1-2   2014-12-07 [1] CRAN (R 4.1.1)
#>  Rcpp           1.0.8.2 2022-03-11 [1] CRAN (R 4.1.2)
#>  reprex         2.0.1   2021-08-05 [1] CRAN (R 4.1.2)
#>  reshape2       1.4.4   2020-04-09 [1] CRAN (R 4.1.2)
#>  rlang          1.0.2   2022-03-04 [1] CRAN (R 4.1.3)
#>  rmarkdown      2.13    2022-03-10 [1] CRAN (R 4.1.3)
#>  RSpectra       0.16-0  2019-12-01 [1] CRAN (R 4.1.2)
#>  rstudioapi     0.13    2020-11-12 [1] CRAN (R 4.1.2)
#>  scales         1.1.1   2020-05-11 [1] CRAN (R 4.1.2)
#>  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
#>  stringi        1.7.6   2021-11-29 [1] CRAN (R 4.1.2)
#>  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.1.2)
#>  styler         1.7.0   2022-03-13 [1] CRAN (R 4.1.2)
#>  tibble         3.1.6   2021-11-07 [1] CRAN (R 4.1.2)
#>  tidyr          1.2.0   2022-02-01 [1] CRAN (R 4.1.2)
#>  tidyselect     1.1.2   2022-02-21 [1] CRAN (R 4.1.2)
#>  utf8           1.2.2   2021-07-24 [1] CRAN (R 4.1.2)
#>  vctrs          0.3.8   2021-04-29 [1] CRAN (R 4.1.2)
#>  withr          2.5.0   2022-03-03 [1] CRAN (R 4.1.2)
#>  xfun           0.30    2022-03-02 [1] CRAN (R 4.1.2)
#>  yaml           2.3.5   2022-02-21 [1] CRAN (R 4.1.2)
#> 
#>  [1] C:/Users/Work/Documents/R/win-library/4.1
#>  [2] C:/Program Files/R/R-4.1.2patched/library
#> 
#> ------------------------------------------------------------------------------

🤔 Expected behavior:

Error should not be raised. predict() function should handle this case and be able to produce predictions.


💡 Possible solution:

Sorting the test dataframe to have variable order that matches the training dataframe

@Max-Bladen Max-Bladen added this to Implemented Locally in bladen-devel-bugs Mar 20, 2022
This was linked to pull requests Mar 20, 2022
@Max-Bladen Max-Bladen removed a link to a pull request Mar 21, 2022
@Max-Bladen Max-Bladen moved this from Implemented Locally to PR Successful on Branch in bladen-devel-bugs Mar 21, 2022
@Max-Bladen Max-Bladen removed the wip work-in-progress label Mar 21, 2022
@Max-Bladen Max-Bladen added this to PR Successful on Branch in mixOmics-development Mar 29, 2022
Max-Bladen added a commit that referenced this issue Apr 25, 2022
fix: predict function has updated error messages for when feature sets are different or in different order
Max-Bladen added a commit that referenced this issue Apr 25, 2022
test: added test which catches the two next error messages that can be returned
@Max-Bladen Max-Bladen added the ready-to-review for all PRs that are ready to be reviewed. including complex, larger commits label Sep 7, 2022
Max-Bladen added a commit that referenced this issue Sep 15, 2022
fix: predict function has updated error messages for when feature sets are different or in different order
@Max-Bladen Max-Bladen removed the ready-to-review for all PRs that are ready to be reviewed. including complex, larger commits label Sep 22, 2022
@Max-Bladen Max-Bladen moved this from Ready to Review to Needs closing in mixOmics-development Sep 22, 2022
@Max-Bladen Max-Bladen moved this from Needs closing to Merged in mixOmics-development Sep 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
bladen-devel-bugs
PR Successful on Branch
Development

Successfully merging a pull request may close this issue.

2 participants