Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DModX values are normalized on RSD of new observations instead of RSD of the model? #22

Open
stephanieherman opened this issue Apr 11, 2023 · 1 comment

Comments

@stephanieherman
Copy link

Hello!

I ran upon a strange behavior of the DModX() function when projecting new data into my PCA model. I've pasted some code to reproduce this below. First, I train a PCA model on the majority of the observations in the mtcars dataset:

# load package and demo data
set.seed(32)
data("mtcars")
library(pcaMethods)

# split dataset
ind <- sample(1:nrow(mtcars), 5)
test <- mtcars[ind,]
train <- mtcars[-ind,]

# train pca model
pca.mod <- pca(train, scale = "uv", center = TRUE, nPcs = 3)

When evaluating the DModX values of new observations, I noticed that I get different results depending on how many new observations I evaluate:

> DModX(pca.mod, dat = test, newdata = TRUE, type = "normalized")
 Maserati Bora        Valiant      Merc 280C Toyota Corolla       Merc 230 
     3.6666773      0.9110368      1.7617564      1.7419626      1.6084993

> DModX(pca.mod, dat = test[1:3,], newdata = TRUE, type = "normalized")
Maserati Bora       Valiant     Merc 280C 
     4.218257      1.048084      2.026778 

You also get the same value for a single new observation (no matter which observation):

> DModX(pca.mod, dat = test[1,], newdata = TRUE, type = "normalized")
Maserati Bora 
     4.795832 

> DModX(pca.mod, dat = test[3,], newdata = TRUE, type = "normalized")
Merc 280C 
 4.795832 

Finally, it only seems to occur for normalized DModX values (type = "normalized"), which is the default setting. The example below show that the same DModX values are obtained when using absolute DModX values (type = "absolute").

> DModX(pca.mod, dat = test, newdata = TRUE, type = "absolute")
 Maserati Bora        Valiant      Merc 280C Toyota Corolla       Merc 230 
     27.268176       6.775156      13.101749      12.954547      11.962013 

> DModX(pca.mod, dat = test[1:3,], newdata = TRUE, type = "absolute")
Maserati Bora       Valiant     Merc 280C 
    27.268176      6.775156     13.101749

In the documentation you write that the "Normalized values are adjusted to the total RSD of the model". Hence, as I am providing the same PCA model, the generated DModX value of a single new observation should not vary depending on the set of new observations that are projected into the model space.

When looking at the function code (pasted below), I have a guess on where this issue might arise. In the calculation of the normalization factor s0 it looks like the squared residuals of the new observations are summed (sum(E2)). Should this not be the sum of the squared residuals of the training data (i.e., sum(resid(object, object@completeObs)^2))?

setMethod("DModX", "pcaRes",
          function(object, dat, newdata=FALSE, type=c("normalized","absolute"), ...) {
            type <- match.arg(type)
            if(missing(dat)) {
              if(!is.null(completeObs(object)))
                dat <- completeObs(object)
              else stop("missing data when calculating DModX")
            }
            A0 <- as.integer(centered(object))
            ny <- ifelse(newdata, 1,
                         sqrt(nObs(object) /
                              (nObs(object) - nP(object) - A0)))
            E2 <- resid(object, dat)^2
            s <- sqrt(rowSums(E2) / (nVar(object) - nP(object))) * ny
            if(type == "absolute")
              return(s)
            s0 <- sqrt(sum(E2) / ((nObs(object) - nP(object) - A0) *
                                  (nVar(object) - nP(object))))
            s / s0
          })
@stephanieherman stephanieherman changed the title Normalized DModX values are normalized on RSD of new observations instead of RSD of the model? DModX values are normalized on RSD of new observations instead of RSD of the model? Apr 11, 2023
@hredestig
Copy link
Owner

Thanks for the detailed report and analysis! This package is old and I haven't worked on it for ages. I had a look and agree with your suggestion, although I think there is a misunderstanding of what 'dat' was meant to be: the whole original training data. Passing a subset was not an intended to be suppported (no good reason, just happened like that).

Took me a while to get started again and found a number of issues with the package but for this particular problem I created a draft solution along your suggestion in #23 what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants