Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in clean method / utils.clean function? #210

Closed
jorainer opened this issue May 2, 2017 · 13 comments
Closed

Bug in clean method / utils.clean function? #210

jorainer opened this issue May 2, 2017 · 13 comments

Comments

@jorainer
Copy link
Collaborator

jorainer commented May 2, 2017

I was surprised by the result to a utils.clean call: seems the very first element and the very last element is always TRUE:

With a vector like

Test <- c(0, 0, 1, 1, 0, 0, 0, 1, 0, 0)

I was expecting to get:

Expect <- c(FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE)

but the result of utils.clean is:

MSnbase:::utils.clean(Test)
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE

The first and last elements are thus always set to TRUE no matter what the next or previous elements are.

@lgatto @sgibb is this intentional?

@lgatto
Copy link
Owner

lgatto commented May 2, 2017

It is unrelated to the fact that the 0s are at the start/end, but with the fact that we remove 0s only when there are 3 or more:

> test0 <- c(0, 0, 0, 1, 0, 1, 0, 0, 0)
> test0[MSnbase:::utils.clean(test0)]
[1] 0 0 1 0 1 0 0
> test1 <- c(0, 0, 0, 1, 0, 0, 1, 0, 0, 0)
> test1[MSnbase:::utils.clean(test1)]
[1] 0 0 1 0 0 1 0 0
> test2 <- c(0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0)
> test2[MSnbase:::utils.clean(test2)]
[1] 0 0 1 0 0 1 0 0
> test3 <- c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0)
> test3[MSnbase:::utils.clean(test3)]
[1] 0 0 1 0 0 1 0 0

The reason for this is that we want to keep the 0s before and after ranges of non-0s.

@lgatto
Copy link
Owner

lgatto commented May 3, 2017

But actually, we probably could update the function to remove double leading/trailing zeros. Is there a good reason to do this?

@jorainer
Copy link
Collaborator Author

jorainer commented May 3, 2017

Reading the documentation I understood that zeros adjacent to peaks are kept while all those that are more than 1 position away from a non-zero value are removed. An alternative function to do this would be (note that this function would also handle NA, i.e. FALSE is returned for all NAs):

#' @description Expands stretches of TRUE values in \code{x} by one on both
#'     sides.
#'
#' @note The return value for a \code{NA} is always \code{FALSE}.
#' 
#' @param x \code{logical} vector.
#' 
#' @noRd
.grow_trues <- function(x) {
    previous <- NA
    x_new <- rep_len(FALSE, length(x))
    for (i in 1:length(x)) {
        if (is.na(x[i])) {
            previous <- NA
            next
        }
        ## If current element is TRUE
        if (x[i]) {
            x_new[i] <- TRUE
            ## if last element was FALSE, set last element to TRUE
            if (!is.na(previous) && !previous)
                x_new[i - 1] <- TRUE
        } else {
            ## if previous element was TRUE, set current to TRUE.
            if (!is.na(previous) && previous)
                x_new[i] <- TRUE
        }
        previous <- x[i]
    }
    x_new
}

This produces:

Test <- c(0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0)
.grow_trues(Test > 0)
 [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

Test[.grow_trues(Test > 0)]
[1] 0 1 1 0 0 1 1 1 0

While the function uses a for loop it is still faster:

library(microbenchmark)
library(MSnbase)
microbenchmark(.grow_trues(Test > 0), MSnbase:::utils.clean(Test))

Unit: microseconds
                        expr     min       lq      mean   median       uq
       .grow_trues(Test > 0)   6.570   7.5320  10.43929   9.6830  12.3165
 MSnbase:::utils.clean(Test) 353.821 368.2385 584.08324 423.5585 511.7995
       max neval cld
    30.389   100  a 
 11846.897   100   b

If you agree I could replace the original with this one.

@sgibb
Copy link
Collaborator

sgibb commented May 3, 2017

If we want to change the behaviour of utils.clean I would like to suggest the following vectorised solution:

.vclean <- function(x) {
  notZero <- x != 0 & !is.na(x)
  notZero | c(notZero[-1], FALSE) | c(FALSE, notZero[-length(notZero)])
}

It turns the numerical vector x into a logical one and moves this one an element to the left and to the right to add TRUEs left/right to the non-zero values.

Test <- c(0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0)
.vclean(Test)
#  [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
.vclean(c(NA, Test))
#  [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

library(microbenchmark)
library(MSnbase)
microbenchmark(.vclean(Test), .grow_trues(Test > 0), MSnbase:::utils.clean(Test))
# Unit: microseconds
#                         expr     min       lq      mean   median       uq     max neval
#                .vclean(Test)   2.824   3.4340   5.60231   5.8980   6.3545  17.504   100
#        .grow_trues(Test > 0)  20.604  22.5175  28.62575  25.5705  34.1700  51.574   100
#  MSnbase:::utils.clean(Test) 455.928 475.5305 490.31268 486.5955 497.3280 767.833   100

@lgatto
Copy link
Owner

lgatto commented May 3, 2017

I am fine with changing the behaviour, and speed gain is impressive here - @sgibb, can you send a PR.

@jorainer
Copy link
Collaborator Author

jorainer commented May 3, 2017

Awesome!

@lgatto lgatto closed this as completed in aca7e3a May 4, 2017
lgatto pushed a commit that referenced this issue May 4, 2017
rewrite utils.clean to fix #210
@jorainer
Copy link
Collaborator Author

I am not quite happy how the current utils.clean handles NA. Example:

ints <- c(0, NA, 20, 0, 0, 0, 123, 124343, 3432, 0, 0, 0)
keep <- MSnbase:::utils.clean(ints)
ints[keep]
[1]     NA     20      0      0    123 124343   3432      0

Here I don't like two things:

  • the NA is still in there, because it's close to a peak
  • there are two 0 between 20 and 123, but I would expect to have only a single one there.

@jorainer
Copy link
Collaborator Author

Forgot to reopen the issue ;)

@jorainer jorainer reopened this Jun 16, 2017
@lgatto
Copy link
Owner

lgatto commented Jun 16, 2017

My 2 cents:

the NA is still in there, because it's close to a peak

Here, I would suggest to handle zeros and NA values independently. First, I would like to know if you have ever observed data with NA intensities. Knowing when/why they appear could shed some light as to how to treat them; we could omit NAs or replace them with 0.

there are two 0 between 20 and 123, but I would expect to have only a single one there.

This is, as far as I can remember, the expected behaviour, as each peak should keep it's own pair of surrounding zeros. The reason is that if these two peaks are far apart, when plotting, each should reach zero intensity next to it's maximum

> ints <- c(0, 10, 15, 2, 0, 0, 10, 30, 3, 0)
> mz <- c(1, 2, 3, 4, 5, 100, 101, 102, 103, 104)
> plot(mz, ints, type = "l")

rplot001

rather than

 plot(mz[-6], ints[-6], type = "l")

rplot001

@jorainer
Copy link
Collaborator Author

jorainer commented Jun 16, 2017

Agree on the two 0s.
Regarding the NA: what about an argument na.rm that, if TRUE removes first all NA and then cleans the vector?

@jorainer
Copy link
Collaborator Author

To clarify, when I extract a Chromatogram with a defined mz range it can well be that for a certain retention time there is no intensity value available in that mz range. Having the NAs in the first place if OK, as it tells me that spectra were measured, but nothing was measured for a rt. At some point it might however be nice to remove them.

@lgatto
Copy link
Owner

lgatto commented Jun 16, 2017

I'm fine with na.rm, possibly with FALSE as default.

sgibb added a commit that referenced this issue Jun 19, 2017
@sgibb
Copy link
Collaborator

sgibb commented Jun 19, 2017

Just a link to the na.rm discussion: f9326bc

@lgatto lgatto closed this as completed in af3d358 Jun 20, 2017
lgatto pushed a commit that referenced this issue Jun 20, 2017
lgatto pushed a commit that referenced this issue Sep 7, 2017
* master:
  merge Sebastian's improvement, bump version
  adapt MSnSet unit tests to new utils.clean implementation
  rewrite utils.clean; closes #210

From: Laurent <lg390@cam.ac.uk>

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/MSnbase@129402 bc3139a8-67e5-0310-9ffc-ced21a209358
lgatto pushed a commit that referenced this issue Sep 7, 2017
From: Sebastian Gibb <mail@sebastiangibb.de>

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/MSnbase@130545 bc3139a8-67e5-0310-9ffc-ced21a209358
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants