Fix robust_summary() #330

adder · 2018-04-05T18:26:19Z

Hey Laurent

Adriaan here.
I tested out your robust summarisation implementation on my data.
It failed when there are NA values in the expression matrix.

This is because I used .lm.fit instead of lm for speed reasons.
And .lm.fit does not handle NAs automatically.

I also avoided making a dataframe in this function because this can be costly in a long loop.

I hope this PR helps
best regards

Makes it work when expression values are missing

lgatto · 2018-04-05T19:35:43Z

I am not going to accept this PR, at least not now.

Firstly, it currently fails the test_robust unit test because

> ## Results from your PR
> tail(exprs(x), n = 2)
        iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
ECA4514   11673.59   11936.43   12090.93   12268.70
ENO       22671.74   19986.16   16679.57   16885.14
> ## Results when first filtering out peptides with NA, as in original code
> tail(exprs(xfilt), n = 2)
        iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
ECA4514   11673.59   11936.43   12090.93   12268.70
ENO       38262.18   32907.09   15000.11   10733.34

What would need to be done is to add an argument that specifies that NA values can be ignored (default would probably be FALSE, and would fail in case of missing data), and the difference in results (see above) would need to be documented in combineFeatures.Rd.

sgibb

Thanks for this PR. Additional to @lgatto's comments I like to suggest a few more changes.

Could you please change the !any(!id) into all(id) (I can't add a direct comment to this line because it is not part of this PR).

BTW: I still don't understand why everybody favours snake_cases over camelCase but using both in one function (robust_summary <- function(..., nIter, ...) is weird. Please choose one of both.

All in all I would suggest the following implementation (not tested):

robustSummary <- function(e, nIter = 100, residuals = FALSE, na.rm = FALSE, ...) {
    ## If there is only one 1 peptide for all samples return
    ## expression of that peptide
    if (nrow(e) == 1L) return(e)

    ## remove data points with missing expression values
    p <- !(is.na(e) & na.rm)
    expression <- e[p]
    sample <- as.factor(col(e)[p])
    feature <- as.factor(row(e)[p])

    ## Sum contrast on peptide level so sample effect will be mean
    ## over all peptides instead of reference level.
    X <- stats::model.matrix(~ -1L + sample + feature,
                             contrasts.arg = list(feature = 'contr.sum'))
    ## MASS::rlm breaks on singulare values.
    ## - Check with base lm if singular values are present.
    ## - If so, these coefficients will be zero, remove this collumn
    ##   from model matrix
    ## - Rinse and repeat on reduced modelmatrx till no singular
    ##   values are present
    repeat {
        fit <- stats::.lm.fit(X, expression)
        id <- fit$coefficients != 0L
        if (all(id)) break
        X <- X[ , id, drop = FALSE]
    }
    ## Last step is always rlm
    fit <- MASS::rlm(X, expression, maxit = nIter, ...)
    fit$coefficients[colSums(p) == nrow(e)]
}

sgibb · 2018-04-06T21:53:45Z

R/combineFeatures.R

-    sample <- x$sample
-    feature <- x$feature
+    ## remove data points with missing expression values
+    expression = as.numeric(as.matrix(e))


e should always be a matrix. So as.matrix is not needed here. as.numeric not needed at all.

Coding style: please never use = for assignment (just for arguments).

Is this true? When I removeld as. matrix. I got an error.
Didn't test removing both as.matrix and as.numeric.
Maybe @lgatto can comment on this? The as.numeric(as.matrix(e)) is from his code.

Yes, using = is my bad habit. My scripts are littered with them.
I know I can't use them for packages but they often slip through. :)

Regarding the as.matrix, that's my bad. Some earlier code erroneously used a data.frame as input, hence the as.matrix. As for as.numeric(), is could be replaced by as.vector(), but I think coercion to a vector is needed.

I think coercion to a vector is needed.

That's right but done automatically by e{p] (if p is logical). So no need for explicitly call as.vector/as.numeric.

Ah, ok. There was not logical subsetting in the original code, which is why it was needed then.

sgibb · 2018-04-06T21:55:20Z

R/combineFeatures.R

+    expression = as.numeric(as.matrix(e))
+    p = !is.na(expression)
+    expression = expression[p]
+    sample = rep(colnames(e), each = nrow(e))[p]


Could be replaced by as.factor(col(e)[p]) (the names of the factor are not user visible, or I am wrong?)

True, nothing here shoud be user visible.

But why converting everything to factors? is this needed?
It's not much but the there some (unneeded) computation time that can quickly add up when this function is called for every protein.
(Eg. as.character is much faster. But rownames always return characters, so no problems here)

Okay, apparently I didn't know what row() and col() do. This is indeed better!

character is converted to factor in model.matrix. But it doesn't matter if we use colnames(e)[col(e)[p]] or as.factor(col(e)[p]) here.

sgibb · 2018-04-06T21:55:47Z

R/combineFeatures.R

+    p = !is.na(expression)
+    expression = expression[p]
+    sample = rep(colnames(e), each = nrow(e))[p]
+    feature = rep(rownames(e), times = ncol(e))[p]


as.factor(row(e)[p])

Same comment as above.

sgibb · 2018-04-06T21:56:41Z

R/combineFeatures.R

-    feature <- x$feature
+    ## remove data points with missing expression values
+    expression = as.numeric(as.matrix(e))
+    p = !is.na(expression)


As @lgatto suggests there should be an argument (e.g. na.rm) to control whether the user want to remove/keep NA.

Fair enough. But I would argue that the default should be TRUE. See my comments below.

sgibb · 2018-04-06T21:57:42Z

R/combineFeatures.R


    ## modelmatrix breaks on factors with 1 level so make vector of
-    ## ones (this swill be intercept).
+    ## ones (intercept).
    if (length(unique(sample)) == 1L) sample <- rep(1, length(sample))


sample <- rep(1, length(sample)). If you want to replace the whole vector with the same value for each element you could use the []-operator: sample[] <- 1L. BTW if you use as.factor above the whole statement is not needed.

as.factor will not work.
modelmatrix` fails on factors with only 1 level.

BTW if you use as.factor above the whole statement is not needed.

You are right, my above statement was wrong. But with as.factor we could use:

if (nlevels(sample) == 1L) sample <- as.integer(sample)

sgibb · 2018-04-06T22:02:44Z

R/combineFeatures.R

+    ## Put NA for the samples without any expression value
+    present = apply(e,2,function(x){any(!is.na(x))})
+    out = rep(NA,length(present))
+    out[present] = fit$coefficients[sampleid]


If you avoid as.numeric in line 288 you could reuse p here: colSums(p) == nrow(e). The whole three lines could be simplified to fit$coefficients[colSums(p) == nrow(e)].

I don't think this will do.
I need to return a summarised value (the sample coefficients) for every sample in the msnset.
If that sample happen to have no expression values for the given protein, this value should be NA.

In your suggested solution, only values are returned for the samples without missing data and so MSnbase will probably throw an error (or should).

Right. I somehow messed up here.
Now I would suggest:

present <- rep(NA_logical_, ncol(e)) present[colSums(!is.na(e)) > 0] <- TRUE fit$coefficients[present]

sgibb · 2018-04-06T22:04:56Z

R/combineFeatures.R

+    present = apply(e,2,function(x){any(!is.na(x))})
+    out = rep(NA,length(present))
+    out[present] = fit$coefficients[sampleid]
+    return(out)


return is just needed if you want to jump out of a function (e.g. in an if-statement). If you use it at the end it is just an unnecessary function call. That means out would be enough.

adder · 2018-04-07T04:48:21Z

Thanks for the code review!

I didn't tested the changes yet but I'm going to look into this today and update my PR accordingly.

I allready posted some comments inline above on your suggested changes.

About camelCase() vs snake_case(). @lgatto choosed robust_summary. I'm not sure if MSnbase adhere to a certain style? I saw different uses in the MSnbase reference. (also point.case())
Out of curiosity @sgibb. What's your case against snake_case? :) I always find the snake_case more readible because they act like spaces so they look more like human language. :)

About the removing of NAs. It's ok to have this as an argument but I would argue that the default should be TRUE.
Every proteomics dataset I know of has an arguable high degree of missingness for the same sequences across samples.
For example I use a dataset with 27 samples (same sample but different treatments and different labs).
If for one sample there is a particular peptide missing, I don't think it's very sensible to also ignore the data in all other samples. There would not be much left for downstream analysis.
Correct me if I'm wrong but MSnbase would always throw an error by default on a biological shotgun proteomics dataset if the default of na.rm would be set to FALSE

A pro na.rm = FALSE argument could be that this error serves as a reminder to the user that there is high missingness in proteomics data and one should proceed carefully.
Maybe a solution is to let MSnbase throw an error but also put a message that this might be due to high missingness and one might consider to put na.rm to TRUE.
The error as it is, is not very helpfull for the average user.

I quickly checked the other functions in combineFeatures() and noticed that all defaults are set to FALSE.
In the prensence of missingness, medpolish also throws an error but mean and median 'fail sillently' (return NA for the failing samples).
So the behaviour accross the different summarisation functions in combineFeatures() is not very consistent.
My origninal proposesed robust function tried to be like mean and median. Give a summarised when possible, return NA if not.

Looking forward to your opinion on the matter :)

Thanks again for all suggestions!

lgatto · 2018-04-07T07:36:12Z

Regarding whether the handling of NAs should be default or not, what needs to be in place is a unit test that makes sure all results but for ENO are identical whatever the value of that argument and that the results for ENO are different, and to check the exact values of ENO against the anticipated results (above). What is also needed is a good explanation in the man page.

The behaviour of the different combineFeatures methods depends on the underlying code. We could try to make it more consistent, I suppose. More helpful messages and better documentation of effects of missing vales are definitely something that would be welcome.

My suggestion would be to

Document the effect of NA in combining features with different methods and remind them about data imputation.
First thing in combineFeatures, have a message that mentions the presence of missing values in the data and refer to the documentation for the user to see the effect thereof for different downstream methods.

Once this is in place, I am happy for robust to handle NAs gracefully and using all values.

lgatto · 2018-04-07T07:39:33Z

As for camel vs snake case, it's very much a historical thing. Camel case has been used in Bioconductor for longer than the snake case adoption in the very successful tidyverse packages, hence the usage of camel case in the visible API. Many older functions use a dot (in particular internal helper functions such as utils.*).

I don't have strong opinions as long as it is coherent within a function or within a set of related functions.

adder · 2018-04-07T07:53:13Z

@lgatto
If I understand correctly you would prefer to have a general message through message() when combinefeatures() is called and if there is at least one NA present.
This message mentions this and points to the help of combinefeatures for more info. In the help is explained what happens for every possible summarisation function.
Correct?

PS. Oops, I forgot to commit my comments on @sgibb `s inline code review. See above for my additional comments.

lgatto · 2018-04-07T08:54:20Z

If I understand correctly you would prefer to have a general message through message() when combineFeatures() is called and if there is at least one NA present.
This message mentions this and points to the help of combineFeatures for more info. In the help is explained what happens for every possible summarisation function.

Yes, indeed. I think informing is critical and giving them choices (with good defaults) essential.

I am happy to work on the documentation aspects. I hope I'll have some time beginning of next week.

sgibb · 2018-04-07T18:21:33Z

About camelCase() vs snake_case(). @lgatto choosed robust_summary. I'm not sure if MSnbase adhere to a certain style? I saw different uses in the MSnbase reference. (also point.case())
Out of curiosity @sgibb. What's your case against snake_case? :) I always find the snake_case more readible because they act like spaces so they look more like human language. :)

I don't want to start a flame war here and I really can live with snake_case but not with a mix of both.
I don't really want to use snake_case:

MSnbase uses camelCase.
snake_case doesn't follow the Bioc coding style and MSnbase is a bioc package.
snake_case needs n * _ additional keystroke(s).
snake_case is n * _ character(s) longer.
On my DE-keyboard-layout I have to hit "[Shift] + [-]" and "[-]" is in the lower row on the right and hard to reach (I can't understand why US keyboard users like snake_case because there the "[-]" is right in the most upper (numeric) row which is even harder to reach).
You may argue that code is written once and read often so writing can't be important. Maybe snake_case is more readable in general but I am used to camelCase and IMHO it easier to find the "image" of a variable/function name (instead of really reading it). In contrast snake_case breaks the "image" by an underscore so I am often forced to really read the name.

I really don't care whether we use camelCase or snake_case but we should choose one of them and don't mix them.
(I am sorry for blaming you instead of @lgatto for the snake_case/camelCase mix 😉.)

Every proteomics dataset I know of has an arguable high degree of missingness for the same sequences across samples.

Right but the might be imputed by various methods provided by MSnbase.

I quickly checked the other functions in combineFeatures() and noticed that all defaults are set to FALSE.
In the prensence of missingness, medpolish also throws an error but mean and median 'fail sillently' (return NA for the failing samples).
So the behaviour accross the different summarisation functions in combineFeatures() is not very consistent.

You are right. We should be more consistent here.

lgatto · 2018-04-07T19:10:50Z

I am sorry for blaming you instead of @lgatto for the snake_case/camelCase mix wink.

I take all the blame!

Re consistency (or lack thereof) with how NAs are handled, it also depends on downstream functions, and I am not sure we could or should impose a meaningful default behaviour for all aggregation methods. Which is why I suggest a message whenever NAs are present and better documentation, including reminding users about filtering or imputing missing values. May be we will figure out something later.

lgatto · 2018-04-07T20:20:59Z

I have pushed the following to master (possibly also to this branch, hopefully this won't affect further commits here)

Added a message at the beginning of combineFeatures to inform users of missing values
Update the man page to describe the different effects of missing values on aggregation - comments welcome.
(Another fix irrelevant to this PR)

TODO

Update/rename the robustSummary as suggested by @adder including the review proposed by @sgibb
Now that I have written the documentation, I don't think we need an na.rm argument. The message will inform the user, and the current behaviour can be by first filtering the data. See the example in the new Rd page for details.
~~The unit test needs to be updated accordingly~~ - done in bc6b0a3

@sgibb

Also add @sgibb to authors

adder · 2018-04-09T13:37:27Z

Okey,
I tried to merge @lgatto commits to my fork before doing my pull request with changes.
I hope I didn't mess up somewhere. Tell me if something is wrong.

I added the changes of @sgibb and added him to the authors.

I remove de na.rm parameter as discussed.
I also decided to remove the Niter parameter since this is just translated to de maxit parameter of rlm.
Since you can easily change the rlm parameters, I decided that it might be better to just go with the rlm defaults.
As an added bonus: the snake_case/camelCase problem is elegantly solved this way ;)
I also changed the Unit test accordingly. They succeeded for me.
Let me know if there are still some issues.

Cheers
Adriaan

lgatto · 2018-04-09T14:11:24Z

Oh dear, still using = instead of <- ;-). I'll iron these details out locally.

lgatto · 2018-04-09T15:40:11Z

I has been merged plus some small changes. I haven't included all @sgibb's code suggestion, as I felt it made some parts less readable (for me at least). Thank you both!

adder · 2018-04-09T16:02:50Z

Oeps, the = is my muscle memory working.
Sorry
Thanks for the merge

lgatto · 2018-04-09T16:06:38Z

Also on Bioc now.

Fix robust_summary()

9cc9572

Makes it work when expression values are missing

sgibb requested changes Apr 6, 2018

View reviewed changes

lgatto pushed a commit that referenced this pull request Apr 7, 2018

adapt robust unit test with regard to NAs, as discussed in issue #330

bc6b0a3

adder added 3 commits April 9, 2018 15:00

Update according code review

fd9fc09

Also add @sgibb to authors

Merge remote-tracking branch 'lgatto/master'

b2e723b

update test_robust.R with optional rlm parameter maxit

c6802c2

lgatto merged commit c6802c2 into lgatto:master Apr 9, 2018

jorainer mentioned this pull request Apr 12, 2018

Unit test "Robust summarisation" fails #335

Closed

Fix robust_summary() #330

Fix robust_summary() #330

Conversation

adder commented Apr 5, 2018

lgatto commented Apr 5, 2018 • edited Loading

sgibb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgatto Apr 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgibb Apr 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adder commented Apr 7, 2018

lgatto commented Apr 7, 2018

lgatto commented Apr 7, 2018

adder commented Apr 7, 2018

lgatto commented Apr 7, 2018

sgibb commented Apr 7, 2018

lgatto commented Apr 7, 2018

lgatto commented Apr 7, 2018 • edited Loading

TODO

adder commented Apr 9, 2018

lgatto commented Apr 9, 2018

lgatto commented Apr 9, 2018

adder commented Apr 9, 2018

lgatto commented Apr 9, 2018

lgatto commented Apr 5, 2018 •

edited

Loading

lgatto Apr 7, 2018 •

edited

Loading

sgibb Apr 6, 2018 •

edited

Loading

lgatto commented Apr 7, 2018 •

edited

Loading