-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix robust_summary() #330
Fix robust_summary() #330
Conversation
Makes it work when expression values are missing
I am not going to accept this PR, at least not now. Firstly, it currently fails the > ## Results from your PR
> tail(exprs(x), n = 2)
iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
ECA4514 11673.59 11936.43 12090.93 12268.70
ENO 22671.74 19986.16 16679.57 16885.14
> ## Results when first filtering out peptides with NA, as in original code
> tail(exprs(xfilt), n = 2)
iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
ECA4514 11673.59 11936.43 12090.93 12268.70
ENO 38262.18 32907.09 15000.11 10733.34 What would need to be done is to add an argument that specifies that NA values can be ignored (default would probably be FALSE, and would fail in case of missing data), and the difference in results (see above) would need to be documented in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR. Additional to @lgatto's comments I like to suggest a few more changes.
Could you please change the !any(!id)
into all(id)
(I can't add a direct comment to this line because it is not part of this PR).
BTW: I still don't understand why everybody favours snake_cases over camelCase but using both in one function (robust_summary <- function(..., nIter, ...)
is weird. Please choose one of both.
All in all I would suggest the following implementation (not tested):
robustSummary <- function(e, nIter = 100, residuals = FALSE, na.rm = FALSE, ...) {
## If there is only one 1 peptide for all samples return
## expression of that peptide
if (nrow(e) == 1L) return(e)
## remove data points with missing expression values
p <- !(is.na(e) & na.rm)
expression <- e[p]
sample <- as.factor(col(e)[p])
feature <- as.factor(row(e)[p])
## Sum contrast on peptide level so sample effect will be mean
## over all peptides instead of reference level.
X <- stats::model.matrix(~ -1L + sample + feature,
contrasts.arg = list(feature = 'contr.sum'))
## MASS::rlm breaks on singulare values.
## - Check with base lm if singular values are present.
## - If so, these coefficients will be zero, remove this collumn
## from model matrix
## - Rinse and repeat on reduced modelmatrx till no singular
## values are present
repeat {
fit <- stats::.lm.fit(X, expression)
id <- fit$coefficients != 0L
if (all(id)) break
X <- X[ , id, drop = FALSE]
}
## Last step is always rlm
fit <- MASS::rlm(X, expression, maxit = nIter, ...)
fit$coefficients[colSums(p) == nrow(e)]
}
sample <- x$sample | ||
feature <- x$feature | ||
## remove data points with missing expression values | ||
expression = as.numeric(as.matrix(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e
should always be a matrix
. So as.matrix
is not needed here. as.numeric
not needed at all.
Coding style: please never use =
for assignment (just for arguments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? When I removeld as. matrix
. I got an error.
Didn't test removing both as.matrix and as.numeric.
Maybe @lgatto can comment on this? The as.numeric(as.matrix(e))
is from his code.
Yes, using =
is my bad habit. My scripts are littered with them.
I know I can't use them for packages but they often slip through. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the as.matrix
, that's my bad. Some earlier code erroneously used a data.frame
as input, hence the as.matrix
. As for as.numeric()
, is could be replaced by as.vector()
, but I think coercion to a vector is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think coercion to a vector is needed.
That's right but done automatically by e{p]
(if p
is logical
). So no need for explicitly call as.vector
/as.numeric
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. There was not logical subsetting in the original code, which is why it was needed then.
expression = as.numeric(as.matrix(e)) | ||
p = !is.na(expression) | ||
expression = expression[p] | ||
sample = rep(colnames(e), each = nrow(e))[p] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be replaced by as.factor(col(e)[p])
(the names of the factor are not user visible, or I am wrong?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, nothing here shoud be user visible.
But why converting everything to factors? is this needed?
It's not much but the there some (unneeded) computation time that can quickly add up when this function is called for every protein.
(Eg. as.character
is much faster. But rownames always return characters, so no problems here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, apparently I didn't know what row()
and col()
do. This is indeed better!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
character
is converted to factor
in model.matrix
. But it doesn't matter if we use colnames(e)[col(e)[p]]
or as.factor(col(e)[p])
here.
p = !is.na(expression) | ||
expression = expression[p] | ||
sample = rep(colnames(e), each = nrow(e))[p] | ||
feature = rep(rownames(e), times = ncol(e))[p] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as.factor(row(e)[p])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
feature <- x$feature | ||
## remove data points with missing expression values | ||
expression = as.numeric(as.matrix(e)) | ||
p = !is.na(expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @lgatto suggests there should be an argument (e.g. na.rm
) to control whether the user want to remove/keep NA
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. But I would argue that the default should be TRUE. See my comments below.
|
||
## modelmatrix breaks on factors with 1 level so make vector of | ||
## ones (this swill be intercept). | ||
## ones (intercept). | ||
if (length(unique(sample)) == 1L) sample <- rep(1, length(sample)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample <- rep(1, length(sample))
. If you want to replace the whole vector with the same value for each element you could use the []
-operator: sample[] <- 1L
. BTW if you use as.factor
above the whole statement is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as.factor
will not work.
model
matrix` fails on factors with only 1 level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW if you use
as.factor
above the whole statement is not needed.
You are right, my above statement was wrong. But with as.factor
we could use:
if (nlevels(sample) == 1L) sample <- as.integer(sample)
## Put NA for the samples without any expression value | ||
present = apply(e,2,function(x){any(!is.na(x))}) | ||
out = rep(NA,length(present)) | ||
out[present] = fit$coefficients[sampleid] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you avoid as.numeric
in line 288 you could reuse p
here: colSums(p) == nrow(e)
. The whole three lines could be simplified to fit$coefficients[colSums(p) == nrow(e)]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will do.
I need to return a summarised value (the sample coefficients) for every sample in the msnset.
If that sample happen to have no expression values for the given protein, this value should be NA
.
In your suggested solution, only values are returned for the samples without missing data and so MSnbase will probably throw an error (or should).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I somehow messed up here.
Now I would suggest:
present <- rep(NA_logical_, ncol(e))
present[colSums(!is.na(e)) > 0] <- TRUE
fit$coefficients[present]
present = apply(e,2,function(x){any(!is.na(x))}) | ||
out = rep(NA,length(present)) | ||
out[present] = fit$coefficients[sampleid] | ||
return(out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return
is just needed if you want to jump out of a function
(e.g. in an if
-statement). If you use it at the end it is just an unnecessary function call. That means out
would be enough.
Thanks for the code review! I didn't tested the changes yet but I'm going to look into this today and update my PR accordingly. I allready posted some comments inline above on your suggested changes. About About the removing of A pro I quickly checked the other functions in Looking forward to your opinion on the matter :) Thanks again for all suggestions! |
Regarding whether the handling of NAs should be default or not, what needs to be in place is a unit test that makes sure all results but for ENO are identical whatever the value of that argument and that the results for ENO are different, and to check the exact values of ENO against the anticipated results (above). What is also needed is a good explanation in the man page. The behaviour of the different My suggestion would be to
Once this is in place, I am happy for robust to handle NAs gracefully and using all values. |
As for camel vs snake case, it's very much a historical thing. Camel case has been used in Bioconductor for longer than the snake case adoption in the very successful tidyverse packages, hence the usage of camel case in the visible API. Many older functions use a dot (in particular internal helper functions such as I don't have strong opinions as long as it is coherent within a function or within a set of related functions. |
@lgatto PS. Oops, I forgot to commit my comments on @sgibb `s inline code review. See above for my additional comments. |
Yes, indeed. I think informing is critical and giving them choices (with good defaults) essential. I am happy to work on the documentation aspects. I hope I'll have some time beginning of next week. |
I don't want to start a flame war here and I really can live with snake_case but not with a mix of both.
I really don't care whether we use camelCase or snake_case but we should choose one of them and don't mix them.
Right but the might be imputed by various methods provided by
You are right. We should be more consistent here. |
I take all the blame! Re consistency (or lack thereof) with how NAs are handled, it also depends on downstream functions, and I am not sure we could or should impose a meaningful default behaviour for all aggregation methods. Which is why I suggest a message whenever NAs are present and better documentation, including reminding users about filtering or imputing missing values. May be we will figure out something later. |
I have pushed the following to master (possibly also to this branch, hopefully this won't affect further commits here)
TODO
|
Okey, I added the changes of @sgibb and added him to the authors. I remove de Cheers |
Oh dear, still using |
I has been merged plus some small changes. I haven't included all @sgibb's code suggestion, as I felt it made some parts less readable (for me at least). Thank you both! |
Oeps, the |
Also on Bioc now. |
Hey Laurent
Adriaan here.
I tested out your robust summarisation implementation on my data.
It failed when there are
NA
values in the expression matrix.This is because I used
.lm.fit
instead oflm
for speed reasons.And
.lm.fit
does not handleNA
s automatically.I also avoided making a dataframe in this function because this can be costly in a long loop.
I hope this PR helps
best regards