Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix robust_summary() #330

Merged
merged 4 commits into from
Apr 9, 2018
Merged

Fix robust_summary() #330

merged 4 commits into from
Apr 9, 2018

Conversation

adder
Copy link
Contributor

@adder adder commented Apr 5, 2018

Hey Laurent

Adriaan here.
I tested out your robust summarisation implementation on my data.
It failed when there are NA values in the expression matrix.

This is because I used .lm.fit instead of lm for speed reasons.
And .lm.fit does not handle NAs automatically.

I also avoided making a dataframe in this function because this can be costly in a long loop.

I hope this PR helps
best regards

Makes it work when expression values are missing
@lgatto
Copy link
Owner

lgatto commented Apr 5, 2018

I am not going to accept this PR, at least not now.

Firstly, it currently fails the test_robust unit test because

> ## Results from your PR
> tail(exprs(x), n = 2)
        iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
ECA4514   11673.59   11936.43   12090.93   12268.70
ENO       22671.74   19986.16   16679.57   16885.14
> ## Results when first filtering out peptides with NA, as in original code
> tail(exprs(xfilt), n = 2)
        iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
ECA4514   11673.59   11936.43   12090.93   12268.70
ENO       38262.18   32907.09   15000.11   10733.34

What would need to be done is to add an argument that specifies that NA values can be ignored (default would probably be FALSE, and would fail in case of missing data), and the difference in results (see above) would need to be documented in combineFeatures.Rd.

Copy link
Collaborator

@sgibb sgibb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. Additional to @lgatto's comments I like to suggest a few more changes.

Could you please change the !any(!id) into all(id) (I can't add a direct comment to this line because it is not part of this PR).

BTW: I still don't understand why everybody favours snake_cases over camelCase but using both in one function (robust_summary <- function(..., nIter, ...) is weird. Please choose one of both.

All in all I would suggest the following implementation (not tested):

robustSummary <- function(e, nIter = 100, residuals = FALSE, na.rm = FALSE, ...) {
    ## If there is only one 1 peptide for all samples return
    ## expression of that peptide
    if (nrow(e) == 1L) return(e)

    ## remove data points with missing expression values
    p <- !(is.na(e) & na.rm)
    expression <- e[p]
    sample <- as.factor(col(e)[p])
    feature <- as.factor(row(e)[p])

    ## Sum contrast on peptide level so sample effect will be mean
    ## over all peptides instead of reference level.
    X <- stats::model.matrix(~ -1L + sample + feature,
                             contrasts.arg = list(feature = 'contr.sum'))
    ## MASS::rlm breaks on singulare values.
    ## - Check with base lm if singular values are present.
    ## - If so, these coefficients will be zero, remove this collumn
    ##   from model matrix
    ## - Rinse and repeat on reduced modelmatrx till no singular
    ##   values are present
    repeat {
        fit <- stats::.lm.fit(X, expression)
        id <- fit$coefficients != 0L
        if (all(id)) break
        X <- X[ , id, drop = FALSE]
    }
    ## Last step is always rlm
    fit <- MASS::rlm(X, expression, maxit = nIter, ...)
    fit$coefficients[colSums(p) == nrow(e)]
}

sample <- x$sample
feature <- x$feature
## remove data points with missing expression values
expression = as.numeric(as.matrix(e))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e should always be a matrix. So as.matrix is not needed here. as.numeric not needed at all.

Coding style: please never use = for assignment (just for arguments).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? When I removeld as. matrix. I got an error.
Didn't test removing both as.matrix and as.numeric.
Maybe @lgatto can comment on this? The as.numeric(as.matrix(e)) is from his code.

Yes, using = is my bad habit. My scripts are littered with them.
I know I can't use them for packages but they often slip through. :)

Copy link
Owner

@lgatto lgatto Apr 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the as.matrix, that's my bad. Some earlier code erroneously used a data.frame as input, hence the as.matrix. As for as.numeric(), is could be replaced by as.vector(), but I think coercion to a vector is needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think coercion to a vector is needed.

That's right but done automatically by e{p] (if p is logical). So no need for explicitly call as.vector/as.numeric.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. There was not logical subsetting in the original code, which is why it was needed then.

expression = as.numeric(as.matrix(e))
p = !is.na(expression)
expression = expression[p]
sample = rep(colnames(e), each = nrow(e))[p]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be replaced by as.factor(col(e)[p]) (the names of the factor are not user visible, or I am wrong?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, nothing here shoud be user visible.

But why converting everything to factors? is this needed?
It's not much but the there some (unneeded) computation time that can quickly add up when this function is called for every protein.
(Eg. as.character is much faster. But rownames always return characters, so no problems here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, apparently I didn't know what row() and col() do. This is indeed better!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

character is converted to factor in model.matrix. But it doesn't matter if we use colnames(e)[col(e)[p]] or as.factor(col(e)[p]) here.

p = !is.na(expression)
expression = expression[p]
sample = rep(colnames(e), each = nrow(e))[p]
feature = rep(rownames(e), times = ncol(e))[p]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as.factor(row(e)[p])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

feature <- x$feature
## remove data points with missing expression values
expression = as.numeric(as.matrix(e))
p = !is.na(expression)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @lgatto suggests there should be an argument (e.g. na.rm) to control whether the user want to remove/keep NA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. But I would argue that the default should be TRUE. See my comments below.


## modelmatrix breaks on factors with 1 level so make vector of
## ones (this swill be intercept).
## ones (intercept).
if (length(unique(sample)) == 1L) sample <- rep(1, length(sample))
Copy link
Collaborator

@sgibb sgibb Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample <- rep(1, length(sample)). If you want to replace the whole vector with the same value for each element you could use the []-operator: sample[] <- 1L. BTW if you use as.factor above the whole statement is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as.factor will not work.
modelmatrix` fails on factors with only 1 level.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW if you use as.factor above the whole statement is not needed.

You are right, my above statement was wrong. But with as.factor we could use:

if (nlevels(sample) == 1L) sample <- as.integer(sample)

## Put NA for the samples without any expression value
present = apply(e,2,function(x){any(!is.na(x))})
out = rep(NA,length(present))
out[present] = fit$coefficients[sampleid]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you avoid as.numeric in line 288 you could reuse p here: colSums(p) == nrow(e). The whole three lines could be simplified to fit$coefficients[colSums(p) == nrow(e)].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will do.
I need to return a summarised value (the sample coefficients) for every sample in the msnset.
If that sample happen to have no expression values for the given protein, this value should be NA.

In your suggested solution, only values are returned for the samples without missing data and so MSnbase will probably throw an error (or should).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I somehow messed up here.
Now I would suggest:

present <- rep(NA_logical_, ncol(e))
present[colSums(!is.na(e)) > 0] <- TRUE
fit$coefficients[present]

present = apply(e,2,function(x){any(!is.na(x))})
out = rep(NA,length(present))
out[present] = fit$coefficients[sampleid]
return(out)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return is just needed if you want to jump out of a function (e.g. in an if-statement). If you use it at the end it is just an unnecessary function call. That means out would be enough.

@adder
Copy link
Contributor Author

adder commented Apr 7, 2018

Thanks for the code review!

I didn't tested the changes yet but I'm going to look into this today and update my PR accordingly.

I allready posted some comments inline above on your suggested changes.

About camelCase() vs snake_case(). @lgatto choosed robust_summary. I'm not sure if MSnbase adhere to a certain style? I saw different uses in the MSnbase reference. (also point.case())
Out of curiosity @sgibb. What's your case against snake_case? :) I always find the snake_case more readible because they act like spaces so they look more like human language. :)

About the removing of NAs. It's ok to have this as an argument but I would argue that the default should be TRUE.
Every proteomics dataset I know of has an arguable high degree of missingness for the same sequences across samples.
For example I use a dataset with 27 samples (same sample but different treatments and different labs).
If for one sample there is a particular peptide missing, I don't think it's very sensible to also ignore the data in all other samples. There would not be much left for downstream analysis.
Correct me if I'm wrong but MSnbase would always throw an error by default on a biological shotgun proteomics dataset if the default of na.rm would be set to FALSE

A pro na.rm = FALSE argument could be that this error serves as a reminder to the user that there is high missingness in proteomics data and one should proceed carefully.
Maybe a solution is to let MSnbase throw an error but also put a message that this might be due to high missingness and one might consider to put na.rm to TRUE.
The error as it is, is not very helpfull for the average user.

I quickly checked the other functions in combineFeatures() and noticed that all defaults are set to FALSE.
In the prensence of missingness, medpolish also throws an error but mean and median 'fail sillently' (return NA for the failing samples).
So the behaviour accross the different summarisation functions in combineFeatures() is not very consistent.
My origninal proposesed robust function tried to be like mean and median. Give a summarised when possible, return NA if not.

Looking forward to your opinion on the matter :)

Thanks again for all suggestions!

@lgatto
Copy link
Owner

lgatto commented Apr 7, 2018

Regarding whether the handling of NAs should be default or not, what needs to be in place is a unit test that makes sure all results but for ENO are identical whatever the value of that argument and that the results for ENO are different, and to check the exact values of ENO against the anticipated results (above). What is also needed is a good explanation in the man page.

The behaviour of the different combineFeatures methods depends on the underlying code. We could try to make it more consistent, I suppose. More helpful messages and better documentation of effects of missing vales are definitely something that would be welcome.

My suggestion would be to

  1. Document the effect of NA in combining features with different methods and remind them about data imputation.
  2. First thing in combineFeatures, have a message that mentions the presence of missing values in the data and refer to the documentation for the user to see the effect thereof for different downstream methods.

Once this is in place, I am happy for robust to handle NAs gracefully and using all values.

@lgatto
Copy link
Owner

lgatto commented Apr 7, 2018

As for camel vs snake case, it's very much a historical thing. Camel case has been used in Bioconductor for longer than the snake case adoption in the very successful tidyverse packages, hence the usage of camel case in the visible API. Many older functions use a dot (in particular internal helper functions such as utils.*).

I don't have strong opinions as long as it is coherent within a function or within a set of related functions.

@adder
Copy link
Contributor Author

adder commented Apr 7, 2018

@lgatto
If I understand correctly you would prefer to have a general message through message() when combinefeatures() is called and if there is at least one NA present.
This message mentions this and points to the help of combinefeatures for more info. In the help is explained what happens for every possible summarisation function.
Correct?

PS. Oops, I forgot to commit my comments on @sgibb `s inline code review. See above for my additional comments.

@lgatto
Copy link
Owner

lgatto commented Apr 7, 2018

If I understand correctly you would prefer to have a general message through message() when combineFeatures() is called and if there is at least one NA present.
This message mentions this and points to the help of combineFeatures for more info. In the help is explained what happens for every possible summarisation function.

Yes, indeed. I think informing is critical and giving them choices (with good defaults) essential.

I am happy to work on the documentation aspects. I hope I'll have some time beginning of next week.

@sgibb
Copy link
Collaborator

sgibb commented Apr 7, 2018

About camelCase() vs snake_case(). @lgatto choosed robust_summary. I'm not sure if MSnbase adhere to a certain style? I saw different uses in the MSnbase reference. (also point.case())
Out of curiosity @sgibb. What's your case against snake_case? :) I always find the snake_case more readible because they act like spaces so they look more like human language. :)

I don't want to start a flame war here and I really can live with snake_case but not with a mix of both.
I don't really want to use snake_case:

  1. MSnbase uses camelCase.
  2. snake_case doesn't follow the Bioc coding style and MSnbase is a bioc package.
  3. snake_case needs n * _ additional keystroke(s).
  4. snake_case is n * _ character(s) longer.
  5. On my DE-keyboard-layout I have to hit "[Shift] + [-]" and "[-]" is in the lower row on the right and hard to reach (I can't understand why US keyboard users like snake_case because there the "[-]" is right in the most upper (numeric) row which is even harder to reach).
  6. You may argue that code is written once and read often so writing can't be important. Maybe snake_case is more readable in general but I am used to camelCase and IMHO it easier to find the "image" of a variable/function name (instead of really reading it). In contrast snake_case breaks the "image" by an underscore so I am often forced to really read the name.

I really don't care whether we use camelCase or snake_case but we should choose one of them and don't mix them.
(I am sorry for blaming you instead of @lgatto for the snake_case/camelCase mix 😉.)

Every proteomics dataset I know of has an arguable high degree of missingness for the same sequences across samples.

Right but the might be imputed by various methods provided by MSnbase.

I quickly checked the other functions in combineFeatures() and noticed that all defaults are set to FALSE.
In the prensence of missingness, medpolish also throws an error but mean and median 'fail sillently' (return NA for the failing samples).
So the behaviour accross the different summarisation functions in combineFeatures() is not very consistent.

You are right. We should be more consistent here.

@lgatto
Copy link
Owner

lgatto commented Apr 7, 2018

I am sorry for blaming you instead of @lgatto for the snake_case/camelCase mix wink.

I take all the blame!

Re consistency (or lack thereof) with how NAs are handled, it also depends on downstream functions, and I am not sure we could or should impose a meaningful default behaviour for all aggregation methods. Which is why I suggest a message whenever NAs are present and better documentation, including reminding users about filtering or imputing missing values. May be we will figure out something later.

@lgatto
Copy link
Owner

lgatto commented Apr 7, 2018

I have pushed the following to master (possibly also to this branch, hopefully this won't affect further commits here)

  • Added a message at the beginning of combineFeatures to inform users of missing values
  • Update the man page to describe the different effects of missing values on aggregation - comments welcome.
  • (Another fix irrelevant to this PR)

TODO

  • Update/rename the robustSummary as suggested by @adder including the review proposed by @sgibb
  • Now that I have written the documentation, I don't think we need an na.rm argument. The message will inform the user, and the current behaviour can be by first filtering the data. See the example in the new Rd page for details.
  • The unit test needs to be updated accordingly - done in bc6b0a3

@adder
Copy link
Contributor Author

adder commented Apr 9, 2018

Okey,
I tried to merge @lgatto commits to my fork before doing my pull request with changes.
I hope I didn't mess up somewhere. Tell me if something is wrong.

I added the changes of @sgibb and added him to the authors.

I remove de na.rm parameter as discussed.
I also decided to remove the Niter parameter since this is just translated to de maxit parameter of rlm.
Since you can easily change the rlm parameters, I decided that it might be better to just go with the rlm defaults.
As an added bonus: the snake_case/camelCase problem is elegantly solved this way ;)
I also changed the Unit test accordingly. They succeeded for me.
Let me know if there are still some issues.

Cheers
Adriaan

@lgatto
Copy link
Owner

lgatto commented Apr 9, 2018

Oh dear, still using = instead of <- ;-). I'll iron these details out locally.

@lgatto lgatto merged commit c6802c2 into lgatto:master Apr 9, 2018
@lgatto
Copy link
Owner

lgatto commented Apr 9, 2018

I has been merged plus some small changes. I haven't included all @sgibb's code suggestion, as I felt it made some parts less readable (for me at least). Thank you both!

@adder
Copy link
Contributor Author

adder commented Apr 9, 2018

Oeps, the = is my muscle memory working.
Sorry
Thanks for the merge

@lgatto
Copy link
Owner

lgatto commented Apr 9, 2018

Also on Bioc now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants