Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could we speedup validObject for MSnExp and Spectrum classes? #183

Closed
sgibb opened this issue Jan 18, 2017 · 13 comments
Closed

Could we speedup validObject for MSnExp and Spectrum classes? #183

sgibb opened this issue Jan 18, 2017 · 13 comments

Comments

@sgibb
Copy link
Collaborator

sgibb commented Jan 18, 2017

While working on synapter we recognise that creating MSnExp object takes a lot of time. Fairly enough the PLGS data format is stupid (an xml file with something like <spectrum>tab delim values</spectrum>) and we can't use OnDiskMSnExp. Reading the file and generating the Spectrum2 objects took 50 minutes (see minutes 22 to 70 on the figure below; ~ 25 minutes reading the file and ~ 25 minutes creating the Spectrum2 objects). Subsequently creating an MSnExp object from these Spectrum2 objects took 45 minutes (70 to 115 on the figure below). The vast majority is spend in validObject.

usage

For a smaller dataset (36000 spectra) validObject needs 40 seconds on my laptop. By not calling validObject it reads the spectra in 100 instead of 140 seconds.
It is eye-catching that we call msLevel 8 times (which takes nearly 8 * 0.5 seconds; 1. valid of pSet; multiple times in .header, e.g. precursorMz, precursorCharge ...). header needs nearly 10 seconds (which could maybe reduced a little bit).
But what eats most time is isVersioned (nearly 14 seconds). Unfortunately most of the time is spend in the S4 class system/method dispatching. So we can't change much ...

valid-graph

valid-data

Any ideas to improve this?

@lgatto
Copy link
Owner

lgatto commented Jan 18, 2017

There is a C-level constructor, that we use for OnDiskMSnSet objects when we create the spectra on the fly: MSnbase:::Spectrum2, MSnbase:::Spectrum1, ...

@sgibb
Copy link
Collaborator Author

sgibb commented Jan 18, 2017

I already use MSnbase:::Spectrum2_mz_sorted (and spectra creation is not too bad) but I call validObject(msnexp) explicitly after it was build.

@lgatto
Copy link
Owner

lgatto commented Jan 18, 2017

I can't think of a solution other than not calling validObject. Would you want a validity method that checks everything but the content of the assay data?

@jorainer
Copy link
Collaborator

I think we do have something like that for the OnDiskMSnExp, i.e. the validObject calling some obvious stuff, without having to read the spectrum data and a method that could be called manually that performs a comprehensive check.

@jorainer
Copy link
Collaborator

An alternative would be to add a C-level validity check function for Spectrum objects.

@lgatto
Copy link
Owner

lgatto commented Jan 18, 2017

we do have something like that for the OnDiskMSnExp, i.e. the validObject calling some obvious stuff

validObject,OnDiskMSnExp does very basic validity checking, without accessing the raw data. Then, there's the validateOnDiskMSnExp function, that calls the former, then creates the spectra and checks their validity and that their data is valid with respect to the feature meta-data.

We could have a validateMSnExp function that does everything by the assay data... but, wouldn't you want to make sure they are valid from the clumsy XML data? Or maybe an additional argument to that function that if data should be validated?

@sgibb
Copy link
Collaborator Author

sgibb commented Jan 18, 2017

For synapter I think it would be enough to check the spectra in .createMsnbaseSpectrum2 (as we have done before): https://github.com/lgatto/synapter/blob/fa57c514b60420c6588148420e4c021e0449b4a9/R/spectrum-functions.R#L297-L308
But I will remove the additional MSnExp validation as you suggested. We don't need to check the spectra twice and I am not interested in the header etc, because we never use these information but only the spectra. Creating an MSnExp object instead of a list of Spectrum2 objects was a decision to use already well tested structures. And calling validObject(SynapterObject) will cause validObject(MSnExp) anyway.

For MSnbase in general I would suggest that we find a way to avoid at least the multiple msLevels calls (or reduce their number).

@lgatto
Copy link
Owner

lgatto commented Jan 18, 2017

For MSnbase in general I would suggest that we find a way to avoid at least the multiple msLevels calls (or reduce their number).

Could you clarify here - just looked into validObject,pSet (there's actually no validObject,MSnExp) for multiple calls to msLevelbut couldn't fine them.

@sgibb
Copy link
Collaborator Author

sgibb commented Jan 18, 2017

Sorry I meant validObject,pSet and indeed there is just one direct call to msLevel:
https://github.com/lgatto/MSnbase/blob/master/R/methods-pSet.R#L36
But than there is header:
https://github.com/lgatto/MSnbase/blob/master/R/methods-pSet.R#L81
which calls .header:
https://github.com/lgatto/MSnbase/blob/master/R/utils.R#L329-L365
and there all these functions call msLevel again:

Maybe I missed some call. I had a global counter that said 8. But maybe there was a msLevel call from another function.
I am wondering whether if (msLevel(object)[1] > 1) could be replaced by if (msLevel(object[[1]]) > 1) (in my test it seems to have no effect ...)

@lgatto
Copy link
Owner

lgatto commented Jan 18, 2017

I am wondering whether if (msLevel(object)[1] > 1) could be replaced by if (msLevel(object[[1]]) > 1) (in my test it seems to have no effect ...)

This would be a good thing for MSnExp instances, but bad for OnDiskMSnExp, because the alternative is going to trigger extraction of the raw data. We could have a .firstMsLevel function that does the right (fast) thing; something like

.firstMsLevel <- function(object) {
  if (inherits(object, "OnDiskMSnExp")) msLevel(object)[1]
  else msLevel(object[[1]])
}

Comments, suggestions?

@jorainer
Copy link
Collaborator

great! was about to write about the possible problem with OnDiskMSnExp and you already provided the solution!

@lgatto
Copy link
Owner

lgatto commented Jan 18, 2017

Have added .firstMsLevel to #059f3d5d2234742736ba910288e3a08266802072 - will replace as suggested above asap.

@lgatto
Copy link
Owner

lgatto commented Jan 18, 2017

Using .firstMSLevel in aba08a6

@lgatto lgatto closed this as completed Jan 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants