Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MSData::Spectrum::getMZIntensityPairs()] Sizes do not match. #170

Closed
jorainer opened this issue Nov 17, 2016 · 42 comments
Closed

[MSData::Spectrum::getMZIntensityPairs()] Sizes do not match. #170

jorainer opened this issue Nov 17, 2016 · 42 comments

Comments

@jorainer
Copy link
Collaborator

I'm experiencing some random errors again:
I have a set of 690 mzML files, select 12 of them for further analysis, filter on retention time and get the following error when calling spectra on the OnDiskMSnExp:

Error: BiocParallel errors
  element index: 6
  first error: [MSData::Spectrum::getMZIntensityPairs()] Sizes do not match.

At first I thought that it must be my files, but when I select randomly selected 12 other files the same error occurred. So it's most likely not these specific files, also, the element index is not always 6.
Also without filtering I get errors.

And what makes me really wondering is that sometimes, especially if called repeatedly, the function works without errors.

@jorainer
Copy link
Collaborator Author

It can't be the files, since I can read them without problems using readMSData.

@lgatto
Copy link
Owner

lgatto commented Nov 17, 2016

Where does the BiocParallel error come from, what operation is done in parallel? Is this one OnDiskMSnExp with 12 files and the filtering is done in parallel over the fileNames()?

@jorainer
Copy link
Collaborator Author

The spectra call does simply read the data in, creates the Spectrum1 objects and returns that. The most obvious difference to the way how readMSData reads the file(s) is that it iterates over the individual acquisitionNum and reads one spectrum at a time, while for OnDiskMSnExp I'm reading all spectra in one call (with mzR::peaks(fileh, idx), where idx is the index of all spectra after filtering by rt).

I'll try to do some more tests tomorrow, also with other files to exclude that it's the files.

@jorainer
Copy link
Collaborator Author

jorainer commented Nov 18, 2016

Some updates:
Loading a single file as OnDiskMSnExp and calling spectra on that does not cause the error, also if this is done on all 12 files sequentially. I only get the error if I load the 12 files as one experiment and call spectra on this one OnDiskMSnExp.

@lgatto
Copy link
Owner

lgatto commented Nov 18, 2016

This is consistent with the BiocParallel error. Progress, I suppose...

@jorainer
Copy link
Collaborator Author

Actually, the error comes from ProteinWizard

mzR::peaks -> RcppWiz::getPeakList(x) which loads the spectrum and extracts the mz intensity pairs with getMZIntensityPairs . The error is thrown by this function if the sizes of mz and intensity arrays don't match. That's what I understand from the error.

@jorainer
Copy link
Collaborator Author

Now that's getting strange. the same files but differently converted to mzML (using vendor settings and without zlib compression of the binary data) and the error doesn't happen again.
problem of zlib?

@jorainer
Copy link
Collaborator Author

even more strange: when I process the files all in one go I don't get the error, when I save the OnDiskMSnExp close R load the object again and call spectra I get the error!

Need to get to a reproducible example using test files from msdata.

@lgatto
Copy link
Owner

lgatto commented Nov 20, 2016

Somehow, on-disk objects shouldn't be saved/load, although, in theory, it should work if the raw files haven't been modified/moved. I just tried with a single file, and, indeed, it works. It is really puzzling. Could it be that the saving/loading error is only a red herring, and the problem lies somewhere else, deeper (for example your commit 9898ece9a70764fb7b748bca024877c0cea44623)

@jorainer
Copy link
Collaborator Author

Yes, I think it was only by chance that it worked and than failed again. So, saving/loading might not be it.
Also, I can't reproduce this error with other files than mine. I'll try some different settings tomorrow to convert the original wiff files into mzML.

The only thing I know so far is that I get a segfault if I use the ramp backend (memory not mapped) and the [MSData::Spectrum::getMZIntensityPairs()] Sizes do not match. if I use pwiz. So apparently the binary spectra data that is returned does somehow not match. Why it sometimes works (e.g. if I load one file at a time) and sometimes not (if I load all files at once) I can't explain.

@jorainer
Copy link
Collaborator Author

GOT IT!
In the function to read the spectrum values on the fly (invoked by spectrapply) I was using the mzR::peaks method without first reading the spectras' headers. Now, if I add a mzR::header call before the mzR::peaks (even without needing the header information), it works.
Eventually that way the C++ code silently reads additional information (e.g. on how the spectrum data is encoded) that is missing if the header information is not read.

I'll do some more tests and push the changes once fixed.

jorainer added a commit that referenced this issue Nov 21, 2016
o Ensure that header information is read too if spectra data is loaded for
  OnDiskMSnExp objects.
@jorainer
Copy link
Collaborator Author

Closing issue as it seems to be fixed for good. @lgatto could you eventually dump version and push to svn?

@lgatto
Copy link
Owner

lgatto commented Nov 24, 2016

Done. Version 2.1.2 on hedgehog

CHANGES IN VERSION 2.1.2
------------------------
 o Update readMSnSet2 to save filename <2016-11-09 Wed>
 o Ensure that header information is read too if spectra data is
   loaded for OnDiskMSnExp objects (see issue #170) <2016-11-24 Thu>

I still have to extract your #170 commit and push to release 3.4.

@lgatto
Copy link
Owner

lgatto commented Nov 24, 2016

I still have to extract your #170 commit and push to release 3.4.

Done too.

lgatto pushed a commit that referenced this issue Nov 27, 2016
o Ensure that header information is read too if spectra data is loaded for
  OnDiskMSnExp objects.

From: jotsetung <johannes.rainer@gmail.com>

git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/MSnbase@124451 bc3139a8-67e5-0310-9ffc-ced21a209358
lgatto pushed a commit that referenced this issue Nov 27, 2016
* master:
  update news
  Fix issue #170
  Add spectrapply method and backend option
  Fix unit test error due to recent changes
  Add bpi method (issue #168)
  set filename only when input is a character
  Update readMSnSet2 to save filename
  Cite Lazar 2016 in vignette imputation section
  add imputatation paper to bib
  update news and description
  fix typo in impute man page
  new github devel version

From: Laurent <lg390@cam.ac.uk>

git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/MSnbase@124452 bc3139a8-67e5-0310-9ffc-ced21a209358
lgatto pushed a commit that referenced this issue Apr 9, 2017
o Ensure that header information is read too if spectra data is loaded for
  OnDiskMSnExp objects.

From: jotsetung <johannes.rainer@gmail.com>

git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/MSnbase@124451 bc3139a8-67e5-0310-9ffc-ced21a209358
lgatto pushed a commit that referenced this issue Apr 9, 2017
* master:
  update news
  Fix issue #170
  Add spectrapply method and backend option
  Fix unit test error due to recent changes
  Add bpi method (issue #168)
  set filename only when input is a character
  Update readMSnSet2 to save filename
  Cite Lazar 2016 in vignette imputation section
  add imputatation paper to bib
  update news and description
  fix typo in impute man page
  new github devel version

From: Laurent <lg390@cam.ac.uk>

git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/MSnbase@124452 bc3139a8-67e5-0310-9ffc-ced21a209358
jorainer added a commit that referenced this issue May 30, 2017
- Remove the code inserted to fix issue #170; fixes issue #216.
- Add tests to the torture.R script checking if the #170 error still happens.
@jorainer
Copy link
Collaborator Author

jorainer commented Jul 17, 2017

Digging deeper into the pwiz code trying to understand why this error occurs. Calling mzR::header before mzR::peaks solves the issue but has a major effect on performance, especially on gzipped mzML files. Thus I'm investigating if I can fix the issue somehow within mzR.

@jorainer
Copy link
Collaborator Author

Updates related to mzR code and changes are available in sneumann/mzR#112

@jorainer
Copy link
Collaborator Author

After extensive tests and evaluation of multiple approaches the only solution to this issue seems to be the original solution, i.e. to call mzR::header before reading the data with mzR::peaks. It might however be slightly modified as it does not seem to be required to read all header.

@lgatto
Copy link
Owner

lgatto commented Jul 19, 2017

Thanks!

@jorainer
Copy link
Collaborator Author

After some tests (many more to come), the issue reported here seems to occur only on macOS and there also only on one specific set of mzML files. So, if all further tests run smoothly, my suggestion would be to make the fix an option, but to disable it by default (more explanations later).

Below are some benchmark tests for just reading data using mzR::peaks and the fixes that call in addition mzR::header:

library(mzR)
library(msdata)
library(microbenchmark)

## Define the functions to compare.
only_peaks <- function(x) {
    fh <- mzR::openMSfile(x)
    pks <- mzR::peaks(fh)
    mzR::close(fh)
}

peaks_with_all_headers <- function(x) {
    fh <- mzR::openMSfile(x)
    hdr <- mzR::header(fh)
    pks <- mzR::peaks(fh)
    mzR::close(fh)
}

peaks_with_last_header <- function(x) {
    fh <- mzR::openMSfile(x)
    hdr <- mzR::header(fh, length(fh))
    pks <- mzR::peaks(fh)
    mzR::close(fh)
}

## mzML
fl <- system.file("microtofq/MM14.mzML", package = "msdata")
microbenchmark(only_peaks(fl), peaks_with_all_headers(fl),
	       peaks_with_last_header(fl), times = 10)
Unit: milliseconds
                       expr      min       lq     mean   median       uq
             only_peaks(fl) 44.89906 45.89676 47.75040 47.15564 49.25066
 peaks_with_all_headers(fl) 71.15074 73.36380 80.23435 74.95574 80.91604
 peaks_with_last_header(fl) 66.75709 67.77629 80.98064 69.63443 74.46741
      max neval cld
  51.4870    10  a 
 106.6319    10   b
 167.8683    10   b

Not unexpectedly, the call without header is the fastest. For mzML files reading only the header of the last spectrum is also faster than reading all of them.

Next on a gzipped mzML file:

## gzipped mzML
fl <- system.file("proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz", package = "msdata")
microbenchmark(only_peaks(fl), peaks_with_all_headers(fl),
	       peaks_with_last_header(fl), times = 10)
Unit: seconds
                       expr      min       lq     mean   median       uq
             only_peaks(fl) 13.39147 13.52713 13.64382 13.56836 13.80904
 peaks_with_all_headers(fl) 27.62570 27.78864 28.13496 27.99689 28.56590
 peaks_with_last_header(fl) 15.50221 15.67585 16.01584 15.87251 16.03055
      max neval cld
 14.14337    10 a  
 29.14699    10   c
 17.84753    10  b 

Now that's considerably slower. Reading the header information from all spectra has really poor performance, while reading the last header is better.

Next we evaluate an mzXML file:

fl <- system.file("lockmass/LockMass_test.mzXML", package = "msdata")
microbenchmark(only_peaks(fl), peaks_with_all_headers(fl),
	       peaks_with_last_header(fl), times = 10)
Unit: milliseconds
                       expr       min       lq      mean    median        uq
             only_peaks(fl)  67.81239  68.1742  70.10934  68.55077  72.94026
 peaks_with_all_headers(fl) 122.98311 126.4965 129.60679 127.83370 131.70625
 peaks_with_last_header(fl) 100.18529 101.1152 104.02154 102.28500 108.03445
       max neval cld
  75.26939    10 a  
 139.63150    10   c
 111.65219    10  b 

Similar to the mzML file, reading just the data is fastest, data + last header second and data + all header is about twice as slow.
Below we repeat the test on the same file, but compressed:

## At last with the same file but gzipped...
fl <- "/Users/jo/data/2017/mzXML/1405_blk1.mzXML.gz"
microbenchmark(only_peaks(fl), peaks_with_all_headers(fl),
	       peaks_with_last_header(fl), times = 10)
Unit: seconds
                       expr      min       lq     mean   median       uq
             only_peaks(fl) 15.58146 15.76791 15.97458 15.96889 16.06633
 peaks_with_all_headers(fl) 30.09662 30.22185 30.64439 30.48337 30.82167
 peaks_with_last_header(fl) 28.57678 28.68009 29.43082 28.90038 29.33533
      max neval cld
 16.65828    10 a  
 31.81355    10   c
 33.78522    10  b 

Also here, reading just the data using mzR::peaks is fastest. Reading the header from a single spectrum instead of all does not make a big difference here.

Summarizing:

  • performance-wise it might be helpful to remove the additional call to header.
  • Gzipped files have a very bad impact on performance - it's better to compress the (binary) data within the mzML files.

@jorainer
Copy link
Collaborator Author

The first runs for my torture tests are ready:

library(mzR)
SN <- "/Users/jo/data/2016/2016-11/NoSN/"
## SN <- "/Users/jo/data/2017/2017_02/"
## SN <- "/Users/jo/data/2016/2016_06/"
## SN <- "/Users/jo/data/2017/nalden01/"

fl <- dir(SN, full.names = TRUE)

torture_test <- function(files, FUN, iterations = 10) {
    for (i in 1:iterations) {
	cat("\nIteration", i, "of", iterations, "\n\n")
	for (j in 1:length(fl)) {
	    if (j %% 20 == 0)
		cat(j, "files processed\n")
	    FUN(fl[j])
	}
    }
}

fail_fun <- function(x) {
    fh <- mzR::openMSfile(x)
    pks <- mzR::peaks(fh)
    mzR::close(fh)
}
torture_test(fl, FUN = fail_fun)

In brief, the test opens each file, extracts the data from each spectrum in the file using mzR::peaks and closes the file again. This is repeated 10 times on the files in one folder.
The fail_fun represents the way how we would usually read data - but this caused the errors described at the top in this issue. I re-run all the tests on 4 different sets of mzML files:

  • 2016-11/NoSN/: 690 files on which I first got the error (on macOS).
  • 2016_06: 609 other mzML files from our lab.
  • 2017_02: 160 mzML files from our lab.
  • nalden01: 9 mzML files from another lab.
    I'll run the torture tests on:
  • macOS
  • Linux
  • Windows
    to evaluate if I get the error on all 3 platforms.

A note on the mzML files from our lab: they are converted from ABI wiff format to mzML using proteowizard on Windows 7.

@jorainer
Copy link
Collaborator Author

Results for macOS:

  • 2016-11/NoSN: 1x OK, 3x FAIL. torture test failed 3 times (each time in the first iteration after ~ 40 files), and succeeded 1 time.
  • 2016_06: 2x OK, 1x FAIL.
  • 2017_02: 3x FAIL (first iteration after ~120 files).
  • nalden01: 4x OK.

As described above, the error occurs randomly, although more frequently on certain files - but not always.

sessionInfo:

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin16.7.0/x86_64 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mzR_2.11.5   Rcpp_0.12.12

loaded via a namespace (and not attached):
[1] compiler_3.4.1      ProtGenerics_1.9.0  parallel_3.4.1     
[4] Biobase_2.37.2      codetools_0.2-15    BiocGenerics_0.23.0

jorainer added a commit that referenced this issue Jul 21, 2017
- Add an option fastLoad to disable the additional mzR::header call executed
  before each mzR::peaks call to fetch data on-demand for OnDiskMSnExp
  objects. This partially reverts the fix for issue #170 as this seems to be
  macOS and file specific.
- Add related unit tests and documentation.
@jorainer
Copy link
Collaborator Author

jorainer commented Jul 21, 2017

Results for Linux:

  • 2016-11/NoSN: 2x OK.
  • 2017_06: 2x OK.
  • 2017_02: 4x OK.
  • nalden01: 4x OK.

Apparently, on Linux there is no problem using just mzR::peaks.

sessionInfo:

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu/x86_64 (64-bit)
Running under: Linux Mint 18.1

Matrix products: default
BLAS: /home/jo/R/2017-07/R-3.4.1-BioC3.6-devel/lib/R/lib/x86_64/libRblas.so
LAPACK: /home/jo/R/2017-07/R-3.4.1-BioC3.6-devel/lib/R/lib/x86_64/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=it_IT.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mzR_2.11.5   Rcpp_0.12.12

loaded via a namespace (and not attached):
[1] compiler_3.4.1      ProtGenerics_1.9.0  parallel_3.4.1     
[4] Biobase_2.37.2      codetools_0.2-15    BiocGenerics_0.23.0

@jorainer
Copy link
Collaborator Author

Results for Windows:

  • 2016-11/NoSN: 2x OK.
  • 2016_06: 2x OK.
  • 2017_02: 2x OK.
  • nalden01: 2x OK.

Also on Windows using mzR::peaks without mzR::header before is working.

sessionInfo:

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252
[3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
[5] LC_TIME=German_Austria.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] mzR_2.11.4   Rcpp_0.12.11

loaded via a namespace (and not attached):
[1] compiler_3.4.1      ProtGenerics_1.9.0  parallel_3.4.1
[4] Biobase_2.37.2      codetools_0.2-15    BiocGenerics_0.23.0

@jorainer
Copy link
Collaborator Author

Conclusion from these tests:

  • mzR::peaks without mzR::header fails only on macOS but works on Linux and Windows.
  • Add an option to MSnbase to enable/disable reading in addition the header for the spectra. Enable this by default on macOS systems.

I will add this (and in addition remove the additional gc(); issue #151) in the function called by spectrapply and perform torture tests to ensure that all is properly working.

@jorainer
Copy link
Collaborator Author

Next I'm running torture tests using MSnbase functions and methods:

library(MSnbase)

torturing <- function(x) {
    tmp <- readMSData2(x, msLevel. = 1)
    register(SerialParam())
    for (i in 1:10) {
        cat("--- ", i, " ---", "\n")
        cat("first spectrapply\n")
        sp <- MSnbase::spectrapply(tmp, FUN = function(z) {max(mz(z))})
        rm(sp)
        gc()
        cat("second spectrapply\n")
        sp <- MSnbase::spectrapply(tmp, FUN = function(z) {max(mz(z))})
        rm(sp)
        gc()
        tmp <- filterRt(tmp, rt = c(5, 500))
        cat("third spectrapply after filter rt\n")
        sp <- MSnbase::spectrapply(tmp, FUN = function(z) {max(mz(z))})
        cat("\n\n")
    }
}

This function is run on the same sets of test files on macOS, Linux and Windows.
Settings enabled in the spectrapply:

  • No additional call to gc() in the function called by spectrapply (issue Failing unit tests - memory leaks? #151).
  • On macOS the function reads the header of the last spectrum prior to reading the data, for all other systems this is skipped (as it does not seem to be required).

jorainer added a commit that referenced this issue Jul 25, 2017
- Update torture script to evaluate the fastLoad option (not reading header
  prior to read data) and the removal of the additional gc() call in
  spectrapply.
- Tune the functions called by spectrapply,OnDiskMSnExp.
- Automatically disable fastLoad on macOS.
@jorainer
Copy link
Collaborator Author

torture test results for macOS:
Each run with fastLoad = TRUE failed with error message:

Error in object@backend$getPeakList(x) : 
  [MSData::Spectrum::getMZIntensityPairs()] Sizes do not match.

With fastLoad = FALSE:

  • 2016-11/NoSN: 2x OK.
  • 2017_02: 2x OK.
  • nalden01: 1x OK.
  • 2016_06: 2x OK.

So, for macOS we definitely have to use fastLoad = FALSE. Apart from that all seems to be fine.

@jorainer
Copy link
Collaborator Author

jorainer commented Jul 26, 2017

torture test results for Linux:
The tests were run with fastLoad = TRUE:

  • 2016/NoSN: 2x OK.
  • 2017_02: 2x OK.
  • nalden01: 2x OK.
  • 2016_06: 2x OK.

For Linux there seems to be no need to call mzR::header before mzR::peaks which can speed up things considerably.

@jorainer
Copy link
Collaborator Author

Finally, torture test results for Windows:
Tests run with fastLoad = TRUE:

  • 2016/NoSN: 2x OK.
  • 2017_02: 2x OK.
  • nalden01: 2x OK.
  • 2016_06: 2x OK.

Looks like also on Windows fastLoad = TRUE works nicely.

@jorainer jorainer closed this as completed Aug 1, 2017
lgatto pushed a commit that referenced this issue Sep 7, 2017
o Ensure that header information is read too if spectra data is loaded for
  OnDiskMSnExp objects.

From: jotsetung <johannes.rainer@gmail.com>

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/MSnbase@124451 bc3139a8-67e5-0310-9ffc-ced21a209358
lgatto pushed a commit that referenced this issue Sep 7, 2017
* master:
  update news
  Fix issue #170
  Add spectrapply method and backend option
  Fix unit test error due to recent changes
  Add bpi method (issue #168)
  set filename only when input is a character
  Update readMSnSet2 to save filename
  Cite Lazar 2016 in vignette imputation section
  add imputatation paper to bib
  update news and description
  fix typo in impute man page
  new github devel version

From: Laurent <lg390@cam.ac.uk>

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/MSnbase@124452 bc3139a8-67e5-0310-9ffc-ced21a209358
@wmoldham
Copy link

I just encountered the following error when using either chromatogram() or mz() functions:

Error: BiocParallel errors
  element index: 26
  first error: [MSData::Spectrum::getMZIntensityPairs()] Sizes do not match.

I am analyzing .mzML files generated by msconvert of Thermo .raw files on a Windows 10 device and analyzing with R 3.5.1 running in Rstudio 1.1.456 on a Mac. Of course, I don't get the error if I run readMSData() using mode = "inMemory". I read the above thread in detail and was wondering how to apply the solution?

Thanks for your help and apologies for key missing details; this is my first post in such a forum.

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RColorBrewer_1.1-2   IPO_1.6.0            CAMERA_1.36.0        rsm_2.10            
 [5] xcms_3.2.0           MSnbase_2.6.4        ProtGenerics_1.12.0  mzR_2.14.0          
 [9] Rcpp_0.12.19         BiocParallel_1.14.2  Biobase_2.40.0       BiocGenerics_0.26.0 
[13] BiocInstaller_1.30.0

loaded via a namespace (and not attached):
 [1] vsn_3.48.1             splines_3.5.1          foreach_1.4.4          Formula_1.2-3         
 [5] assertthat_0.2.0       affy_1.58.0            stats4_3.5.1           latticeExtra_0.6-28   
 [9] RBGL_1.56.0            yaml_2.2.0             impute_1.54.0          pillar_1.3.0          
[13] backports_1.1.2        lattice_0.20-35        glue_1.3.0             limma_3.36.5          
[17] digest_0.6.18          checkmate_1.8.5        colorspace_1.3-2       htmltools_0.3.6       
[21] preprocessCore_1.42.0  Matrix_1.2-14          plyr_1.8.4             MALDIquant_1.18       
[25] XML_3.98-1.16          pkgconfig_2.0.2        zlibbioc_1.26.0        purrr_0.2.5           
[29] scales_1.0.0           RANN_2.6               affyio_1.50.0          tibble_1.4.2          
[33] htmlTable_1.12         IRanges_2.14.12        ggplot2_3.0.0          nnet_7.3-12           
[37] lazyeval_0.2.1         MassSpecWavelet_1.46.0 survival_2.42-6        magrittr_1.5          
[41] crayon_1.3.4           doParallel_1.0.14      MASS_7.3-51            foreign_0.8-71        
[45] graph_1.58.2           data.table_1.11.8      tools_3.5.1            stringr_1.3.1         
[49] S4Vectors_0.18.3       munsell_0.5.0          cluster_2.0.7-1        bindrcpp_0.2.2        
[53] pcaMethods_1.72.0      compiler_3.5.1         mzID_1.18.0            rlang_0.2.2           
[57] grid_3.5.1             iterators_1.0.10       rstudioapi_0.8         htmlwidgets_1.3       
[61] igraph_1.2.2           base64enc_0.1-3        gtable_0.2.0           codetools_0.2-15      
[65] multtest_2.36.0        R6_2.3.0               gridExtra_2.3          knitr_1.20            
[69] dplyr_0.7.7            bindr_0.1.1            Hmisc_4.1-1            stringi_1.2.4         
[73] rpart_4.1-13           acepack_1.4.1          tidyselect_0.2.5     

@lgatto
Copy link
Owner

lgatto commented Oct 16, 2018

Thank you for the report @wmoldham - Do you get the error on OSX and Windows, or only OSX?

@wmoldham
Copy link

I have only attempted this on OSX, I don't have easy access to a Windows machine (!), I can try to find one to reproduce there.

@jorainer
Copy link
Collaborator Author

I had the same error recently on a set of files too (on OSX). To me this happened randomly, i.e. if I called the same function a second time I did not get the error again. That made me think it might be related to garbage collection. Note also that this error is thrown by the proteowizard routines that are used in mzR for data import.

@lgatto
Copy link
Owner

lgatto commented Oct 17, 2018

@jotsetung - is the fastLoad = TRUE parameter still available? If so, @wmoldham could set it to FALSE (at the cost of slowing down access) if the error persists.

@jorainer
Copy link
Collaborator Author

On MacOS it should be always FALSE, but you can check its value with isMSnbaseFastLoad() @wmoldham .

@wmoldham
Copy link

Apologies for the delayed response. After restarting Rstudio, I spent yesterday working with the data, including repeating the processing steps that yielded the error previously, and I did not encounter this error again. No problems using the readMSData() using mode "onDisk". I can confirm that isMSnbaseFastLoad() == FALSE. I will get back in touch if I can find a way to reproduce the error! Thanks for your attention.

@lauzikaite
Copy link

Hi all,

I've been experiencing the same issue on and off using function findChromPeaks on OnDiskMSnExp object.

Error: BiocParallel errors
  element index: 5, 6, 7, 8, 9, 10, ...
  first error: [MSData::Spectrum::getMZIntensityPairs()] Sizes do not match.

I am running this on macOS, with fastLoad = FALSE. I am struggling to reproduce this error, as it keeps coming up with different files, and if I called the same function a second time on the same file, error is not produced.

< sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] massFlowR_1.0       ggplot2_3.0.0       bindrcpp_0.2.2      xcms_3.2.0          MSnbase_2.6.2       ProtGenerics_1.12.0 mzR_2.14.0         
 [8] Rcpp_0.12.19        BiocParallel_1.14.2 Biobase_2.40.0      BiocGenerics_0.26.0

loaded via a namespace (and not attached):
 [1] viridis_0.5.1          vsn_3.48.1             tidyr_0.8.1            viridisLite_0.3.0      splines_3.5.1          foreach_1.4.4         
 [7] assertthat_0.2.0       affy_1.58.0            stats4_3.5.1           yaml_2.2.0             impute_1.54.0          pillar_1.3.0          
[13] lattice_0.20-35        glue_1.3.0             limma_3.36.2           digest_0.6.15          RColorBrewer_1.1-2     colorspace_1.3-2      
[19] preprocessCore_1.42.0  Matrix_1.2-14          plyr_1.8.4             MALDIquant_1.18        XML_3.98-1.16          pkgconfig_2.0.1       
[25] devtools_1.13.6        zlibbioc_1.26.0        purrr_0.2.5            scales_1.0.0           RANN_2.6               affyio_1.50.0         
[31] tibble_1.4.2           IRanges_2.14.10        withr_2.1.2            lazyeval_0.2.1         MassSpecWavelet_1.46.0 survival_2.42-6       
[37] magrittr_1.5           crayon_1.3.4           memoise_1.1.0          doParallel_1.0.11      MASS_7.3-50            xml2_1.2.0            
[43] BiocInstaller_1.30.0   tools_3.5.1            stringr_1.3.1          S4Vectors_0.18.3       munsell_0.5.0          pcaMethods_1.72.0     
[49] compiler_3.5.1         mzID_1.18.0            rlang_0.2.2            grid_3.5.1             iterators_1.0.10       rstudioapi_0.7        
[55] igraph_1.2.2           labeling_0.3           testthat_2.0.0         gtable_0.2.0           codetools_0.2-15       multtest_2.36.0       
[61] roxygen2_6.1.0         R6_2.2.2               gridExtra_2.3          dplyr_0.7.6            bindr_0.1.1            commonmark_1.5        
[67] stringi_1.2.4          tidyselect_0.2.4       faahKO_1.20.0 

@wmoldham
Copy link

@lauzikaite, I have had much better stability utilizing the doParallel package described in the vignettes linked below. I don't think it completely eliminates the error, but the frequency is dramatically decreased and no longer interferes with the analysis. Hope it helps you too.

Metabolomics data pre-processing

LCMS data preprocessing and analysis with xcms

@jorainer
Copy link
Collaborator Author

jorainer commented Nov 5, 2018

@lauzikaite yes, I am aware of this and it keeps happening to me too (macOS). Problem is that I have no idea how we could fix the error. To me it seems to be related to some garbage collection process (in R?) that kicks in randomly.

@wmoldham thanks for your input! I also had the impression that with doParallel it works better - but was not sure if it wasn't pure imagination.

@lauzikaite
Copy link

@wmoldham, thank you for the suggestion. I can confirm that use of doParallel cluster reduces the frequency of this issue, as well as the use of BiocParallel::DoparParam() backend for BiocParallel::bplapply function. Neither completely eradicate it however.

@jotsetung, thank you. I just wanted to inquire whether I've missed something in my setup to avoid this. So far, the best "fix" for me has been the use of a while loop together with try for the findChromPeaks implementation on multiple files.

@trainorp
Copy link

I have also noticed this problem on mac...
But not on linux (openSuse). So strange

@jorainer
Copy link
Collaborator Author

For me (on mac) it seems to be OK now. Regarding linux and Windows, I never got this error on my linux and windows test environments. This seems indeed to happen (randomly) on mac - and absolutely no idea why (the error is thrown by the proteowizard C++ code that is used by mzR).

@YonghuiDong
Copy link

I encountered the same problem on Mac.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants