Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDF format with XCMS #332

Closed
rdallet opened this issue Apr 10, 2018 · 13 comments
Closed

CDF format with XCMS #332

rdallet opened this issue Apr 10, 2018 · 13 comments
Assignees

Comments

@rdallet
Copy link

rdallet commented Apr 10, 2018

Hi @jotsetung ,

@lecorguille, @yguitton, @0driveshaft0 and myself noticed an inconsistency in the construction of chromatograms when raw data come from CDF files.

Context :

We try to draw chromatograms from CDF or mzXML files.

We test the performance of :

  • tic() / bpi() of MSnBase
  • chromatogram() of XCMS (v1.46)

=> We assume that they are supposed to give the same result

Problem :

File extensions :

  • CDF :

tic() vs chromatogram(aggregationFun = 'sum') => TICs different

bpi() vs chromatogram(aggregationFun = 'max') => Same BPCs

  • mzXML :

tic() vs chromatogram(aggregationFun = 'sum') => Same TICs

bpi() vs chromatogram(aggregationFun = 'max') => Same BPCs

tic_vs_chromatogram

Functions :

  • tic() :

CDF vs mzXML => Same chromatograms

  • chromatogram()

CDF vs mzXML => Chromatograms different (+ Multiplying factor ?)

cdf_vs_mzxml

Conclusion :

Is there something wrong with the CDF format with XCMS (v1.46) ?

@jorainer jorainer self-assigned this Apr 10, 2018
@jorainer
Copy link
Collaborator

thanks for the detailed @RomainDallet comparison. I will investigate later today. Some of the differences might be explained by the fact that the tic methods reports the total ion current reported in the raw file and chromtogram calculates it based on the values.

@jorainer
Copy link
Collaborator

OK, for the differences between what tic reports and a TIC generated using the chromatogram method: I believe that can be explained as above described. The tic method simply returns the value that is reported in the raw file (CDF, mzXML or mzML file) and returned by mzR::header in column totIonCurrent. chromatogram calculates the TIC by summing all intensity values in each spectrum. To illustrate this the example below:

library(MSnbase)
od <- readMSData(system.file("cdf/ko15.CDF", package = "msdata"), mode = "onDisk")

## Get the tic
tic_od <- tic(od)

## Calculate the TIC
tic_calc <- unlist(spectrapply(od, function(z) sum(z@intensity)))

## Finally extract a TIC
tic_chr <- chromatogram(od, aggregationFun = "sum")

## Comparing the values:
par(mfrow = c(1, 2))
plot(tic_od, tic_calc, pch = 16, col = "#00000080")
plot(intensity(tic_chr[1, 1]), tic_calc, pch = 16, col = "#00000080")

tmp

As you see, the reported total ion current is completely different to the sum of intensities per spectrum. It might well be that this is specific for this file (no idea how this was generated or manipulated) - I haven't checked any other CDF files since I have none at hand.

Does this explain the differences that you see too?

Regarding the differences you see between CDF and mzXML, could it be explained by the same fact? I do not have data in both CDF and mzXML format to check, unfortunately. To test I exported the CDF above as mzXML file and compared the tic of that with the tic of the original. We do however update the totIonCurrent header value upon export with the sum of intensities, so the tic of the re-read mzXML file is identical to the tic calculated as the sum of intensities per spectrum.

In general, I try to avoid the tic because of this reason and use as much as possible chromatogram or similar, even if that means that I have to access the full raw data (tic is much faster because it doesn't require access to the original file).

@lgatto
Copy link
Owner

lgatto commented Apr 10, 2018

@jotsetung - isn't this something we should add to the documentation file with a link to this issue?

@jorainer
Copy link
Collaborator

Good point. I think there is already a note in the documentation - if not I will add it - and add a link to this issue.

jorainer added a commit that referenced this issue Apr 10, 2018
- Fix tic,OnDiskMSnExp with initial = FALSE to calculate the total ion count
  from the full data (related to issue #332).
@jorainer
Copy link
Collaborator

In the documentation of tic,OnDiskMSnExp it was mentioned that by default the totalIonCurrent from the header is returned, which can be different from a tic calculated on the data. Parameter initial = FALSE can be used to force the tic being calculated on the data. It wasn't doing that but I've fixed it now and added also a reference to this issue.

@rdallet
Copy link
Author

rdallet commented Apr 11, 2018

I tested your script to compare the 2 methods with my own dataset in CDF and mzXML format :

library(MSnbase)
od <- readMSData("20180322_002.CDF", mode = "onDisk")

## Get the tic
tic_od <- tic(od)

## Calculate the TIC
tic_calc <- unlist(spectrapply(od, function(z) sum(z@intensity)))

## Finally extract a TIC
tic_chr <- chromatogram(od, aggregationFun = "sum")

## Comparing the values:
par(mfrow = c(1, 2))
plot(tic_od, tic_calc, pch = 16, col = "#00000080")
plot(intensity(tic_chr[1, 1]), tic_calc, pch = 16, col = "#00000080")

But both, CDF and mzXML, give the same result with a reported total ion current identical to the sum of intensities per spectrum :
comparison

@jorainer
Copy link
Collaborator

OK, so the totalIonCurrent is identical to the calculated tic for your CDF. For the files from the faahko package there is definitely a difference that can be explained as above.

And between the mzXML and the CDF, is there still a difference? There should not be a difference for the calculated tic - if there is we have a bigger problem that is probably related to the different backends to read the data (proteowizard for mzXML and netcdf for CDF).

@rdallet
Copy link
Author

rdallet commented Apr 11, 2018

OK, I think the problem is here. As you can see on the next plot representing the calculated tic, the shape is similar but the difference is in the scale that is different between the 2 plots.

library(MSnbase)
mzxml <- readMSData("20180322_002.mzXML", mode = "onDisk")
cdf <- readMSData("20180322_002.cdf", mode = "onDisk")

## Calculate the TIC
tic_calc_mzxml <- unlist(spectrapply(mzxml, function(z) sum(z@intensity)))
tic_calc_cdf <- unlist(spectrapply(cdf, function(z) sum(z@intensity

par(mfrow = c(1, 2)))))
plot(tic_calc_mzxml, pch=16, col="#00000080")
plot(tic_calc_cdf, pch=16, col="#00000080")

tic_calc_vs

Edit :
tic(mzxml) and tic(cdf) are identical.

@jorainer
Copy link
Collaborator

OK, so, if I understand it correctly, the difference is in the calculated TIC from both files. Could this be something related to the precision in which the intensity values are stored in the file?
Can you have a quick look at the intensities from both files?

something like:

unlist(intensity(data_cdf))[1:40]

unlist(intensity(data_mzxml))[1:40]

I just checked the ko15.CDF file and for that the intensities seem to be integers, while for another test mzXML file they are double. I don't know if that applies to all CDF files, but if intensity data in CDF files is stored as integers, that could explain the differences you see.

@rdallet
Copy link
Author

rdallet commented Apr 23, 2018

Sorry for the delay. So, as you said, intensities in CDF files are well stored as integer and in mzXML as double.

unlist(intensity(mzxml))[1:40]
  F1.S00011   F1.S00012   F1.S00013   F1.S00014   F1.S00015   F1.S00016 
  76913.039   26206.951   12683.207   15941.148  295482.719   10936.617 
  F1.S00017   F1.S00018   F1.S00019  F1.S000110  F1.S000111  F1.S000112 
  66302.383  137487.109   14040.188   10798.498   27801.715    9913.201 
 F1.S000113  F1.S000114  F1.S000115  F1.S000116  F1.S000117  F1.S000118 
  14710.414  520396.812   25490.676  106768.203   10073.900   53364.871 
 F1.S000119  F1.S000120  F1.S000121  F1.S000122  F1.S000123  F1.S000124 
6840915.000   16774.482   54500.172   12954.951   26629.545   11592.397 
 F1.S000125  F1.S000126  F1.S000127  F1.S000128  F1.S000129  F1.S000130 
  10285.073   68888.305  116811.531   21737.303  150419.891   10987.931 
 F1.S000131  F1.S000132  F1.S000133  F1.S000134  F1.S000135  F1.S000136 
  15961.298   17425.691   22313.035 2653486.750  119605.781   14484.614 
 F1.S000137  F1.S000138  F1.S000139  F1.S000140 
  51200.082  116223.227   15091.529  262209.719

unlist(intensity(cdf))[1:40]
 F1.S00011  F1.S00012  F1.S00013  F1.S00014  F1.S00015  F1.S00016  F1.S00017 
         0          0          0          0          0          0          0 
 F1.S00018  F1.S00019 F1.S000110 F1.S000111 F1.S000112 F1.S000113 F1.S000114 
         0       3732      17367      41130      67273      75002      51556 
F1.S000115 F1.S000116 F1.S000117 F1.S000118 F1.S000119 F1.S000120 F1.S000121 
     26791       6916          0          0          0          0          0 
F1.S000122 F1.S000123 F1.S000124 F1.S000125 F1.S000126 F1.S000127 F1.S000128 
         0          0          0       2667       8286      17441      24756 
F1.S000129 F1.S000130 F1.S000131 F1.S000132 F1.S000133 F1.S000134 F1.S000135 
     24876      17727       8500       2474          0          0          0 
F1.S000136 F1.S000137 F1.S000138 F1.S000139 F1.S000140 
         0          0          0          0          0

I don't have enough knowledge in MSnBase so maybe it's a dumb question but why values are ​​so different ?

To solve this problem more easily, I propose you this link if you want to get the dataset : https://filesender.renater.fr/?s=download&token=2d862d79-d366-6d23-8e6b-2fdc16d0fedf

@jorainer
Copy link
Collaborator

Thanks for the files. I'll have a look at it.

@jorainer
Copy link
Collaborator

I looked at your files and the first thing that I realized was that the spectra in the mzXML contain much less data points than in the CDF (number of spectra is the same and their rtime is also identical up to 5 decimal positions):

library(xcms)
mzxml <- readMSData(file = "20180322_002.mzXML", mode = "onDisk")
cdf <- readMSData(file = "20180322_002.cdf", mode = "onDisk")

mean(unlist(spectrapply(mzxml, function(z) length(intensity(z)))))
[1] 1141.21
mean(unlist(spectrapply(cdf, function(z) length(intensity(z)))))
[1] 17012.48

In the mzXML is stated:

    <dataProcessing centroided="1">
      <software type="conversion" name="ProteoWizard software" version="3.0.9810"/>
      <processingOperation name="Conversion to mzML"/>
      <software type="processing" name="ProteoWizard software" version="3.0.9810"/>
      <comment>Thermo/Xcalibur peak picking</comment>
    </dataProcessing>

so, the mzXML file for sure is centroided and I guess the CDF file is in profile mode. That would explain the large difference in data points and also the difference in the calculated TIC (because each mass peak is reduced to a single m/z-intensity pair, while the CDF file contains all values). The total ion current listed in the header of the spectra is the total ion current measured by the MS instrument for the specific retention tima and AFAIK not changed/updated by proteowizard, even if you do centroiding (hence tic(cdf) is equal to tic(mzxml)).

Summarizing, the two files you provided contain a different amount of data (CDF file contains more data points than the mzXML).

@rdallet
Copy link
Author

rdallet commented Apr 27, 2018

Okay, it's much clearer now. So I will try to deprofile my CDF and make it centroided.

Thank you very much for your time.

@rdallet rdallet closed this as completed Apr 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants