adding mzidentml data to MSexp object fails #42

adder · 2015-01-17T12:06:25Z

Hi

I try to add a .mzid file to my MSexp object but when I look at fData I only see NA's. There are for sure identification in the file (it's 50 mb and when I open it in an editor I see lots of entries). The .mzid is generated by msgf+ on a search perfomed on a MGF file. The MSexp object is generated with the same MGF file. I also loaded the results in Peptideshaker and tried to create an .mzid file with the peptideshaker export. Same result.
When I try adding identification files in MSnbase with the files provided with the MSnbase package, it works.

Is it something to do with the MGF file?

Greetz

sgibb · 2015-01-17T12:18:01Z

Which version of MSnbase do you use? What is the output of sessionInfo()? Could you share the corresponding files?

lgatto · 2015-01-17T12:33:22Z

Hi

I try to add a .mzid file to my MSexp object but when I look at fData I only see NA's. There are for sure identification in the file (it's 50 mb and when I open it in an editor I see lots of entries). The .mzid is generated by msgf+ on a search perfomed on a MGF file. The MSexp object is generated with the same MGF file. I also loaded the results in Peptideshaker and tried to create an .mzid file with the peptideshaker export. Same result.
When I try adding identification files in MSnbase with the files provided with the MSnbase package, it works.

Is it something to do with the MGF file?

Yes, indeed. The MGF file is very metadata poor, and the matching
between the spectra in the MSnExp object and the identification
results can not be guaranteed. It would probably be possible to get some
code to work for specific pipelines (say Mascot), but without much
generality.

However, using files that store metadata in a well defined standard will
work. I would suggest you convert your raw data into mzML to be used
as input to MSGF+ and to generate the MSnExp. The mzid produced by
MSGF+ will then have all the necessary metadata from the mzML file to
match the spectra.

In the future, please also provide the output of sessionInfo(), as
suggested by Sebastian, to enable us to verify the software versions you
use.

Hope this helps,

Laurent

adder · 2015-01-17T12:38:23Z

Thanks a lot for the swift response!

sessionInfo()

other attached packages:
[1] MSnbase_1.14.1      BiocParallel_1.0.0  mzR_2.0.0          
[4] Rcpp_0.11.3         Biobase_2.26.0      BiocGenerics_0.12.1

loaded via a namespace (and not attached):
 [1] affy_1.44.0           affyio_1.34.0         base64enc_0.1-2      
 [4] BatchJobs_1.5         BBmisc_1.8            BiocInstaller_1.16.1 
 [7] brew_1.0-6            checkmate_1.5.1       codetools_0.2-9      
[10] colorspace_1.2-4      compiler_3.1.2        DBI_0.3.1            
[13] digest_0.6.8          doParallel_1.0.8      fail_1.2             
[16] foreach_1.4.2         ggplot2_1.0.0         grid_3.1.2           
[19] gtable_0.1.2          impute_1.40.0         IRanges_2.0.1        
[22] iterators_1.0.7       lattice_0.20-29       limma_3.22.4         
[25] MALDIquant_1.11       MASS_7.3-35           munsell_0.4.2        
[28] mzID_1.4.1            pcaMethods_1.56.0     plyr_1.8.1           
[31] preprocessCore_1.28.0 proto_0.3-10          reshape2_1.4.1       
[34] RSQLite_1.0.0         S4Vectors_0.4.0       scales_0.2.4         
[37] sendmailR_1.2-1       stats4_3.1.2          stringr_0.6.2        
[40] tools_3.1.2           vsn_3.34.0            XML_3.98-1.1         
[43] zlibbioc_1.12.0

The MGF and both msgf and peptideshaker mzid files are in this dropbox folder:
https://www.dropbox.com/sh/ezjm78f16bidcty/AAB20qeWSguhfWT-tMkvV1xda?dl=0

greets

adder · 2015-01-17T13:10:46Z

Hi Laurent

I didn't see your comment when I posted my previous answer. I did'nt make the MGF myself. I'll try to get a hold on the original files. Thanks for the suggestion!
I figured there was enough metadata in the MGFs filse since peptideshaker seems to be able to make sense of the mgf and searchengine output files. MGF files seem to be the standard files in the searchgui-peptideshaker workflow. Some search engines (like xtandem) does not even accept mzML files as an input. (which is a pitty)

Thanks anyway for your clarification!
greetz

lgatto · 2015-01-17T18:24:46Z

Our issue is different than using the mgf file to run a search. We need a way to match spectra defined in mgf and mzid. We decided to automate this for well defined formats, such as mzML and mzid (see the PSI page for details).

It is perfectly possible to do the matching using the header in the TITLE in the mgf file

TITLE=Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.2.2.2
RTINSECONDS=900.7589
PEPMASS=634.32830810546875 2871.236328125
CHARGE=2

and the matching element in the mzid

<cvParam accession="MS:1000796" cvRef="PSI-MS" value="Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.2.2.2" name="spectrum title"/>

In the MSnExp, the title can be extract from the featureData

>  xx <- readMgfData("~/Downloads/3_2.mgf")
> head(fData(xx)$TITLE)
                                                        X1 
        Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.2.2.2 
                                                       X10 
      Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.16.16.2 
                                                      X100 
    Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.311.311.2 
                                                     X1000 
  Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.1663.1663.2 
                                                    X10000 
Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.13140.13140.3 
                                                    X10001 
Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.13141.13141.3 
13998 Levels: Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.10001.10001.2 ...

It's not too much work, I think, but I will not have time to implement this now. Maybe @sgibb has some time.

lgatto · 2015-01-17T18:55:59Z

@sgibb Ideally, we should make it possible to defined what is matched in the MSnExp/MSnSet and the mzid file. Default are acquisition numbers, but ideally one could specify fData()$TITLE (default being acquisitionNum()) and the name in the column name in the flattened mzID data.frame ("spectrum title" or "acquisitionnum", in the given example file). What do you think?

adder · 2015-01-18T09:29:28Z

Such a feature would indeed be great and very useful.
I took a short look at the code base, but I'm afraid I can't be of much help here. :)

sgibb · 2015-01-18T19:35:31Z

I am going to look into it. Can we assume that all mgf files need the combination fData()$TITLE and mzID::flatten()[,"spectrum title"]? If that is the case we could provide an automatism to select the correct column names.

lgatto · 2015-01-18T19:58:45Z

I am going to look into it. Can we assume that all mgf files need the combination
fData()$TITLE and mzID::flatten()[,"spectrum title"]? If that is the case we could provide an
automatism to select the correct column names.

It seems so:

> library("rols")
> term("MS:1000796", "MS")
[1] "spectrum title"
> termMetadata("MS:1000796", "MS")
definition: A free-form text title describing a spectrum. comment: This is the preferred storage place for the spectrum TITLE from an MGF peak list.

But it would still be great to be general and not only support the acquisition number and spectrum title.

sgibb · 2015-01-18T21:28:43Z

Seems to work by simply replacing the merging columns acquisition.number/acquisitionnum by TITLE/spectrum title: a0f5b62
Are you satisfied with fDataCol and iDataCol as argument names? Instead of using a default we could set them automatically, e.g.:

fDataCol <- ifelse (is.null(fDataCol) && grepl("mgf", fileNames(msexp), "TITLE", "acquisition.number")

adder · 2015-01-19T09:36:32Z

Thanks a lot for the effort. If I can test something on my files, just let me know :)

lgatto · 2015-01-19T13:33:55Z

Seems to work by simply replacing the merging columns acquisition.number/acquisitionnum by TITLE/spectrum title

Basically, acquisition.number is a convention and means getting these data via acquisitionNum(.) and add it temporarily to fData; for anything else, it means a feature variable column name.

What about the following:

if fcol and icol (see below) are a character of length 1, than the first one is matched agains fData and the latter against the flattened id data.frame.
if they are vectors or factors of length equal to nrow(fData(object)) (this works for MSnSet and MSnExp instances), then they are used as is for the matching.

The default would be fcol = acquisitionNum(object) and icol = "acquisitionnum" for MSnExp instances and fcol = acquisition.number for MSnSets.

I think it is very similar than your original suggestion but keeps things a bit more general and does not rely on a convention. What do you think?

We could also have a helper function getIcols(mzid) that returns the colnames of the flattened mzid file.

Are you satisfied with fDataCol and iDataCol as argument names?

Instead of fDataCol, we generally use fcol(less typing). Maybe also change to icol?

sgibb · 2015-01-20T19:12:52Z

if they are vectors or factors of length equal to nrow(fData(object)) (this works for MSnSet and MSnExp instances), then they are used as is for the matching.

I am not quite sure how to realize the fcol/icol vectors.

We could have multiple files in an MSnExp with the same number or less corresponding identification files. In this case we have to accept a list of vectors, e.g. addIdentificationData(msexp, filenames=c("a.mzid", "b.mzid"), fcol=list(1:3, 4:6), icol=list(c(1, 1, 1, 1, 3), c(1, 1, 1, 4:6))). Did you mean something like this?

(BTW fcol has to be ==nrow(fData(object)) and icol has to be ==nrow(flatten(mzID(file)))).

lgatto · 2015-01-20T21:58:05Z

We could have multiple files in an MSnExp with the same number or less corresponding identification files. In this case we have to accept a list of vectors, e.g. addIdentificationData(msexp, filenames=c("a.mzid", "b.mzid"), fcol=list(1:3, 4:6), icol=list(c(1, 1, 1, 1, 3), c(1, 1, 1, 4:6))). Did you mean something like this?

Indeed, we need to support multiple files, which makes my idea a bit too convoluted. I guess it is better to drop the vector example altogether.

Let's go for your original suggestion. The default would be to match fcol, a feature variable label against icol, a column name of an identification data.frame. The defaults are fcol = "acquisition.number" and icol = "acquisitionnum". The former implicitly means acquisitionNum() and the latter reads the mzIdentML and flattens it, unless a data.frame is passed (as suggested in issue #43).

sgibb · 2015-01-22T18:26:31Z

Should I add a new argument to addIdentificationData or do you prefer a new method for merging a data.frame.

I would suggest:

setMethod("addIdentificationData", "MSnExp",
          function(object, filenames, df, fcol, icol, verbose = TRUE) { ... })

lgatto · 2015-01-22T19:36:05Z

I think methods are appropriate here. The signature would be

c("MSnExp", "character")
c("MSnExp", "mzID")
c("MSnExp", "data.frame")

c("MSnSet", "character")
c("MSnSet", "mzID")
c("MSnSet", "data.frame")

where "character" would be an mzid file name, mzID would be the id object and data.frame the flattened id data frame.

(At some stage, it would be good to add mzR, which is much faster than mzID, but maybe no now. )

sgibb · 2015-01-22T20:28:20Z

You are right, that would be the cleanest way. But it may breaks the current API.
We need to replace the current generic addIdentificationData(object, ...) by addIdentificationData(object, idata, ...) or add a new method with the mentioned signature.

fix mzIdentML import for MGF based identification files; see issue #42

sgibb · 2015-02-01T16:59:52Z

This issue should be solved with the next MSnbase devel release (1.15.5). It was closed by #45.
@adder the following should work in the near future:

msexp <- addIdentificationData(msexp, "3_2.msgf.mzid",
                               fcol=c("TITLE"),
                               icol=c("spectrum title"))

sgibb added enhancement and removed enhancement labels Jan 18, 2015

sgibb self-assigned this Jan 18, 2015

sgibb added a commit that referenced this issue Jan 18, 2015

add {f,i}DataCol arguments; see #42

a0f5b62

This was referenced Jan 19, 2015

partly rewrite readMgfData #44

Merged

fix mzIdentML import for MGF based identification files; see issue #42 #45

Merged

sgibb added a commit that referenced this issue Feb 1, 2015

Merge pull request #45 from lgatto/issue42

68ca0f3

fix mzIdentML import for MGF based identification files; see issue #42

sgibb closed this as completed Feb 1, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding mzidentml data to MSexp object fails #42

adding mzidentml data to MSexp object fails #42

adder commented Jan 17, 2015

sgibb commented Jan 17, 2015

lgatto commented Jan 17, 2015

adder commented Jan 17, 2015

adder commented Jan 17, 2015

lgatto commented Jan 17, 2015

lgatto commented Jan 17, 2015

adder commented Jan 18, 2015

sgibb commented Jan 18, 2015

lgatto commented Jan 18, 2015

sgibb commented Jan 18, 2015

adder commented Jan 19, 2015

lgatto commented Jan 19, 2015

sgibb commented Jan 20, 2015

lgatto commented Jan 20, 2015

sgibb commented Jan 22, 2015

lgatto commented Jan 22, 2015

sgibb commented Jan 22, 2015

sgibb commented Feb 1, 2015

adding mzidentml data to MSexp object fails #42

adding mzidentml data to MSexp object fails #42

Comments

adder commented Jan 17, 2015

sgibb commented Jan 17, 2015

lgatto commented Jan 17, 2015

adder commented Jan 17, 2015

adder commented Jan 17, 2015

lgatto commented Jan 17, 2015

lgatto commented Jan 17, 2015

adder commented Jan 18, 2015

sgibb commented Jan 18, 2015

lgatto commented Jan 18, 2015

sgibb commented Jan 18, 2015

adder commented Jan 19, 2015

lgatto commented Jan 19, 2015

sgibb commented Jan 20, 2015

lgatto commented Jan 20, 2015

sgibb commented Jan 22, 2015

lgatto commented Jan 22, 2015

sgibb commented Jan 22, 2015

sgibb commented Feb 1, 2015