Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding mzidentml data to MSexp object fails #42

Closed
adder opened this issue Jan 17, 2015 · 18 comments
Closed

adding mzidentml data to MSexp object fails #42

adder opened this issue Jan 17, 2015 · 18 comments
Assignees

Comments

@adder
Copy link
Contributor

adder commented Jan 17, 2015

Hi

I try to add a .mzid file to my MSexp object but when I look at fData I only see NA's. There are for sure identification in the file (it's 50 mb and when I open it in an editor I see lots of entries). The .mzid is generated by msgf+ on a search perfomed on a MGF file. The MSexp object is generated with the same MGF file. I also loaded the results in Peptideshaker and tried to create an .mzid file with the peptideshaker export. Same result.
When I try adding identification files in MSnbase with the files provided with the MSnbase package, it works.

Is it something to do with the MGF file?

Greetz

@sgibb
Copy link
Collaborator

sgibb commented Jan 17, 2015

Which version of MSnbase do you use? What is the output of sessionInfo()? Could you share the corresponding files?

@lgatto
Copy link
Owner

lgatto commented Jan 17, 2015

Hi

I try to add a .mzid file to my MSexp object but when I look at fData I only see NA's. There are for sure identification in the file (it's 50 mb and when I open it in an editor I see lots of entries). The .mzid is generated by msgf+ on a search perfomed on a MGF file. The MSexp object is generated with the same MGF file. I also loaded the results in Peptideshaker and tried to create an .mzid file with the peptideshaker export. Same result.
When I try adding identification files in MSnbase with the files provided with the MSnbase package, it works.

Is it something to do with the MGF file?

Yes, indeed. The MGF file is very metadata poor, and the matching
between the spectra in the MSnExp object and the identification
results can not be guaranteed. It would probably be possible to get some
code to work for specific pipelines (say Mascot), but without much
generality.

However, using files that store metadata in a well defined standard will
work. I would suggest you convert your raw data into mzML to be used
as input to MSGF+ and to generate the MSnExp. The mzid produced by
MSGF+ will then have all the necessary metadata from the mzML file to
match the spectra.

In the future, please also provide the output of sessionInfo(), as
suggested by Sebastian, to enable us to verify the software versions you
use.

Hope this helps,

Laurent

@adder
Copy link
Contributor Author

adder commented Jan 17, 2015

Thanks a lot for the swift response!

sessionInfo()

other attached packages:
[1] MSnbase_1.14.1      BiocParallel_1.0.0  mzR_2.0.0          
[4] Rcpp_0.11.3         Biobase_2.26.0      BiocGenerics_0.12.1

loaded via a namespace (and not attached):
 [1] affy_1.44.0           affyio_1.34.0         base64enc_0.1-2      
 [4] BatchJobs_1.5         BBmisc_1.8            BiocInstaller_1.16.1 
 [7] brew_1.0-6            checkmate_1.5.1       codetools_0.2-9      
[10] colorspace_1.2-4      compiler_3.1.2        DBI_0.3.1            
[13] digest_0.6.8          doParallel_1.0.8      fail_1.2             
[16] foreach_1.4.2         ggplot2_1.0.0         grid_3.1.2           
[19] gtable_0.1.2          impute_1.40.0         IRanges_2.0.1        
[22] iterators_1.0.7       lattice_0.20-29       limma_3.22.4         
[25] MALDIquant_1.11       MASS_7.3-35           munsell_0.4.2        
[28] mzID_1.4.1            pcaMethods_1.56.0     plyr_1.8.1           
[31] preprocessCore_1.28.0 proto_0.3-10          reshape2_1.4.1       
[34] RSQLite_1.0.0         S4Vectors_0.4.0       scales_0.2.4         
[37] sendmailR_1.2-1       stats4_3.1.2          stringr_0.6.2        
[40] tools_3.1.2           vsn_3.34.0            XML_3.98-1.1         
[43] zlibbioc_1.12.0      

The MGF and both msgf and peptideshaker mzid files are in this dropbox folder:
https://www.dropbox.com/sh/ezjm78f16bidcty/AAB20qeWSguhfWT-tMkvV1xda?dl=0

greets

@adder
Copy link
Contributor Author

adder commented Jan 17, 2015

Hi Laurent

I didn't see your comment when I posted my previous answer. I did'nt make the MGF myself. I'll try to get a hold on the original files. Thanks for the suggestion!
I figured there was enough metadata in the MGFs filse since peptideshaker seems to be able to make sense of the mgf and searchengine output files. MGF files seem to be the standard files in the searchgui-peptideshaker workflow. Some search engines (like xtandem) does not even accept mzML files as an input. (which is a pitty)

Thanks anyway for your clarification!
greetz

@lgatto
Copy link
Owner

lgatto commented Jan 17, 2015

Our issue is different than using the mgf file to run a search. We need a way to match spectra defined in mgf and mzid. We decided to automate this for well defined formats, such as mzML and mzid (see the PSI page for details).

It is perfectly possible to do the matching using the header in the TITLE in the mgf file

TITLE=Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.2.2.2
RTINSECONDS=900.7589
PEPMASS=634.32830810546875 2871.236328125
CHARGE=2

and the matching element in the mzid

<cvParam accession="MS:1000796" cvRef="PSI-MS" value="Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.2.2.2" name="spectrum title"/>

In the MSnExp, the title can be extract from the featureData

>  xx <- readMgfData("~/Downloads/3_2.mgf")
> head(fData(xx)$TITLE)
                                                        X1 
        Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.2.2.2 
                                                       X10 
      Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.16.16.2 
                                                      X100 
    Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.311.311.2 
                                                     X1000 
  Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.1663.1663.2 
                                                    X10000 
Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.13140.13140.3 
                                                    X10001 
Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.13141.13141.3 
13998 Levels: Orbi2_study6b_W080321_6QC1_sigma48_ft8_pc_01.10001.10001.2 ...

It's not too much work, I think, but I will not have time to implement this now. Maybe @sgibb has some time.

@lgatto
Copy link
Owner

lgatto commented Jan 17, 2015

@sgibb Ideally, we should make it possible to defined what is matched in the MSnExp/MSnSet and the mzid file. Default are acquisition numbers, but ideally one could specify fData()$TITLE (default being acquisitionNum()) and the name in the column name in the flattened mzID data.frame ("spectrum title" or "acquisitionnum", in the given example file). What do you think?

@adder
Copy link
Contributor Author

adder commented Jan 18, 2015

Such a feature would indeed be great and very useful.
I took a short look at the code base, but I'm afraid I can't be of much help here. :)

@sgibb sgibb self-assigned this Jan 18, 2015
@sgibb
Copy link
Collaborator

sgibb commented Jan 18, 2015

I am going to look into it. Can we assume that all mgf files need the combination fData()$TITLE and mzID::flatten()[,"spectrum title"]? If that is the case we could provide an automatism to select the correct column names.

@lgatto
Copy link
Owner

lgatto commented Jan 18, 2015

I am going to look into it. Can we assume that all mgf files need the combination
fData()$TITLE and mzID::flatten()[,"spectrum title"]? If that is the case we could provide an
automatism to select the correct column names.

It seems so:

> library("rols")
> term("MS:1000796", "MS")
[1] "spectrum title"
> termMetadata("MS:1000796", "MS")
definition: A free-form text title describing a spectrum. comment: This is the preferred storage place for the spectrum TITLE from an MGF peak list. 

But it would still be great to be general and not only support the acquisition number and spectrum title.

sgibb added a commit that referenced this issue Jan 18, 2015
@sgibb
Copy link
Collaborator

sgibb commented Jan 18, 2015

Seems to work by simply replacing the merging columns acquisition.number/acquisitionnum by TITLE/spectrum title: a0f5b62
Are you satisfied with fDataCol and iDataCol as argument names? Instead of using a default we could set them automatically, e.g.:

fDataCol <- ifelse (is.null(fDataCol) && grepl("mgf", fileNames(msexp), "TITLE", "acquisition.number")

@adder
Copy link
Contributor Author

adder commented Jan 19, 2015

Thanks a lot for the effort. If I can test something on my files, just let me know :)

@lgatto
Copy link
Owner

lgatto commented Jan 19, 2015

Seems to work by simply replacing the merging columns acquisition.number/acquisitionnum by TITLE/spectrum title

Basically, acquisition.number is a convention and means getting these data via acquisitionNum(.) and add it temporarily to fData; for anything else, it means a feature variable column name.

What about the following:

  • if fcol and icol (see below) are a character of length 1, than the first one is matched agains fData and the latter against the flattened id data.frame.
  • if they are vectors or factors of length equal to nrow(fData(object)) (this works for MSnSet and MSnExp instances), then they are used as is for the matching.

The default would be fcol = acquisitionNum(object) and icol = "acquisitionnum" for MSnExp instances and fcol = acquisition.number for MSnSets.

I think it is very similar than your original suggestion but keeps things a bit more general and does not rely on a convention. What do you think?

We could also have a helper function getIcols(mzid) that returns the colnames of the flattened mzid file.

Are you satisfied with fDataCol and iDataCol as argument names?

Instead of fDataCol, we generally use fcol(less typing). Maybe also change to icol?

@sgibb
Copy link
Collaborator

sgibb commented Jan 20, 2015

  • if they are vectors or factors of length equal to nrow(fData(object)) (this works for MSnSet and MSnExp instances), then they are used as is for the matching.

I am not quite sure how to realize the fcol/icol vectors.

We could have multiple files in an MSnExp with the same number or less corresponding identification files. In this case we have to accept a list of vectors, e.g. addIdentificationData(msexp, filenames=c("a.mzid", "b.mzid"), fcol=list(1:3, 4:6), icol=list(c(1, 1, 1, 1, 3), c(1, 1, 1, 4:6))). Did you mean something like this?

(BTW fcol has to be ==nrow(fData(object)) and icol has to be ==nrow(flatten(mzID(file)))).

@lgatto
Copy link
Owner

lgatto commented Jan 20, 2015

We could have multiple files in an MSnExp with the same number or less corresponding identification files. In this case we have to accept a list of vectors, e.g. addIdentificationData(msexp, filenames=c("a.mzid", "b.mzid"), fcol=list(1:3, 4:6), icol=list(c(1, 1, 1, 1, 3), c(1, 1, 1, 4:6))). Did you mean something like this?

Indeed, we need to support multiple files, which makes my idea a bit too convoluted. I guess it is better to drop the vector example altogether.

Let's go for your original suggestion. The default would be to match fcol, a feature variable label against icol, a column name of an identification data.frame. The defaults are fcol = "acquisition.number" and icol = "acquisitionnum". The former implicitly means acquisitionNum() and the latter reads the mzIdentML and flattens it, unless a data.frame is passed (as suggested in issue #43).

@sgibb
Copy link
Collaborator

sgibb commented Jan 22, 2015

Should I add a new argument to addIdentificationData or do you prefer a new method for merging a data.frame.

I would suggest:

setMethod("addIdentificationData", "MSnExp",
          function(object, filenames, df, fcol, icol, verbose = TRUE) { ... })

@lgatto
Copy link
Owner

lgatto commented Jan 22, 2015

I think methods are appropriate here. The signature would be

c("MSnExp", "character")
c("MSnExp", "mzID")
c("MSnExp", "data.frame")

c("MSnSet", "character")
c("MSnSet", "mzID")
c("MSnSet", "data.frame")

where "character" would be an mzid file name, mzID would be the id object and data.frame the flattened id data frame.

(At some stage, it would be good to add mzR, which is much faster than mzID, but maybe no now. )

@sgibb
Copy link
Collaborator

sgibb commented Jan 22, 2015

You are right, that would be the cleanest way. But it may breaks the current API.
We need to replace the current generic addIdentificationData(object, ...) by addIdentificationData(object, idata, ...) or add a new method with the mentioned signature.

sgibb added a commit that referenced this issue Feb 1, 2015
fix mzIdentML import for MGF based identification files; see issue #42
@sgibb
Copy link
Collaborator

sgibb commented Feb 1, 2015

This issue should be solved with the next MSnbase devel release (1.15.5). It was closed by #45.
@adder the following should work in the near future:

msexp <- addIdentificationData(msexp, "3_2.msgf.mzid",
                               fcol=c("TITLE"),
                               icol=c("spectrum title"))

@sgibb sgibb closed this as completed Feb 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants