Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix mzIdentML import for MGF based identification files; see issue #42 #45

Merged
merged 13 commits into from
Feb 1, 2015

Conversation

sgibb
Copy link
Collaborator

@sgibb sgibb commented Jan 19, 2015

This PR is WIP. @lgatto's suggestion #42 (comment) and the documentation are still missing. I will leave a comment if this PR is finished.

@sgibb
Copy link
Collaborator Author

sgibb commented Jan 25, 2015

This PR may breaks the API because the generic for addIdentificationData was changed:

# before PR
setGeneric("addIdentificationData", function(object, ...) standardGeneric("addIdentificationData"))
# PR
setGeneric("addIdentificationData", function(object, id, ...) standardGeneric("addIdentificationData"))

Additionally the name of the second argument was filenames before (is id now).

idSummary is currently not working (because mzIDCollection contains no hint about the corresponding id files; see thomasp85/mzID#27), some tests fail (all regarding the identification file problem) and the documentation is still missing.

@sgibb
Copy link
Collaborator Author

sgibb commented Jan 27, 2015

@lgatto a few questions:

  1. Do we really want to add the filenames of the identification files to MSnExp@MSnProcess@files slot (via fileNames(msexp) <- basename(mzID::files(id)$raw$location))? This would not work for the general data.frame if it contains no "idFile" column (that's why I removed it from the current code in this PR, but it could be added easily).
  2. By adding the new overloaded methods addIdentificationData we break the backward compatibility. How we want to deal with this fact (the second argument is now id instead of filenames)?

@sgibb
Copy link
Collaborator Author

sgibb commented Jan 27, 2015

A 3. questions same problem as 2.: backward compatibility:
The merge columns are now fcol = c("spectrum.file", "acquisition.number"), icol = c("spectrumFile", "acquisitionnum") (was fcol = c("file", "acquisition.number"), icol = c("spectrumFile", "acquisitionnum") before but it wasn't visible to the user).

Both fcol columns are generated from the MSnExp as follows:

fd$spectrum.file <- basename(fileNames(object)[fromFile(object)])
fd$acquisition.number <- acquisitionNum(object)

The icol columns are present in the mzID output. IMHO it is easier for the user to create his own spectrumFile and acquisitionNum columns (if he want to create his own data.frame and don't use mzID) than creating a file column (file id; position of the file in the MSnExp@MSnProcess@files slot). The latter could be done automatically for mzID but the user has to do it himself for his own data.frames.

This change breaks the backward compatibility. Should not be a real problem because nobody could add his own data.frame yet. (But if somebody works with the fData(msexp)$identFile column he will have trouble.)

The output of idSummary would change as well. Instead of file id numbers there would be filenames.

What do you think?

@lgatto
Copy link
Owner

lgatto commented Jan 27, 2015

Do we really want to add the filenames of the identification files to MSnExp@MSnProcess@files slot (via fileNames(msexp) <- basename(mzID::files(id)$raw$location))? This would not work for the general data.frame if it contains no "idFile" column (that's why I removed it from the current code in this PR, but it could be added easily).

Would it be possible to just ignore the identification filename if there is no idFile column? We don't want this to become overly complicated, so I would also consider dropping it altogether.

@lgatto
Copy link
Owner

lgatto commented Jan 27, 2015

By adding the new overloaded methods addIdentificationData we break the backward compatibility. How we want to deal with this fact (the second argument is now id instead of filenames)?

Yes, but I think we gain much from this. There have been 2 or 3 issues about adding id data, so some flexibility is welcome. In the future, I would also consider add a method for identification from mzR.

@lgatto
Copy link
Owner

lgatto commented Jan 27, 2015

A 3. questions same problem as 2.: backward compatibility:

Same as for 2 - go ahead, break it ;-)

@sgibb
Copy link
Collaborator Author

sgibb commented Jan 28, 2015

IMHO this PR is ready to merge if mzID 1.5.2 is on bioc. Therefore @thomasp85 has to export his mzID and mzIDCollection classes or accept this PR: thomasp85/mzID#28

The identification filenames are ignored now (was issue 1.). The backward compatibility is broken (were 2. and 3.).

#42 is solved with this PR by the following clumsy call:

library("MSnbase")

msexp <- readMgfData("3_2.mgf")
msexp <- addIdentificationData(msexp, "3_2.msgf.mzid",
                               fcol=c("spectrum.file", "TITLE"),
                               icol=c("spectrumFile", "spectrum title"))

Theoretical #43 should be solved as well but I don't know the correct columns.

@sgibb sgibb mentioned this pull request Jan 28, 2015
@thomasp85
Copy link

Will do tomorrow - lot on my plate atm.

Den 28/01/2015 kl. 22.56 skrev Sebastian Gibb notifications@github.com:

IMHO this PR is ready to merge if mzID 1.5.2 is on bioc. Therefore @thomasp85 has to export his mzID and mzIDCollection classes or accept this PR: thomasp85/mzID#28

The identification filenames are ignored now (was issue 1.). The backward compatibility is broken (were 2. and 3.).

#42 is solved with this PR by the following clumsy call:

library("MSnbase")

msexp <- readMgfData("3_2.mgf")
msexp <- addIdentificationData(msexp, "3_2.msgf.mzid",
fcol=c("spectrum.file", "TITLE"),
icol=c("spectrumFile", "spectrum title"))
Theoretical #43 should be solved as well but I don't know the correct columns.


Reply to this email directly or view it on GitHub.

@lgatto
Copy link
Owner

lgatto commented Jan 28, 2015

Sorry, dump question here

msexp <- addIdentificationData(msexp, "3_2.msgf.mzid",
                               fcol=c("spectrum.file", "TITLE"),
                               icol=c("spectrumFile", "spectrum title"))

but why do you have spectrum.file/spectrumFile and TITLE/spectrum title? Is the matching not done on one only?

@sgibb
Copy link
Collaborator Author

sgibb commented Jan 30, 2015

In this particular case spectrum.file/spectrumFile is indeed superfluous. The following would be enough:

msexp <- addIdentificationData(msexp, "3_2.msgf.mzid",
                               fcol=c("TITLE"),
                               icol=c("spectrum title"))

I had multiple file support in mind. But I have seen that readMgfData doesn't support multiple files. That's why it should be safe to use just TITLE/spectrum title.

mzID 1.5.2 is on bioc; If you agree this PR could be merged.

@lgatto
Copy link
Owner

lgatto commented Jan 30, 2015

If you agree this PR could be merged.

Sure, go ahead!

sgibb added a commit that referenced this pull request Feb 1, 2015
fix mzIdentML import for MGF based identification files; see issue #42
@sgibb sgibb merged commit 68ca0f3 into master Feb 1, 2015
@sgibb sgibb deleted the issue42 branch February 1, 2015 17:00
@lgatto
Copy link
Owner

lgatto commented Feb 2, 2015

FYI: I have now pushed the latest version to the Bioc svn server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants