Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to output multiple PSMs per spectrum? #360

Closed
glormph opened this issue Nov 10, 2023 · 7 comments
Closed

How to output multiple PSMs per spectrum? #360

glormph opened this issue Nov 10, 2023 · 7 comments

Comments

@glormph
Copy link

glormph commented Nov 10, 2023

Hi, I'm using MSGFPlus in the "only output rank 1 PSMs" mode, and then msgf2pin with -m 4 to also extract all the solutions that have an identical score in MSGF:

  • isoleucine/leucine
  • oxidation on different nearby M
  • different short sequence somehow (I hardly believed it, but it happens!)

Anyway, the percolator input file contains all of these (with correctly different PSM ids for each), but when running percolator I only find one solution per scan in the output. I found some issues that mention doing multi-solution analysis but they weren't mentioning this behavior, making me think I'm doing something wrong. Maybe I just couldn't find the corresponding option for this for the percolator process?

Command lines are:

msgf2pin -m 4 -o percoin.tsv -e trypsin -P "decoy_" metafile
percolator -Y -j percoin.tsv -X perco.xml -N 500000 --decoy-xml-output
@glormph
Copy link
Author

glormph commented Nov 13, 2023

I've poked around a bit more, and it only happens using -Y for target/decoy competition (I'm searching using a concatenated DB with multiple files so I use -Y to not use mixmax due to duplicate scan numbers), so I'm wondering if it is intentional. On the whole, of a dataset of 22 PSMs, 13 were retained in my tests. Indeed, there are only 13 scan numbers in the input, but multiple sequences and 22 different psm_ids, which differ in their SIR (SpectrumIdentificationResult id?). When not using -Y all of them are in the XML output.

I guess this kind of explains it: #152 (comment) , but I'm still wondering if one should take into account the entire PSM ID or not.

@MatthewThe
Copy link
Collaborator

I think you can get the behavior you want by setting --search-input concatenated. You don't need the -Y flag in that case (it will just be ignored).

@glormph
Copy link
Author

glormph commented Nov 13, 2023

Aha, yes that indeed works! It is a bit quieter in the stderr about the FDR method (it says "separate searches input detected, but overridden by -I flag...". I don't know what the difference is between -Y and -I though, does it still do the relevant FDR method for concatenated searches?

@MatthewThe
Copy link
Collaborator

Yes, --search-input concatenated uses the FDR method for concatenated searches. We specifically introduced this flag to deal with non-standard scenarios such as yours.

@glormph
Copy link
Author

glormph commented Nov 13, 2023

Too many options not to get confused :)
So, if I have understood correctly, I used to do -Y on my concatenated search, which avoids mixmax, but is actually meant for when one runs separated searches and do the competition step inside percolator? So when I instead run -I concatenated it will use the same FDR (non mixmax) but not do actual competition based on PSM scan numbers.
And when one runs msgf2pin without -m, the behaviour of -Y and -I concatenated would be more or less identical, since it does not have multiple solutions for a given scan numbers.

Hope I have understood it now, thanks for the quick help! :)

@MatthewThe
Copy link
Collaborator

Yes, that's correct :)

@glormph
Copy link
Author

glormph commented Nov 13, 2023

Great, thanks very much!

@glormph glormph closed this as completed Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants