You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 30, 2023. It is now read-only.
From a usability perspective, I would suggest that sequences with unsupported amino acids be skipped and warnings be issued. The current behavior simply has the program crash with a KeyError. The note on the main page about these amino acids is clear but having the user sanitize their input seems like much more effort than handling this through Prosit.
Here is my quick and dirty patch (just to get the tool running for me). Probably not the best way to handle this but it does rely on utils.peptide_parser() to not duplicate logic.
diff --git a/prosit/tensorize.py b/prosit/tensorize.py
index d4eefdb..d94127f 100644
--- a/prosit/tensorize.py+++ b/prosit/tensorize.py@@ -1,4 +1,5 @@
import collections
+import logging
import numpy as np
from . import constants
@@ -45,7 +46,7 @@ def get_precursor_charge_onehot(charges):
return array
-def get_sequence_integer(sequences):+def get_sequence_integer(sequences):
array = np.zeros([len(sequences), MAX_SEQUENCE], dtype=int)
for i, sequence in enumerate(sequences):
for j, s in enumerate(utils.peptide_parser(sequence)):
@@ -89,6 +90,14 @@ def csv(df):
assert "modified_sequence" in df.columns
assert "collision_energy" in df.columns
assert "precursor_charge" in df.columns
++ n_all = df.shape[0]+ df = df[df.modified_sequence.apply(is_supported_sequence)]+ n_supported = df.shape[0]+ if n_supported < n_all:+ logging.warning(' {0} / {1} sequences were skipped as they '+ 'contained unsupported amino acids'.format(n_all - n_supported, n_all))+
data = {
"collision_energy_aligned_normed": get_numbers(df.collision_energy) / 100.0,
"sequence_integer": get_sequence_integer(df.modified_sequence),
@@ -107,3 +116,12 @@ def csv(df):
data["masses_pred"] = masses_pred
return data
+++def is_supported_sequence(seq: str) -> bool:+ try:+ _ = [ALPHABET[s] for s in utils.peptide_parser(seq)]+ return True+ except KeyError as e:+ return False+
PS Good work on developing Prosit! 😄
The text was updated successfully, but these errors were encountered:
Sorry to chip in. Originally, we designed the Prosit pipeline, so that it never changes any user input. We on purpose didn't skip sequences, because input and ouput files would have a different length. I believe that providing malformed input data should throw an error and not a warning. But maybe this premise changed in the Prosit pipeline.
I see, that makes sense. If the premise hasn't changed, then I'd only suggest making the error more explicit. In my case, it took me quite some time to figure out what was wrong (and I'm familiar with Python). Then I just decided it was more efficient (and acceptable) for me to fix it like above :)
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
From a usability perspective, I would suggest that sequences with unsupported amino acids be skipped and warnings be issued. The current behavior simply has the program crash with a KeyError. The note on the main page about these amino acids is clear but having the user sanitize their input seems like much more effort than handling this through Prosit.
Here is my quick and dirty patch (just to get the tool running for me). Probably not the best way to handle this but it does rely on
utils.peptide_parser()
to not duplicate logic.PS Good work on developing Prosit! 😄
The text was updated successfully, but these errors were encountered: