Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New problem with parsing FASTA files #136

Closed
tristanbrown opened this issue Dec 19, 2023 · 3 comments
Closed

New problem with parsing FASTA files #136

tristanbrown opened this issue Dec 19, 2023 · 3 comments

Comments

@tristanbrown
Copy link

After the release of #120 in v4.6.2, I now get the following error traceback when trying to parse fastas:

99 with _get_filesystem(fasta_uri).open(fasta_uri, "r") as fastafile:
    100     results = []
--> 101     for description, sequence in MyUniProt(fastafile):
    102         description['sequence'] = sequence
    103         results.append(description)

File /opt/conda/lib/python3.8/site-packages/pyteomics/auxiliary/file_helpers.py:178, in IteratorContextManager.__next__(self)
    176 def __next__(self):
    177     # try:
--> 178     return next(self._reader)

File /opt/conda/lib/python3.8/site-packages/pyteomics/fasta.py:232, in FASTA._read(self)
    230     sequence = sequence[:-1]
    231 if self.parser is not None:
--> 232     description = self.parser(description)
    233 yield Protein(description, sequence)
    234 accumulated_strings = [stripped_string[1:]]

File /opt/conda/lib/python3.8/site-packages/pyteomics/fasta.py:144, in _add_raw_field.<locals>._new_parser(instance, descr)
    142     parsed[RAW_HEADER_KEY] = descr
    143 else:
--> 144     raise aux.PyteomicsError('Cannot save raw protein header, since the corresponsing'
    145                             'key ({}) already exists.'.format(RAW_HEADER_KEY))
    146 return parsed

PyteomicsError: Pyteomics error, message: 'Cannot save raw protein header, since the corresponsingkey (__raw__) already exists.'

MyUniProt is just a custom parser with a more robust regex pattern:

class MyUniProt(fasta.UniProt):
    """Redefine the header-parsing pattern to tolerate '-' in the entry field."""

    header_pattern = r'^(?P<db>\w+)\|(?P<id>[-\w]+)\|(?P<entry>[-\w]+)\s+(?P<name>.*?)(?:(\s+OS=(?P<OS>[^=]+))|(\s+OX=(?P<OX>\d+))|(\s+GN=(?P<GN>\S+))|(\s+PE=(?P<PE>\d))|(\s+SV=(?P<SV>\d+)))*\s*$'

    def parser(self, header):
        """
        Catch errors when parsing a header and return a simpler dict; this allows
        parsing FASTAs where not all entries are in a valid Uniprot format.
        """
        try:
            return fasta.UniProt.parser(self, header)
        except:
            _logger.warning("Error parsing header: %s", header, exc_info=True)
            return {
                "id": header,
                "entry": header,
            }

This parsing works without a problem in v4.6.1.

@mobiusklein
Copy link
Contributor

I see, the parent method is already wrapped, but the metaclass cannot tell, so it tries to wrap it again. The interim solution is to remove fasta.RAW_HEADER_KEY from the return value of fasta.UniProt.parser(self, header) before returning it.

The longer term solution would be to modify the check in the _add_raw_field wrapper so that if the fasta.RAW_HEADER_KEY key is present, if its value is the same as the string we would assign to it otherwise, don't throw an error.

@levitsky
Copy link
Owner

Thank you @tristanbrown for reporting and @mobiusklein for your suggestion, I tried implementing it in f9d7f7c.

@tristanbrown could you try the latest master and see if it works for you?

@tristanbrown
Copy link
Author

@levitsky Yes, the latest master branch fixes my issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants