String/Bytes with PDB Files #81

rmcgibbo · 2015-02-26T22:51:29Z

One of the py3 errors is

    Traceback (most recent call last):
      File "/home/rmcgibbo/miniconda/envs/3.4.2/lib/python3.4/doctest.py", line 1324, in __run
        compileflags, 1), test.globs)
      File "<doctest pdbfixer.pdbfixer.PDBFixer.__init__[3]>", line 1, in <module>
        fixer = PDBFixer(pdbfile=file)
      File "/home/rmcgibbo/projects/pdbfixer/pdbfixer/pdbfixer.py", line 198, in __init__
        structure = PdbStructure(pdbfile)
      File "/home/rmcgibbo/miniconda/envs/3.4.2/lib/python3.4/site-packages/simtk/openmm/app/internal/pdbstructure.py", line 146, in __init__
        self._load(input_stream)
      File "/home/rmcgibbo/miniconda/envs/3.4.2/lib/python3.4/site-packages/simtk/openmm/app/internal/pdbstructure.py", line 154, in _load
        if (pdb_line.find("ATOM  ") == 0) or (pdb_line.find("HETATM") == 0):
    TypeError: Type str doesn't support the buffer API

The reason for this is that pdbfile, in this error case, is a urllib request which is a file-like object of bytes. the type of pdb_line is bytes.

the error is the same error as in

In [87]: b'bytes'.find('string')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-87-4e19ac47a0ed> in <module>()
----> 1 b'bytes'.find('string')

TypeError: Type str doesn't support the buffer API

I'm not sure if we should count this as a bug in simtk/openmm/app/internal/pdbstructure.py or not. Does the PDB format specify an encoding? Should we consider it as string-based, or bytes based?

The text was updated successfully, but these errors were encountered:

mpharrigan · 2015-02-26T22:54:24Z

You can fix with

In [1]: b'bytes'.find(b'string')
Out[1]: -1

mpharrigan · 2015-02-26T22:54:32Z

"fix"

peastman · 2015-02-26T22:55:13Z

Does the PDB format specify an encoding?

Not that I know of. Or rather, I don't think PDB files are ever supposed to contain non-ASCII characters.

rmcgibbo · 2015-02-26T23:07:28Z

Currently on py3, if you want to create a PdbStructure, you need to have input_stream opened to read strings, not bytes.
PDBFixer's doctest use an input_stream created from urllib.request.urlopen, which yields bytes.

One of these two things needs to change. I'm not sure which though.

rmcgibbo · 2015-02-26T23:09:14Z

(for file io, this is the difference between open(fn, 'r') and open(fn, 'rb'))

peastman · 2015-02-26T23:10:38Z

We always use 'r', not 'rb', so that's the problem.

rmcgibbo · 2015-02-26T23:12:14Z

No, that's why it works within OpenMM, because "r" yields strings, not bytes. It's only urllib that yields bytes, which is why you only see this issue in PDBFixer.

peastman · 2015-02-26T23:15:01Z

Yes, that's what I meant. The problem is that it expects a string, not bytes. How do we change the urlopen() call to get strings instead?

rmcgibbo · 2015-02-26T23:17:33Z

I don't think there's an easy one liner. You can do urllib.request.urlopen(url).read().decode('utf-8'), but then you need to stuff it back into a file-like object?

rmcgibbo · 2015-02-26T23:19:03Z

We could change OpenMM to respond appropriately to either strings or bytes too.

peastman · 2015-02-26T23:23:54Z

What would be the easiest way of doing that? PdbStructure reads lines from the PDB file with

for pdb_line in input_stream:

But if it's a stream of bytes, each iteration will get one byte instead of one line, right? Is there an easy way it can detect the type of stream and convert/wrap/whatever it?

rmcgibbo · 2015-02-26T23:29:36Z

The iterator still splits on bytes-newlines.

In [97]: [line for line in open('README.md', 'rb')]
Out[97]: 
[b'## [MSMBuilder: Statistical models for Biomolecular Dynamics](http://msmbuilder.org/)\n',
 b'\n',
 b'[![Build Status](https://travis-ci.org/msmbuilder/msmbuilder.png?branch=master)](https://travis-ci.org/msmbuilder/msmbuilder) [![PyPi version](https://pypip.in/v/msmbuilder/badge.png)](https://pypi.python.org/pypi/msmbuilder/) [![Supported Python versions](https://pypip.in/py_versions/msmbuilder/badge.svg)](https://pypi.python.org/pypi/msmbuilder/) [![License](https://img.shields.io/badge/license-LGPLv2.1+-red.svg?style=flat)](https://pypi.python.org/pypi/msmbuilder/)\n',
 b'[![Documentation Status](https://img.shields.io/badge/docs-latest-blue.svg?style=flat)](http://msmbuilder.org)\n',
 b'\n',
 b'MSMBuilder is a python package which implements a series of statistical models for high-dimensional time-series -- the particular focus of the library is on the  analysis of atomistic simulations of biomolecular dynamics such as protein folding and conformational change. MSMBuilder is available under the LGPL (v2.1 or later).\n',
 b'\n',
 b'Algorithms available include:\n',
 b'\n',
 b'- Geometric clustering\n',
 b'- Markov state models\n',
 b'- Time-structure independent components analysis\n',
 b'- L1-regularized reversible hidden Markov models\n',
 b'- Transition path theory\n',
 b'\n',
 b'Resources:\n',
 b'- [Documentation](http://msmbuilder.org)\n',
 b'- [Mailing List](https://mailman.stanford.edu/mailman/listinfo/msmbuilder-user)\n']

peastman · 2015-02-26T23:35:22Z

So if we just add "pdb_line = pdb_line.decode('utf-8')", will that work everywhere?

rmcgibbo · 2015-02-26T23:45:09Z

I think so.

swails · 2015-02-27T00:23:51Z

So if we just add "pdb_line = pdb_line.decode('utf-8')", will that work everywhere?

No. In Py3 if you open as 'r' instead of 'rb', you read strings. Strings in Py3 don't have a decode attribute:

>>> 'this is a string'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
>>> b'these are bytes'.decode('utf-8')
'these are bytes'

This is, by far, the worst part about the Py2->Py3 transition. It ruins duck typing between file-like objects unless you just open everything with 'rb'... but yuck. For instance: gzip.open(file, 'r') yields bytes... open(file, 'r') yields strings.

I have a helper class in ParmEd that wraps around an open file-like object. See https://github.com/ParmEd/ParmEd/blob/master/chemistry/formats/io.py#L62-L103 for example.

rmcgibbo · 2015-02-27T00:27:03Z

just check the type before decoding it.

On Thu, Feb 26, 2015 at 4:23 PM, Jason Swails notifications@github.com
wrote:

So if we just add "pdb_line = pdb_line.decode('utf-8')", will that work
everywhere?

No. In Py3 if you open as 'r' instead of 'rb', you read strings. Strings
in Py3 don't have a decode attribute:

'this is a string'.decode('utf-8')
Traceback (most recent call last):
File "", line 1, in AttributeError: 'str' object has no attribute 'decode'>>> b'these are bytes'.decode('utf-8')'these are bytes'

This is, by far, the worst part about the Py2->Py3 transition. It ruins
duck typing between file-like objects unless you just open everything with
'rb'... but yuck. For instance: gzip.open(file, 'r') yields bytes... open(file,
'r') yields strings.

I have a helper class in ParmEd that wraps around an open file-like
object. See
https://github.com/ParmEd/ParmEd/blob/master/chemistry/formats/io.py#L62-L103
for example.

—
Reply to this email directly or view it on GitHub
#81 (comment).

peastman · 2015-02-27T00:34:48Z

Ok, how about this:

if not isinstance(pdb_line, str):
    pdb_line = pdb_line.decode('utf-8')

That should work everywhere, right?

swails · 2015-02-27T00:53:56Z

How many places do you read a line from the file? You need to do this in each place. I have so many parsers in ParmEd that I needed to write a class to handle it more robustly

peastman · 2015-02-27T00:59:43Z

PdbStructure is the only one we're talking about here. There are lots of other file readers, but they aren't relevant to PDBFixer.

swails · 2015-02-27T01:00:15Z

I meant how many times to you read a line from a file inside PDBFixer? Or is it just one big for line in file type of thing?

peastman · 2015-02-27T01:01:42Z

You mean PdbStructure? It's just one big loop.

peastman · 2015-02-27T22:45:07Z

This is hopefully fixed by openmm/openmm#839.

peastman mentioned this issue Feb 27, 2015

PdbStructure handles byte streams correctly in Python 3 openmm/openmm#839

Merged

rmcgibbo closed this as completed Feb 27, 2015

ddcas9 mentioned this issue Nov 25, 2021

PDB file lines read in as Models, throws "AttributeError: 'Model' object has no attribute 'decode'" #237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String/Bytes with PDB Files #81

String/Bytes with PDB Files #81

rmcgibbo commented Feb 26, 2015

mpharrigan commented Feb 26, 2015

mpharrigan commented Feb 26, 2015

peastman commented Feb 26, 2015

rmcgibbo commented Feb 26, 2015

rmcgibbo commented Feb 26, 2015

peastman commented Feb 26, 2015

rmcgibbo commented Feb 26, 2015

peastman commented Feb 26, 2015

rmcgibbo commented Feb 26, 2015

rmcgibbo commented Feb 26, 2015

peastman commented Feb 26, 2015

rmcgibbo commented Feb 26, 2015

peastman commented Feb 26, 2015

rmcgibbo commented Feb 26, 2015

swails commented Feb 27, 2015

rmcgibbo commented Feb 27, 2015

peastman commented Feb 27, 2015

swails commented Feb 27, 2015

peastman commented Feb 27, 2015

swails commented Feb 27, 2015

peastman commented Feb 27, 2015

peastman commented Feb 27, 2015

String/Bytes with PDB Files #81

String/Bytes with PDB Files #81

Comments

rmcgibbo commented Feb 26, 2015

mpharrigan commented Feb 26, 2015

mpharrigan commented Feb 26, 2015

peastman commented Feb 26, 2015

rmcgibbo commented Feb 26, 2015

rmcgibbo commented Feb 26, 2015

peastman commented Feb 26, 2015

rmcgibbo commented Feb 26, 2015

peastman commented Feb 26, 2015

rmcgibbo commented Feb 26, 2015

rmcgibbo commented Feb 26, 2015

peastman commented Feb 26, 2015

rmcgibbo commented Feb 26, 2015

peastman commented Feb 26, 2015

rmcgibbo commented Feb 26, 2015

swails commented Feb 27, 2015

rmcgibbo commented Feb 27, 2015

peastman commented Feb 27, 2015

swails commented Feb 27, 2015

peastman commented Feb 27, 2015

swails commented Feb 27, 2015

peastman commented Feb 27, 2015

peastman commented Feb 27, 2015