Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String/Bytes with PDB Files #81

Closed
rmcgibbo opened this issue Feb 26, 2015 · 22 comments
Closed

String/Bytes with PDB Files #81

rmcgibbo opened this issue Feb 26, 2015 · 22 comments

Comments

@rmcgibbo
Copy link
Contributor

One of the py3 errors is

    Traceback (most recent call last):
      File "/home/rmcgibbo/miniconda/envs/3.4.2/lib/python3.4/doctest.py", line 1324, in __run
        compileflags, 1), test.globs)
      File "<doctest pdbfixer.pdbfixer.PDBFixer.__init__[3]>", line 1, in <module>
        fixer = PDBFixer(pdbfile=file)
      File "/home/rmcgibbo/projects/pdbfixer/pdbfixer/pdbfixer.py", line 198, in __init__
        structure = PdbStructure(pdbfile)
      File "/home/rmcgibbo/miniconda/envs/3.4.2/lib/python3.4/site-packages/simtk/openmm/app/internal/pdbstructure.py", line 146, in __init__
        self._load(input_stream)
      File "/home/rmcgibbo/miniconda/envs/3.4.2/lib/python3.4/site-packages/simtk/openmm/app/internal/pdbstructure.py", line 154, in _load
        if (pdb_line.find("ATOM  ") == 0) or (pdb_line.find("HETATM") == 0):
    TypeError: Type str doesn't support the buffer API

The reason for this is that pdbfile, in this error case, is a urllib request which is a file-like object of bytes. the type of pdb_line is bytes.

the error is the same error as in

In [87]: b'bytes'.find('string')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-87-4e19ac47a0ed> in <module>()
----> 1 b'bytes'.find('string')

TypeError: Type str doesn't support the buffer API

I'm not sure if we should count this as a bug in simtk/openmm/app/internal/pdbstructure.py or not. Does the PDB format specify an encoding? Should we consider it as string-based, or bytes based?

@mpharrigan
Copy link

You can fix with

In [1]: b'bytes'.find(b'string')
Out[1]: -1

@mpharrigan
Copy link

"fix"

@peastman
Copy link
Member

Does the PDB format specify an encoding?

Not that I know of. Or rather, I don't think PDB files are ever supposed to contain non-ASCII characters.

@rmcgibbo
Copy link
Contributor Author

  • Currently on py3, if you want to create a PdbStructure, you need to have input_stream opened to read strings, not bytes.
  • PDBFixer's doctest use an input_stream created from urllib.request.urlopen, which yields bytes.

One of these two things needs to change. I'm not sure which though.

@rmcgibbo
Copy link
Contributor Author

(for file io, this is the difference between open(fn, 'r') and open(fn, 'rb'))

@peastman
Copy link
Member

We always use 'r', not 'rb', so that's the problem.

@rmcgibbo
Copy link
Contributor Author

No, that's why it works within OpenMM, because "r" yields strings, not bytes. It's only urllib that yields bytes, which is why you only see this issue in PDBFixer.

@peastman
Copy link
Member

Yes, that's what I meant. The problem is that it expects a string, not bytes. How do we change the urlopen() call to get strings instead?

@rmcgibbo
Copy link
Contributor Author

I don't think there's an easy one liner. You can do urllib.request.urlopen(url).read().decode('utf-8'), but then you need to stuff it back into a file-like object?

@rmcgibbo
Copy link
Contributor Author

We could change OpenMM to respond appropriately to either strings or bytes too.

@peastman
Copy link
Member

What would be the easiest way of doing that? PdbStructure reads lines from the PDB file with

for pdb_line in input_stream:

But if it's a stream of bytes, each iteration will get one byte instead of one line, right? Is there an easy way it can detect the type of stream and convert/wrap/whatever it?

@rmcgibbo
Copy link
Contributor Author

The iterator still splits on bytes-newlines.

In [97]: [line for line in open('README.md', 'rb')]
Out[97]: 
[b'## [MSMBuilder: Statistical models for Biomolecular Dynamics](http://msmbuilder.org/)\n',
 b'\n',
 b'[![Build Status](https://travis-ci.org/msmbuilder/msmbuilder.png?branch=master)](https://travis-ci.org/msmbuilder/msmbuilder) [![PyPi version](https://pypip.in/v/msmbuilder/badge.png)](https://pypi.python.org/pypi/msmbuilder/) [![Supported Python versions](https://pypip.in/py_versions/msmbuilder/badge.svg)](https://pypi.python.org/pypi/msmbuilder/) [![License](https://img.shields.io/badge/license-LGPLv2.1+-red.svg?style=flat)](https://pypi.python.org/pypi/msmbuilder/)\n',
 b'[![Documentation Status](https://img.shields.io/badge/docs-latest-blue.svg?style=flat)](http://msmbuilder.org)\n',
 b'\n',
 b'MSMBuilder is a python package which implements a series of statistical models for high-dimensional time-series -- the particular focus of the library is on the  analysis of atomistic simulations of biomolecular dynamics such as protein folding and conformational change. MSMBuilder is available under the LGPL (v2.1 or later).\n',
 b'\n',
 b'Algorithms available include:\n',
 b'\n',
 b'- Geometric clustering\n',
 b'- Markov state models\n',
 b'- Time-structure independent components analysis\n',
 b'- L1-regularized reversible hidden Markov models\n',
 b'- Transition path theory\n',
 b'\n',
 b'Resources:\n',
 b'- [Documentation](http://msmbuilder.org)\n',
 b'- [Mailing List](https://mailman.stanford.edu/mailman/listinfo/msmbuilder-user)\n']

@peastman
Copy link
Member

So if we just add "pdb_line = pdb_line.decode('utf-8')", will that work everywhere?

@rmcgibbo
Copy link
Contributor Author

I think so.

@swails
Copy link
Contributor

swails commented Feb 27, 2015

So if we just add "pdb_line = pdb_line.decode('utf-8')", will that work everywhere?

No. In Py3 if you open as 'r' instead of 'rb', you read strings. Strings in Py3 don't have a decode attribute:

>>> 'this is a string'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
>>> b'these are bytes'.decode('utf-8')
'these are bytes'

This is, by far, the worst part about the Py2->Py3 transition. It ruins duck typing between file-like objects unless you just open everything with 'rb'... but yuck. For instance: gzip.open(file, 'r') yields bytes... open(file, 'r') yields strings.

I have a helper class in ParmEd that wraps around an open file-like object. See https://github.com/ParmEd/ParmEd/blob/master/chemistry/formats/io.py#L62-L103 for example.

@rmcgibbo
Copy link
Contributor Author

just check the type before decoding it.

On Thu, Feb 26, 2015 at 4:23 PM, Jason Swails notifications@github.com
wrote:

So if we just add "pdb_line = pdb_line.decode('utf-8')", will that work
everywhere?

No. In Py3 if you open as 'r' instead of 'rb', you read strings. Strings
in Py3 don't have a decode attribute:

'this is a string'.decode('utf-8')
Traceback (most recent call last):
File "", line 1, in AttributeError: 'str' object has no attribute 'decode'>>> b'these are bytes'.decode('utf-8')'these are bytes'

This is, by far, the worst part about the Py2->Py3 transition. It ruins
duck typing between file-like objects unless you just open everything with
'rb'... but yuck. For instance: gzip.open(file, 'r') yields bytes... open(file,
'r') yields strings.

I have a helper class in ParmEd that wraps around an open file-like
object. See
https://github.com/ParmEd/ParmEd/blob/master/chemistry/formats/io.py#L62-L103
for example.


Reply to this email directly or view it on GitHub
#81 (comment).

@peastman
Copy link
Member

Ok, how about this:

if not isinstance(pdb_line, str):
    pdb_line = pdb_line.decode('utf-8')

That should work everywhere, right?

@swails
Copy link
Contributor

swails commented Feb 27, 2015

How many places do you read a line from the file? You need to do this in each place. I have so many parsers in ParmEd that I needed to write a class to handle it more robustly

@peastman
Copy link
Member

PdbStructure is the only one we're talking about here. There are lots of other file readers, but they aren't relevant to PDBFixer.

@swails
Copy link
Contributor

swails commented Feb 27, 2015

I meant how many times to you read a line from a file inside PDBFixer? Or is it just one big for line in file type of thing?

@peastman
Copy link
Member

You mean PdbStructure? It's just one big loop.

@peastman
Copy link
Member

This is hopefully fixed by openmm/openmm#839.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants