Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: python 3 compression and read_fwf #3963

TomAugspurger opened this issue Jun 19, 2013 · 6 comments · Fixed by #4783 or #4784


Copy link

commented Jun 19, 2013

Getting a TypeError: Type str doesn't support the buffer API when using read_fwf on a compressed file with python 3 only; works in python 2.7


with a file fwf.bug.txt like


import pandas as pd


widths = [5, 5]

!gzip -d fwf_bug.txt

df = pd.read_fwf('fwf_bug.txt', widths=widths, names=['one', 'two'])

!gzip fwf_bug.txt

# python 3 throws an error here.
df2 = pd.read_fwf('fwf_bug.txt.gz', widths=widths,
                  names=['one', 'two'], compression='gzip')


python3: should be most recent

I can paste the full traceback if you'd like.


This comment has been minimized.

Copy link
Contributor Author

commented Jun 20, 2013

Doing a bit of digging around, looks like it's a unicode thing.

The error is coming here, line 1919 in pandas/build/lib.macosx-10.8-x86_64-3.3/pandas/io/

    def next(self):
        line = next(self.f)
        # Note: 'colspecs' is a sequence of half-open intervals.
        return [line[fromm:to].strip(self.filler or ' ')
                for (fromm, to) in self.colspecs]

here, line is a bytecode string:

ipdb> line

I'm not sure what the preferred way of dealing with this is, but

ipdb> line.decode('utf-8')[fromm:to].strip(' ')



This comment has been minimized.

Copy link

commented Jun 20, 2013

should probably be

import pandas.core.common as com
com.pprint_thing(line[fromm:to]).strip(self.filler or ' ')

@ghost ghost assigned jtratner Sep 5, 2013


This comment has been minimized.

Copy link

commented Sep 9, 2013

basically, the problem is that you can't mix bytes and str in Python 3:

In [12]: b'abcd'.strip('')
TypeError                                 Traceback (most recent call last)
<ipython-input-12-e487f65c9b72> in <module>()
----> 1 b'abcd'.strip('')

TypeError: Type str doesn't support the buffer API

In [13]: # Whereas it works with bytes

In [14]: b'adbc'.strip(b'')
Out[14]: b'adbc'

This comment has been minimized.

Copy link

commented Sep 9, 2013

I'm reopening, since after staring at #4784 for a bit I think it's (and my +1 of it) wrong.

The use of next(f) when f is a BytesIO object seem dodgy,
and since next(f) doesn't strip the newline (it can't reliably), the decoding
may fail. I'm also not sure that the use of strip() is well-defined here either.

An example to illustrate some of this:

#!/usr/bin/env python3.3
from io import BytesIO
from encodings.aliases import aliases

for enc in set(aliases.values()):
        # print(enc, next(BytesIO("1234\nabcd".encode(enc)))[:-1].decode(enc)=='1234')
        if res!='1234':
            print(enc, res)
    except LookupError:
    except UnicodeDecodeError:
        print("%s failed" % enc)

I think TextIOWrapper is the correct solution here, you can't get lines until you have text

@ghost ghost reopened this Sep 9, 2013


This comment has been minimized.

Copy link

commented Sep 9, 2013

@y-p TextIOWrapper is what I tried first (in #4783) and it works perfectly in 3.3 (though we'd need to edit it to use the specified encoding if and only if one is provided). However, bz2 doesn't play nice with it because it doesn't support a read1() method. (gzip is fine).

Two options:

  1. Make a subclass of io.TextIOWrapper that detects whether the passed buffer defines read1() and calls read() instead if it's not defined.
  2. Special case bz2 and create a wrapper class that proxies everything to the internal bz2 reader, but calls read() when read1() is asked for.

Either way would work - (1) is probably more explicit though. If bz2 is the only place where we'll need this, probably makes more sense to do (2)...


This comment has been minimized.

Copy link

commented Sep 9, 2013

So you did, didn't see it earlier.

I would be fine with correct code that only works on 3.3 and raising an error (or just blowing up) otherwise.
Building a compat layer for 3.2 seems like wasted effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
3 participants
You can’t perform that action at this time.