Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: python 3 compression and read_fwf #3963

Closed
TomAugspurger opened this issue Jun 19, 2013 · 6 comments · Fixed by #4783 or #4784

Comments

@TomAugspurger
Copy link
Contributor

commented Jun 19, 2013

Getting a TypeError: Type str doesn't support the buffer API when using read_fwf on a compressed file with python 3 only; works in python 2.7

Example:

with a file fwf.bug.txt like

1111111111
2222222222
3333333333
4444444444

import pandas as pd

print(pd.__version__)

widths = [5, 5]

!gzip -d fwf_bug.txt

df = pd.read_fwf('fwf_bug.txt', widths=widths, names=['one', 'two'])
print(df)

!gzip fwf_bug.txt

# python 3 throws an error here.
df2 = pd.read_fwf('fwf_bug.txt.gz', widths=widths,
                  names=['one', 'two'], compression='gzip')

print(df2)

Versions:
python3: 0.11.1.dev-4d06037 should be most recent
python2: 0.11.1.dev-3ebfef9

I can paste the full traceback if you'd like.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Jun 20, 2013

Doing a bit of digging around, looks like it's a unicode thing.

The error is coming here, line 1919 in pandas/build/lib.macosx-10.8-x86_64-3.3/pandas/io/parsers.py:

    def next(self):
        line = next(self.f)
        # Note: 'colspecs' is a sequence of half-open intervals.
        return [line[fromm:to].strip(self.filler or ' ')
                for (fromm, to) in self.colspecs]

here, line is a bytecode string:

ipdb> line
b'1111111111\n'

I'm not sure what the preferred way of dealing with this is, but

ipdb> line.decode('utf-8')[fromm:to].strip(' ')
'11111'

works.

@cpcloud

This comment has been minimized.

Copy link
Member

commented Jun 20, 2013

should probably be

import pandas.core.common as com
com.pprint_thing(line[fromm:to]).strip(self.filler or ' ')

@ghost ghost assigned jtratner Sep 5, 2013

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Sep 9, 2013

basically, the problem is that you can't mix bytes and str in Python 3:

In [12]: b'abcd'.strip('')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-e487f65c9b72> in <module>()
----> 1 b'abcd'.strip('')

TypeError: Type str doesn't support the buffer API

In [13]: # Whereas it works with bytes

In [14]: b'adbc'.strip(b'')
Out[14]: b'adbc'
@ghost

This comment has been minimized.

Copy link

commented Sep 9, 2013

I'm reopening, since after staring at #4784 for a bit I think it's (and my +1 of it) wrong.

The use of next(f) when f is a BytesIO object seem dodgy,
and since next(f) doesn't strip the newline (it can't reliably), the decoding
may fail. I'm also not sure that the use of strip() is well-defined here either.

An example to illustrate some of this:

#!/usr/bin/env python3.3
from io import BytesIO
from encodings.aliases import aliases

for enc in set(aliases.values()):
    try:
        # print(enc, next(BytesIO("1234\nabcd".encode(enc)))[:-1].decode(enc)=='1234')
        bs=BytesIO("1234\nabcd".encode(enc))
        line=next(bs)
        res=line.strip().decode(enc)
        if res!='1234':
            print(enc, res)
    except LookupError:
        pass
    except UnicodeDecodeError:
        print("%s failed" % enc)

I think TextIOWrapper is the correct solution here, you can't get lines until you have text

@ghost ghost reopened this Sep 9, 2013

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Sep 9, 2013

@y-p TextIOWrapper is what I tried first (in #4783) and it works perfectly in 3.3 (though we'd need to edit it to use the specified encoding if and only if one is provided). However, bz2 doesn't play nice with it because it doesn't support a read1() method. (gzip is fine).

Two options:

  1. Make a subclass of io.TextIOWrapper that detects whether the passed buffer defines read1() and calls read() instead if it's not defined.
  2. Special case bz2 and create a wrapper class that proxies everything to the internal bz2 reader, but calls read() when read1() is asked for.

Either way would work - (1) is probably more explicit though. If bz2 is the only place where we'll need this, probably makes more sense to do (2)...

@ghost

This comment has been minimized.

Copy link

commented Sep 9, 2013

So you did, didn't see it earlier.

I would be fine with correct code that only works on 3.3 and raising an error (or just blowing up) otherwise.
Building a compat layer for 3.2 seems like wasted effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.