c-parser branch: Iteration over an open file handle makes the parser fail #2071

lbeltrame · 2012-10-15T09:23:44Z

An example is better than words:

In [16]: cat test.txt
AAA
BBB
CCC
DDD
EEE
FFF
GGG

In [17]: with open("test.txt") as handle:                      
    iterator = itertools.islice(handle, 3)
    print list(iterator)
    res = pandas.read_table(handle, squeeze=True, header=None)
   ....:     
['AAA\n', 'BBB\n', 'CCC\n']

In [18]: res
Out[18]: 
Empty DataFrame
Columns: array([], dtype=object)
Index: array([], dtype=object)

This works with the python engine. Notice that the handle is not really iterated through: when debugging I noticed that after iterator usage, the handle keeps on staying at the same file line (IOW the parser is not iterating on it at all).

lbeltrame · 2012-10-15T09:31:26Z

In general terms, it seems that any iteration on the open file handle breaks things:

In [26]: with open("test.txt") as handle:
    for line in handle:
        print line
        if "CCC" in line:
            break
    res = pandas.read_table(handle, squeeze=True, header=None)
   ....:     
AAA

BBB

CCC


In [27]: res
Out[27]: 
Empty DataFrame
Columns: array([], dtype=object)
Index: array([], dtype=object)

While this is not the case for the Python parser:

In [28]: with open("test.txt") as handle:
    for line in handle:
        print line
        if "CCC" in line:
            break
    res = pandas.read_table(handle, squeeze=True, header=None, engine="python")
   ....:     
AAA

BBB

CCC


In [29]: res
Out[29]: 
0     DDD
1     EEE
2     FFF
3     GGG
4    None
Name: X0

lbeltrame · 2012-11-15T08:22:54Z

Gave it a go again after the merge to master, this time the parser simply segfaults with this case (100% reproducible).

lbeltrame · 2012-11-15T08:29:35Z

And here's a backtrace:

Program received signal SIGSEGV, Segmentation fault.
buffer_rd_bytes (source=0xf068c0, nbytes=<optimized out>, bytes_read=0x7fffffffb5c8, status=0x7fffffffb5c4) at pandas/src/parser/io.c:128
128         if (!PyBytes_Check(result)) {
(gdb) bt
#0  buffer_rd_bytes (source=0xf068c0, nbytes=<optimized out>, bytes_read=0x7fffffffb5c8, status=0x7fffffffb5c4) at pandas/src/parser/io.c:128
#1  0x00007fffeeba2637 in parser_buffer_bytes (self=self@entry=0x6e6c10, nbytes=<optimized out>) at pandas/src/parser/parser.c:493
#2  0x00007fffeeba2d1f in _tokenize_helper (self=0x6e6c10, nrows=nrows@entry=1, all=all@entry=0) at pandas/src/parser/parser.c:1188
#3  0x00007fffeeba2dd7 in tokenize_nrows (self=<optimized out>, nrows=nrows@entry=1) at pandas/src/parser/parser.c:1218
#4  0x00007fffeeb7f9bd in __pyx_f_6pandas_7_parser_10TextReader__tokenize_rows (__pyx_v_self=0x6dcd20, __pyx_v_nrows=1) at pandas/src/parser.c:5893
#5  0x00007fffeeb81a21 in __pyx_f_6pandas_7_parser_10TextReader__get_header (__pyx_v_self=0x6dcd20) at pandas/src/parser.c:4946
#6  0x00007fffeeb8306b in __pyx_pf_6pandas_7_parser_10TextReader___cinit__ (__pyx_v_verbose=0x7ffff7d90a80 <_Py_ZeroStruct>, __pyx_v_skip_footer=0x61e9b0, __pyx_v_skiprows=0x7fffef08fde8, __pyx_v_low_memory=0x7ffff7d90a60 <_Py_TrueStruct>, 
    __pyx_v_use_unsigned=0x7ffff7d90a80 <_Py_ZeroStruct>, __pyx_v_compact_ints=0x7ffff7d90a80 <_Py_ZeroStruct>, __pyx_v_na_values=0x7fffef08f878, __pyx_v_na_filter=0x7ffff7d90a60 <_Py_TrueStruct>, 
    __pyx_v_warn_bad_lines=0x7ffff7d90a60 <_Py_TrueStruct>, __pyx_v_error_bad_lines=0x7ffff7d90a60 <_Py_TrueStruct>, __pyx_v_usecols=0x7ffff7da4e20 <_Py_NoneStruct>, __pyx_v_dtype=0x7ffff7da4e20 <_Py_NoneStruct>, 
    __pyx_v_thousands=0x7ffff7da4e20 <_Py_NoneStruct>, __pyx_v_decimal=<optimized out>, __pyx_v_encoding=0x7ffff7da4e20 <_Py_NoneStruct>, __pyx_v_quoting=0x61e9b0, __pyx_v_quotechar=0x7ffff6b3ff08, 
    __pyx_v_doublequote=0x7ffff7d90a60 <_Py_TrueStruct>, __pyx_v_escapechar=0x7ffff7da4e20 <_Py_NoneStruct>, __pyx_v_skipinitialspace=0x7ffff7d90a80 <_Py_ZeroStruct>, __pyx_v_as_recarray=0x7ffff7d90a80 <_Py_ZeroStruct>, 
    __pyx_v_factorize=0x0, __pyx_v_converters=0xf4bf10, __pyx_v_compression=<optimized out>, __pyx_v_delim_whitespace=<optimized out>, __pyx_v_tokenize_chunksize=<optimized out>, __pyx_v_memory_map=<optimized out>, 
    __pyx_v_names=0x7ffff7da4e20 <_Py_NoneStruct>, __pyx_v_header=<optimized out>, __pyx_v_delimiter=0x7ffff7da4e20 <_Py_NoneStruct>, __pyx_v_source=0x7ffff7f385d0, __pyx_v_self=<optimized out>, __pyx_v_comment=<optimized out>, 
    __pyx_v_buffer_lines=<optimized out>) at pandas/src/parser.c:3601
#7  __pyx_pw_6pandas_7_parser_10TextReader_1__cinit__ (__pyx_v_self=__pyx_v_self@entry=0x6dcd20, __pyx_args=__pyx_args@entry=0x7ffff7eec310, __pyx_kwds=__pyx_kwds@entry=0xf4ccc0) at pandas/src/parser.c:2481
#8  0x00007fffeeb8699e in __pyx_tp_new_6pandas_7_parser_TextReader (t=<optimized out>, a=0x7ffff7eec310, k=0xf4ccc0) at pandas/src/parser.c:19587

wesm · 2012-11-15T15:11:56Z

Thanks. I'll have a look

wesm · 2012-11-15T15:44:26Z

The underlying problem is that the new parser relies on being able to call read on the file handle you pass. however, after iterating, this causes:

ValueError: Mixing iteration and read methods would lose data

The case where calling read fails was not handled in the C code, so I did that and here's the new error message. Best I can do for this somewhat unusual case

/home/wesm/code/pandas/pandas/_parser.so in pandas._parser.TextReader.__cinit__ (pandas/src/parser.c:3624)()

/home/wesm/code/pandas/pandas/_parser.so in pandas._parser.TextReader._get_header (pandas/src/parser.c:4594)()

/home/wesm/code/pandas/pandas/_parser.so in pandas._parser.TextReader._tokenize_rows (pandas/src/parser.c:5967)()

/home/wesm/code/pandas/pandas/_parser.so in pandas._parser.raise_parser_error (pandas/src/parser.c:14702)()

CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

lbeltrame · 2012-11-15T15:52:51Z

In short, doing like the example should be considered broken? I assume the Python parser did not rely upon read() directly, or did it work by pure chance?

wesm · 2012-11-15T16:03:46Z

It worked only because the Python code used the csv module which uses the iteration protocol rather than read. I could go to some effort to make the code automatically "fall back" on the Python parser code, but not today

lbeltrame · 2012-11-15T16:07:15Z

In data giovedì 15 novembre 2012 08:03:53, Wes McKinney ha scritto:

the code automatically "fall back" on the Python parser code, but not today

That's good enough for now, I merely switched to the python engine all the
bits in my code that relied on this functionality and kept the rest with the C
parser.

Luca Beltrame - KDE Forums team
KDE Science supporter
GPG key ID: 6E1A4E79

wesm closed this as completed in 25cc4e1 Nov 15, 2012

mcocdawc mentioned this issue Aug 2, 2017

read_csv with filehandler and nrows argument #17155

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c-parser branch: Iteration over an open file handle makes the parser fail #2071

c-parser branch: Iteration over an open file handle makes the parser fail #2071

lbeltrame commented Oct 15, 2012

lbeltrame commented Oct 15, 2012

lbeltrame commented Nov 15, 2012

lbeltrame commented Nov 15, 2012

wesm commented Nov 15, 2012

wesm commented Nov 15, 2012

lbeltrame commented Nov 15, 2012

wesm commented Nov 15, 2012

lbeltrame commented Nov 15, 2012

c-parser branch: Iteration over an open file handle makes the parser fail #2071

c-parser branch: Iteration over an open file handle makes the parser fail #2071

Comments

lbeltrame commented Oct 15, 2012

lbeltrame commented Oct 15, 2012

lbeltrame commented Nov 15, 2012

lbeltrame commented Nov 15, 2012

wesm commented Nov 15, 2012

wesm commented Nov 15, 2012

lbeltrame commented Nov 15, 2012

wesm commented Nov 15, 2012

lbeltrame commented Nov 15, 2012