Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c-parser branch: Iteration over an open file handle makes the parser fail #2071

Closed
lbeltrame opened this issue Oct 15, 2012 · 8 comments
Closed
Labels
Bug IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@lbeltrame
Copy link
Contributor

An example is better than words:

In [16]: cat test.txt
AAA
BBB
CCC
DDD
EEE
FFF
GGG

In [17]: with open("test.txt") as handle:                      
    iterator = itertools.islice(handle, 3)
    print list(iterator)
    res = pandas.read_table(handle, squeeze=True, header=None)
   ....:     
['AAA\n', 'BBB\n', 'CCC\n']

In [18]: res
Out[18]: 
Empty DataFrame
Columns: array([], dtype=object)
Index: array([], dtype=object)

This works with the python engine. Notice that the handle is not really iterated through: when debugging I noticed that after iterator usage, the handle keeps on staying at the same file line (IOW the parser is not iterating on it at all).

@lbeltrame
Copy link
Contributor Author

In general terms, it seems that any iteration on the open file handle breaks things:

In [26]: with open("test.txt") as handle:
    for line in handle:
        print line
        if "CCC" in line:
            break
    res = pandas.read_table(handle, squeeze=True, header=None)
   ....:     
AAA

BBB

CCC


In [27]: res
Out[27]: 
Empty DataFrame
Columns: array([], dtype=object)
Index: array([], dtype=object)

While this is not the case for the Python parser:

In [28]: with open("test.txt") as handle:
    for line in handle:
        print line
        if "CCC" in line:
            break
    res = pandas.read_table(handle, squeeze=True, header=None, engine="python")
   ....:     
AAA

BBB

CCC


In [29]: res
Out[29]: 
0     DDD
1     EEE
2     FFF
3     GGG
4    None
Name: X0

@lbeltrame
Copy link
Contributor Author

Gave it a go again after the merge to master, this time the parser simply segfaults with this case (100% reproducible).

@lbeltrame
Copy link
Contributor Author

And here's a backtrace:

Program received signal SIGSEGV, Segmentation fault.
buffer_rd_bytes (source=0xf068c0, nbytes=<optimized out>, bytes_read=0x7fffffffb5c8, status=0x7fffffffb5c4) at pandas/src/parser/io.c:128
128         if (!PyBytes_Check(result)) {
(gdb) bt
#0  buffer_rd_bytes (source=0xf068c0, nbytes=<optimized out>, bytes_read=0x7fffffffb5c8, status=0x7fffffffb5c4) at pandas/src/parser/io.c:128
#1  0x00007fffeeba2637 in parser_buffer_bytes (self=self@entry=0x6e6c10, nbytes=<optimized out>) at pandas/src/parser/parser.c:493
#2  0x00007fffeeba2d1f in _tokenize_helper (self=0x6e6c10, nrows=nrows@entry=1, all=all@entry=0) at pandas/src/parser/parser.c:1188
#3  0x00007fffeeba2dd7 in tokenize_nrows (self=<optimized out>, nrows=nrows@entry=1) at pandas/src/parser/parser.c:1218
#4  0x00007fffeeb7f9bd in __pyx_f_6pandas_7_parser_10TextReader__tokenize_rows (__pyx_v_self=0x6dcd20, __pyx_v_nrows=1) at pandas/src/parser.c:5893
#5  0x00007fffeeb81a21 in __pyx_f_6pandas_7_parser_10TextReader__get_header (__pyx_v_self=0x6dcd20) at pandas/src/parser.c:4946
#6  0x00007fffeeb8306b in __pyx_pf_6pandas_7_parser_10TextReader___cinit__ (__pyx_v_verbose=0x7ffff7d90a80 <_Py_ZeroStruct>, __pyx_v_skip_footer=0x61e9b0, __pyx_v_skiprows=0x7fffef08fde8, __pyx_v_low_memory=0x7ffff7d90a60 <_Py_TrueStruct>, 
    __pyx_v_use_unsigned=0x7ffff7d90a80 <_Py_ZeroStruct>, __pyx_v_compact_ints=0x7ffff7d90a80 <_Py_ZeroStruct>, __pyx_v_na_values=0x7fffef08f878, __pyx_v_na_filter=0x7ffff7d90a60 <_Py_TrueStruct>, 
    __pyx_v_warn_bad_lines=0x7ffff7d90a60 <_Py_TrueStruct>, __pyx_v_error_bad_lines=0x7ffff7d90a60 <_Py_TrueStruct>, __pyx_v_usecols=0x7ffff7da4e20 <_Py_NoneStruct>, __pyx_v_dtype=0x7ffff7da4e20 <_Py_NoneStruct>, 
    __pyx_v_thousands=0x7ffff7da4e20 <_Py_NoneStruct>, __pyx_v_decimal=<optimized out>, __pyx_v_encoding=0x7ffff7da4e20 <_Py_NoneStruct>, __pyx_v_quoting=0x61e9b0, __pyx_v_quotechar=0x7ffff6b3ff08, 
    __pyx_v_doublequote=0x7ffff7d90a60 <_Py_TrueStruct>, __pyx_v_escapechar=0x7ffff7da4e20 <_Py_NoneStruct>, __pyx_v_skipinitialspace=0x7ffff7d90a80 <_Py_ZeroStruct>, __pyx_v_as_recarray=0x7ffff7d90a80 <_Py_ZeroStruct>, 
    __pyx_v_factorize=0x0, __pyx_v_converters=0xf4bf10, __pyx_v_compression=<optimized out>, __pyx_v_delim_whitespace=<optimized out>, __pyx_v_tokenize_chunksize=<optimized out>, __pyx_v_memory_map=<optimized out>, 
    __pyx_v_names=0x7ffff7da4e20 <_Py_NoneStruct>, __pyx_v_header=<optimized out>, __pyx_v_delimiter=0x7ffff7da4e20 <_Py_NoneStruct>, __pyx_v_source=0x7ffff7f385d0, __pyx_v_self=<optimized out>, __pyx_v_comment=<optimized out>, 
    __pyx_v_buffer_lines=<optimized out>) at pandas/src/parser.c:3601
#7  __pyx_pw_6pandas_7_parser_10TextReader_1__cinit__ (__pyx_v_self=__pyx_v_self@entry=0x6dcd20, __pyx_args=__pyx_args@entry=0x7ffff7eec310, __pyx_kwds=__pyx_kwds@entry=0xf4ccc0) at pandas/src/parser.c:2481
#8  0x00007fffeeb8699e in __pyx_tp_new_6pandas_7_parser_TextReader (t=<optimized out>, a=0x7ffff7eec310, k=0xf4ccc0) at pandas/src/parser.c:19587

@wesm
Copy link
Member

wesm commented Nov 15, 2012

Thanks. I'll have a look

@wesm
Copy link
Member

wesm commented Nov 15, 2012

The underlying problem is that the new parser relies on being able to call read on the file handle you pass. however, after iterating, this causes:

ValueError: Mixing iteration and read methods would lose data

The case where calling read fails was not handled in the C code, so I did that and here's the new error message. Best I can do for this somewhat unusual case

/home/wesm/code/pandas/pandas/_parser.so in pandas._parser.TextReader.__cinit__ (pandas/src/parser.c:3624)()

/home/wesm/code/pandas/pandas/_parser.so in pandas._parser.TextReader._get_header (pandas/src/parser.c:4594)()

/home/wesm/code/pandas/pandas/_parser.so in pandas._parser.TextReader._tokenize_rows (pandas/src/parser.c:5967)()

/home/wesm/code/pandas/pandas/_parser.so in pandas._parser.raise_parser_error (pandas/src/parser.c:14702)()

CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

@wesm wesm closed this as completed in 25cc4e1 Nov 15, 2012
@lbeltrame
Copy link
Contributor Author

In short, doing like the example should be considered broken? I assume the Python parser did not rely upon read() directly, or did it work by pure chance?

@wesm
Copy link
Member

wesm commented Nov 15, 2012

It worked only because the Python code used the csv module which uses the iteration protocol rather than read. I could go to some effort to make the code automatically "fall back" on the Python parser code, but not today

@lbeltrame
Copy link
Contributor Author

In data giovedì 15 novembre 2012 08:03:53, Wes McKinney ha scritto:

the code automatically "fall back" on the Python parser code, but not today

That's good enough for now, I merely switched to the python engine all the
bits in my code that relied on this functionality and kept the rest with the C
parser.

Luca Beltrame - KDE Forums team
KDE Science supporter
GPG key ID: 6E1A4E79

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

2 participants