Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Core dumped in read_csv (C engine) when reading multiple corrupted gzip files #12098
Comments
|
pls show an example with data as minimal as possible. why does looping matter here? |
jreback
added the
CSV
label
Jan 20, 2016
alessiodore
commented
Jan 20, 2016
|
Sorry I forgot to attach the file. Here it is: The loop is to reproduce the problem without having to attach multiple files. INSTALLED VERSIONScommit: None pandas: 0.17.1 |
|
@alessiodore can you see if you can narrow it down a bit more pls. e.g. keep chopping untill you don't get the error then back up |
alessiodore
commented
Jan 20, 2016
|
I am not sure this is what you mean but I changed my script to:
When I slice the file to 9656 I don't get the segmentation fault. At 9657 it gets the segm fault. |
|
great. so extract the slice that is causing the error. we need a simple, copy-pastable example in order to pinpoint the problem, I don't want a file, rather a string of characters that repro. further try with and w/o the gzip to see if that is the problem. The more you can narrow it down the better. |
alessiodore
commented
Jan 20, 2016
|
I tried to slice the left part of the file (log[n:9657]) but I got the segm fault only for n=0. Also I tried for log[1:len(log)] and I didn't get the segm fault. |
|
yes, ideally what you can do is something like:
e.g. a complete copy-pastable example that repros. Then we can use this to debug and as a test. I know narrowing down is not so fun :< but in order fix these issues much better to have a simple example. thanks! |
alessiodore
commented
Jan 20, 2016
|
I understand. I wasn't just entirely sure if it was okay to post a 10K characters string.
|
|
can you cut this down |
alessiodore
commented
Jan 20, 2016
|
I am not sure how I can give you a simpler example. I can reproduce the segmentation fault only considering the file sliced from 0 to 9657. If I get the characters even from the first one to the end of the file [1:len(log)] I don't get the segm fault. Also no segmentation fault if I consider the file from 0 to 9656. The parser seems to detect that the file is corrupted but when I try to read a certain number of corrupted files at some point I get a segmentation fault. Unfortunately, these are all the information I have and this is the only way I could recreate the problem. |
|
ok, this is reproducible. thanks for the example. |
jreback
added Bug Difficulty Intermediate Effort Medium
labels
Jan 20, 2016
jreback
added this to the
Next Major Release
milestone
Jan 20, 2016
|
if anyone is interested cc @mcwitt |
alessiodore
changed the title from
Core dumped in read_csv (C engine) when reading corrupted gzip file multiple times to Core dumped in read_csv (C engine) when reading multiple corrupted gzip files
Jan 20, 2016
selasley
commented
Jan 25, 2016
|
The segfault with python2 is caused by the Py_XDECREF(RDS(rds)->buffer); line in the del_rd_source function in the io.c source file. The reference count for rds->obj is explicitly incremented in new_rd_source() but I haven't found where the reference count for rds->buffer is incremented. Removing the Py_XDECREF(RDS(rds)->buffer); line in io.c allows the example code to run without a segfault. Does anyone know of a good reason to keep the call to Py_XDECREF(RDS(rds)->buffer) in the del_rd_source function? |
|
@selasley I just looked this over. In the line which creates result = PyObject_CallObject(func, args);This returns a new reference, so the reference count of this object should be 1. The problematic thing I'm seeing is actually this block: if (result == NULL) {
PyGILState_Release(state);
*bytes_read = 0;
*status = CALLING_READ_FAILED;
return NULL;
}From first principles: If |
|
To be on the safe side it would be better to always set |
selasley
commented
Jan 25, 2016
|
I put the call to Py_XDECREF(RDS(rds)->buffer); back in del_rd_source() and added the lines you suggested The problem code runs without segfaulting and all tests pass in |
|
Cool, I think just that one line |
selasley
commented
Jan 25, 2016
|
Will do. |
selasley
referenced
this issue
Jan 25, 2016
Closed
BUG: set src->buffer = NULL after garbage collecting it in buffer_rd_… #12135
jreback
modified the milestone: 0.18.0, Next Major Release
Jan 25, 2016
selasley
pushed a commit
to selasley/pandas
that referenced
this issue
Jan 26, 2016
|
|
fb204a2
|
selasley
pushed a commit
to selasley/pandas
that referenced
this issue
Jan 27, 2016
|
|
+ |
a1f0a79
|
alessiodore commentedJan 20, 2016
I am using read_csv to read some gzip compressed log files. Some of these files are corrupted and they cannot be uncompressed.
At different iterations in the loop that reads these files my script crashes with a core dumped message:
*** Error in `/usr/bin/python': corrupted double-linked list: 0x0000000003836790 ***
or just:
Segmentation fault (core dumped)
This is a stripped-down version (just looping over one of the corrupted files) of the code where this error occurs:
The traceback of the catched exception is:
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 498, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 285, in _read
return parser.read()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 747, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1197, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 766, in pandas.parser.TextReader.read (pandas/parser.c:7988)
File "pandas/parser.pyx", line 788, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)
File "pandas/parser.pyx", line 842, in pandas.parser.TextReader._read_rows (pandas/parser.c:8970)
File "pandas/parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838)
File "pandas/parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas/parser.c:22649)
CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
If I remove the delim_whitespace argument the loop completes without segmentation fault. I tried adding low_memory=False but the program still crashes.
I am using pandas version 0.17.1 on Ubuntu 14.04 OS.
It looks like a similar issue to #5664 but the problems should have been resolved in v0.16.1