BUG: read_csv, engine='c' error #9735

sbtlaarzc · 2015-03-26T14:59:21Z

I am trying to read a file 57MB with pandas.csv_read. The file contains a header (5 rows), afterwads integer values and at the end float values:

info         
       2681087         53329       1287215       1287215         53328
RSA                    53328         53328       1287215             0
(I14)           (I14)           (d25.15)            (d25.15)            
F                          1         5332   
           1
          33
          61
          92
         128
         ...
         165
         205
         239
         272
    0.112474585277959E+09
    0.126110931411177E+09
    0.515995872032845E+09
    0.126110931411175E+09
   -0.194634413074014E+09
    0.112474585277950E+09
    ...

When I read the txt file:
import pandas as pd
pd.read_csv(file, skiprows=5+n_int_values, header=None, engine='c',
dtype=np.float, low_memory=False)

The result is an error:

---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-118-699921ac7a12> in <module>()
----> 1 a=pd.read_csv(loc, skiprows=5+n_coloums+n_rows, header=None, engine='c', 
low_memory=False, error_bad_lines=False)

C:\Anaconda\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect,     
compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, 
header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, 
true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, 
as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, 
error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, 
dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, 
encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    468                     skip_blank_lines=skip_blank_lines)
    469 
--> 470         return _read(filepath_or_buffer, kwds)
    471 
    472     parser_f.__name__ = name

C:\Anaconda\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    254         return parser
    255 
--> 256     return parser.read()
    257 
    258 _parser_defaults = {

C:\Anaconda\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    713                 raise ValueError('skip_footer not supported for iteration')
    714 
--> 715         ret = self._engine.read(nrows)
    716 
    717         if self.options.get('as_recarray'):

C:\Anaconda\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
   1162 
   1163         try:
-> 1164             data = self._reader.read(nrows)
   1165         except StopIteration:
   1166             if nrows is None:

pandas\parser.pyx in pandas.parser.TextReader.read (pandas\parser.c:7426)()

pandas\parser.pyx in pandas.parser.TextReader._read_rows (pandas\parser.c:8377)()

pandas\parser.pyx in pandas.parser.raise_parser_error (pandas\parser.c:20728)()

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input     
file.

This happens on pandas 0.16.0, on anaconda python 2.7.8. On an older version - 0.14.1. it works correctly.

Note: When I use engine='python', the txt file is loaded normaly.

The text was updated successfully, but these errors were encountered:

evanpw · 2015-09-15T12:45:24Z

I can reproduce this bug with 0.16.0, but it works in 0.16.1+. It looks like GH #10023 fixed it (based on bisection).

jreback · 2015-09-15T13:34:52Z

do you want to add a confirming test ?

chbrown · 2015-09-16T21:38:37Z

Same error here on pandas 0.16.2 (Mac OS X) -- works when engine='python', not when using C parser.

Input file is a 59 MB csv that I had previously written via pandas with df.to_csv(filepath, index=False, encoding='utf8').

Works just fine when I cut off the first 1000 lines of the input file and try to read that, so I figure the bug is somewhere in TextReader._read_low_memory, which shows up in the error traceback

sbtlaarzc · 2015-09-17T06:05:26Z

The 0.16.1. update resolved my problem, so it works now.

chbrown · 2015-09-17T15:53:50Z

Perhaps 0.16.2 is a regression, then.

I can open a new ticket if you think it's a different bug, but I get the exact same error message and traceback.

jreback · 2015-09-17T17:25:06Z

on 0.16.2, so if you have a differnce pls post.

In [1]: In [1]: data = """info         
   ...:    ...:        2681087         53329       1287215       1287215         53328
   ...:    ...: RSA                    53328         53328       1287215             0
   ...:    ...: (I14)           (I14)           (d25.15)            (d25.15)            
   ...:    ...: F                          1         5332   
   ...:    ...:            1
   ...:    ...:           33
   ...:    ...:           61
   ...:    ...:           92
   ...:    ...:          128
   ...:    ...:          ...
   ...:    ...:          165
   ...:    ...:          205
   ...:    ...:          239
   ...:    ...:          272
   ...:    ...:     0.112474585277959E+09
   ...:    ...:     0.126110931411177E+09
   ...:    ...:     0.515995872032845E+09
   ...:    ...:     0.126110931411175E+09
   ...:    ...:    -0.194634413074014E+09
   ...:    ...:     0.112474585277950E+09
   ...:    ...: """

In [2]: 

In [2]: pd.read_csv(StringIO(data), skiprows=15, header=None, engine='c',low_memory=False)
Out[2]: 
              0
0  1.124746e+08
1  1.261109e+08
2  5.159959e+08
3  1.261109e+08
4 -1.946344e+08
5  1.124746e+08

evanpw · 2015-09-17T17:55:36Z

I had to add a bunch of floats to the end to get it to fail on 0.16.0, but it works on everything I tried from 0.16.1 through current trunk. I'll try to get a minimal test case.

…andas-dev#9735)

jreback · 2015-09-17T23:48:15Z

closed by #11138

* commit 'v0.17.0rc1-92-gc6bcc99': (29 commits) CI: tests latest versions of openpyxl COMPAT: openpyxl >= 2.2 support, pandas-dev#10125 Tests demonstrating how to use sqlalchemy.text() objects in read_sql() TST: Capture warnings in _check_plot_works COMPAT/BUG: color handling in scatter COMPAT: Support for matplotlib 1.5 ERR/API: Raise NotImplementedError when Panel operator function is not implemented, pandas-dev#7692 DOC: minor doc formatting fixes PERF: nested dict DataFrame construction DEPR: deprecate SparsePanel BLD: dateutil->python-dateutil in conda recipe BUG/API: GH11086 where freq is not inferred if both freq is None ENH: add merge indicator to DataFrame.merge PERF: improves performance in groupby.size BUG: DatetimeTZBlock.fillna raises TypeError PERF: infer_datetime_format without padding pandas-dev#11142 PERF: improves performance in SeriesGroupBy.transform TST: Verify fix for buffer overflow in read_csv with engine='c' (GH pandas-dev#9735) DEPR: Series.is_timeseries BUG: nested construction with timedelta pandas-dev#11129 ...

sbtlaarzc changed the title ~~read_csv, engine='c' error~~ BUG: read_csv, engine='c' error Mar 26, 2015

jreback added the IO CSV read_csv, to_csv label Mar 27, 2015

evanpw pushed a commit to evanpw/pandas that referenced this issue Sep 17, 2015

TST: Verify fix for buffer overflow in read_csv with engine='c' (GH p…

3ff37b7

…andas-dev#9735)

evanpw mentioned this issue Sep 17, 2015

TST: Verify fix for buffer overflow in read_csv with engine='c' #11138

Merged

jreback added this to the 0.17.0 milestone Sep 17, 2015

jreback closed this as completed Sep 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv, engine='c' error #9735

BUG: read_csv, engine='c' error #9735

sbtlaarzc commented Mar 26, 2015

evanpw commented Sep 15, 2015

jreback commented Sep 15, 2015

chbrown commented Sep 16, 2015

sbtlaarzc commented Sep 17, 2015

chbrown commented Sep 17, 2015

jreback commented Sep 17, 2015

evanpw commented Sep 17, 2015

jreback commented Sep 17, 2015

BUG: read_csv, engine='c' error #9735

BUG: read_csv, engine='c' error #9735

Comments

sbtlaarzc commented Mar 26, 2015

evanpw commented Sep 15, 2015

jreback commented Sep 15, 2015

chbrown commented Sep 16, 2015

sbtlaarzc commented Sep 17, 2015

chbrown commented Sep 17, 2015

jreback commented Sep 17, 2015

evanpw commented Sep 17, 2015

jreback commented Sep 17, 2015