Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_csv, engine='c' error #9735

Closed
sbtlaarzc opened this issue Mar 26, 2015 · 8 comments
Closed

BUG: read_csv, engine='c' error #9735

sbtlaarzc opened this issue Mar 26, 2015 · 8 comments
Labels
IO CSV read_csv, to_csv
Milestone

Comments

@sbtlaarzc
Copy link

I am trying to read a file 57MB with pandas.csv_read. The file contains a header (5 rows), afterwads integer values and at the end float values:

info         
       2681087         53329       1287215       1287215         53328
RSA                    53328         53328       1287215             0
(I14)           (I14)           (d25.15)            (d25.15)            
F                          1         5332   
           1
          33
          61
          92
         128
         ...
         165
         205
         239
         272
    0.112474585277959E+09
    0.126110931411177E+09
    0.515995872032845E+09
    0.126110931411175E+09
   -0.194634413074014E+09
    0.112474585277950E+09
    ...

When I read the txt file:
import pandas as pd
pd.read_csv(file, skiprows=5+n_int_values, header=None, engine='c',
dtype=np.float, low_memory=False)

The result is an error:

---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-118-699921ac7a12> in <module>()
----> 1 a=pd.read_csv(loc, skiprows=5+n_coloums+n_rows, header=None, engine='c', 
low_memory=False, error_bad_lines=False)

C:\Anaconda\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect,     
compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, 
header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, 
true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, 
as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, 
error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, 
dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, 
encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    468                     skip_blank_lines=skip_blank_lines)
    469 
--> 470         return _read(filepath_or_buffer, kwds)
    471 
    472     parser_f.__name__ = name

C:\Anaconda\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    254         return parser
    255 
--> 256     return parser.read()
    257 
    258 _parser_defaults = {

C:\Anaconda\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    713                 raise ValueError('skip_footer not supported for iteration')
    714 
--> 715         ret = self._engine.read(nrows)
    716 
    717         if self.options.get('as_recarray'):

C:\Anaconda\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
   1162 
   1163         try:
-> 1164             data = self._reader.read(nrows)
   1165         except StopIteration:
   1166             if nrows is None:

pandas\parser.pyx in pandas.parser.TextReader.read (pandas\parser.c:7426)()

pandas\parser.pyx in pandas.parser.TextReader._read_rows (pandas\parser.c:8377)()

pandas\parser.pyx in pandas.parser.raise_parser_error (pandas\parser.c:20728)()

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input     
file.

This happens on pandas 0.16.0, on anaconda python 2.7.8. On an older version - 0.14.1. it works correctly.

Note: When I use engine='python', the txt file is loaded normaly.

@sbtlaarzc sbtlaarzc changed the title read_csv, engine='c' error BUG: read_csv, engine='c' error Mar 26, 2015
@jreback jreback added the IO CSV read_csv, to_csv label Mar 27, 2015
@evanpw
Copy link
Contributor

evanpw commented Sep 15, 2015

I can reproduce this bug with 0.16.0, but it works in 0.16.1+. It looks like GH #10023 fixed it (based on bisection).

@jreback
Copy link
Contributor

jreback commented Sep 15, 2015

do you want to add a confirming test ?

@chbrown
Copy link

chbrown commented Sep 16, 2015

Same error here on pandas 0.16.2 (Mac OS X) -- works when engine='python', not when using C parser.

Input file is a 59 MB csv that I had previously written via pandas with df.to_csv(filepath, index=False, encoding='utf8').

Works just fine when I cut off the first 1000 lines of the input file and try to read that, so I figure the bug is somewhere in TextReader._read_low_memory, which shows up in the error traceback

@sbtlaarzc
Copy link
Author

The 0.16.1. update resolved my problem, so it works now.

@chbrown
Copy link

chbrown commented Sep 17, 2015

Perhaps 0.16.2 is a regression, then.

I can open a new ticket if you think it's a different bug, but I get the exact same error message and traceback.

@jreback
Copy link
Contributor

jreback commented Sep 17, 2015

on 0.16.2, so if you have a differnce pls post.

In [1]: In [1]: data = """info         
   ...:    ...:        2681087         53329       1287215       1287215         53328
   ...:    ...: RSA                    53328         53328       1287215             0
   ...:    ...: (I14)           (I14)           (d25.15)            (d25.15)            
   ...:    ...: F                          1         5332   
   ...:    ...:            1
   ...:    ...:           33
   ...:    ...:           61
   ...:    ...:           92
   ...:    ...:          128
   ...:    ...:          ...
   ...:    ...:          165
   ...:    ...:          205
   ...:    ...:          239
   ...:    ...:          272
   ...:    ...:     0.112474585277959E+09
   ...:    ...:     0.126110931411177E+09
   ...:    ...:     0.515995872032845E+09
   ...:    ...:     0.126110931411175E+09
   ...:    ...:    -0.194634413074014E+09
   ...:    ...:     0.112474585277950E+09
   ...:    ...: """

In [2]: 

In [2]: pd.read_csv(StringIO(data), skiprows=15, header=None, engine='c',low_memory=False)
Out[2]: 
              0
0  1.124746e+08
1  1.261109e+08
2  5.159959e+08
3  1.261109e+08
4 -1.946344e+08
5  1.124746e+08

@evanpw
Copy link
Contributor

evanpw commented Sep 17, 2015

I had to add a bunch of floats to the end to get it to fail on 0.16.0, but it works on everything I tried from 0.16.1 through current trunk. I'll try to get a minimal test case.

@jreback
Copy link
Contributor

jreback commented Sep 17, 2015

closed by #11138

@jreback jreback closed this as completed Sep 17, 2015
yarikoptic added a commit to neurodebian/pandas that referenced this issue Oct 11, 2015
* commit 'v0.17.0rc1-92-gc6bcc99': (29 commits)
  CI: tests latest versions of openpyxl
  COMPAT: openpyxl >= 2.2 support, pandas-dev#10125
  Tests demonstrating how to use sqlalchemy.text() objects in read_sql()
  TST: Capture warnings in _check_plot_works
  COMPAT/BUG: color handling in scatter
  COMPAT: Support for matplotlib 1.5
  ERR/API: Raise NotImplementedError when Panel operator function is not implemented, pandas-dev#7692
  DOC: minor doc formatting fixes
  PERF: nested dict DataFrame construction
  DEPR: deprecate SparsePanel
  BLD: dateutil->python-dateutil in conda recipe
  BUG/API: GH11086 where freq is not inferred if both freq is None
  ENH: add merge indicator to DataFrame.merge
  PERF: improves performance in groupby.size
  BUG: DatetimeTZBlock.fillna raises TypeError
  PERF: infer_datetime_format without padding pandas-dev#11142
  PERF: improves performance in SeriesGroupBy.transform
  TST: Verify fix for buffer overflow in read_csv with engine='c' (GH pandas-dev#9735)
  DEPR: Series.is_timeseries
  BUG: nested construction with timedelta pandas-dev#11129
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

4 participants