Skip to content

read_csv() force numeric dtype on column and set unparseable entry as missing (NaN) #2570

@dragoljub

Description

@dragoljub

How can I force a dtype on a column and ensure that any not-parseable data entry are filled as NaN? This is important in cases where there are unpredictable data entry errors in CSVs or database streams that cannot be mapped to missing values a priori.

Eg: Below I want column 'a' to be parsed as np.float but the erroneous 'Dog' entry causes an exception. Is there a way to tell read_csv() to force parsing a column 'a' as np.float and fill all non-parseable entries with NaN?

data = 'a,b,c\n1.1,2,3\nDog,5,6\n7.7,8,9.5'
df = pd.read_csv(StringIO.StringIO(data), dtype={'a': np.float})
df.dtypes

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-cd8b6f868aec> in <module>()
      1 data = 'a,b,c\n1.1,2,3\nDog,5,6\n7.7,8,9.5'
----> 2 df = pd.read_csv(StringIO.StringIO(data), dtype={'a': np.float})
      3 df.dtypes

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    389                     buffer_lines=buffer_lines)
    390 
--> 391         return _read(filepath_or_buffer, kwds)
    392 
    393     parser_f.__name__ = name

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    205         return parser
    206 
--> 207     return parser.read()
    208 
    209 _parser_defaults = {

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    622             #     self._engine.set_error_bad_lines(False)
    623 
--> 624         ret = self._engine.read(nrows)
    625 
    626         if self.options.get('as_recarray'):

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    943 
    944         try:
--> 945             data = self._reader.read(nrows)
    946         except StopIteration:
    947             if nrows is None:

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader.read (pandas\src\parser.c:5785)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6002)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6870)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7919)()

AttributeError: 'NoneType' object has no attribute 'dtype'

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions