More consistent na_values handling in read_csv #1657

BrenBarn · 2012-07-21T21:04:46Z

The current handling of the na_values argument to read_csv is strangely different depending on what kind of value you pass to na_values. If you pass None, the default NA values are used. If you pass a dict mapping column names to values, then those values will be used for those columns, totally overriding the default NA values, while for columns not in the dict, the default values will be used. If you pass some other kind of iterable, it uses the union of the passed values and the default values as the NA values.

This behavior is confusing because sometimes the passed values override the defaults, but other times they just add to the defaults. It's also contrary to the documentation at http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files, which says: "If you pass an empty list or an empty list for a particular column, no values (including empty strings) will be considered NA." But passing an empty list doesn't result in no values being considered NA. In fact, passing an empty list does nothing, since the empty list is unioned with the default NA values, so the default NA values are just used anyway.

Currently there is no easy way to pass a list of NA values which overrides the default for all columns. You can pass a dict, but then you have to specify the defaults per column. If you pass a list, you're not overriding the defaults, you're adding to them. This makes for confusing behavior when reading CSV files with string data in which strings like "na" and "nan" are valid data and should be read as their literal string values.

There should be a way to pass an all-column set of NA values that overrides the defaults. One possibility would be to have two arguments, something like all_na_values and more_na_values, to specify overriding and additional values, respectively. Another possibility would be to expose the default (currently the module-level _NA_VALUES in parsers.py), and allow users to add to it it they want to add more NA values (e.g., read_csv(na_values=set(['newNA']) | pandas.default_nas).

The text was updated successfully, but these errors were encountered:

changhiskhan · 2012-07-23T22:01:16Z

I agree that it's too confusing/inconsistent as it stands. I added custom NA handling for a previous issue but I should have revisited the whole NA specifying API for read_csv. The fix will be part of 0.8.2 coming out in a few weeks.

Thanks for the bug report!

changhiskhan · 2012-08-19T19:31:50Z

I added a keep_default_na keyword

kayvonr · 2014-02-05T17:36:18Z

Hey there - I believe this issue is still not completely fixed. Reading in a csv with countries and using both keep_default_na=False and na_values='' still results in "Namibia", country code "NA", being read in as a float NaN

In [56]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values='')

In [57]: data.ix[160:165]
Out[57]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163         NaN        Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF

In [61]: data.ix[163, 'geo_country']
Out[61]: nan

In [62]: type(data.ix[163, 'geo_country'])
Out[62]: float

jreback · 2014-02-05T18:14:22Z

@kayvonr what version of pandas?

jreback · 2014-02-05T18:15:18Z

http://pandas.pydata.org/pandas-docs/dev/io.html#na-values

na_values has to be a list, so try with: na_values=['']

kayvonr · 2014-02-05T18:20:13Z

pandas 0.12.0

I originally tried it with a list argument first and that presents two different problems:

In [7]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values=[' '])

In [8]: data.ix[160:165]
Out[8]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163                    Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF

In [9]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values=[''])

In [10]: data.ix[160:165]
Out[10]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163         NaN        Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF

An argument with any non-empty string causes the "NA" to be read in as an empty string.

In [13]: data.ix[163, 'geo_country']
Out[13]: ''

Double checked it wasn't just the space with

In [11]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values=['foo'])

In [12]: data.ix[160:165]
Out[12]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163                    Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF

and an empty string results in the "NA" string being again turned into a NaN

kayvonr · 2014-02-05T18:22:45Z

Issue also present with a dictionary argument to na_values

In [14]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values={"geo_country" : "foo"})

In [15]: data.ix[163, 'geo_country']
Out[15]: ''

In [16]: data.ix[160:165]
Out[16]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163                    Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF

jreback · 2014-02-05T18:30:44Z

this was fixed by #4374, it in 0.13

kayvonr · 2014-02-05T18:35:29Z

ah ok, thanks

makmanalp · 2016-03-24T19:08:36Z

Kinda funny, I just ran into a similar issue as @kayvonr with Namibia/NA, it just wasn't obvious to me that keep_default_na was a thing and so why setting na_values to some random value or None was not quite working. This is the convenience / magic behavior tradeoff I guess.

makmanalp · 2016-03-24T19:11:24Z

Probably changing the default behavior to be less magic is not an option anymore, but maybe there is a documentation improvement ticket in here somewhere.

jreback · 2016-03-24T19:32:26Z

@makmanalp I mean the docs are pretty clear. If you want to add an example would take that. Of course people just don't read the docs......

neilser · 2018-01-07T16:09:25Z

I can't get a dictionary of na_values to work properly for me, no matter what I try. Version is 0.22.0.
hack.csv contains:

113125,"blah","/blaha",kjsdkj,412.166,225.874,214.008
729639,"qwer","",asdfkj,466.681,,252.373

Two variants of my code - the one with the list does what I expect, but the dict version doesn't:

df = pd.read_csv("hack.csv", header=None, keep_default_na=False, na_values=[214.008,'',"blah"])
df.head()

output:

0	1	2	3	4	5	6
113125	NaN	/blaha	kjsdkj	412.166	225.874	NaN
729639	qwer	NaN	asdfkj	466.681	NaN	252.373

but the dict version:

df = pd.read_csv("hack.csv", header=None, 
                 keep_default_na=False, na_values={2:"",6:"214.008",1:"blah",0:'113125'})
df.head()

is paying attention to the columns I specify, and then simply refusing to create any NaNs in those columns:

0	1	2	3	4	5	6
113125	blah	/blaha	kjsdkj	412.166	225.874	214.008
729639	qwer		asdfkj	466.681	NaN	252.373

So... I'm stuck. Any suggestions? I really want to have column-specific NaN handling so I need the dict.
[Edit: oddly, the dict version does create NaNs in columns I didn't specify in the dict, which also totally goes against my expectations for the combination of keep_default_na=False and an explicit value for na_values.]
Btw: to be picky, the docs for keep_default_na are a bit misleading, in that they imply there should be no effect unless na_values is supplied (but in fact there is an effect even when na_values isn't supplied).

ghost assigned changhiskhan Jul 23, 2012

changhiskhan added a commit that referenced this issue Aug 19, 2012

BUG: more consistent na_values #1657

d9abf68

changhiskhan closed this as completed Aug 20, 2012

wesm unassigned changhiskhan Oct 12, 2016

nealmcb mentioned this issue May 7, 2017

Add option to disable empty string -> NA in file readers #1450

Closed

This was referenced Jan 8, 2018

read_csv's na_values dict format cannot parse float type #12224

Closed

read_csv issues with dict for na_values #19227

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More consistent na_values handling in read_csv #1657

More consistent na_values handling in read_csv #1657

BrenBarn commented Jul 21, 2012

changhiskhan commented Jul 23, 2012

changhiskhan commented Aug 19, 2012

kayvonr commented Feb 5, 2014

jreback commented Feb 5, 2014

jreback commented Feb 5, 2014

kayvonr commented Feb 5, 2014

kayvonr commented Feb 5, 2014

jreback commented Feb 5, 2014

kayvonr commented Feb 5, 2014

makmanalp commented Mar 24, 2016

makmanalp commented Mar 24, 2016

jreback commented Mar 24, 2016

neilser commented Jan 7, 2018 •

edited

Loading

More consistent na_values handling in read_csv #1657

More consistent na_values handling in read_csv #1657

Comments

BrenBarn commented Jul 21, 2012

changhiskhan commented Jul 23, 2012

changhiskhan commented Aug 19, 2012

kayvonr commented Feb 5, 2014

jreback commented Feb 5, 2014

jreback commented Feb 5, 2014

kayvonr commented Feb 5, 2014

kayvonr commented Feb 5, 2014

jreback commented Feb 5, 2014

kayvonr commented Feb 5, 2014

makmanalp commented Mar 24, 2016

makmanalp commented Mar 24, 2016

jreback commented Mar 24, 2016

neilser commented Jan 7, 2018 • edited Loading

neilser commented Jan 7, 2018 •

edited

Loading