Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More consistent na_values handling in read_csv #1657

Closed
BrenBarn opened this issue Jul 21, 2012 · 13 comments
Closed

More consistent na_values handling in read_csv #1657

BrenBarn opened this issue Jul 21, 2012 · 13 comments
Labels
Bug IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@BrenBarn
Copy link

The current handling of the na_values argument to read_csv is strangely different depending on what kind of value you pass to na_values. If you pass None, the default NA values are used. If you pass a dict mapping column names to values, then those values will be used for those columns, totally overriding the default NA values, while for columns not in the dict, the default values will be used. If you pass some other kind of iterable, it uses the union of the passed values and the default values as the NA values.

This behavior is confusing because sometimes the passed values override the defaults, but other times they just add to the defaults. It's also contrary to the documentation at http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files, which says: "If you pass an empty list or an empty list for a particular column, no values (including empty strings) will be considered NA." But passing an empty list doesn't result in no values being considered NA. In fact, passing an empty list does nothing, since the empty list is unioned with the default NA values, so the default NA values are just used anyway.

Currently there is no easy way to pass a list of NA values which overrides the default for all columns. You can pass a dict, but then you have to specify the defaults per column. If you pass a list, you're not overriding the defaults, you're adding to them. This makes for confusing behavior when reading CSV files with string data in which strings like "na" and "nan" are valid data and should be read as their literal string values.

There should be a way to pass an all-column set of NA values that overrides the defaults. One possibility would be to have two arguments, something like all_na_values and more_na_values, to specify overriding and additional values, respectively. Another possibility would be to expose the default (currently the module-level _NA_VALUES in parsers.py), and allow users to add to it it they want to add more NA values (e.g., read_csv(na_values=set(['newNA']) | pandas.default_nas).

@ghost ghost assigned changhiskhan Jul 23, 2012
@changhiskhan
Copy link
Contributor

I agree that it's too confusing/inconsistent as it stands. I added custom NA handling for a previous issue but I should have revisited the whole NA specifying API for read_csv. The fix will be part of 0.8.2 coming out in a few weeks.

Thanks for the bug report!

@changhiskhan
Copy link
Contributor

I added a keep_default_na keyword

@kayvonr
Copy link

kayvonr commented Feb 5, 2014

Hey there - I believe this issue is still not completely fixed. Reading in a csv with countries and using both keep_default_na=False and na_values='' still results in "Namibia", country code "NA", being read in as a float NaN

In [56]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values='')

In [57]: data.ix[160:165]
Out[57]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163         NaN        Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF

In [61]: data.ix[163, 'geo_country']
Out[61]: nan

In [62]: type(data.ix[163, 'geo_country'])
Out[62]: float

@jreback
Copy link
Contributor

jreback commented Feb 5, 2014

@kayvonr what version of pandas?

@jreback
Copy link
Contributor

jreback commented Feb 5, 2014

http://pandas.pydata.org/pandas-docs/dev/io.html#na-values

na_values has to be a list, so try with: na_values=['']

@kayvonr
Copy link

kayvonr commented Feb 5, 2014

pandas 0.12.0

I originally tried it with a list argument first and that presents two different problems:

In [7]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values=[' '])

In [8]: data.ix[160:165]
Out[8]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163                    Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF

In [9]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values=[''])

In [10]: data.ix[160:165]
Out[10]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163         NaN        Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF

An argument with any non-empty string causes the "NA" to be read in as an empty string.

In [13]: data.ix[163, 'geo_country']
Out[13]: ''

Double checked it wasn't just the space with

In [11]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values=['foo'])

In [12]: data.ix[160:165]
Out[12]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163                    Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF

and an empty string results in the "NA" string being again turned into a NaN

@kayvonr
Copy link

kayvonr commented Feb 5, 2014

Issue also present with a dictionary argument to na_values

In [14]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values={"geo_country" : "foo"})

In [15]: data.ix[163, 'geo_country']
Out[15]: ''

In [16]: data.ix[160:165]
Out[16]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163                    Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF

@jreback
Copy link
Contributor

jreback commented Feb 5, 2014

this was fixed by #4374, it in 0.13

@kayvonr
Copy link

kayvonr commented Feb 5, 2014

ah ok, thanks

@makmanalp
Copy link
Contributor

Kinda funny, I just ran into a similar issue as @kayvonr with Namibia/NA, it just wasn't obvious to me that keep_default_na was a thing and so why setting na_values to some random value or None was not quite working. This is the convenience / magic behavior tradeoff I guess.

@makmanalp
Copy link
Contributor

Probably changing the default behavior to be less magic is not an option anymore, but maybe there is a documentation improvement ticket in here somewhere.

@jreback
Copy link
Contributor

jreback commented Mar 24, 2016

@makmanalp I mean the docs are pretty clear. If you want to add an example would take that. Of course people just don't read the docs......

@neilser
Copy link

neilser commented Jan 7, 2018

I can't get a dictionary of na_values to work properly for me, no matter what I try. Version is 0.22.0.
hack.csv contains:

113125,"blah","/blaha",kjsdkj,412.166,225.874,214.008
729639,"qwer","",asdfkj,466.681,,252.373

Two variants of my code - the one with the list does what I expect, but the dict version doesn't:

df = pd.read_csv("hack.csv", header=None, keep_default_na=False, na_values=[214.008,'',"blah"])
df.head()

output:

0 1 2 3 4 5 6
113125 NaN /blaha kjsdkj 412.166 225.874 NaN
729639 qwer NaN asdfkj 466.681 NaN 252.373

but the dict version:

df = pd.read_csv("hack.csv", header=None, 
                 keep_default_na=False, na_values={2:"",6:"214.008",1:"blah",0:'113125'})
df.head()

is paying attention to the columns I specify, and then simply refusing to create any NaNs in those columns:

0 1 2 3 4 5 6
113125 blah /blaha kjsdkj 412.166 225.874 214.008
729639 qwer   asdfkj 466.681 NaN 252.373

So... I'm stuck. Any suggestions? I really want to have column-specific NaN handling so I need the dict.
[Edit: oddly, the dict version does create NaNs in columns I didn't specify in the dict, which also totally goes against my expectations for the combination of keep_default_na=False and an explicit value for na_values.]
Btw: to be picky, the docs for keep_default_na are a bit misleading, in that they imply there should be no effect unless na_values is supplied (but in fact there is an effect even when na_values isn't supplied).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

6 participants