BUG: read_csv converts NaN to 1.0 in certain circumstances #42808

tristanlmiller · 2021-07-29T23:59:01Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np
a = pd.DataFrame([np.nan, True, False])
a.to_csv('test.csv',index=False)
b = pd.read_csv('test.csv',dtype='float')
print(b)

Output:

Problem description

When reading a csv file containing NaNs mixed with booleans, and when told to convert to a float, pandas converts NaN to 1.0.

Expected Output

I'd expect NaN to be read as NaN. It should behave the same as when I do this:

print(a.astype('float'))

Output:

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 2cb9652
python : 3.8.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.232-123.381.amzn1.x86_64
Version : #1 SMP Wed May 19 18:00:09 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.4
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1
setuptools : 49.6.0.post20210108
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.22.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.04.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.1
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 4.0.0
pyxlsb : None
s3fs : 2021.04.0
scipy : 1.6.3
sqlalchemy : 1.4.11
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.53.1

The text was updated successfully, but these errors were encountered:

dicristina · 2021-08-01T17:00:05Z

The problem is related to the C parser and it is present on master (01076c6).

from io import StringIO
import pandas as pd

data = '''0
""
True
False
'''
c = pd.read_csv(StringIO(data), dtype='float', engine='c', skip_blank_lines=False)
p = pd.read_csv(StringIO(data), dtype='float', engine='python', skip_blank_lines=False)

The expected result can be obtained from the Python parser:

In [2]: c
Out[2]:
     0
0  1.0
1  1.0
2  0.0

In [3]: p
Out[3]:
     0
0  NaN
1  1.0
2  0.0

rhshadrach · 2021-08-03T23:23:26Z

Thanks @tristanlmiller and @dicristina! Further investigations and a PR to fix are most welcome!

…andas-dev#42808)

dicristina · 2021-08-09T13:39:21Z

The C parser has the same behavior when dtype is int. This should raise ValueError:

from io import StringIO
import pandas as pd

data = '''0
""
True
False
'''
c = pd.read_csv(StringIO(data), dtype='int', engine='c', skip_blank_lines=False)

Instead we get:

In [14]: c
Out[14]:
     0
0  1.0
1  1.0
2  0.0

joelgibson · 2021-08-10T05:26:03Z

Thanks for pointing this out @dicristina, I've added it to my fix linked in the PR above. This problem comes up in an older issue #34120.

The issue is caused by some funky handling of bools-with-nans in the C parser. I've added tests for converting to bool, int, and float. Can you think of any other dtypes that could behave poorly here?

dicristina · 2021-08-11T03:30:18Z

@joelgibson I could not find any more examples where the NA values were misrepresented in the result but I found that for dtype="Int64" we get ValueError regardless of the parser.

Now I have some doubts as to whether this is something that should work at all. At first glance the answer seems to be affirmative because we think of boolean values as a subset of the integers so it is reasonable to be able to do this kind of upcasting. But is this something that should be triggered by our choice of the dtype parameter of pd.read_csv? How about we just raise? The user has the option of using the boolean dtype when reading the data and then upcasting if needed.

joelgibson · 2021-08-11T04:51:20Z

@dicristina I don't know what the right answer is - this change brings the C parser in-line with what the Python parser is already doing, which. As to whether the actual behaviour is sensible, I find it hard to tell because I have no real mental model of the casting rules when reading from a CSV. For example I find the following behaviour bizarre:

import pandas as pd
from io import StringIO

csv_data = """0
NaN
True
False
"""

df1 = pd.read_csv(StringIO(csv_data))
print(repr(df1["0"].values))

df2 = pd.read_csv(StringIO(csv_data), dtype=object)
print(repr(df2["0"].values))

Which prints out

array([nan, True, False], dtype=object)
array([nan, 'True', 'False'], dtype=object)

I would expect that explicitly passing the same type which is eventually inferred would always get the same result, but it doesn't.

When you say "is this something that should be triggered by our choice of the dtype parameter of pd.read_csv?" I don't know the answer, since I can't find the actual logic for how the dtype parameter is supposed to work written down anywhere (and examples like the above show it might be complicated, or perhaps simple with some exceptions).

Now that true nullable types exist in pandas, the CSV parsing code (especially the dtype conversions) could potentially be simplified a lot, but it's hard to do without some mental model of how the conversions are meant to work.

tristanlmiller added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 29, 2021

rhshadrach added IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2021

rhshadrach added this to the Contributions Welcome milestone Aug 3, 2021

joelgibson added a commit to joelgibson/pandas that referenced this issue Aug 9, 2021

BUG: read_csv with mixed bools and NaNs sometimes reads NaNs as 1.0 (p…

1db0f62

…andas-dev#42808)

joelgibson mentioned this issue Aug 9, 2021

BUG: read_csv with mixed bools and NaNs sometimes reads NaNs as 1.0 (#42808) #42944

Closed

5 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Aug 10, 2021

shubham11941140 mentioned this issue Aug 23, 2021

BUG: read_csv with mixed bools and int sometimes reads 1 as a bool function. #43179

Closed

4 tasks

amcpherson mentioned this issue Nov 30, 2021

Nan values read in as 1.0 when using read_csv dask/dask#8434

Closed

phofl mentioned this issue Dec 15, 2021

BUG: read_csv converting nans to 1 when casting bools to float #44901

Merged

5 tasks

jreback closed this as completed in #44901 Dec 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv converts NaN to 1.0 in certain circumstances #42808

BUG: read_csv converts NaN to 1.0 in certain circumstances #42808

tristanlmiller commented Jul 29, 2021

INSTALLED VERSIONS

dicristina commented Aug 1, 2021

rhshadrach commented Aug 3, 2021

dicristina commented Aug 9, 2021

joelgibson commented Aug 10, 2021

dicristina commented Aug 11, 2021

joelgibson commented Aug 11, 2021 •

edited

Loading

BUG: read_csv converts NaN to 1.0 in certain circumstances #42808

BUG: read_csv converts NaN to 1.0 in certain circumstances #42808

Comments

tristanlmiller commented Jul 29, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

dicristina commented Aug 1, 2021

rhshadrach commented Aug 3, 2021

dicristina commented Aug 9, 2021

joelgibson commented Aug 10, 2021

dicristina commented Aug 11, 2021

joelgibson commented Aug 11, 2021 • edited Loading

Output of `pd.show_versions()`

joelgibson commented Aug 11, 2021 •

edited

Loading