Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_csv converts NaN to 1.0 in certain circumstances #42808

Closed
2 of 3 tasks
tristanlmiller opened this issue Jul 29, 2021 · 6 comments · Fixed by #44901
Closed
2 of 3 tasks

BUG: read_csv converts NaN to 1.0 in certain circumstances #42808

tristanlmiller opened this issue Jul 29, 2021 · 6 comments · Fixed by #44901
Labels
Bug IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@tristanlmiller
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
import numpy as np
a = pd.DataFrame([np.nan, True, False])
a.to_csv('test.csv',index=False)
b = pd.read_csv('test.csv',dtype='float')
print(b)

Output:

     0
0  1.0
1  1.0
2  0.0

Problem description

When reading a csv file containing NaNs mixed with booleans, and when told to convert to a float, pandas converts NaN to 1.0.

Expected Output

I'd expect NaN to be read as NaN. It should behave the same as when I do this:

print(a.astype('float'))

Output:

     0
0  NaN
1  1.0
2  0.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2cb9652
python : 3.8.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.232-123.381.amzn1.x86_64
Version : #1 SMP Wed May 19 18:00:09 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.4
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1
setuptools : 49.6.0.post20210108
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.22.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.04.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.1
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 4.0.0
pyxlsb : None
s3fs : 2021.04.0
scipy : 1.6.3
sqlalchemy : 1.4.11
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.53.1

@tristanlmiller tristanlmiller added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 29, 2021
@dicristina
Copy link
Contributor

The problem is related to the C parser and it is present on master (01076c6).

from io import StringIO
import pandas as pd

data = '''0
""
True
False
'''
c = pd.read_csv(StringIO(data), dtype='float', engine='c', skip_blank_lines=False)
p = pd.read_csv(StringIO(data), dtype='float', engine='python', skip_blank_lines=False)

The expected result can be obtained from the Python parser:

In [2]: c
Out[2]:
     0
0  1.0
1  1.0
2  0.0

In [3]: p
Out[3]:
     0
0  NaN
1  1.0
2  0.0

@rhshadrach
Copy link
Member

Thanks @tristanlmiller and @dicristina! Further investigations and a PR to fix are most welcome!

@rhshadrach rhshadrach added IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2021
@rhshadrach rhshadrach added this to the Contributions Welcome milestone Aug 3, 2021
@dicristina
Copy link
Contributor

The C parser has the same behavior when dtype is int. This should raise ValueError:

from io import StringIO
import pandas as pd

data = '''0
""
True
False
'''
c = pd.read_csv(StringIO(data), dtype='int', engine='c', skip_blank_lines=False)

Instead we get:

In [14]: c
Out[14]:
     0
0  1.0
1  1.0
2  0.0

@joelgibson
Copy link
Contributor

Thanks for pointing this out @dicristina, I've added it to my fix linked in the PR above. This problem comes up in an older issue #34120.

The issue is caused by some funky handling of bools-with-nans in the C parser. I've added tests for converting to bool, int, and float. Can you think of any other dtypes that could behave poorly here?

@jreback jreback modified the milestones: Contributions Welcome, 1.4 Aug 10, 2021
@dicristina
Copy link
Contributor

@joelgibson I could not find any more examples where the NA values were misrepresented in the result but I found that for dtype="Int64" we get ValueError regardless of the parser.

Now I have some doubts as to whether this is something that should work at all. At first glance the answer seems to be affirmative because we think of boolean values as a subset of the integers so it is reasonable to be able to do this kind of upcasting. But is this something that should be triggered by our choice of the dtype parameter of pd.read_csv? How about we just raise? The user has the option of using the boolean dtype when reading the data and then upcasting if needed.

@joelgibson
Copy link
Contributor

joelgibson commented Aug 11, 2021

@dicristina I don't know what the right answer is - this change brings the C parser in-line with what the Python parser is already doing, which. As to whether the actual behaviour is sensible, I find it hard to tell because I have no real mental model of the casting rules when reading from a CSV. For example I find the following behaviour bizarre:

import pandas as pd
from io import StringIO

csv_data = """0
NaN
True
False
"""

df1 = pd.read_csv(StringIO(csv_data))
print(repr(df1["0"].values))

df2 = pd.read_csv(StringIO(csv_data), dtype=object)
print(repr(df2["0"].values))

Which prints out

array([nan, True, False], dtype=object)
array([nan, 'True', 'False'], dtype=object)

I would expect that explicitly passing the same type which is eventually inferred would always get the same result, but it doesn't.

When you say "is this something that should be triggered by our choice of the dtype parameter of pd.read_csv?" I don't know the answer, since I can't find the actual logic for how the dtype parameter is supposed to work written down anywhere (and examples like the above show it might be complicated, or perhaps simple with some exceptions).

Now that true nullable types exist in pandas, the CSV parsing code (especially the dtype conversions) could potentially be simplified a lot, but it's hard to do without some mental model of how the conversions are meant to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment