Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv ignores dtype for bool columns with missing values #20591

Closed
finnhacks42 opened this issue Apr 3, 2018 · 2 comments

Comments

Projects
None yet
4 participants
@finnhacks42
Copy link

commented Apr 3, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
from pandas.compat import StringIO
data = "false,1\n,1\ntrue,"
df = pd.read_csv(StringIO(data), header=None, names=['a','b'], dtype={'a': 'bool'}, engine='c')
print(df['a'].dtype)
print(df)
outputs:

object
a b
0 False 1.0
1 NaN 1.0
2 True NaN

import pandas as pd
from pandas.compat import StringIO
data = "false,1\n,1\ntrue,"
df = pd.read_csv(StringIO(data), header=None, names=['a','b'], dtype={'a': 'bool'},engine='python')
print (df['a'].dtype)
print(df)
outputs:

bool
a b
0 False 1.0
1 True 1.0
2 True NaN

Problem description

Missing values cannot be coerced into a column of dtype bool. The expected behaviour, to be consistent with the analogous case for integers, is to throw a ValueError. The user is then aware of the issue and can specify a converter or parse the boolean as a float (subject to #16698).

Instead the c engine silently ignores the requested dtype and returns a column of type object. This behaviour is especially confusing on large datasets with low_memory=True as you get the warning: Column 1 has mixed types. Specify dtype option on import or set low_memory=False.

The python engine silently converts all missing values to True and returns a column of dtype bool.

Expected Output

ValueError('Boolean column has NA values in column 1')

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_US.UTF-8

pandas: 0.23.0.dev0+725.gf67c6fa80
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 6.3.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.6
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.4
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

@chris-b1

This comment has been minimized.

Copy link
Contributor

commented Apr 4, 2018

Yes, I agree this should match the integer behavior, thanks for the report, PR welcome!

df = pd.read_csv(StringIO(data), header=None, names=['a','b'], dtype={'a': 'bool', 'b': 'int64'}, engine='c')

ValueError                                Traceback (most recent call last)
<snip>
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()
ValueError: Integer column has NA values in column 1

@chris-b1 chris-b1 added this to the Next Major Release milestone Apr 4, 2018

@atulagrwl

This comment has been minimized.

Copy link

commented Aug 21, 2018

Here is the current behavior

Bool Value Engine = C Engine = Python
NA NaN True
0 False False
1 True True
-1 error True
100 error True
abc error True

What is expected behavior for above cases? @chris-b1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.