Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force boolean column to category while reading a csv #20498

Closed
svgsponer opened this issue Mar 27, 2018 · 6 comments · Fixed by #20826
Closed

Force boolean column to category while reading a csv #20498

svgsponer opened this issue Mar 27, 2018 · 6 comments · Fixed by #20826
Labels
Categorical Categorical Data Type IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@svgsponer
Copy link

svgsponer commented Mar 27, 2018

Problem description

Reading a csv file with dtypes specified in a dictionary produces NaNs for boolean columns read as category. It seems the output of dtypes changes from version 0.20 to 0.21 so that the below code produces NaNs for the second column.

This might not really be a bug but it seems quite unintuitive to me as I would expect that output of df.dtypes should be useable to read it correctly again and I would expect the interpretation of booleans as a category as valid.

Code to reproduce the issue

b = {'a':[5,4,3,2], 'b':[True, False, True, True], , 'c':['A', 'B', 'A', 'C']}
df = pd.DataFrame(b)

df['b'] = df['b'].astype('category')
df['c'] = df['c'].astype('category')
dtypes_dict = df.dtypes.to_dict()
df.to_csv("data.csv", index=False)

df2 = pd.read_csv("data.csv", dtype=dtypes_dict)
print(df2)

Produced Output

   a    b  c
0  5  NaN  A
1  4  NaN  B
2  3  NaN  A
3  2  NaN  C

Expected Output

   a    b  c
0  5  True  A
1  4  False  B
2  3  True  A
3  2  True  C

Output of df.dtypes version 0.22

{'a': dtype('int64'), 'b': CategoricalDtype(categories=[False, True], ordered=False), 'c': CategoricalDtype(categories=['A', 'B', 'C'], ordered=False)}

Output of df.dtypes version 0.20

{'a': dtype('int64'), 'b': category, 'c': category}

Version

Pands version 0.22

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.8.6-040806-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8
LOCALE: en_IE.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.2
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 27, 2018

@svgsponer should the line df['c'] = df['c'].astype('category') be df['c'] = df['b'].astype('category')?
(changed a c to b)

Hmm perhaps not, because that won't get your expected output. Regardless, something seems to be going strange here.

@TomAugspurger TomAugspurger added IO Data IO issues that don't fit into a more specific label IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version Categorical Categorical Data Type labels Mar 27, 2018
@TomAugspurger
Copy link
Contributor

Ahh, note the difference:

In [15]: pd.read_csv(StringIO("A\nTrue\nFalse\nTrue"), dtype={"A": pd.api.types.CategoricalDtype(['True', 'False'])})
Out[15]:
       A
0   True
1  False
2   True

In [16]: pd.read_csv(StringIO("A\nTrue\nFalse\nTrue"), dtype={"A": pd.api.types.CategoricalDtype([True, False])})
Out[16]:
     A
0  NaN
1  NaN
2  NaN

We do some checking for numeric dtypes, I'm not sure about booleans.

@TomAugspurger
Copy link
Contributor

The issue is likely in

def _from_inferred_categories(cls, inferred_categories, inferred_codes,
dtype):
"""Construct a Categorical from inferred values
For inferred categories (`dtype` is None) the categories are sorted.
For explicit `dtype`, the `inferred_categories` are cast to the
appropriate type.
Parameters
----------
inferred_categories : Index
inferred_codes : Index
dtype : CategoricalDtype or 'category'
Returns
-------
Categorical
"""

Everything read by the CSV parser is a string. We do some checking for whether those strings should be converted to a specialized type (numeric, datetime-like). Pandas considers Index([True, False]) to be object dtype, which is also used for strings, so it's skipped.

In

if known_categories:
we'll add a if dtype.categories.is_boolean().

I'll get to this next week unless someone beats me to it.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 27, 2018

Actually, I might just claim this issue to use for a talk I'm giving next week on how to contribute to pandas.

@svgsponer
Copy link
Author

svgsponer commented Mar 27, 2018

@TomAugspurger Sorry, forgot to add the 'c' column here after I edited my test code. It is updated now. df['c'] = df['c'].astype('category') should make more sense now and the code produces the provided example output.

@jreback jreback added this to the Next Major Release milestone Mar 30, 2018
@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.23.0 Apr 24, 2018
@TomAugspurger
Copy link
Contributor

Sorry, dropped the ball on this. Will try to resurrect my PR.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Apr 25, 2018
@jreback jreback modified the milestones: 0.23.0, 0.23.1 Apr 27, 2018
@jreback jreback modified the milestones: 0.23.1, 0.23.2 Jun 7, 2018
@jreback jreback modified the milestones: 0.23.2, 0.23.3 Jun 26, 2018
@jreback jreback modified the milestones: 0.23.3, 0.24.0 Jul 5, 2018
gfyoung added a commit to TomAugspurger/pandas that referenced this issue Nov 23, 2018
Previously, was being parsed as object instead of boolean.

Closes pandas-devgh-20498.

Original Author: @TomAugspurger
Rebased by @gfyoung due to merge conflicts.
TomAugspurger added a commit that referenced this issue Nov 27, 2018
Previously, was being parsed as object instead of boolean.

Closes gh-20498.

Original Author: @TomAugspurger
Rebased by @gfyoung due to merge conflicts.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
Previously, was being parsed as object instead of boolean.

Closes pandas-devgh-20498.

Original Author: @TomAugspurger
Rebased by @gfyoung due to merge conflicts.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
Previously, was being parsed as object instead of boolean.

Closes pandas-devgh-20498.

Original Author: @TomAugspurger
Rebased by @gfyoung due to merge conflicts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants