Force boolean column to category while reading a csv #20498

svgsponer · 2018-03-27T11:29:31Z

Problem description

Reading a csv file with dtypes specified in a dictionary produces NaNs for boolean columns read as category. It seems the output of dtypes changes from version 0.20 to 0.21 so that the below code produces NaNs for the second column.

This might not really be a bug but it seems quite unintuitive to me as I would expect that output of df.dtypes should be useable to read it correctly again and I would expect the interpretation of booleans as a category as valid.

Code to reproduce the issue

b = {'a':[5,4,3,2], 'b':[True, False, True, True], , 'c':['A', 'B', 'A', 'C']}
df = pd.DataFrame(b)

df['b'] = df['b'].astype('category')
df['c'] = df['c'].astype('category')
dtypes_dict = df.dtypes.to_dict()
df.to_csv("data.csv", index=False)

df2 = pd.read_csv("data.csv", dtype=dtypes_dict)
print(df2)

Produced Output

   a    b  c
0  5  NaN  A
1  4  NaN  B
2  3  NaN  A
3  2  NaN  C

Expected Output

   a    b  c
0  5  True  A
1  4  False  B
2  3  True  A
3  2  True  C

Output of df.dtypes version 0.22

{'a': dtype('int64'), 'b': CategoricalDtype(categories=[False, True], ordered=False), 'c': CategoricalDtype(categories=['A', 'B', 'C'], ordered=False)}

Output of df.dtypes version 0.20

{'a': dtype('int64'), 'b': category, 'c': category}

Version

Pands version 0.22

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.8.6-040806-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8
LOCALE: en_IE.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.2
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-03-27T11:35:30Z

@svgsponer should the line df['c'] = df['c'].astype('category') be df['c'] = df['b'].astype('category')?
(changed a c to b)

Hmm perhaps not, because that won't get your expected output. Regardless, something seems to be going strange here.

TomAugspurger · 2018-03-27T11:39:51Z

Ahh, note the difference:

In [15]: pd.read_csv(StringIO("A\nTrue\nFalse\nTrue"), dtype={"A": pd.api.types.CategoricalDtype(['True', 'False'])})
Out[15]:
       A
0   True
1  False
2   True

In [16]: pd.read_csv(StringIO("A\nTrue\nFalse\nTrue"), dtype={"A": pd.api.types.CategoricalDtype([True, False])})
Out[16]:
     A
0  NaN
1  NaN
2  NaN

We do some checking for numeric dtypes, I'm not sure about booleans.

TomAugspurger · 2018-03-27T11:46:22Z

The issue is likely in

pandas/pandas/core/arrays/categorical.py

Lines 490 to 508 in 687cbe2

    
               def _from_inferred_categories(cls, inferred_categories, inferred_codes, 
        
                                             dtype): 
        
                   """Construct a Categorical from inferred values 
        
                   For inferred categories (`dtype` is None) the categories are sorted. 
        
                   For explicit `dtype`, the `inferred_categories` are cast to the 
        
                   appropriate type. 
        
                   Parameters 
        
                   ---------- 
        
                   inferred_categories : Index 
        
                   inferred_codes : Index 
        
                   dtype : CategoricalDtype or 'category' 
        
                   Returns 
        
                   ------- 
        
                   Categorical 
        
                   """

Everything read by the CSV parser is a string. We do some checking for whether those strings should be converted to a specialized type (numeric, datetime-like). Pandas considers Index([True, False]) to be object dtype, which is also used for strings, so it's skipped.

In

pandas/pandas/core/arrays/categorical.py

Line 516 in 687cbe2

if known_categories:

we'll add a if dtype.categories.is_boolean().

I'll get to this next week unless someone beats me to it.

TomAugspurger · 2018-03-27T11:49:16Z

Actually, I might just claim this issue to use for a talk I'm giving next week on how to contribute to pandas.

svgsponer · 2018-03-27T12:28:32Z

@TomAugspurger Sorry, forgot to add the 'c' column here after I edited my test code. It is updated now. df['c'] = df['c'].astype('category') should make more sense now and the code produces the provided example output.

TomAugspurger · 2018-04-25T18:12:11Z

Sorry, dropped the ball on this. Will try to resurrect my PR.

Closes pandas-dev#20498

@TomAugspurger

Previously, was being parsed as object instead of boolean. Closes pandas-devgh-20498. Original Author: @TomAugspurger Rebased by @gfyoung due to merge conflicts.

@TomAugspurger

Previously, was being parsed as object instead of boolean. Closes gh-20498. Original Author: @TomAugspurger Rebased by @gfyoung due to merge conflicts.

@TomAugspurger

Previously, was being parsed as object instead of boolean. Closes pandas-devgh-20498. Original Author: @TomAugspurger Rebased by @gfyoung due to merge conflicts.

@TomAugspurger

Previously, was being parsed as object instead of boolean. Closes pandas-devgh-20498. Original Author: @TomAugspurger Rebased by @gfyoung due to merge conflicts.

TomAugspurger added IO Data IO issues that don't fit into a more specific label IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version Categorical Categorical Data Type labels Mar 27, 2018

jreback added this to the Next Major Release milestone Mar 30, 2018

jorisvandenbossche modified the milestones: Next Major Release, 0.23.0 Apr 24, 2018

TomAugspurger mentioned this issue Apr 25, 2018

Fixed read_csv with CategoricalDtype with boolean categories (20498) #20826

Merged

4 tasks

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Apr 25, 2018

BUG: Fixed read_csv with boolean CategoricalDtype

5495551

Closes pandas-dev#20498

jreback modified the milestones: 0.23.0, 0.23.1 Apr 27, 2018

jreback modified the milestones: 0.23.1, 0.23.2 Jun 7, 2018

jreback modified the milestones: 0.23.2, 0.23.3 Jun 26, 2018

jreback modified the milestones: 0.23.3, 0.24.0 Jul 5, 2018

TomAugspurger closed this as completed in #20826 Nov 27, 2018

TomAugspurger added a commit that referenced this issue Nov 27, 2018

BUG: Properly handle CSV boolean CategoricalDtype (#20826)

d53a4cc

Previously, was being parsed as object instead of boolean. Closes gh-20498. Original Author: @TomAugspurger Rebased by @gfyoung due to merge conflicts.

teto mentioned this issue Feb 26, 2019

Serialize/deserialize a Categorical whose values are taken from an enum #25448

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force boolean column to category while reading a csv #20498

Force boolean column to category while reading a csv #20498

svgsponer commented Mar 27, 2018 •

edited

Loading

INSTALLED VERSIONS

TomAugspurger commented Mar 27, 2018 •

edited

Loading

TomAugspurger commented Mar 27, 2018

TomAugspurger commented Mar 27, 2018

TomAugspurger commented Mar 27, 2018 •

edited

Loading

svgsponer commented Mar 27, 2018 •

edited

Loading

TomAugspurger commented Apr 25, 2018

Force boolean column to category while reading a csv #20498

Force boolean column to category while reading a csv #20498

Comments

svgsponer commented Mar 27, 2018 • edited Loading

Problem description

Code to reproduce the issue

Produced Output

Expected Output

Output of df.dtypes version 0.22

Output of df.dtypes version 0.20

Version

INSTALLED VERSIONS

TomAugspurger commented Mar 27, 2018 • edited Loading

TomAugspurger commented Mar 27, 2018

TomAugspurger commented Mar 27, 2018

TomAugspurger commented Mar 27, 2018 • edited Loading

svgsponer commented Mar 27, 2018 • edited Loading

TomAugspurger commented Apr 25, 2018

svgsponer commented Mar 27, 2018 •

edited

Loading

TomAugspurger commented Mar 27, 2018 •

edited

Loading

TomAugspurger commented Mar 27, 2018 •

edited

Loading

svgsponer commented Mar 27, 2018 •

edited

Loading