New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept CategoricalDtype in read_csv #17643

Merged
merged 19 commits into from Oct 2, 2017

Conversation

Projects
None yet
5 participants
@TomAugspurger
Contributor

TomAugspurger commented Sep 23, 2017

import pandas as pd
from io import StringIO
from pandas.api.types import CategoricalDtype

data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3'

dtype = CategoricalDtype(['d', 'c', 'b', 'a'], ordered=True)
pd.read_csv(StringIO(data), dtype={'col1': dtype}).dtypes

This is for after #16015

cc @chris-b1

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Sep 23, 2017

Contributor

I squashed everything from #16015 down to a single commit, so the changes here are just ccbaa04

Contributor

TomAugspurger commented Sep 23, 2017

I squashed everything from #16015 down to a single commit, so the changes here are just ccbaa04

Show outdated Hide outdated doc/source/io.rst

TomAugspurger added some commits Sep 24, 2017

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Sep 25, 2017

Contributor

This should be ready to go. My earlier implementation was buggy and only worked when the data were already sorted.

Casting is now implemented by

  1. checking if dtype.categories is {numeric,datetime,timedelta} type
  2. calling the appropriate to_* function to cast the values / inferred categories

One question I had is how to control options passed to that function. I've simply hardcoded errors='ignore'. I'm leery about trying to be clever here.

Contributor

TomAugspurger commented Sep 25, 2017

This should be ready to go. My earlier implementation was buggy and only worked when the data were already sorted.

Casting is now implemented by

  1. checking if dtype.categories is {numeric,datetime,timedelta} type
  2. calling the appropriate to_* function to cast the values / inferred categories

One question I had is how to control options passed to that function. I've simply hardcoded errors='ignore'. I'm leery about trying to be clever here.

@TomAugspurger TomAugspurger added this to the 0.21.0 milestone Sep 25, 2017

@jorisvandenbossche

What (should) happens when there are values in the csv file column that are not specified in the categories?(error or coerce to NaN)? (I would also mention this in the docs)

Show outdated Hide outdated doc/source/io.rst
Show outdated Hide outdated doc/source/whatsnew/v0.21.0.txt
@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Sep 25, 2017

Contributor

What (should) happens when there are values in the csv file column that are not specified in the categories?

Ah I forgot about this case. Yes, I think we will insert NaNs then. In my mind this should behave like a .set_categories(dtype.categories) after the fact. I'll add tests and docs for this tomorrow.

Contributor

TomAugspurger commented Sep 25, 2017

What (should) happens when there are values in the csv file column that are not specified in the categories?

Ah I forgot about this case. Yes, I think we will insert NaNs then. In my mind this should behave like a .set_categories(dtype.categories) after the fact. I'll add tests and docs for this tomorrow.

@codecov

This comment has been minimized.

Show comment
Hide comment
@codecov

codecov bot Sep 26, 2017

Codecov Report

❗️ No coverage uploaded for pull request base (master@7e87385). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #17643   +/-   ##
=========================================
  Coverage          ?   91.24%           
=========================================
  Files             ?      163           
  Lines             ?    49819           
  Branches          ?        0           
=========================================
  Hits              ?    45456           
  Misses            ?     4363           
  Partials          ?        0
Flag Coverage Δ
#multiple 89.04% <100%> (?)
#single 40.31% <14.28%> (?)
Impacted Files Coverage Δ
pandas/io/parsers.py 95.51% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e87385...6f175a7. Read the comment docs.

codecov bot commented Sep 26, 2017

Codecov Report

❗️ No coverage uploaded for pull request base (master@7e87385). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #17643   +/-   ##
=========================================
  Coverage          ?   91.24%           
=========================================
  Files             ?      163           
  Lines             ?    49819           
  Branches          ?        0           
=========================================
  Hits              ?    45456           
  Misses            ?     4363           
  Partials          ?        0
Flag Coverage Δ
#multiple 89.04% <100%> (?)
#single 40.31% <14.28%> (?)
Impacted Files Coverage Δ
pandas/io/parsers.py 95.51% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e87385...6f175a7. Read the comment docs.

@codecov

This comment has been minimized.

Show comment
Hide comment
@codecov

codecov bot Sep 26, 2017

Codecov Report

❗️ No coverage uploaded for pull request base (master@7e87385). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #17643   +/-   ##
=========================================
  Coverage          ?   91.24%           
=========================================
  Files             ?      163           
  Lines             ?    49819           
  Branches          ?        0           
=========================================
  Hits              ?    45456           
  Misses            ?     4363           
  Partials          ?        0
Flag Coverage Δ
#multiple 89.04% <100%> (?)
#single 40.31% <14.28%> (?)
Impacted Files Coverage Δ
pandas/io/parsers.py 95.51% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e87385...6f175a7. Read the comment docs.

codecov bot commented Sep 26, 2017

Codecov Report

❗️ No coverage uploaded for pull request base (master@7e87385). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #17643   +/-   ##
=========================================
  Coverage          ?   91.24%           
=========================================
  Files             ?      163           
  Lines             ?    49819           
  Branches          ?        0           
=========================================
  Hits              ?    45456           
  Misses            ?     4363           
  Partials          ?        0
Flag Coverage Δ
#multiple 89.04% <100%> (?)
#single 40.31% <14.28%> (?)
Impacted Files Coverage Δ
pandas/io/parsers.py 95.51% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e87385...6f175a7. Read the comment docs.

@codecov

This comment has been minimized.

Show comment
Hide comment
@codecov

codecov bot Sep 26, 2017

Codecov Report

Merging #17643 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17643      +/-   ##
==========================================
- Coverage   91.27%   91.23%   -0.04%     
==========================================
  Files         163      163              
  Lines       49765    49848      +83     
==========================================
+ Hits        45421    45480      +59     
- Misses       4344     4368      +24
Flag Coverage Δ
#multiple 89.03% <100%> (-0.02%) ⬇️
#single 40.32% <7.4%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/categorical.py 95.73% <100%> (+0.02%) ⬆️
pandas/io/parsers.py 95.49% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/tools/datetimes.py 82.97% <0%> (-0.83%) ⬇️
pandas/core/common.py 91.42% <0%> (-0.56%) ⬇️
pandas/core/indexes/multi.py 96.39% <0%> (-0.51%) ⬇️
pandas/core/config.py 87.7% <0%> (-0.39%) ⬇️
pandas/core/indexes/category.py 97.46% <0%> (-0.29%) ⬇️
pandas/core/groupby.py 92.04% <0%> (-0.2%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db1206a...9325a93. Read the comment docs.

codecov bot commented Sep 26, 2017

Codecov Report

Merging #17643 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17643      +/-   ##
==========================================
- Coverage   91.27%   91.23%   -0.04%     
==========================================
  Files         163      163              
  Lines       49765    49848      +83     
==========================================
+ Hits        45421    45480      +59     
- Misses       4344     4368      +24
Flag Coverage Δ
#multiple 89.03% <100%> (-0.02%) ⬇️
#single 40.32% <7.4%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/categorical.py 95.73% <100%> (+0.02%) ⬆️
pandas/io/parsers.py 95.49% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/tools/datetimes.py 82.97% <0%> (-0.83%) ⬇️
pandas/core/common.py 91.42% <0%> (-0.56%) ⬇️
pandas/core/indexes/multi.py 96.39% <0%> (-0.51%) ⬇️
pandas/core/config.py 87.7% <0%> (-0.39%) ⬇️
pandas/core/indexes/category.py 97.46% <0%> (-0.29%) ⬇️
pandas/core/groupby.py 92.04% <0%> (-0.2%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db1206a...9325a93. Read the comment docs.

@jorisvandenbossche

some minor doc comments

Show outdated Hide outdated doc/source/io.rst
Show outdated Hide outdated doc/source/whatsnew/v0.21.0.txt
Show outdated Hide outdated doc/source/whatsnew/v0.21.0.txt
Show outdated Hide outdated doc/source/whatsnew/v0.21.0.txt
Show outdated Hide outdated pandas/_libs/parsers.pyx
Show outdated Hide outdated pandas/_libs/parsers.pyx
Show outdated Hide outdated doc/source/io.rst
@@ -129,8 +129,37 @@ e.g., when converting string data to a ``Categorical`` (:issue:`14711`,
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
s.astype(dtype)
One place that deserves special mention is in :meth:`read_csv`. Previously, with

This comment has been minimized.

@jreback

jreback Sep 26, 2017

Contributor

maybe a separate sub-section for this

@jreback

jreback Sep 26, 2017

Contributor

maybe a separate sub-section for this

Show outdated Hide outdated pandas/_libs/parsers.pyx
result[name] = union_categoricals(arrs, sort_categories=True)
dtype = dtypes.pop()
if is_categorical_dtype(dtype):
sort_categories = isinstance(dtype, str)

This comment has been minimized.

@jreback

jreback Sep 26, 2017

Contributor

str -> string_types

@jreback

jreback Sep 26, 2017

Contributor

str -> string_types

Show outdated Hide outdated pandas/io/parsers.py
Show outdated Hide outdated pandas/_libs/parsers.pyx
dtype = CategoricalDtype(cats, ordered=False)
codes = inferred_codes
return cls(codes, dtype=dtype, fastpath=True)

This comment has been minimized.

@jreback

jreback Sep 28, 2017

Contributor

much nicer

@jreback

jreback Sep 28, 2017

Contributor

much nicer

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Sep 28, 2017

Contributor

Hmm, seems like the compiler error is back on circle CI. Looking into it.

Contributor

TomAugspurger commented Sep 28, 2017

Hmm, seems like the compiler error is back on circle CI. Looking into it.

Show outdated Hide outdated pandas/io/parsers.py
Show outdated Hide outdated pandas/core/dtypes/cast.py
@jreback

minor comment lgtm otherwise

Show outdated Hide outdated pandas/core/categorical.py
Show outdated Hide outdated pandas/core/categorical.py
@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Oct 2, 2017

Contributor

All green. Merging.

I opened up #17743 for optimizing _categorical_convert in the C parser. I won't have time to get to it for the release though.

Contributor

TomAugspurger commented Oct 2, 2017

All green. Merging.

I opened up #17743 for optimizing _categorical_convert in the C parser. I won't have time to get to it for the release though.

@TomAugspurger TomAugspurger merged commit def3bce into pandas-dev:master Oct 2, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche
Member

jorisvandenbossche commented Oct 2, 2017

Thanks!

@TomAugspurger TomAugspurger deleted the TomAugspurger:categorical-csv-2 branch Oct 2, 2017

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 2, 2017

Contributor

thanks @TomAugspurger this is great!

Contributor

jreback commented Oct 2, 2017

thanks @TomAugspurger this is great!

kchomski-reef added a commit to reef-technologies/pandas that referenced this pull request Oct 16, 2017

ENH: Accept CategoricalDtype in read_csv (#17643)
* ENH: Accept CategoricalDtype in CSV reader

* rework

* Fixed basic implementation

* Added casting

* Doc and cleanup

* Fixed assignment of categoricals

* Doc and test unexpected values

* DOC: fixups

* More coercion, use _recode_for_categories

* Refactor with maybe_convert_for_categorical

* PEP8

* Type for 32bit

* REF: refactor to new method

* py2 compat

* Refactored

* More in Categorical

* fixup! More in Categorical

alanbato added a commit to alanbato/pandas that referenced this pull request Nov 10, 2017

ENH: Accept CategoricalDtype in read_csv (#17643)
* ENH: Accept CategoricalDtype in CSV reader

* rework

* Fixed basic implementation

* Added casting

* Doc and cleanup

* Fixed assignment of categoricals

* Doc and test unexpected values

* DOC: fixups

* More coercion, use _recode_for_categories

* Refactor with maybe_convert_for_categorical

* PEP8

* Type for 32bit

* REF: refactor to new method

* py2 compat

* Refactored

* More in Categorical

* fixup! More in Categorical

No-Stream added a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017

ENH: Accept CategoricalDtype in read_csv (#17643)
* ENH: Accept CategoricalDtype in CSV reader

* rework

* Fixed basic implementation

* Added casting

* Doc and cleanup

* Fixed assignment of categoricals

* Doc and test unexpected values

* DOC: fixups

* More coercion, use _recode_for_categories

* Refactor with maybe_convert_for_categorical

* PEP8

* Type for 32bit

* REF: refactor to new method

* py2 compat

* Refactored

* More in Categorical

* fixup! More in Categorical
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment