New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/API: Index.append with mixed object/Categorical indices #14545

Merged
merged 3 commits into from Nov 3, 2016

Conversation

Projects
None yet
4 participants
@jorisvandenbossche
Member

jorisvandenbossche commented Oct 31, 2016

Closes #14298, related to #13660 and #13767

This closes the info() issue with CategoricalIndex, but the underlying issue is actually related to Index.append, when appending mixed dtype index objects:

In [11]: pd.Index(['a']).append(pd.CategoricalIndex(['a', 'b']))
...
AttributeError: 'Index' object has no attribute '_is_dtype_compat'

This error occurs when the calling index is not a Categorical (because of https://github.com/pandas-dev/pandas/blob/v0.19.0/pandas/indexes/base.py#L1439 the Categorical _is_dtype_compat method is used anyway).

But the question is, what should the result be of the above expression?

  • a CategoricalIndex ? -> IMO not, as the calling index is not a CategoricalIndex, the result should also not be one (regardless of what is appended) -> so the intention of the code is IMO also not correct
  • an object Index ? -> yes, this is also what 0.18.1 does

But, the more important discussion is what to do when the calling index is a CategoricalIndex ? In 0.18 (and in 0.19.0) this either returned a CategoricalIndex or raised an error:

In [13]: pd.__version__
Out[13]: u'0.18.1'

In [104]: pd.CategoricalIndex(['a', 'b']).append(pd.Index(['a']))
Out[104]: CategoricalIndex([u'a', u'b', u'a'], categories=[u'a', u'b'], ordered=False, dtype='category')

In [105]: pd.CategoricalIndex(['a', 'b']).append(pd.Index(['c']))
...
TypeError: cannot append a non-category item to a CategoricalIndex

But, for the concat case we decided to return an object dtyped Index in those cases instead of coercing or raising (see discussion in #13767 and the summary table with rules in #13767 (comment)). So the question is, do think that append should follow the same rules as concat. Or, do we want to be more flexible for append, and let it depend on the type of the calling Index?

cc @sinhrks

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Oct 31, 2016

Member

In the current PR, I just removed the check for category, which is the more intrusive change, as now the return result will always be object dtype if not self and all objects to append are CategoricalIndex.
I could also change the if 'category' in typs: check to if self.is_categorical():, which then keeps the current behaviour if the calling object is a CategoricalIndex.

Member

jorisvandenbossche commented Oct 31, 2016

In the current PR, I just removed the check for category, which is the more intrusive change, as now the return result will always be object dtype if not self and all objects to append are CategoricalIndex.
I could also change the if 'category' in typs: check to if self.is_categorical():, which then keeps the current behaviour if the calling object is a CategoricalIndex.

Show outdated Hide outdated pandas/indexes/base.py
typs = _concat.get_dtype_kinds(to_concat)
if 'category' in typs:
if self.is_categorical():
# if any of the to_concat is category

This comment has been minimized.

@jreback

jreback Nov 1, 2016

Contributor

then you should change this comment as this is no longer true (the Index must be a CI to return a CI) only.

@jreback

jreback Nov 1, 2016

Contributor

then you should change this comment as this is no longer true (the Index must be a CI to return a CI) only.

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Nov 1, 2016

Current coverage is 85.27% (diff: 100%)

Merging #14545 into master will not change coverage

@@             master     #14545   diff @@
==========================================
  Files           140        140          
  Lines         50693      50693          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43229      43229          
  Misses         7464       7464          
  Partials          0          0          

Powered by Codecov. Last update 093aa82...0d47495

codecov-io commented Nov 1, 2016

Current coverage is 85.27% (diff: 100%)

Merging #14545 into master will not change coverage

@@             master     #14545   diff @@
==========================================
  Files           140        140          
  Lines         50693      50693          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43229      43229          
  Misses         7464       7464          
  Partials          0          0          

Powered by Codecov. Last update 093aa82...0d47495

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 2, 2016

Member

OK, so I changed the PR to only fix the issue when the calling Index is not of Categorical dtype (which caused the bug), and left the behaviour of CategoricalIndex.append(..) alone.
As a result pd.Index(['a']).append(pd.CategoricalIndex(['a', 'b'])) works again (returning object Index as in 0.18.1).

That is probably the easiest for now, to have this fix in 0.19.1, but we should then discuss the behaviour of Index.append(..) a bit more in detail for 0.20 (whether to follow the same rules as concat or not).

Member

jorisvandenbossche commented Nov 2, 2016

OK, so I changed the PR to only fix the issue when the calling Index is not of Categorical dtype (which caused the bug), and left the behaviour of CategoricalIndex.append(..) alone.
As a result pd.Index(['a']).append(pd.CategoricalIndex(['a', 'b'])) works again (returning object Index as in 0.18.1).

That is probably the easiest for now, to have this fix in 0.19.1, but we should then discuss the behaviour of Index.append(..) a bit more in detail for 0.20 (whether to follow the same rules as concat or not).

@jorisvandenbossche jorisvandenbossche referenced this pull request Nov 2, 2016

Closed

RLS: v0.19.1 #14474

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Nov 2, 2016

Contributor

+1 on the behavior here (coercing Categorical to Index).

I'd feel slightly better if the monkey patching of sys.stdout was in a context manager

from contextlib import contextmanager

@contextmanager
def stdout():
    sys.stdout = StringIO()
    yield
    sys.stdout = sys.__stdout__

but not a huge deal.

Contributor

TomAugspurger commented Nov 2, 2016

+1 on the behavior here (coercing Categorical to Index).

I'd feel slightly better if the monkey patching of sys.stdout was in a context manager

from contextlib import contextmanager

@contextmanager
def stdout():
    sys.stdout = StringIO()
    yield
    sys.stdout = sys.__stdout__

but not a huge deal.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 2, 2016

Member

I'd feel slightly better if the monkey patching of sys.stdout was in a context manager

I just copied the approach from other tests :-), so would be a nice clean-up in general!

Member

jorisvandenbossche commented Nov 2, 2016

I'd feel slightly better if the monkey patching of sys.stdout was in a context manager

I just copied the approach from other tests :-), so would be a nice clean-up in general!

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 2, 2016

Contributor

don't re-invent the wheel, the pattern is:

buf = StringIO()
df.info(buf=buf, null_counts=null_counts)
self.assertTrue(('non-null' in buf.getvalue()) is result)
Contributor

jreback commented Nov 2, 2016

don't re-invent the wheel, the pattern is:

buf = StringIO()
df.info(buf=buf, null_counts=null_counts)
self.assertTrue(('non-null' in buf.getvalue()) is result)
Show outdated Hide outdated pandas/tests/frame/test_repr_info.py
df = pd.DataFrame(np.zeros((2, 2)), index=idx, columns=idx)
import sys
sys.stdout = StringIO()

This comment has been minimized.

@jreback

jreback Nov 2, 2016

Contributor

here

@jreback

jreback Nov 2, 2016

Contributor

here

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Nov 3, 2016

Member

Updated the test

@jorisvandenbossche

jorisvandenbossche Nov 3, 2016

Member

Updated the test

@jorisvandenbossche jorisvandenbossche merged commit 252526c into pandas-dev:master Nov 3, 2016

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

jorisvandenbossche added a commit that referenced this pull request Nov 3, 2016

[Backport #14545] BUG/API: Index.append with mixed object/Categorical…
… indices (#14545)

* BUG/API: Index.append with mixed object/Categorical indices

* Only coerce to object if the calling index is not categorical

* Add test for the df.info() case (GH14298)

(cherry picked from commit 252526c)

yarikoptic added a commit to neurodebian/pandas that referenced this pull request Nov 18, 2016

Merge tag 'v0.19.1' into debian
Version 0.19.1

* tag 'v0.19.1': (43 commits)
  RLS: v0.19.1
  DOC: update whatsnew/release notes for 0.19.1 (#14573)
  [Backport #14545] BUG/API: Index.append with mixed object/Categorical indices (#14545)
  DOC: rst fixes
  [Backport #14567] DEPR: add deprecation warning for com.array_equivalent (#14567)
  [Backport #14551] PERF: casting loc to labels dtype before searchsorted (#14551)
  [Backport #14536] BUG: DataFrame.quantile with NaNs (GH14357) (#14536)
  [Backport #14520] BUG: don't close user-provided file handles in C parser (GH14418) (#14520)
  [Backport #14392] BUG: Dataframe constructor when given dict with None value (#14392)
  [Backport #14514] BUG: Don't parse inline quotes in skipped lines (#14514)
  [Bacport #14543] BUG: tseries ceil doc fix (#14543)
  [Backport #14541] DOC: Simplify the gbq integration testing procedure for contributors (#14541)
  [Backport #14527] BUG/ERR: raise correct error when sql driver is not installed (#14527)
  [Backport #14501] BUG: fix DatetimeIndex._maybe_cast_slice_bound for empty index (GH14354) (#14501)
  [Backport #14442] DOC: Expand on reference docs for read_json() (#14442)
  BLD: fix 3.4 build for cython to 0.24.1
  [Backport #14492] BUG: Accept unicode quotechars again in pd.read_csv
  [Backport #14496] BLD: Support Cython 0.25
  [Backport #14498] COMPAT/TST: fix test for range testing of negative integers to neg powers
  [Backport #14476] PERF: performance regression in Series.asof (#14476)
  ...

amolkahat added a commit to amolkahat/pandas that referenced this pull request Nov 26, 2016

BUG/API: Index.append with mixed object/Categorical indices (#14545)
* BUG/API: Index.append with mixed object/Categorical indices

* Only coerce to object if the calling index is not categorical

* Add test for the df.info() case (GH14298)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment