Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ExtensionArray types in where #24077

Closed
TomAugspurger opened this issue Dec 3, 2018 · 1 comment

Comments

Projects
None yet
1 participant
@TomAugspurger
Copy link
Contributor

commented Dec 3, 2018

This is blocking DatetimeArray. It's also a slight regression from 0.24, since things like .where on a DataFrame with period objects would work (via object dtype).

I think the easiest place for this is by defining ExtensionBlock.where, and restricting it to cases where the dtype of self and other match (so that the result dtype is the same).

We can do this pretty easily for our EAs by performing the .where on _ndarray_values. But _ndarray_values isn't part of the EA interface yet. I'm not sure if we'll have time to properly design and implement a generic .where for any ExtensionArray since there are a couple subtlies.

Here's a start

diff --git a/pandas/core/internals/blocks.py b/pandas/core/internals/blocks.py
index 1b67c2053..ce5c01359 100644
--- a/pandas/core/internals/blocks.py
+++ b/pandas/core/internals/blocks.py
@@ -1955,6 +1955,37 @@ class ExtensionBlock(NonConsolidatableMixIn, Block):
                                            placement=self.mgr_locs,
                                            ndim=self.ndim)]
 
+    def where(self, other, cond, align=True, errors='raise',
+              try_cast=False, axis=0, transpose=False):
+        import pandas.core.computation.expressions as expressions
+
+        values = self.values._ndarray_values
+
+        if cond.ndim == 2:
+            assert cond.shape[-1] == 1
+            cond = cond._data.blocks[0].values.ravel()
+
+        if hasattr(other, 'ndim') and other.ndim == 2:
+            # TODO: this hasn't been normalized
+            assert other.shape[-1] == 1
+            other = other._data.blocks[0].values
+
+        elif (lib.is_scalar(other) and isna(other)) or other is None:
+            # TODO: we need the storage NA value (e.g. iNaT)
+            other = self.values.dtype.na_value
+            # other = tslibs.iNaT
+
+        # TODO: cond.ravel().all() short-circut
+
+        if cond.ndim > 1:
+            cond = cond.ravel()
+
+        result = expressions.where(cond, values, other)
+        if not isinstance(result, self._holder):
+            # Need a kind of _from_ndarray_values()
+            # this is different from _from_sequence
+            result = self._holder.(result, dtype=self.dtype)
+        return self.make_block_same_class(result)
+
     @property
     def _ftype(self):
         return getattr(self.values, '_pandas_ftype', Block._ftype)

There are a couple TODOs there, plus tests, and I'm sure plenty of edge cases.

In [7]: df = pd.DataFrame({"A": pd.period_range("2000", periods=12)})

In [8]: df.where(df.A.dt.day == 2)
Out[8]:
             A
0          NaT
1   2000-01-02
2          NaT
3          NaT
4          NaT
5          NaT
6          NaT
7          NaT
8          NaT
9          NaT
10         NaT
11         NaT

@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Dec 3, 2018

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Dec 5, 2018

This will also avoid converting to objects for categoricals.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 5, 2018

API: Added ExtensionArray.where
We need some way to do `.where` on EA object for DatetimeArray. Adding it
to the interface is, I think, the easiest way.

Initially I started to write a version on ExtensionBlock, but it proved
to be unwieldy. to write a version that performed well for all types.
It *may* be possible to do using `_ndarray_values` but we'd need a few more
things around that (missing values, converting an arbitrary array to the
"same' ndarary_values, error handling, re-constructing). It seemed easier
to push this down to the array.

The implementation on ExtensionArray is readable, but likely slow since
it'll involve a conversion to object-dtype.

Closes pandas-dev#24077

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 5, 2018

API: Added ExtensionArray.where
We need some way to do `.where` on EA object for DatetimeArray. Adding it
to the interface is, I think, the easiest way.

Initially I started to write a version on ExtensionBlock, but it proved
to be unwieldy. to write a version that performed well for all types.
It *may* be possible to do using `_ndarray_values` but we'd need a few more
things around that (missing values, converting an arbitrary array to the
"same' ndarary_values, error handling, re-constructing). It seemed easier
to push this down to the array.

The implementation on ExtensionArray is readable, but likely slow since
it'll involve a conversion to object-dtype.

Closes pandas-dev#24077

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 5, 2018

Squashed commit of the following:
commit 56470c3
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Dec 5 11:39:48 2018 -0600

    Fixups:

    * Ensure data generated OK.
    * Remove erroneous comments about alignment. That was user error.

commit c4604df
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Mon Dec 3 14:23:25 2018 -0600

    API: Added ExtensionArray.where

    We need some way to do `.where` on EA object for DatetimeArray. Adding it
    to the interface is, I think, the easiest way.

    Initially I started to write a version on ExtensionBlock, but it proved
    to be unwieldy. to write a version that performed well for all types.
    It *may* be possible to do using `_ndarray_values` but we'd need a few more
    things around that (missing values, converting an arbitrary array to the
    "same' ndarary_values, error handling, re-constructing). It seemed easier
    to push this down to the array.

    The implementation on ExtensionArray is readable, but likely slow since
    it'll involve a conversion to object-dtype.

    Closes pandas-dev#24077

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 7, 2018

Squashed commit of the following:
commit 9e0d87d
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Fri Dec 7 07:18:58 2018 -0600

    update docs, cleanup

commit 1271d3d
Merge: 033ac9c f74fc59
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Fri Dec 7 07:12:49 2018 -0600

    Merge remote-tracking branch 'upstream/master' into ea-where

commit 033ac9c
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Fri Dec 7 06:30:18 2018 -0600

    Setitem-based where

commit e9665b8
Merge: 5e14414 03134cb
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 21:38:42 2018 -0600

    Merge remote-tracking branch 'upstream/master' into ea-where

commit 5e14414
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 09:18:54 2018 -0600

    where versionadded

commit d90f384
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 09:17:43 2018 -0600

    deprecation note for categorical

commit 4715ef6
Merge: edff47e b78aa8d
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 08:15:26 2018 -0600

    Merge remote-tracking branch 'upstream/master' into ea-where

commit edff47e
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 08:15:21 2018 -0600

    32-bit compat

commit badb5be
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 06:21:44 2018 -0600

    compat, revert

commit 911a2da
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Dec 5 15:55:24 2018 -0600

    debug 32-bit issue

commit a69dbb3
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Dec 5 15:49:17 2018 -0600

    warn for categorical

commit 6f79282
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Dec 5 12:45:54 2018 -0600

    32-bit compat

commit 56470c3
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Dec 5 11:39:48 2018 -0600

    Fixups:

    * Ensure data generated OK.
    * Remove erroneous comments about alignment. That was user error.

commit c4604df
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Mon Dec 3 14:23:25 2018 -0600

    API: Added ExtensionArray.where

    We need some way to do `.where` on EA object for DatetimeArray. Adding it
    to the interface is, I think, the easiest way.

    Initially I started to write a version on ExtensionBlock, but it proved
    to be unwieldy. to write a version that performed well for all types.
    It *may* be possible to do using `_ndarray_values` but we'd need a few more
    things around that (missing values, converting an arbitrary array to the
    "same' ndarary_values, error handling, re-constructing). It seemed easier
    to push this down to the array.

    The implementation on ExtensionArray is readable, but likely slow since
    it'll involve a conversion to object-dtype.

    Closes pandas-dev#24077

jreback added a commit that referenced this issue Dec 10, 2018

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.