New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: ExtensionArray.fillna #19909

Merged
merged 26 commits into from Mar 16, 2018

Conversation

Projects
None yet
3 participants
@TomAugspurger
Contributor

TomAugspurger commented Feb 26, 2018

xref #19696

This adds ExtensionArray.fillna, and support for NDFrame.fillna with extension arrays.

Like Categorical.fillna, limit is not implemented. This shouldn't make it any harder to support that in the future.

-------
filled : ExtensionArray with NA/NaN filled
"""
from pandas.api.types import is_scalar

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 26, 2018

Contributor

A putmask type method would be extremely useful here.

I'll see what I can do to simplify this.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Feb 26, 2018

Member

How would that be different with array[mask] = values ?

]
else:
new_values = self
return type(self)(new_values)

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 26, 2018

Contributor

Wait on #19906 before merging this, so I can update this line.

@@ -1963,6 +1963,23 @@ def concat_same_type(self, to_concat, placement=None):
return self.make_block_same_class(values, ndim=self.ndim,
placement=placement)
def fillna(self, value, limit=None, inplace=False, downcast=None,

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 26, 2018

Contributor

This was moved from Categorical.fillna with changes

1.) Removed check for limit, since that's done in values.fillna
2.) Removed _try_coerce_result
For CategoricalBlock, this was unnecessary since
Categorical.fillna always returns a Categorical
3.) Used make_block_same_class
This limits ExtensionArray.fillna to not change the type
of the array / block, which I think is a good thing.

placement=self.mgr_locs,
ndim=self.ndim)]
def interpolate(self, method='pad', axis=0, inplace=False, limit=None,

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 26, 2018

Contributor

This was just a move from CategoricalBlock to ExtensionBlock, no changes.

@pytest.mark.skip(reason="Backwards compatability")
def test_fillna_limit_raises(self):
"""Has a different error message."""

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 26, 2018

Contributor

I can just up the error message for Categorical.fillna probably.

def tolist(self):
# type: () -> list
"""Convert the array to a list of scalars."""
return list(self)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Feb 26, 2018

Member

Is this needed for fillna?

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 26, 2018

Contributor

Indirectly, via the default fillna.

Series.__iter__ calls Series.tolist, which calls Series._values.tolist. We could modify that to check the type and just call list(self._values) for EAs. I don't have a strong preference vs. adding a default .tolist.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Feb 26, 2018

Member

Or directly call values.__iter__ ? As iterating through values to make a list and then iterate again through that list sounds like a bit of overhead.

For .tolist() itself, it is of course consistent with Series and numpy arrays, but personally I am not sure what its value is compared to list(values). I don't think you typically can implement a more efficient conversion to lists than what list(values) will do by default? (i.e. iterating through the values)

@codecov

This comment has been minimized.

codecov bot commented Feb 26, 2018

Codecov Report

Merging #19909 into master will decrease coverage by 0.02%.
The diff coverage is 93.54%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19909      +/-   ##
==========================================
- Coverage   91.78%   91.75%   -0.03%     
==========================================
  Files         152      152              
  Lines       49196    49214      +18     
==========================================
+ Hits        45153    45155       +2     
- Misses       4043     4059      +16
Flag Coverage Δ
#multiple 90.13% <93.54%> (-0.03%) ⬇️
#single 41.81% <22.58%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/cast.py 87.68% <ø> (ø) ⬆️
pandas/core/arrays/categorical.py 95.09% <100%> (-0.02%) ⬇️
pandas/core/internals.py 95.53% <100%> (-0.01%) ⬇️
pandas/core/base.py 96.78% <100%> (+0.01%) ⬆️
pandas/core/arrays/base.py 80.64% <89.47%> (+3.9%) ⬆️
pandas/plotting/_converter.py 65.07% <0%> (-1.74%) ⬇️
pandas/util/testing.py 83.74% <0%> (-0.21%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38afa93...05fced6. Read the comment docs.

new_values = []
for is_na, val in zip(mask, data):

This comment has been minimized.

@jreback

jreback Feb 27, 2018

Contributor

huh? what is all this.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Feb 27, 2018

Member

As you can clearly see from the diff, it is an generic implementation of fillna, and the above part for the method kwarg (forward or backward fill).
If you don't like the implementation (I also don't really like it, feels very inefficient), it can be more helpful to give a more specific comment than that.

@TomAugspurger do you want this for cyberpandas? Otherwise, we could also start for now with only supporting filling with a value ? (that's what we supported for now in geopandas)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Feb 27, 2018

Member

On the other hand, we could maybe do this differently (that might be more efficient, without looping through all data): create an array of indices (arange), mask them with the missing values, then fill those indices with the existing fill-method machinery, and then use that to index into the original data and to fill the missing values.

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 27, 2018

Contributor

I don't really care about ffill / bfill, but Categorical supports it so best to be consistent.

Here's one alternative based astyping to object and setting.

+        from pandas.core.missing import pad_1d, backfill_1d
 
         value, method = validate_fillna_kwargs(value, method)
 
+        mask = self.isna()
+
         if not is_scalar(value):
             if len(value) != len(self):
                 raise ValueError("Length of 'value' does not match. Got ({}) "
                                  " expected {}".format(len(value), len(self)))
-        else:
-            value = itertools.cycle([value])
+            value = value[mask]
 
         if limit is not None:
             msg = ("Specifying 'limit' for 'fillna' has not been implemented "
                    "yet for {} typed data".format(self.dtype))
             raise NotImplementedError(msg)
 
-        mask = self.isna()
-
         if mask.any():
             # ffill / bfill
-            if method is not None:
-                if method == 'backfill':
-                    data = reversed(self)
-                    mask = reversed(mask)
-                    last_valid = self[len(self) - 1]
-                else:
-                    last_valid = self[0]
-                    data = self
-
-                new_values = []
-
-                for is_na, val in zip(mask, data):
-                    if is_na:
-                        new_values.append(last_valid)
-                    else:
-                        new_values.append(val)
-                        last_valid = val
-
-                if method in {'bfill', 'backfill'}:
-                    new_values = list(reversed(new_values))
+            if method == 'pad':
+                values = self.astype(object)
+                new_values = pad_1d(values, mask=mask)
+            elif method == 'backfill':
+                values = self.astype(object)
+                new_values = backfill_1d(values, mask=mask)
             else:
                 # fill with value
-                new_values = [
-                    val if is_na else original
-                    for is_na, original, val in zip(mask, self, value)
-                ]
+                new_values = self.copy()
+                new_values[mask] = value
         else:
             new_values = self
         return type(self)(new_values)

Timings (n_rows, value/method, seconds):

astype

1000, value, 0.01
1000, ffill, 0.00
1000, bfill, 0.00
1000000, value, 4.61
1000000, ffill, 2.34
1000000, bfill, 2.36

Loopy

1000, value, 0.00
1000, ffill, 0.00
1000, bfill, 0.00
1000000, value, 1.28
1000000, ffill, 1.37
1000000, bfill, 1.32

So the loopy version (current PR) is ~2-4x faster. Lots of variability there, esp depending on the speed of EA.setitem, but it seems like a good default.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Feb 28, 2018

Member

What code did you use for the timings here?

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 28, 2018

Contributor
from timeit import default_timer as tic
import pandas as pd
import numpy as np
from pandas.tests.extension.decimal.array import DecimalArray
from decimal import Decimal

Ns = [1000, 1_000_000]

for N in Ns:

    arr = np.random.uniform(size=(N,))
    arr[arr > .8] = np.nan
    arr = DecimalArray([Decimal(x) for x in arr])

    start = tic()
    arr.fillna(value=Decimal('1.0'))
    stop = tic()

    print(f'{N}, value, {stop - start:0.2f}')

    start = tic()
    arr.fillna(method='ffill')
    stop = tic()

    print(f'{N}, ffill, {stop - start:0.2f}')

    start = tic()
    arr.fillna(method='bfill')
    stop = tic()

    print(f'{N}, bfill, {stop - start:0.2f}')

I can re-run it / profile things a bit to see why the array[object] version was so much slower.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Feb 28, 2018

Member

With a quick dirty check:

In [1]: from geopandas.array import from_shapely
   ...: import shapely.geometry
   ...: a = np.array([shapely.geometry.Point(i, i) for i in np.random.randn(10000)], dtype=object)
   ...: a[np.random.randint(0, 10000, 10)] = None
   ...: ga = from_shapely(a)
   ...: s = pd.Series(ga)

In [2]: s.head()
Out[2]: 
0    POINT (-0.9684702137574125 -0.9684702137574125)
1        POINT (0.419808066032094 0.419808066032094)
2          POINT (0.57207855258413 0.57207855258413)
3      POINT (0.7773915414611537 0.7773915414611537)
4        POINT (2.073261603753559 2.073261603753559)
dtype: geometry

In [3]: %timeit s.fillna(method='pad')
115 ms ± 730 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit s.fillna(method='pad_new')
217 µs ± 3.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: s.fillna(method='pad').equals(s.fillna(method='pad_new'))
Out[5]: True

And for this 'new' method I did :

+                if method == 'pad_new':
+                    idx = np.arange(len(self), dtype=float)
+                    idx[mask] = np.nan
+                    idx = pad_1d(idx)
+                    idx = idx.astype('int64')
+                    new_values = self.take(idx)
+                else: ... existing code in PR

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Feb 28, 2018

Member

OK, didn't see your post before posting mine. So doing the same with the decimal array arr (with 10000 values) doesn't show a speed-up, rather a slowdown:

In [18]: %timeit arr.fillna(method='pad')
17.3 ms ± 772 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [19]: %timeit arr.fillna(method='pad_new')
28.5 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I suppose this is because DecimalArray uses an object array under the hood anyhow? So something like take will not be that efficient (while in geopandas it is a take on a int array)

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 28, 2018

Contributor

I think we shouldn't care about performance for the default implementation. It'll be too dependent on which optimizations the subclass can implement.

I'll go with a simple implementation like your pad_new.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Feb 28, 2018

Member

And also, if you care about performance, you probably don't have an object array under the hood but a typed array, and then this will be faster.

@TomAugspurger TomAugspurger referenced this pull request Feb 28, 2018

Open

ExtensionArray meta-issue #19696

12 of 15 tasks complete
@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Feb 28, 2018

@jreback @jorisvandenbossche do you have a preference for tolist? Here's the diff removing it.

diff --git a/pandas/core/arrays/base.py b/pandas/core/arrays/base.py
index 8dc4ddbff..bb58fefe5 100644
--- a/pandas/core/arrays/base.py
+++ b/pandas/core/arrays/base.py
@@ -218,11 +218,6 @@ class ExtensionArray(object):
         """
         raise AbstractMethodError(self)
 
-    def tolist(self):
-        # type: () -> list
-        """Convert the array to a list of scalars."""
-        return list(self)
-
     def fillna(self, value=None, method=None, limit=None):
         """ Fill NA/NaN values using the specified method.
 
diff --git a/pandas/core/base.py b/pandas/core/base.py
index 280b88497..5ced4ec25 100644
--- a/pandas/core/base.py
+++ b/pandas/core/base.py
@@ -829,6 +829,8 @@ class IndexOpsMixin(object):
 
         if is_datetimelike(self):
             return [com._maybe_box_datetimelike(x) for x in self._values]
+        elif is_extension_array_dtype(self):
+            return list(self._values)
         else:
             return self._values.tolist()

I don't have a strong preference, but slightly lean towards implementing it (the current state of the PR). It's ugly, but practical in this case.

func = pad_1d if method == 'pad' else backfill_1d
idx = np.arange(len(self), dtype=np.float64)
idx[mask] = np.nan
idx = func(idx, mask=mask).astype(np.int64)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Feb 28, 2018

Member

probably needs to be np.intp (or an ensure_platform_int) for windows?

new_values[mask] = value
else:
new_values = self.copy()
return type(self)(new_values)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Feb 28, 2018

Member

I think this can now just be new_values ?

if mask.any():
if method is not None:
# ffill / bfill
func = pad_1d if method == 'pad' else backfill_1d

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Feb 28, 2018

Member

is method normalized by here? (I mean, can it also be 'ffill'?)

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 28, 2018

Contributor

Yeah, by validate_fillna_kwargs

Method to use for filling holes in reindexed Series
pad / ffill: propagate last valid observation forward to next valid
backfill / bfill: use NEXT valid observation to fill gap
value : scalar, array-like

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Feb 28, 2018

Member

value before method ? (order of signature)

Alternatively, an array-like 'value' can be given. It's expected
that the array-like have the same length as 'self'.
limit : int, default None
(Not implemented yet for ExtensionArray!)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Feb 28, 2018

Member

this can now maybe be passed to pad_1d / backfill_1d ?

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 28, 2018

Contributor

Yeah, I had to change things up slightly to use integers and the datetime fillna methods if you want to take a look.

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 28, 2018

Contributor

Categorical.fillna can probably do soemthing similar, haven't looked.

TomAugspurger added some commits Feb 28, 2018

Refactor tolist
Moved to a new method in dtypes.cast.
@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Mar 1, 2018

Updated to remove tolist for now if you could take another look @jorisvandenbossche.

Added a new method tolist to cast.py that basically does that. I didn't see any other places we could use it, other than IndexOpsMixin.tolist.

@@ -53,3 +54,8 @@ def test_is_extension_array_dtype(self, data):
assert is_extension_array_dtype(data.dtype)
assert is_extension_array_dtype(pd.Series(data))
assert isinstance(data.dtype, ExtensionDtype)
def test_tolist(self, data):
result = tolist(data)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Mar 1, 2018

Member

maybe rather test Series[extension].tolist() here? to check it is the same as list(data) (instead of testing this internal method)

Test Series[ea].tolist
Moved to test_casting
@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Mar 1, 2018

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Mar 1, 2018

This is fine for me. I agree it is a bit more complexity, so if somebody feels strong about it, I won't object having a tolist (but still, I think it is better to keep the interface as limited as possible for now, and tolist feels a bit superfluous from the extension array point of view).

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Mar 1, 2018

This turned up an issue with Categorical.__iter__ not unwrapping NumPy scalars to their Python types. I fixed that in the latest commit, to be consistent with the rest of pandas __iter__ methods.

@@ -1712,7 +1711,7 @@ def __len__(self):
def __iter__(self):
"""Returns an Iterator over the values of this Categorical."""
return iter(self.get_values())
return iter(self.get_values().tolist())

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Mar 1, 2018

Member

Is there a reason this does self.get_values().tolist() instead of directly self.tolist() ?

This comment has been minimized.

@TomAugspurger

TomAugspurger Mar 2, 2018

Contributor

I think to avoid an infinite loop.

Categorical.tolist calls tolist which calls Categorical.__iter__.

Need to do get_values break that cycle / extract the python scalar types.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Mar 2, 2018

Member

It just feels that we are doing to much work. To iterate, we first create an array, iterate through it to create a list, and then iterate through the list ..

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Mar 2, 2018

Member

So get_values either returns a array or a Datetime(like)Index. For the array we can just iterate over it, but for the Datetime(like)Index I think as well . The tolist implementation there is list(self.astype(object))

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Mar 2, 2018

Member

Sorry, that previous comment was of course wrong, as that is exactly checking the get_values().tolist()

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Mar 2, 2018

Member

Ah, yes, therefore we need the tolist.
But then moving the logic is maybe not best, as then tolist would consume an iterator coming from a list ..

It's all just complex with the different unboxing depending on the type :-)

This comment has been minimized.

@TomAugspurger

TomAugspurger Mar 2, 2018

Contributor

Simplified slightly if you want to take one last look. Basically, we didn't have to worry about the different unboxing for different types, since Categorical.get_values() with datetimes returns a DatetimeIndex, and when we .tolist() on that we get the right unboxing.

This comment has been minimized.

@jorisvandenbossche

This comment has been minimized.

@jreback

jreback Mar 7, 2018

Contributor

this is the whole .get_values() fiasco (which I created a while back :). I think should be addressed sooner rather than later.

This comment has been minimized.

@TomAugspurger

TomAugspurger Mar 12, 2018

Contributor

Any issues with the current implementation? Currently there's a tradeoff between how expensive it = iter(Categorical) is vs. next(it).

Somewhere, we have to go from numpy scalars to Python scalars. The fastest way to do that is with tolist(), but that has upfront memory overhead. We could avoid that by doing it in the next, but that has time overhead for

  1. converting to Python scalars elementwise
  2. An extra call to CategoricalIndex.categories.getitem per element

so that next(it) becomes much slower. I don't think there's much more to be done right now.

# The basic idea is to create an array of integer positions.
# Internally, we use iNaT and the datetime filling routines
# to avoid floating-point NaN. Once filled, we take on `self`
# to get the actual values.

This comment has been minimized.

@jreback

jreback Mar 7, 2018

Contributor

I am still not clear why you are writing another implementation of fill here. We already have one for object dtypes. I would suggest using that one. Yes I get that you want one for the extension classes, but if that is the case then I would replace the algos we have with THIS one. There is no reason to have 2 floating around. Then this will be fully tested and you can look at performance and such.

I don't really have a preference to the these 2 options.

The extension classes implementation MUST be integrated if its in the base class.

@@ -1712,7 +1711,7 @@ def __len__(self):
def __iter__(self):
"""Returns an Iterator over the values of this Categorical."""
return iter(self.get_values())
return iter(self.get_values().tolist())

This comment has been minimized.

@jreback

jreback Mar 7, 2018

Contributor

this is the whole .get_values() fiasco (which I created a while back :). I think should be addressed sooner rather than later.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Mar 12, 2018

I am still not clear why you are writing another implementation of fill here. We already have one for object dtypes. I would suggest using that one. Yes I get that you want one for the extension classes, but if that is the case then I would replace the algos we have with THIS one. There is no reason to have 2 floating around. Then this will be fully tested and you can look at performance and such.

Sorry, but I really disagree that having a good implementation of fillna for ExtensionArrays is out of scope.
The implementation we had before is really not that trivial for external authors to do themselves, and was using internal API. But it was a good generic implementation that would work in many cases.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Mar 12, 2018

But OK, let's go ahead with this PR, we can always discuss later an improved implementation

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Mar 12, 2018

Ah I opened #20300

I don't really care whether we do it now or elsewhere. That type of filling on integer, followed by a take seems like a useful strategy in general. I can do a followup implementing in in core/missing.py and the call it here.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Mar 12, 2018

OK, let's merge this one then!

@jreback

This comment has been minimized.

Contributor

jreback commented Mar 12, 2018

i will have a look later today

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Mar 12, 2018

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Mar 14, 2018

Fixed a merge conflict.

This should be ready as well.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Mar 15, 2018

Fixed the merge conflict.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Mar 16, 2018

All green.

@jreback

This comment has been minimized.

Contributor

jreback commented Mar 16, 2018

you never answered my question of why you are adding a new implementation and not using the current one.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Mar 16, 2018

I removed the new implementation. #20300

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Mar 16, 2018

Was that your last objection?

@jreback jreback merged commit 92524f5 into pandas-dev:master Mar 16, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jreback

This comment has been minimized.

Contributor

jreback commented Mar 16, 2018

thanks @TomAugspurger, no didn't see the removal, but looking good now

@TomAugspurger TomAugspurger deleted the TomAugspurger:fu1+fillna branch May 2, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment