Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: ExtensionArray.fillna #19909

Merged
merged 26 commits into from
Mar 16, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
74614cb
ENH: ExtensionArray.fillna
TomAugspurger Feb 22, 2018
67a19ba
REF: cast to object
TomAugspurger Feb 27, 2018
280ac94
Revert "REF: cast to object"
TomAugspurger Feb 27, 2018
6705712
Merge remote-tracking branch 'upstream/master' into fu1+fillna
TomAugspurger Feb 28, 2018
69a0f9b
Simpler implementation
TomAugspurger Feb 28, 2018
4d6846b
Support limit
TomAugspurger Feb 28, 2018
f3b81dc
Test backfill with limit
TomAugspurger Feb 28, 2018
70efbb8
BUG: ensure array-like for indexer
TomAugspurger Feb 28, 2018
8fc3851
Refactor tolist
TomAugspurger Mar 1, 2018
39096e5
Test Series[ea].tolist
TomAugspurger Mar 1, 2018
9e44f88
Merge remote-tracking branch 'upstream/master' into fu1+fillna
TomAugspurger Mar 1, 2018
e902f18
Fixed Categorical unwrapping
TomAugspurger Mar 1, 2018
17126e6
updated
TomAugspurger Mar 2, 2018
1106ef2
Back to getvalues list
TomAugspurger Mar 2, 2018
744c381
Simplified
TomAugspurger Mar 2, 2018
9a3fa55
As a method
TomAugspurger Mar 2, 2018
2bc43bc
Merge remote-tracking branch 'upstream/master' into fu1+fillna
TomAugspurger Mar 3, 2018
f22fd7b
Removed tolist fully
TomAugspurger Mar 3, 2018
c26d261
Merge remote-tracking branch 'upstream/master' into fu1+fillna
TomAugspurger Mar 5, 2018
b342efe
Linting
TomAugspurger Mar 5, 2018
a609f48
Merge remote-tracking branch 'upstream/master' into fu1+fillna
TomAugspurger Mar 12, 2018
a35f93c
Just apply to objects
TomAugspurger Mar 12, 2018
3f78dec
Merge remote-tracking branch 'upstream/master' into fu1+fillna
TomAugspurger Mar 12, 2018
1160e15
Linting
TomAugspurger Mar 13, 2018
c9c7a48
Merge remote-tracking branch 'upstream/master' into fu1+fillna
TomAugspurger Mar 14, 2018
05fced6
Merge remote-tracking branch 'upstream/master' into fu1+fillna
TomAugspurger Mar 15, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -843,6 +843,7 @@ Categorical
- Bug in :meth:`Series.astype` and ``Categorical.astype()`` where an existing categorical data does not get updated (:issue:`10696`, :issue:`18593`)
- Bug in :class:`Index` constructor with ``dtype=CategoricalDtype(...)`` where ``categories`` and ``ordered`` are not maintained (issue:`19032`)
- Bug in :class:`Series` constructor with scalar and ``dtype=CategoricalDtype(...)`` where ``categories`` and ``ordered`` are not maintained (issue:`19565`)
- Bug in ``Categorical.__iter__`` not converting to Python types (:issue:`19909`)
- Bug in :func:`pandas.factorize` returning the unique codes for the ``uniques``. This now returns a ``Categorical`` with the same dtype as the input (:issue:`19721`)
- Bug in :func:`pandas.factorize` including an item for missing values in the ``uniques`` return value (:issue:`19721`)

Expand Down
54 changes: 54 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,59 @@ def isna(self):
"""
raise AbstractMethodError(self)

def fillna(self, value=None, method=None, limit=None):
""" Fill NA/NaN values using the specified method.

Parameters
----------
value : scalar, array-like
If a scalar value is passed it is used to fill all missing values.
Alternatively, an array-like 'value' can be given. It's expected
that the array-like have the same length as 'self'.
method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
Method to use for filling holes in reindexed Series
pad / ffill: propagate last valid observation forward to next valid
backfill / bfill: use NEXT valid observation to fill gap
limit : int, default None
If method is specified, this is the maximum number of consecutive
NaN values to forward/backward fill. In other words, if there is
a gap with more than this number of consecutive NaNs, it will only
be partially filled. If method is not specified, this is the
maximum number of entries along the entire axis where NaNs will be
filled.

Returns
-------
filled : ExtensionArray with NA/NaN filled
"""
from pandas.api.types import is_scalar
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A putmask type method would be extremely useful here.

I'll see what I can do to simplify this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would that be different with array[mask] = values ?

from pandas.util._validators import validate_fillna_kwargs
from pandas.core.missing import pad_1d, backfill_1d

value, method = validate_fillna_kwargs(value, method)

mask = self.isna()

if not is_scalar(value):
if len(value) != len(self):
raise ValueError("Length of 'value' does not match. Got ({}) "
" expected {}".format(len(value), len(self)))
value = value[mask]

if mask.any():
if method is not None:
func = pad_1d if method == 'pad' else backfill_1d
new_values = func(self.astype(object), limit=limit,
mask=mask)
new_values = self._constructor_from_sequence(new_values)
else:
# fill with value
new_values = self.copy()
new_values[mask] = value
else:
new_values = self.copy()
return new_values

def unique(self):
"""Compute the ExtensionArray of unique values.

Expand Down Expand Up @@ -285,6 +338,7 @@ def take(self, indexer, allow_fill=True, fill_value=None):
.. code-block:: python

def take(self, indexer, allow_fill=True, fill_value=None):
indexer = np.asarray(indexer)
mask = indexer == -1
result = self.data.take(indexer)
result[mask] = np.nan # NA for this type
Expand Down
14 changes: 6 additions & 8 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -480,9 +480,7 @@ def tolist(self):
(for str, int, float) or a pandas scalar
(for Timestamp/Timedelta/Interval/Period)
"""
if is_datetimelike(self.categories):
return [com._maybe_box_datetimelike(x) for x in self]
return np.array(self).tolist()
return list(self)

@property
def base(self):
Expand Down Expand Up @@ -1592,16 +1590,16 @@ def fillna(self, value=None, method=None, limit=None):

Parameters
----------
method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
Method to use for filling holes in reindexed Series
pad / ffill: propagate last valid observation forward to next valid
backfill / bfill: use NEXT valid observation to fill gap
value : scalar, dict, Series
If a scalar value is passed it is used to fill all missing values.
Alternatively, a Series or dict can be used to fill in different
values for each index. The value should not be a list. The
value(s) passed should either be in the categories or should be
NaN.
method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
Method to use for filling holes in reindexed Series
pad / ffill: propagate last valid observation forward to next valid
backfill / bfill: use NEXT valid observation to fill gap
limit : int, default None
(Not implemented yet for Categorical!)
If method is specified, this is the maximum number of consecutive
Expand Down Expand Up @@ -1717,7 +1715,7 @@ def __len__(self):

def __iter__(self):
"""Returns an Iterator over the values of this Categorical."""
return iter(self.get_values())
return iter(self.get_values().tolist())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this does self.get_values().tolist() instead of directly self.tolist() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to avoid an infinite loop.

Categorical.tolist calls tolist which calls Categorical.__iter__.

Need to do get_values break that cycle / extract the python scalar types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just feels that we are doing to much work. To iterate, we first create an array, iterate through it to create a list, and then iterate through the list ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So get_values either returns a array or a Datetime(like)Index. For the array we can just iterate over it, but for the Datetime(like)Index I think as well . The tolist implementation there is list(self.astype(object))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that previous comment was of course wrong, as that is exactly checking the get_values().tolist()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, therefore we need the tolist.
But then moving the logic is maybe not best, as then tolist would consume an iterator coming from a list ..

It's all just complex with the different unboxing depending on the type :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified slightly if you want to take one last look. Basically, we didn't have to worry about the different unboxing for different types, since Categorical.get_values() with datetimes returns a DatetimeIndex, and when we .tolist() on that we get the right unboxing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the whole .get_values() fiasco (which I created a while back :). I think should be addressed sooner rather than later.

Copy link
Contributor Author

@TomAugspurger TomAugspurger Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any issues with the current implementation? Currently there's a tradeoff between how expensive it = iter(Categorical) is vs. next(it).

Somewhere, we have to go from numpy scalars to Python scalars. The fastest way to do that is with tolist(), but that has upfront memory overhead. We could avoid that by doing it in the next, but that has time overhead for

  1. converting to Python scalars elementwise
  2. An extra call to CategoricalIndex.categories.getitem per element

so that next(it) becomes much slower. I don't think there's much more to be done right now.


def _tidy_repr(self, max_vals=10, footer=True):
""" a short repr displaying only max_vals and an optional (but default
Expand Down
7 changes: 4 additions & 3 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@
from pandas.core.dtypes.missing import isna
from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries, ABCIndexClass
from pandas.core.dtypes.common import (
is_datetimelike,
is_object_dtype,
is_list_like,
is_scalar,
is_datetimelike,
is_extension_type,
is_extension_array_dtype)

Expand Down Expand Up @@ -826,9 +826,10 @@ def tolist(self):
--------
numpy.ndarray.tolist
"""

if is_datetimelike(self):
if is_datetimelike(self._values):
return [com._maybe_box_datetimelike(x) for x in self._values]
elif is_extension_array_dtype(self._values):
return list(self._values)
else:
return self._values.tolist()

Expand Down
3 changes: 2 additions & 1 deletion pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@
from .common import (_ensure_object, is_bool, is_integer, is_float,
is_complex, is_datetimetz, is_categorical_dtype,
is_datetimelike,
is_extension_type, is_object_dtype,
is_extension_type,
is_object_dtype,
is_datetime64tz_dtype, is_datetime64_dtype,
is_datetime64_ns_dtype,
is_timedelta64_dtype, is_timedelta64_ns_dtype,
Expand Down
38 changes: 17 additions & 21 deletions pandas/core/internals.py
Original file line number Diff line number Diff line change
Expand Up @@ -1963,6 +1963,23 @@ def concat_same_type(self, to_concat, placement=None):
return self.make_block_same_class(values, ndim=self.ndim,
placement=placement)

def fillna(self, value, limit=None, inplace=False, downcast=None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was moved from Categorical.fillna with changes

1.) Removed check for limit, since that's done in values.fillna
2.) Removed _try_coerce_result
For CategoricalBlock, this was unnecessary since
Categorical.fillna always returns a Categorical
3.) Used make_block_same_class
This limits ExtensionArray.fillna to not change the type
of the array / block, which I think is a good thing.

mgr=None):
values = self.values if inplace else self.values.copy()
values = values.fillna(value=value, limit=limit)
return [self.make_block_same_class(values=values,
placement=self.mgr_locs,
ndim=self.ndim)]

def interpolate(self, method='pad', axis=0, inplace=False, limit=None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just a move from CategoricalBlock to ExtensionBlock, no changes.

fill_value=None, **kwargs):

values = self.values if inplace else self.values.copy()
return self.make_block_same_class(
values=values.fillna(value=fill_value, method=method,
limit=limit),
placement=self.mgr_locs)


class NumericBlock(Block):
__slots__ = ()
Expand Down Expand Up @@ -2522,27 +2539,6 @@ def _try_coerce_result(self, result):

return result

def fillna(self, value, limit=None, inplace=False, downcast=None,
mgr=None):
# we may need to upcast our fill to match our dtype
if limit is not None:
raise NotImplementedError("specifying a limit for 'fillna' has "
"not been implemented yet")

values = self.values if inplace else self.values.copy()
values = self._try_coerce_result(values.fillna(value=value,
limit=limit))
return [self.make_block(values=values)]

def interpolate(self, method='pad', axis=0, inplace=False, limit=None,
fill_value=None, **kwargs):

values = self.values if inplace else self.values.copy()
return self.make_block_same_class(
values=values.fillna(fill_value=fill_value, method=method,
limit=limit),
placement=self.mgr_locs)

def shift(self, periods, axis=0, mgr=None):
return self.make_block_same_class(values=self.values.shift(periods),
placement=self.mgr_locs)
Expand Down
17 changes: 15 additions & 2 deletions pandas/tests/categorical/test_dtypes.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# -*- coding: utf-8 -*-

import pytest

import numpy as np

import pandas.util.testing as tm
from pandas.core.dtypes.dtypes import CategoricalDtype
from pandas import Categorical, Index, CategoricalIndex, Series
from pandas.compat import long
from pandas import Categorical, Index, CategoricalIndex, Series, Timestamp


class TestCategoricalDtypes(object):
Expand Down Expand Up @@ -161,3 +161,16 @@ def test_astype_category(self, dtype_ordered, cat_ordered):
result = cat.astype('category')
expected = cat
tm.assert_categorical_equal(result, expected)

def test_iter_python_types(self):
# GH-19909
# TODO(Py2): Remove long
cat = Categorical([1, 2])
assert isinstance(list(cat)[0], (int, long))
assert isinstance(cat.tolist()[0], (int, long))

def test_iter_python_types_datetime(self):
cat = Categorical([Timestamp('2017-01-01'),
Timestamp('2017-01-02')])
assert isinstance(list(cat)[0], Timestamp)
assert isinstance(cat.tolist()[0], Timestamp)
5 changes: 5 additions & 0 deletions pandas/tests/extension/base/casting.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,8 @@ def test_astype_object_series(self, all_data):
ser = pd.Series({"A": all_data})
result = ser.astype(object)
assert isinstance(result._data.blocks[0], ObjectBlock)

def test_tolist(self, data):
result = pd.Series(data).tolist()
expected = list(data)
assert result == expected
69 changes: 69 additions & 0 deletions pandas/tests/extension/base/missing.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import numpy as np
import pytest

import pandas as pd
import pandas.util.testing as tm
Expand Down Expand Up @@ -45,3 +46,71 @@ def test_dropna_frame(self, data_missing):
result = df.dropna()
expected = df.iloc[:0]
self.assert_frame_equal(result, expected)

def test_fillna_limit_pad(self, data_missing):
arr = data_missing.take([1, 0, 0, 0, 1])
result = pd.Series(arr).fillna(method='ffill', limit=2)
expected = pd.Series(data_missing.take([1, 1, 1, 0, 1]))
self.assert_series_equal(result, expected)

def test_fillna_limit_backfill(self, data_missing):
arr = data_missing.take([1, 0, 0, 0, 1])
result = pd.Series(arr).fillna(method='backfill', limit=2)
expected = pd.Series(data_missing.take([1, 0, 1, 1, 1]))
self.assert_series_equal(result, expected)

def test_fillna_series(self, data_missing):
fill_value = data_missing[1]
ser = pd.Series(data_missing)

result = ser.fillna(fill_value)
expected = pd.Series(type(data_missing)([fill_value, fill_value]))
self.assert_series_equal(result, expected)

# Fill with a series
result = ser.fillna(expected)
self.assert_series_equal(result, expected)

# Fill with a series not affecting the missing values
result = ser.fillna(ser)
self.assert_series_equal(result, ser)

@pytest.mark.parametrize('method', ['ffill', 'bfill'])
def test_fillna_series_method(self, data_missing, method):
fill_value = data_missing[1]

if method == 'ffill':
data_missing = type(data_missing)(data_missing[::-1])

result = pd.Series(data_missing).fillna(method=method)
expected = pd.Series(type(data_missing)([fill_value, fill_value]))

self.assert_series_equal(result, expected)

def test_fillna_frame(self, data_missing):
fill_value = data_missing[1]

result = pd.DataFrame({
"A": data_missing,
"B": [1, 2]
}).fillna(fill_value)

expected = pd.DataFrame({
"A": type(data_missing)([fill_value, fill_value]),
"B": [1, 2],
})

self.assert_frame_equal(result, expected)

def test_fillna_fill_other(self, data):
result = pd.DataFrame({
"A": data,
"B": [np.nan] * len(data)
}).fillna({"B": 0.0})

expected = pd.DataFrame({
"A": data,
"B": [0.0] * len(result),
})

self.assert_frame_equal(result, expected)
9 changes: 8 additions & 1 deletion pandas/tests/extension/category/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,14 @@ def test_getitem_scalar(self):


class TestMissing(base.BaseMissingTests):
pass

@pytest.mark.skip(reason="Not implemented")
def test_fillna_limit_pad(self):
pass

@pytest.mark.skip(reason="Not implemented")
def test_fillna_limit_backfill(self):
pass


class TestMethods(base.BaseMethodsTests):
Expand Down
1 change: 1 addition & 0 deletions pandas/tests/extension/decimal/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ def isna(self):
return np.array([x.is_nan() for x in self.values])

def take(self, indexer, allow_fill=True, fill_value=None):
indexer = np.asarray(indexer)
mask = indexer == -1

indexer = _ensure_platform_int(indexer)
Expand Down
Loading