Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: non-numeric fillna #20300

Open
TomAugspurger opened this issue Mar 12, 2018 · 10 comments
Open

PERF: non-numeric fillna #20300

TomAugspurger opened this issue Mar 12, 2018 · 10 comments
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance

Comments

@TomAugspurger
Copy link
Contributor

Split from a35f93c

# ffill / bfill
# The basic idea is to create an array of integer positions.
# Internally, we use iNaT and the datetime filling routines
# to avoid floating-point NaN. Once filled, we take on `self`
# to get the actual values.
func = pad_1d if method == 'pad' else backfill_1d
idx = np.arange(len(self), dtype='int64')
idx[mask] = iNaT
idx = _ensure_platform_int(func(idx, mask=mask,
                                limit=limit,
                                dtype='datetime64[ns]'))
idx[idx == iNaT] = -1  # missing value marker for take.
new_values = self.take(idx)

self is a non-numeric array-like thing.

@TomAugspurger TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Difficulty Intermediate labels Mar 12, 2018
@jorisvandenbossche
Copy link
Member

What also might be useful, thinking about the case of geopandas: somehow an advanced but officially public fillna function that works on numeric data. In geopandas I have integer data under the hood I could forward or backward fill or fill with a value. So inside a GeometryArray.fillna method I could pass those data to such a fillna function together with the specified method/value/limit keywords, and then wrap the result of that in again into GeometryArray before returning.
Similar pattern could be useful for PeriodArray.

@WillAyd
Copy link
Member

WillAyd commented Mar 27, 2018

Might be missing the larger context here but I recently implemented a Cythonized fill method for groupby objects that writes an array of indexes.

def group_fillna_indexer(ndarray[int64_t] out, ndarray[int64_t] labels,

A subsequent call to algorithms.take_nd allows you to get backed "filled" values in a performant way agnostic of the object type - is that in the direction of what you are looking for here?

@TomAugspurger
Copy link
Contributor Author

Yes, I think that looks about right, though I think it there would need to be an option to disable sorting for it to be usable outside a groupby context.

@WillAyd
Copy link
Member

WillAyd commented Mar 27, 2018

The sorting in the GroupBy implementation is only required to ensure that group items are evaluated together. A non-groupby implementation would be simpler and to your point not even not the sort that's in that code

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Mar 28, 2018 via email

@WillAyd
Copy link
Member

WillAyd commented Mar 28, 2018

I don't think they are that far off. In fact (though rather verbose) you could hack the groupby implementation as is to use with a Series

In [25]: from pandas._libs import groupby as libgroupby
In [25]: import pandas.core.algorithms as algorithms
In [25]: ser = pd.Series([1., np.nan, np.nan, 2, np.nan, 3, np.nan])
In [25]: idxes = np.zeros_like(ser, dtype=np.int64)  # positions to pass later to take
In [25]: dummy_labs = np.zeros_like(ser, dtype=np.int64)  # just need the same val for all items
In [25]: mask = np.isnan(ser.values).view(np.uint8)  # mask of NA vals
In [25]: libgroupby.group_fillna_indexer(idxes, dummy_labs, mask, 'ffill', -1)

In [25]: idxes
Out[25]: array([0, 0, 0, 3, 3, 5, 5])

In [27]: algorithms.take_nd(ser.values, idxes)
Out[27]: array([1., 1., 1., 2., 2., 3., 3.])

@WillAyd
Copy link
Member

WillAyd commented Mar 29, 2018

I don't want to open a pandora's box with this comment but this also has me thinking about our general approach to GroupBy and Series/DataFrame algorithms. The former tend to be copy/paste of the latter with a slight twist to account for Group labels, but I'm wondering if there's not a way for really all of the algorithms to be shared, with the Series/DataFrame ones providing either None or an empty array for Group labels.

It's certainly no small undertaking, but it would:

Food for thought

@jorisvandenbossche
Copy link
Member

Not about the bigger issue you raise, but about this specific algorithm. It seems that the existing pad_1d algorithm (what Tom posted in the top post) is somewhat faster as the groupby one:

In [36]: ser = pd.Series(np.arange(100000))

In [37]: ser[np.random.randint(0, 100000, 10000)] = np.nan

In [38]: %%timeit
    ...: idx = np.arange(len(ser), dtype='int64')
    ...: mask = np.isnan(ser.values)
    ...: pad_1d(idx, mask=mask, dtype='datetime64[ns]')
    ...: idx[idx == iNaT] = -1
    ...: 
435 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [39]: %%timeit
    ...: idxes = np.zeros_like(ser, dtype=np.int64)  # positions to pass later to take
    ...: dummy_labs = np.zeros_like(ser, dtype=np.int64)  # just need the same val for all items
    ...: mask = np.isnan(ser.values).view(np.uint8)  # mask of NA vals
    ...: libgroupby.group_fillna_indexer(idxes, dummy_labs, mask, 'ffill', -1)
    ...: 
1.81 ms ± 28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

(of course I don't know how much overhead there is by the fact it needs to do something with the grouping. But just saying that for the original raised issue, we actually already have all cython infrastructure available, it is more making a higher level function wrapping the above, to make the functionality more easily available)

@WillAyd
Copy link
Member

WillAyd commented Mar 29, 2018

@jorisvandenbossche that's interesting, though I'd wonder if just the algos implementation is coded more efficiently and some of that could be ported over to the groupby. I can't imagine the labels (since they are all missing anyway) should add that much overhead, but that would have to be profiled in more detail

@TomAugspurger
Copy link
Contributor Author

DatetimeLikeArrayMixin would benefit from this (or a _values_for_fillna) as well.

@mroeschke mroeschke removed the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jun 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants