Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cythonized GroupBy Quantile #20405

Merged
merged 65 commits into from Feb 28, 2019
Merged
Changes from 52 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
618ec99
Reorganized existing describe test
WillAyd Mar 15, 2018
74871d8
Added quantile tests and impl
WillAyd Mar 15, 2018
7b6ca68
Broken impl and doc updates
WillAyd Mar 15, 2018
31aff03
Working impl with non-missing; more tests
WillAyd Mar 16, 2018
4a43815
DOC: update the Index.isin docstring (#20249)
noemielteto Mar 18, 2018
eb18823
Working impl with NA data
WillAyd Mar 18, 2018
813da81
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Mar 18, 2018
e152dd5
Added check_names arg to failing tests
WillAyd Mar 18, 2018
7a8fefb
Added tests for dt, object raises
WillAyd Mar 18, 2018
b4938ba
Added interpolation keyword support
WillAyd Mar 19, 2018
3f7d0a9
LINT fix
WillAyd Mar 19, 2018
d7aec3f
Updated benchmarks
WillAyd Mar 19, 2018
e712946
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Aug 7, 2018
72cd30e
Removed errant git diff
WillAyd Aug 7, 2018
a3c4b11
Removed errant pd file
WillAyd Aug 7, 2018
ac96526
Fixed broken function tests
WillAyd Aug 7, 2018
7d439d8
Added check_names=False to tests
WillAyd Aug 7, 2018
3047eed
Py27 compat
WillAyd Aug 7, 2018
70bf89a
LINT fixup
WillAyd Aug 7, 2018
02eb336
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Aug 7, 2018
7c3c349
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Aug 13, 2018
3b9c7c4
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Nov 13, 2018
ad8b184
Replaced double with float64
WillAyd Nov 13, 2018
b846bc2
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Nov 15, 2018
93b122c
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Nov 19, 2018
09308d4
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Nov 24, 2018
1a718f2
Fixed segfault on all NA group
WillAyd Nov 24, 2018
ff062bd
Stylistic and idiomatic test updates
WillAyd Nov 24, 2018
bdb5089
LINT fixup
WillAyd Nov 24, 2018
9b55fb5
Added cast to remove build warning
WillAyd Nov 24, 2018
31e66fc
Used memoryview.shape instead of len
WillAyd Nov 24, 2018
41a734f
Use pytest.raises
WillAyd Nov 24, 2018
67e0f00
Better Cython types
WillAyd Nov 24, 2018
07b0c00
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Nov 26, 2018
86aeb4a
Loosened test expectation on Windows
WillAyd Nov 26, 2018
86b9d8d
Used api types
WillAyd Nov 26, 2018
cfa1b45
Removed test hacks
WillAyd Nov 26, 2018
00085d0
Used is_object_dtype
WillAyd Nov 27, 2018
1f02532
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Dec 25, 2018
3c64c1f
Removed loosened check on agg_result
WillAyd Dec 25, 2018
09695f5
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Jan 9, 2019
68cfed9
isort fixup
WillAyd Jan 9, 2019
4ce1448
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Jan 11, 2019
5e840da
Removed nonlocal variable usage
WillAyd Jan 11, 2019
7969fb6
Updated documentation
WillAyd Jan 11, 2019
f9a8317
LINT fixup
WillAyd Jan 11, 2019
464a831
Reverted errant whatsnew
WillAyd Jan 11, 2019
4b3f9be
Refactor processor signatures
WillAyd Jan 11, 2019
b996e1d
Documentation updates
WillAyd Jan 11, 2019
cdd8985
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Jan 22, 2019
64f46a3
Added empty assignment for variable
WillAyd Jan 22, 2019
4d88e8a
Docstring fixup
WillAyd Jan 22, 2019
1cd93dd
Updated README
WillAyd Jan 22, 2019
9ae23c1
Pytest arg deprecation fix
WillAyd Jan 26, 2019
eb99f07
Removed test_describe test
WillAyd Jan 31, 2019
94d4892
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Jan 31, 2019
0512f37
Moved whatsnew
WillAyd Jan 31, 2019
2370129
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 2, 2019
a018570
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 12, 2019
f41cd05
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 20, 2019
21691bb
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 27, 2019
082aea3
LINT fixup
WillAyd Feb 27, 2019
dc5877a
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 27, 2019
7496a9b
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 28, 2019
ec013bf
LINT fixup
WillAyd Feb 28, 2019
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.

Always

Just for now

@@ -14,7 +14,7 @@
method_blacklist = {
'object': {'median', 'prod', 'sem', 'cumsum', 'sum', 'cummin', 'mean',
'max', 'skew', 'cumprod', 'cummax', 'rank', 'pct_change', 'min',
'var', 'mad', 'describe', 'std'},
'var', 'mad', 'describe', 'std', 'quantile'},
'datetime': {'median', 'prod', 'sem', 'cumsum', 'sum', 'mean', 'skew',
'cumprod', 'cummax', 'pct_change', 'var', 'mad', 'describe',
'std'}
@@ -316,8 +316,9 @@ class GroupByMethods(object):
['all', 'any', 'bfill', 'count', 'cumcount', 'cummax', 'cummin',
'cumprod', 'cumsum', 'describe', 'ffill', 'first', 'head',
'last', 'mad', 'max', 'min', 'median', 'mean', 'nunique',
'pct_change', 'prod', 'rank', 'sem', 'shift', 'size', 'skew',
'std', 'sum', 'tail', 'unique', 'value_counts', 'var'],
'pct_change', 'prod', 'quantile', 'rank', 'sem', 'shift',
'size', 'skew', 'std', 'sum', 'tail', 'unique', 'value_counts',
'var'],
['direct', 'transformation']]

def setup(self, dtype, method, application):
@@ -424,6 +424,7 @@ Other Enhancements
- :func:`pandas.DataFrame.to_sql` has gained the ``method`` argument to control SQL insertion clause. See the :ref:`insertion method <io.sql.method>` section in the documentation. (:issue:`8953`)
- :meth:`DataFrame.corrwith` now supports Spearman's rank correlation, Kendall's tau as well as callable correlation methods. (:issue:`21925`)
- :meth:`DataFrame.to_json`, :meth:`DataFrame.to_csv`, :meth:`DataFrame.to_pickle`, and :meth:`DataFrame.to_XXX` etc. now support tilde(~) in path argument. (:issue:`23473`)
- :class:`pandas.core.groupby.GroupBy` now has a built-in implementation of :meth:`pandas.core.groupby.GroupBy.quantile`. (:issue:`20405`)
This conversation was marked as resolved by WillAyd

This comment has been minimized.

Copy link
@jreback

jreback Jan 22, 2019

Contributor

this is not necessary here
instead put in a performance note


.. _whatsnew_0240.api_breaking:

@@ -0,0 +1,6 @@
cdef enum InterpolationEnumType:
This conversation was marked as resolved by jreback

This comment has been minimized.

Copy link
@jreback

jreback Jan 9, 2019

Contributor

IIRC we reverted using enum in other versions (because this would require an additional install in py2)?

This comment has been minimized.

Copy link
@WillAyd

WillAyd Jan 9, 2019

Author Member

Hmm we have something similar in algos:

cdef enum TiebreakEnumType:

Do you know where that issue would have popped up?

This comment has been minimized.

Copy link
@chris-b1

chris-b1 Jan 9, 2019

Contributor

It was #24170 - this is different because it's a cdef enum (which compiles out to a c enum) - there we were using a the enum package from the python stdlib

This comment has been minimized.

Copy link
@jreback

jreback Jan 10, 2019

Contributor

ok great.

INTERPOLATION_LINEAR,
INTERPOLATION_LOWER,
INTERPOLATION_HIGHER,
INTERPOLATION_NEAREST,
INTERPOLATION_MIDPOINT
@@ -382,5 +382,105 @@ def group_any_all(uint8_t[:] out,
out[lab] = flag_val


@cython.boundscheck(False)
@cython.wraparound(False)
def group_quantile(ndarray[float64_t] out,
ndarray[int64_t] labels,
numeric[:] values,
ndarray[uint8_t] mask,
float64_t q,
object interpolation):
"""
Calculate the quantile per group.
Parameters
----------
out : ndarray
Array of aggregated values that will be written to.
labels : ndarray
Array containing the unique group labels.
values : ndarray
Array containing the values to apply the function against.
q : float
The quantile value to search for.
Notes
-----
Rather than explicitly returning a value, this function modifies the
provided `out` parameter.
"""
cdef:
Py_ssize_t i, N=len(labels), ngroups, grp_sz, non_na_sz
Py_ssize_t grp_start=0, idx=0
int64_t lab
uint8_t interp
float64_t q_idx, frac, val, next_val
ndarray[int64_t] counts, non_na_counts, sort_arr

assert values.shape[0] == N
inter_methods = {
'linear': INTERPOLATION_LINEAR,
'lower': INTERPOLATION_LOWER,
'higher': INTERPOLATION_HIGHER,
'nearest': INTERPOLATION_NEAREST,
'midpoint': INTERPOLATION_MIDPOINT,
}
interp = inter_methods[interpolation]

counts = np.zeros_like(out, dtype=np.int64)
non_na_counts = np.zeros_like(out, dtype=np.int64)
ngroups = len(counts)

# First figure out the size of every group
with nogil:
for i in range(N):
lab = labels[i]
counts[lab] += 1
if not mask[i]:
non_na_counts[lab] += 1

# Get an index of values sorted by labels and then values
order = (values, labels)
sort_arr = np.lexsort(order).astype(np.int64, copy=False)

with nogil:
for i in range(ngroups):
# Figure out how many group elements there are
grp_sz = counts[i]
non_na_sz = non_na_counts[i]

if non_na_sz == 0:
out[i] = NaN
else:
# Calculate where to retrieve the desired value
# Casting to int will intentionaly truncate result
idx = grp_start + <int64_t>(q * <float64_t>(non_na_sz - 1))

val = values[sort_arr[idx]]
# If requested quantile falls evenly on a particular index
# then write that index's value out. Otherwise interpolate
q_idx = q * (non_na_sz - 1)
frac = q_idx % 1

if frac == 0.0 or interp == INTERPOLATION_LOWER:
out[i] = val
else:
next_val = values[sort_arr[idx + 1]]
if interp == INTERPOLATION_LINEAR:
out[i] = val + (next_val - val) * frac
elif interp == INTERPOLATION_HIGHER:
out[i] = next_val
elif interp == INTERPOLATION_MIDPOINT:
out[i] = (val + next_val) / 2.0
elif interp == INTERPOLATION_NEAREST:
if frac > .5 or (frac == .5 and q > .5): # Always OK?
out[i] = next_val
else:
out[i] = val

# Increment the index reference in sorted_arr for the next group
grp_start += grp_sz


# generated from template
include "groupby_helper.pxi"
@@ -29,6 +29,8 @@ class providing the base-class of operations.
ensure_float, is_extension_array_dtype, is_numeric_dtype, is_scalar)
from pandas.core.dtypes.missing import isna, notna

from pandas.api.types import (
is_datetime64_dtype, is_integer_dtype, is_object_dtype)
import pandas.core.algorithms as algorithms
from pandas.core.base import (
DataError, GroupByError, PandasObject, SelectionMixin, SpecificationError)
@@ -1024,15 +1026,17 @@ def _bool_agg(self, val_test, skipna):
"""

def objs_to_bool(vals):
try:
vals = vals.astype(np.bool)
except ValueError: # for objects
# type: np.ndarray -> (np.ndarray, typing.Type)
if is_object_dtype(vals):
vals = np.array([bool(x) for x in vals])
else:
vals = vals.astype(np.bool)

return vals.view(np.uint8)
return vals.view(np.uint8), np.bool

def result_to_bool(result):
return result.astype(np.bool, copy=False)
def result_to_bool(result, inference):
# type: (np.ndarray, typing.Type) -> np.ndarray
return result.astype(inference, copy=False)

return self._get_cythonized_result('group_any_all', self.grouper,
aggregate=True,
@@ -1688,6 +1692,75 @@ def nth(self, n, dropna=None):

return result

def quantile(self, q=0.5, interpolation='linear'):
"""
Return group values at the given quantile, a la numpy.percentile.
Parameters
----------
q : float or array-like, default 0.5 (50% quantile)
Value(s) between 0 and 1 providing the quantile(s) to compute.
interpolation : {'linear', 'lower', 'higher', 'midpoint', 'nearest'}
Method to use when the desired quantile falls between two points.
Returns
-------
Series or DataFrame
Return type determined by caller of GroupBy object.
See Also
--------
Series.quantile : Similar method for Series.
DataFrame.quantile : Similar method for DataFrame.
numpy.percentile : NumPy method to compute qth percentile.
Examples
--------
>>> df = pd.DataFrame([
... ['a', 1], ['a', 2], ['a', 3],
... ['b', 1], ['b', 3], ['b', 5]
... ], columns=['key', 'val'])
>>> df.groupby('key').quantile()
val
key
a 2.0
b 3.0
"""

def pre_processor(vals):
# type: np.ndarray -> (np.ndarray, Optional[typing.Type])
if is_object_dtype(vals):
raise TypeError("'quantile' cannot be performed against "
"'object' dtypes!")

inference = None
if is_integer_dtype(vals):
inference = np.int64
elif is_datetime64_dtype(vals):
inference = 'datetime64[ns]'
vals = vals.astype(np.float)

return vals, inference

def post_processor(vals, inference):
# type: (np.ndarray, Optional[typing.Type]) -> np.ndarray
if inference:
# Check for edge case
if not (is_integer_dtype(inference) and
interpolation in {'linear', 'midpoint'}):
vals = vals.astype(inference)

return vals

return self._get_cythonized_result('group_quantile', self.grouper,
aggregate=True,
needs_values=True,
needs_mask=True,
cython_dtype=np.float64,
pre_processing=pre_processor,
post_processing=post_processor,
q=q, interpolation=interpolation)

@Substitution(name='groupby')
def ngroup(self, ascending=True):
"""
@@ -1924,10 +1997,16 @@ def _get_cythonized_result(self, how, grouper, aggregate=False,
Whether the result of the Cython operation is an index of
values to be retrieved, instead of the actual values themselves
pre_processing : function, default None
Function to be applied to `values` prior to passing to Cython
Raises if `needs_values` is False
Function to be applied to `values` prior to passing to Cython.
Function should return a tuple where the first element is the
values to be passed to Cython and the second element is an optional
type which the values should be converted to after being returned
by the Cython operation. Raises if `needs_values` is False.
post_processing : function, default None
Function to be applied to result of Cython function
Function to be applied to result of Cython function. Should accept
This conversation was marked as resolved by WillAyd

This comment has been minimized.

Copy link
@jreback

jreback Jan 11, 2019

Contributor

maybe you want to show the signature as a comment of the Callable[object, dict]

This comment has been minimized.

Copy link
@WillAyd

WillAyd Jan 11, 2019

Author Member

Are you asking to do that in this docstring or as a type hint in the implementations themselves?

This comment has been minimized.

Copy link
@jreback

jreback Jan 11, 2019

Contributor

both

an array of values as the first argument and type inferences as its
second argument, i.e. the signature should be
(ndarray, typing.Type).

This comment has been minimized.

Copy link
@jreback

jreback Jan 13, 2019

Contributor

not really what I meant, this is a (ndarray, dict)

This comment has been minimized.

Copy link
@WillAyd

WillAyd Jan 22, 2019

Author Member

Well on a second pass the dict was unnecessary so I ended up just returning the type. I suppose it could either be an actual type object or a string so this is an approximation of that for simplicity, though could update to include a string as well

**kwargs : dict
Extra arguments to be passed back to Cython funcs
@@ -1963,10 +2042,12 @@ def _get_cythonized_result(self, how, grouper, aggregate=False,

result = np.zeros(result_sz, dtype=cython_dtype)
func = partial(base_func, result, labels)
inferences = None

if needs_values:
This conversation was marked as resolved by WillAyd

This comment has been minimized.

Copy link
@jreback

jreback Jan 13, 2019

Contributor

just define inference = None at the top then you wont ever have a NameError

vals = obj.values
if pre_processing:
vals = pre_processing(vals)
vals, inferences = pre_processing(vals)
func = partial(func, vals)

if needs_mask:
@@ -1982,7 +2063,7 @@ def _get_cythonized_result(self, how, grouper, aggregate=False,
result = algorithms.take_nd(obj.values, result)

if post_processing:
result = post_processing(result)
result = post_processing(result, inferences)
This conversation was marked as resolved by WillAyd

This comment has been minimized.

Copy link
@WillAyd

WillAyd Jan 11, 2019

Author Member

FYI this will fail if we try to pass in post_processing with pre_processing as inferences will be undefined in that case. Since it's all internal I suppose not a big deal for now, but could also add an explicit check at the start of the function if it doesn't add too much bloat.


output[name] = result

@@ -696,6 +696,26 @@ def test_is_monotonic_decreasing(in_vals, out_vals):

# describe
# --------------------------------
def test_describe():
df = DataFrame([
[1, 2, 'foo'],
[1, np.nan, 'bar'],
[3, np.nan, 'baz']
], columns=['A', 'B', 'C'])
grp = df.groupby('A')

index = pd.Index([1, 3], name='A')
columns = pd.MultiIndex.from_product([
['B'], ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']])

expected = pd.DataFrame([
[1.0, 2.0, np.nan, 2.0, 2.0, 2.0, 2.0, 2.0],
[0.0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]
], index=index, columns=columns)

result = grp.describe()
tm.assert_frame_equal(result, expected)


def test_apply_describe_bug(mframe):
grouped = mframe.groupby(level='first')
@@ -1060,6 +1080,55 @@ def test_size(df):
tm.assert_series_equal(df.groupby('A').size(), out)


# quantile
# --------------------------------

This comment has been minimized.

Copy link
@jreback

jreback Feb 2, 2019

Contributor

side note this file is getting pretty big, maybe should split it up a bit (later)

@pytest.mark.parametrize("interpolation", [
"linear", "lower", "higher", "nearest", "midpoint"])
@pytest.mark.parametrize("a_vals,b_vals", [
# Ints
([1, 2, 3, 4, 5], [5, 4, 3, 2, 1]),
([1, 2, 3, 4], [4, 3, 2, 1]),
([1, 2, 3, 4, 5], [4, 3, 2, 1]),
# Floats
([1., 2., 3., 4., 5.], [5., 4., 3., 2., 1.]),
# Missing data
([1., np.nan, 3., np.nan, 5.], [5., np.nan, 3., np.nan, 1.]),
([np.nan, 4., np.nan, 2., np.nan], [np.nan, 4., np.nan, 2., np.nan]),
# Timestamps
([x for x in pd.date_range('1/1/18', freq='D', periods=5)],
[x for x in pd.date_range('1/1/18', freq='D', periods=5)][::-1]),
# All NA
([np.nan] * 5, [np.nan] * 5),
])
@pytest.mark.parametrize('q', [0, .25, .5, .75, 1])
def test_quantile(interpolation, a_vals, b_vals, q):
if interpolation == 'nearest' and q == 0.5 and b_vals == [4, 3, 2, 1]:
pytest.skip("Unclear numpy expectation for nearest result with "
"equidistant data")

a_expected = pd.Series(a_vals).quantile(q, interpolation=interpolation)
b_expected = pd.Series(b_vals).quantile(q, interpolation=interpolation)

df = DataFrame({
'key': ['a'] * len(a_vals) + ['b'] * len(b_vals),
'val': a_vals + b_vals})

expected = DataFrame([a_expected, b_expected], columns=['val'],
index=Index(['a', 'b'], name='key'))
result = df.groupby('key').quantile(q, interpolation=interpolation)

tm.assert_frame_equal(result, expected)


def test_quantile_raises():
df = pd.DataFrame([
['foo', 'a'], ['foo', 'b'], ['foo', 'c']], columns=['key', 'val'])

with pytest.raises(TypeError, message="cannot be performed against "
"'object' dtypes"):
df.groupby('key').quantile()


# pipe
# --------------------------------

ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.