Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cythonized GroupBy Quantile #20405

Merged
merged 65 commits into from
Feb 28, 2019
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
618ec99
Reorganized existing describe test
WillAyd Mar 15, 2018
74871d8
Added quantile tests and impl
WillAyd Mar 15, 2018
7b6ca68
Broken impl and doc updates
WillAyd Mar 15, 2018
31aff03
Working impl with non-missing; more tests
WillAyd Mar 16, 2018
4a43815
DOC: update the Index.isin docstring (#20249)
noemielteto Mar 18, 2018
eb18823
Working impl with NA data
WillAyd Mar 18, 2018
813da81
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Mar 18, 2018
e152dd5
Added check_names arg to failing tests
WillAyd Mar 18, 2018
7a8fefb
Added tests for dt, object raises
WillAyd Mar 18, 2018
b4938ba
Added interpolation keyword support
WillAyd Mar 19, 2018
3f7d0a9
LINT fix
WillAyd Mar 19, 2018
d7aec3f
Updated benchmarks
WillAyd Mar 19, 2018
e712946
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Aug 7, 2018
72cd30e
Removed errant git diff
WillAyd Aug 7, 2018
a3c4b11
Removed errant pd file
WillAyd Aug 7, 2018
ac96526
Fixed broken function tests
WillAyd Aug 7, 2018
7d439d8
Added check_names=False to tests
WillAyd Aug 7, 2018
3047eed
Py27 compat
WillAyd Aug 7, 2018
70bf89a
LINT fixup
WillAyd Aug 7, 2018
02eb336
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Aug 7, 2018
7c3c349
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Aug 13, 2018
3b9c7c4
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Nov 13, 2018
ad8b184
Replaced double with float64
WillAyd Nov 13, 2018
b846bc2
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Nov 15, 2018
93b122c
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Nov 19, 2018
09308d4
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Nov 24, 2018
1a718f2
Fixed segfault on all NA group
WillAyd Nov 24, 2018
ff062bd
Stylistic and idiomatic test updates
WillAyd Nov 24, 2018
bdb5089
LINT fixup
WillAyd Nov 24, 2018
9b55fb5
Added cast to remove build warning
WillAyd Nov 24, 2018
31e66fc
Used memoryview.shape instead of len
WillAyd Nov 24, 2018
41a734f
Use pytest.raises
WillAyd Nov 24, 2018
67e0f00
Better Cython types
WillAyd Nov 24, 2018
07b0c00
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Nov 26, 2018
86aeb4a
Loosened test expectation on Windows
WillAyd Nov 26, 2018
86b9d8d
Used api types
WillAyd Nov 26, 2018
cfa1b45
Removed test hacks
WillAyd Nov 26, 2018
00085d0
Used is_object_dtype
WillAyd Nov 27, 2018
1f02532
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Dec 25, 2018
3c64c1f
Removed loosened check on agg_result
WillAyd Dec 25, 2018
09695f5
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Jan 9, 2019
68cfed9
isort fixup
WillAyd Jan 9, 2019
4ce1448
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Jan 11, 2019
5e840da
Removed nonlocal variable usage
WillAyd Jan 11, 2019
7969fb6
Updated documentation
WillAyd Jan 11, 2019
f9a8317
LINT fixup
WillAyd Jan 11, 2019
464a831
Reverted errant whatsnew
WillAyd Jan 11, 2019
4b3f9be
Refactor processor signatures
WillAyd Jan 11, 2019
b996e1d
Documentation updates
WillAyd Jan 11, 2019
cdd8985
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Jan 22, 2019
64f46a3
Added empty assignment for variable
WillAyd Jan 22, 2019
4d88e8a
Docstring fixup
WillAyd Jan 22, 2019
1cd93dd
Updated README
WillAyd Jan 22, 2019
9ae23c1
Pytest arg deprecation fix
WillAyd Jan 26, 2019
eb99f07
Removed test_describe test
WillAyd Jan 31, 2019
94d4892
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Jan 31, 2019
0512f37
Moved whatsnew
WillAyd Jan 31, 2019
2370129
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 2, 2019
a018570
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 12, 2019
f41cd05
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 20, 2019
21691bb
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 27, 2019
082aea3
LINT fixup
WillAyd Feb 27, 2019
dc5877a
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 27, 2019
7496a9b
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 28, 2019
ec013bf
LINT fixup
WillAyd Feb 28, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions asv_bench/benchmarks/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
method_blacklist = {
'object': {'median', 'prod', 'sem', 'cumsum', 'sum', 'cummin', 'mean',
'max', 'skew', 'cumprod', 'cummax', 'rank', 'pct_change', 'min',
'var', 'mad', 'describe', 'std'},
'var', 'mad', 'describe', 'std', 'quantile'},
'datetime': {'median', 'prod', 'sem', 'cumsum', 'sum', 'mean', 'skew',
'cumprod', 'cummax', 'pct_change', 'var', 'mad', 'describe',
'std'}
Expand Down Expand Up @@ -314,8 +314,9 @@ class GroupByMethods(object):
['all', 'any', 'bfill', 'count', 'cumcount', 'cummax', 'cummin',
'cumprod', 'cumsum', 'describe', 'ffill', 'first', 'head',
'last', 'mad', 'max', 'min', 'median', 'mean', 'nunique',
'pct_change', 'prod', 'rank', 'sem', 'shift', 'size', 'skew',
'std', 'sum', 'tail', 'unique', 'value_counts', 'var'],
'pct_change', 'prod', 'quantile', 'rank', 'sem', 'shift',
'size', 'skew', 'std', 'sum', 'tail', 'unique', 'value_counts',
'var'],
['direct', 'transformation']]

def setup(self, dtype, method, application):
Expand Down
6 changes: 6 additions & 0 deletions pandas/_libs/groupby.pxd
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
cdef enum InterpolationEnumType:
jreback marked this conversation as resolved.
Show resolved Hide resolved
INTERPOLATION_LINEAR,
INTERPOLATION_LOWER,
INTERPOLATION_HIGHER,
INTERPOLATION_NEAREST,
INTERPOLATION_MIDPOINT
101 changes: 101 additions & 0 deletions pandas/_libs/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -380,5 +380,106 @@ def group_any_all(uint8_t[:] out,
out[lab] = flag_val


@cython.boundscheck(False)
@cython.wraparound(False)
def group_quantile(ndarray[float64_t] out,
ndarray[int64_t] labels,
numeric[:] values,
ndarray[uint8_t] mask,
float64_t q,
object interpolation):
"""
Calculate the quantile per group.

Parameters
----------
out : ndarray
Array of aggregated values that will be written to.
labels : ndarray
Array containing the unique group labels.
values : ndarray
Array containing the values to apply the function against.
q : float
The quantile value to search for.

Notes
-----
Rather than explicitly returning a value, this function modifies the
provided `out` parameter.
"""
cdef:
Py_ssize_t i, N=len(labels)
int64_t lab, ngroups, grp_sz, non_na_sz, grp_start=0, idx=0
uint8_t interp, offset
numeric val, next_val
float64_t q_idx, frac
ndarray[int64_t] counts, non_na_counts
ndarray[int64_t] sort_arr

assert <Py_ssize_t>len(values) == N
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check was throwing a build warning as the lhs was returning size_t whereas len(labels) was returning Py_ssize_t (eventually replaced with N since that was duplicative).

Not sure why len would have returned two different types though. May be a difference between ndarray and numeric

@jbrockmendel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A difference between ndarray and memoryview is plausible. IIRC cython suggests using arr.shape[0] instead of len(arr) for memoryviews because the former is done wholly in C-space while the latter makes a python call. I could imagine that being related to this, but am really just guessing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - using shape had the same effect as the cast though seems more idiomatic. Looks like I have some windows failures to look at but will push that up along with it. Good to know for future reference as well

inter_methods = {
'linear': INTERPOLATION_LINEAR,
'lower': INTERPOLATION_LOWER,
'higher': INTERPOLATION_HIGHER,
'nearest': INTERPOLATION_NEAREST,
'midpoint': INTERPOLATION_MIDPOINT,
}
interp = inter_methods[interpolation]

counts = np.zeros_like(out, dtype=np.int64)
non_na_counts = np.zeros_like(out, dtype=np.int64)
ngroups = len(counts)

# First figure out the size of every group
with nogil:
for i in range(N):
lab = labels[i]
counts[lab] += 1
if not mask[i]:
non_na_counts[lab] += 1

# Get an index of values sorted by labels and then values
order = (values, labels)
sort_arr = np.lexsort(order).astype(np.int64, copy=False)

with nogil:
for i in range(ngroups):
# Figure out how many group elements there are
grp_sz = counts[i]
non_na_sz = non_na_counts[i]

if non_na_sz == 0:
out[i] = NaN
else:
# Calculate where to retrieve the desired value
# Casting to int will intentionaly truncate result
idx = grp_start + <int64_t>(q * <float64_t>(non_na_sz - 1))

val = values[sort_arr[idx]]
# If requested quantile falls evenly on a particular index
# then write that index's value out. Otherwise interpolate
q_idx = q * (non_na_sz - 1)
frac = q_idx % 1

if frac == 0.0 or interp == INTERPOLATION_LOWER:
out[i] = val
else:
next_val = values[sort_arr[idx + 1]]
if interp == INTERPOLATION_LINEAR:
out[i] = val + (next_val - val) * frac
elif interp == INTERPOLATION_HIGHER:
out[i] = next_val
elif interp == INTERPOLATION_MIDPOINT:
out[i] = (val + next_val) / 2.0
elif interp == INTERPOLATION_NEAREST:
if frac > .5 or (frac == .5 and q > .5): # Always OK?
out[i] = next_val
else:
out[i] = val

# Increment the index reference in sorted_arr for the next group
grp_start += grp_sz


# generated from template
include "groupby_helper.pxi"
66 changes: 66 additions & 0 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1670,6 +1670,72 @@ def nth(self, n, dropna=None):

return result

def quantile(self, q=0.5, interpolation='linear'):
"""
Return group values at the given quantile, a la numpy.percentile.

Parameters
----------
q : float or array-like, default 0.5 (50% quantile)
0 <= q <= 1, the quantile(s) to compute
interpolation : {'linear', 'lower', 'higher', 'midpoint', 'nearest'}
Method to use when the desired quantile falls between two points.

Returns
-------
Series or DataFrame
Return type determined by caller of GroupBy object.

See Also
--------
Series.quantile : Similar method for Series
DataFrame.quantile : Similar method for DataFrame
numpy.percentile : NumPy method to compute qth percentile

Examples
--------
>>> df = pd.DataFrame(
... [['foo'] * 5 + ['bar'] * 5,
... [1, 2, 3, 4, 5, 5, 4, 3, 2, 1]],
... columns=['key', 'val'])
>>> df
WillAyd marked this conversation as resolved.
Show resolved Hide resolved
"""

inferences = { # TODO (py27): replace with nonlocal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty janky I would change to do this

def pre_processor(vals):
     ....
     return vals, inferences

def post_processor(vals, inferences):
     ....

basically returning the state (or an empty dict) in the pre and passing it to the post

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will have to adjust the other callers of this but shouldn't be a big deal

'is_dt': False,
'is_int': False
}

def pre_processor(vals):
if vals.dtype == np.object:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_object_dtype

raise TypeError("'quantile' cannot be performed against "
"'object' dtypes!")
elif vals.dtype == np.int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your issue is this check - on windows np.int is 32 bit, and is probably too strict anyways. Use is_integer_dtype instead

inferences['is_int'] = True
elif vals.dtype == 'datetime64[ns]':
vals = vals.astype(np.float)
inferences['is_dt'] = True

return vals

def post_processor(vals):
if inferences['is_dt']:
vals = vals.astype('datetime64[ns]')
elif inferences['is_int'] and interpolation in [
'lower', 'higher', 'nearest']:
vals = vals.astype(np.int)

return vals

return self._get_cythonized_result('group_quantile', self.grouper,
aggregate=True,
needs_values=True,
needs_mask=True,
cython_dtype=np.float64,
pre_processing=pre_processor,
post_processing=post_processor,
q=q, interpolation=interpolation)

@Substitution(name='groupby')
def ngroup(self, ascending=True):
"""
Expand Down
69 changes: 69 additions & 0 deletions pandas/tests/groupby/test_function.py
Original file line number Diff line number Diff line change
Expand Up @@ -694,6 +694,26 @@ def test_is_monotonic_decreasing(in_vals, out_vals):

# describe
# --------------------------------
def test_describe():
df = DataFrame([
[1, 2, 'foo'],
[1, np.nan, 'bar'],
[3, np.nan, 'baz']
], columns=['A', 'B', 'C'])
grp = df.groupby('A')

index = pd.Index([1, 3], name='A')
columns = pd.MultiIndex.from_product([
['B'], ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']])

expected = pd.DataFrame([
[1.0, 2.0, np.nan, 2.0, 2.0, 2.0, 2.0, 2.0],
[0.0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]
], index=index, columns=columns)

result = grp.describe()
tm.assert_frame_equal(result, expected)


def test_apply_describe_bug(mframe):
grouped = mframe.groupby(level='first')
Expand Down Expand Up @@ -1055,6 +1075,55 @@ def test_size(df):
tm.assert_series_equal(df.groupby('A').size(), out)


# quantile
# --------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note this file is getting pretty big, maybe should split it up a bit (later)

@pytest.mark.parametrize("interpolation", [
"linear", "lower", "higher", "nearest", "midpoint"])
@pytest.mark.parametrize("a_vals,b_vals", [
# Ints
([1, 2, 3, 4, 5], [5, 4, 3, 2, 1]),
([1, 2, 3, 4], [4, 3, 2, 1]),
([1, 2, 3, 4, 5], [4, 3, 2, 1]),
# Floats
([1., 2., 3., 4., 5.], [5., 4., 3., 2., 1.]),
# Missing data
([1., np.nan, 3., np.nan, 5.], [5., np.nan, 3., np.nan, 1.]),
([np.nan, 4., np.nan, 2., np.nan], [np.nan, 4., np.nan, 2., np.nan]),
# Timestamps
([x for x in pd.date_range('1/1/18', freq='D', periods=5)],
[x for x in pd.date_range('1/1/18', freq='D', periods=5)][::-1]),
# All NA
([np.nan] * 5, [np.nan] * 5),
])
@pytest.mark.parametrize('q', [0, .25, .5, .75, 1])
def test_quantile(interpolation, a_vals, b_vals, q):
if interpolation == 'nearest' and q == 0.5 and b_vals == [4, 3, 2, 1]:
pytest.skip("Unclear numpy expectation for nearest result with "
"equidistant data")

a_expected = pd.Series(a_vals).quantile(q, interpolation=interpolation)
b_expected = pd.Series(b_vals).quantile(q, interpolation=interpolation)

df = pd.DataFrame({
'key': ['a'] * len(a_vals) + ['b'] * len(b_vals),
'val': a_vals + b_vals})

expected = DataFrame([a_expected, b_expected], columns=['val'],
index=Index(['a', 'b'], name='key'))
result = df.groupby('key').quantile(q, interpolation=interpolation)

tm.assert_frame_equal(result, expected)


def test_quantile_raises():
df = pd.DataFrame([
['foo', 'a'], ['foo', 'b'], ['foo', 'c']], columns=['key', 'val'])

with tm.assert_raises_regex(TypeError, "cannot be performed against "
"'object' dtypes"):
df.groupby('key').quantile()


# pipe
# --------------------------------

Expand Down
4 changes: 2 additions & 2 deletions pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,13 +222,13 @@ def f(x, q=None, axis=0):
agg_result = df_grouped.agg(np.percentile, 80, axis=0)
apply_result = df_grouped.apply(DataFrame.quantile, .8)
expected = df_grouped.quantile(.8)
assert_frame_equal(apply_result, expected)
assert_frame_equal(apply_result, expected, check_names=False)
jreback marked this conversation as resolved.
Show resolved Hide resolved
assert_frame_equal(agg_result, expected, check_names=False)

agg_result = df_grouped.agg(f, q=80)
apply_result = df_grouped.apply(DataFrame.quantile, q=.8)
assert_frame_equal(agg_result, expected, check_names=False)
assert_frame_equal(apply_result, expected)
assert_frame_equal(apply_result, expected, check_names=False)


def test_len():
Expand Down