Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -421,6 +421,7 @@ Other Enhancements
- :func:`pandas.DataFrame.to_sql` has gained the ``method`` argument to control SQL insertion clause. See the :ref:`insertion method <io.sql.method>` section in the documentation. (:issue:`8953`)
- :meth:`DataFrame.corrwith` now supports Spearman's rank correlation, Kendall's tau as well as callable correlation methods. (:issue:`21925`)
- :meth:`DataFrame.to_json`, :meth:`DataFrame.to_csv`, :meth:`DataFrame.to_pickle`, and :meth:`DataFrame.to_XXX` etc. now support tilde(~) in path argument. (:issue:`23473`)
- :func: qcut now accepts ``bounded`` as a keyword argument, allowing for unbounded quantiles such that the lower/upper bounds are -inf/inf (:issue:`17282`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to 0.25 at this point


.. _whatsnew_0240.api_breaking:

Expand Down
16 changes: 14 additions & 2 deletions pandas/core/reshape/tile.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from pandas.core.dtypes.common import (
_NS_DTYPE, ensure_int64, is_categorical_dtype, is_datetime64_dtype,
is_datetime64tz_dtype, is_datetime_or_timedelta_dtype, is_integer,
is_scalar, is_timedelta64_dtype)
is_integer_dtype, is_scalar, is_timedelta64_dtype)
from pandas.core.dtypes.missing import isna

from pandas import (
Expand Down Expand Up @@ -244,7 +244,8 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
series_index, name, dtype)


def qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise'):
def qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise',
bounded=True):
"""
Quantile-based discretization function. Discretize variable into
equal-sized buckets based on rank or based on sample quantiles. For example
Expand All @@ -271,6 +272,12 @@ def qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise'):

.. versionadded:: 0.20.0

bounded : bool, default True
Use the min/max of the distribution as the lower/upper bounds if True,
otherwise use -inf/inf. Ignored if dtype is datetime/timedelta.

.. versionadded:: 0.24.0

Returns
-------
out : Categorical or Series or array of integers if labels is False
Expand Down Expand Up @@ -308,6 +315,11 @@ def qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise'):
else:
quantiles = q
bins = algos.quantile(x, quantiles)
if not bounded and not dtype:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about bounded and dtype? I feel like bounded should not be ignored in that case (though I don't know the correct behavior).

if is_integer_dtype(bins):
bins = bins.astype(np.float64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't want to do this. It can cause precision issues for large integers, and I suspect it may be surprising for users.

Could you instead use the min / max integer for the size?

info = np.iinf(bins.dtype)
bins[0] = info.min
bins[-1] = info.max

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. Not sure either approach is guaranteed to avoid unexpected results for users. I think either would work for my use cases, but any approach will be a compromise since there is no way to represent infinity for int types. Looking into your other comment about dtype, the same issues arise for datetime-like types. I'm leaning towards closing this PR since I think the unbounded concept can only be naturally represented for float types and isn't worth using hacks for all other types.

bins[0] = -np.inf
bins[-1] = np.inf
fac, bins = _bins_to_cuts(x, bins, labels=labels,
precision=precision, include_lowest=True,
dtype=dtype, duplicates=duplicates)
Expand Down
27 changes: 27 additions & 0 deletions pandas/tests/reshape/test_qcut.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,3 +197,30 @@ def test_date_like_qcut_bins(arg, expected_bins):
ser = Series(arg)
result, result_bins = qcut(ser, 2, retbins=True)
tm.assert_index_equal(result_bins, expected_bins)


def test_qcut_unbounded():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parametrize over bounded

# GH 17282
labels = qcut(range(5), 4, bounded=False)
left = labels.categories.left.values
right = labels.categories.right.values
expected = np.array([-np.inf, 1.0, 2.0, 3.0, np.inf])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than use numpy arrays, can you construct the expected Index and use tm.assert_index_equal

tm.assert_numpy_array_equal(left, expected[:-1])
tm.assert_numpy_array_equal(right, expected[1:])


@pytest.mark.parametrize('bins', [3, np.linspace(0, 1, 4)])
def test_datetimetz_qcut_unbounded(bins):
# GH 19872
tz = 'US/Eastern'
s = Series(date_range('20130101', periods=3, tz=tz))
result = qcut(s, bins, bounded=False)
expected = Series(IntervalIndex([
Interval(Timestamp("2012-12-31 23:59:59.999999999", tz=tz),
Timestamp("2013-01-01 16:00:00", tz=tz)),
Interval(Timestamp("2013-01-01 16:00:00", tz=tz),
Timestamp("2013-01-02 08:00:00", tz=tz)),
Interval(Timestamp("2013-01-02 08:00:00", tz=tz),
Timestamp("2013-01-03 00:00:00", tz=tz))])).astype(
CDT(ordered=True))
tm.assert_series_equal(result, expected)