BUG: pd.cut(df, bins=N) current behavior is incorrect, returns float where int is the limit #47996

FlorinAndrei · 2022-08-06T22:37:07Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

https://github.com/FlorinAndrei/misc/blob/master/HeartDisease.csv

import pandas as pd
hd = pd.read_csv("HeartDisease.csv")
pd.cut(hd["Age"], bins=3, include_lowest=True)

Issue Description

The lowest of the three bins created is: (28.951, 45.0]. This is incorrect in several ways.

First off, I expect a left-inclusive bin there. That bin is not left-inclusive.

Secondly, the minimum value in that column is 29. It is not 28.951 - that float is an artifact of the library and does not exist in the data.

One workaround you can find online is this:

_, edges = pd.cut(hd["Age"], bins=3, include_lowest=True, retbins=True)
edges_r = [round(x) for x in edges]
pd.cut(hd["Age"], bins=edges_r)

But this is pointless and annoying. The library should simply return the true minimum value.

Expected Behavior

The bin I expect is [29.0, 45.0]. I would also settle for [29, 45].

Installed Versions

pd.show_versions()
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages\pandas\util_print_versions.py", line 109, in show_versions
deps = _get_dependency_info()
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages\pandas\util_print_versions.py", line 88, in get_dependency_info
mod = import_optional_dependency(modname, errors="ignore")
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages\pandas\compat_optional.py", line 138, in import_optional_dependency
module = importlib.import_module(name)
File "C:\Program Files\Python310\lib\importlib_init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in gcd_import
File "", line 1027, in find_and_load
File "", line 1002, in find_and_load_unlocked
File "", line 945, in find_spec
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages_distutils_hack_init.py", line 79, in find_spec
return method()
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages_distutils_hack_init.py", line 100, in spec_for_pip
if self.pip_imported_during_build():
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages_distutils_hack_init.py", line 111, in pip_imported_during_build
return any(
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages_distutils_hack_init.py", line 112, in
frame.f_globals['file'].endswith('setup.py')
KeyError: 'file'

I do have the latest Pandas installed (1.4.3) but there's another bug now with show_versions() that prevents me from printing that info.

Python 3.10.6

Numpy 1.22.4

Windows 10

Jupyter Notebook

The text was updated successfully, but these errors were encountered:

GYHHAHA · 2022-08-06T23:54:21Z

For each item, the same interval close form is expected. So now we set the lowest edge as min - 0.1% * (max - min) when close is right to envelop the mininal value.

attack68 · 2022-08-07T06:52:13Z

I agree with you. Although the docs suggest the following for bins:

int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

I think this is unnecessary and, as you put it, annoying.

Also I think this function is probably quite useful (for categorising continuous data in an ML context) and possibly likely to increase in usage. It could do with revamp. It was also added to #40245 without yet being addressed.

For numpy.histogram I think the behaviour is much better, i.e. they get [1,2), [2,3), [3,4]

All but the last (righthand-most) bin is half-open.

Additionally numpy.histogram has a very usful arg density, which allows for a different algo to calculate bins, included with weights.

For example

>>> s = np.random.rand(100)
>>> s[0], s[99] = 0, 1
>>> df = pd.DataFrame({"samples": s})
>>> pd.cut(df["samples"], bins=10, retbins=True)
... 
Categories (10, interval[float64, right]): [(-0.001, 0.1] < (0.1, 0.2] < (0.2, 0.3] < (0.3, 0.4] ... (0.6, 0.7] < (0.7, 0.8] < (0.8, 0.9] < (0.9, 1.0]]

>>> np.histogram(df["samples"])
(array([ 7,  4, 11, 14, 11,  9,  8, 14, 11, 11]),
 array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]))

If doing: pd.cut(df["samples"], bins=np.histogram(df["samples"])[1], retbins=True), then the first datapoint at 0 is excluded. If include_lowest is set to True it just reduces the leftmost specified bin by an arbitrary 0.1% (instead of actually making it inclusive), which is horrible, IMO.

FlorinAndrei · 2022-08-07T23:23:02Z

For numpy.histogram I think the behaviour is much better, i.e. they get [1,2), [2,3), [3,4]

Indeed, numpy.histogram() does the right thing if you request that both ends are included. The R function discretize() has the same behavior as Numpy. This is what people expect when they use a function like this.

If include_lowest is set to True it just reduces the leftmost specified bin by an arbitrary 0.1% (instead of actually making it inclusive), which is horrible, IMO.

That is precisely my main objection. The current behavior of this function makes stuff up. It's just a very, very poor hack.

andrey-orderby · 2024-02-20T14:58:09Z

I am facing the same issue right now of pd.cut() creating intervals with left bound excluded and right bound included. Totally support @FlorinAndrei. The default behavior should be that the left bound is included and the right one is excluded. I did not expect any other variant.
Although, I firured out that it suffies to set right = False. So at least there is a way to setup the function to create intervals as I prefer.

import pandas as pd
s = pd.Series([1,1,2,2,3,4,5])
pd.cut(s, bins = [-np.inf, 2, np.inf], right = False)

returns

0    [-inf, 2.0)
1    [-inf, 2.0)
2     [2.0, inf)
3     [2.0, inf)
4     [2.0, inf)
5     [2.0, inf)
6     [2.0, inf)
Length: 7, dtype: category
Categories (2, interval[float64, left]): [[-inf, 2.0) < [2.0, inf)]```

attack68 · 2024-02-20T15:22:21Z

PRs are welcome. It is always encouraged to support a software that is free and is built by unpaid volunteers.

FlorinAndrei added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 6, 2022

attack68 added cut cut, qcut Enhancement and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pd.cut(df, bins=N) current behavior is incorrect, returns float where int is the limit #47996

BUG: pd.cut(df, bins=N) current behavior is incorrect, returns float where int is the limit #47996

FlorinAndrei commented Aug 6, 2022 •

edited

Loading

GYHHAHA commented Aug 6, 2022 •

edited

Loading

attack68 commented Aug 7, 2022

FlorinAndrei commented Aug 7, 2022

andrey-orderby commented Feb 20, 2024 •

edited

Loading

attack68 commented Feb 20, 2024

BUG: pd.cut(df, bins=N) current behavior is incorrect, returns float where int is the limit #47996

BUG: pd.cut(df, bins=N) current behavior is incorrect, returns float where int is the limit #47996

Comments

FlorinAndrei commented Aug 6, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

GYHHAHA commented Aug 6, 2022 • edited Loading

attack68 commented Aug 7, 2022

FlorinAndrei commented Aug 7, 2022

andrey-orderby commented Feb 20, 2024 • edited Loading

attack68 commented Feb 20, 2024

FlorinAndrei commented Aug 6, 2022 •

edited

Loading

GYHHAHA commented Aug 6, 2022 •

edited

Loading

andrey-orderby commented Feb 20, 2024 •

edited

Loading