Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.cut(df, bins=N) current behavior is incorrect, returns float where int is the limit #47996

Open
3 tasks done
FlorinAndrei opened this issue Aug 6, 2022 · 5 comments
Open
3 tasks done

Comments

@FlorinAndrei
Copy link

FlorinAndrei commented Aug 6, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

https://github.com/FlorinAndrei/misc/blob/master/HeartDisease.csv

import pandas as pd
hd = pd.read_csv("HeartDisease.csv")
pd.cut(hd["Age"], bins=3, include_lowest=True)

Issue Description

The lowest of the three bins created is: (28.951, 45.0]. This is incorrect in several ways.

First off, I expect a left-inclusive bin there. That bin is not left-inclusive.

Secondly, the minimum value in that column is 29. It is not 28.951 - that float is an artifact of the library and does not exist in the data.

One workaround you can find online is this:

_, edges = pd.cut(hd["Age"], bins=3, include_lowest=True, retbins=True)
edges_r = [round(x) for x in edges]
pd.cut(hd["Age"], bins=edges_r)

But this is pointless and annoying. The library should simply return the true minimum value.

Expected Behavior

The bin I expect is [29.0, 45.0]. I would also settle for [29, 45].

Installed Versions

pd.show_versions()
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages\pandas\util_print_versions.py", line 109, in show_versions
deps = _get_dependency_info()
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages\pandas\util_print_versions.py", line 88, in get_dependency_info
mod = import_optional_dependency(modname, errors="ignore")
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages\pandas\compat_optional.py", line 138, in import_optional_dependency
module = importlib.import_module(name)
File "C:\Program Files\Python310\lib\importlib_init
.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in gcd_import
File "", line 1027, in find_and_load
File "", line 1002, in find_and_load_unlocked
File "", line 945, in find_spec
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages_distutils_hack_init
.py", line 79, in find_spec
return method()
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages_distutils_hack_init
.py", line 100, in spec_for_pip
if self.pip_imported_during_build():
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages_distutils_hack_init.py", line 111, in pip_imported_during_build
return any(
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages_distutils_hack_init.py", line 112, in
frame.f_globals['file'].endswith('setup.py')
KeyError: 'file'

I do have the latest Pandas installed (1.4.3) but there's another bug now with show_versions() that prevents me from printing that info.

Python 3.10.6

Numpy 1.22.4

Windows 10

Jupyter Notebook

@FlorinAndrei FlorinAndrei added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 6, 2022
@GYHHAHA
Copy link
Contributor

GYHHAHA commented Aug 6, 2022

For each item, the same interval close form is expected. So now we set the lowest edge as min - 0.1% * (max - min) when close is right to envelop the mininal value.

@attack68
Copy link
Contributor

attack68 commented Aug 7, 2022

I agree with you. Although the docs suggest the following for bins:

int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

I think this is unnecessary and, as you put it, annoying.

Also I think this function is probably quite useful (for categorising continuous data in an ML context) and possibly likely to increase in usage. It could do with revamp. It was also added to #40245 without yet being addressed.

For numpy.histogram I think the behaviour is much better, i.e. they get [1,2), [2,3), [3,4]

All but the last (righthand-most) bin is half-open.

Additionally numpy.histogram has a very usful arg density, which allows for a different algo to calculate bins, included with weights.

For example

>>> s = np.random.rand(100)
>>> s[0], s[99] = 0, 1
>>> df = pd.DataFrame({"samples": s})
>>> pd.cut(df["samples"], bins=10, retbins=True)
... 
Categories (10, interval[float64, right]): [(-0.001, 0.1] < (0.1, 0.2] < (0.2, 0.3] < (0.3, 0.4] ... (0.6, 0.7] < (0.7, 0.8] < (0.8, 0.9] < (0.9, 1.0]]

>>> np.histogram(df["samples"])
(array([ 7,  4, 11, 14, 11,  9,  8, 14, 11, 11]),
 array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]))

If doing: pd.cut(df["samples"], bins=np.histogram(df["samples"])[1], retbins=True), then the first datapoint at 0 is excluded. If include_lowest is set to True it just reduces the leftmost specified bin by an arbitrary 0.1% (instead of actually making it inclusive), which is horrible, IMO.

@attack68 attack68 added cut cut, qcut Enhancement and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 7, 2022
@FlorinAndrei
Copy link
Author

For numpy.histogram I think the behaviour is much better, i.e. they get [1,2), [2,3), [3,4]

Indeed, numpy.histogram() does the right thing if you request that both ends are included. The R function discretize() has the same behavior as Numpy. This is what people expect when they use a function like this.

If include_lowest is set to True it just reduces the leftmost specified bin by an arbitrary 0.1% (instead of actually making it inclusive), which is horrible, IMO.

That is precisely my main objection. The current behavior of this function makes stuff up. It's just a very, very poor hack.

@andrey-orderby
Copy link

andrey-orderby commented Feb 20, 2024

I am facing the same issue right now of pd.cut() creating intervals with left bound excluded and right bound included. Totally support @FlorinAndrei. The default behavior should be that the left bound is included and the right one is excluded. I did not expect any other variant.
Although, I firured out that it suffies to set right = False. So at least there is a way to setup the function to create intervals as I prefer.

import pandas as pd
s = pd.Series([1,1,2,2,3,4,5])
pd.cut(s, bins = [-np.inf, 2, np.inf], right = False)

returns

0    [-inf, 2.0)
1    [-inf, 2.0)
2     [2.0, inf)
3     [2.0, inf)
4     [2.0, inf)
5     [2.0, inf)
6     [2.0, inf)
Length: 7, dtype: category
Categories (2, interval[float64, left]): [[-inf, 2.0) < [2.0, inf)]```

@attack68
Copy link
Contributor

PRs are welcome. It is always encouraged to support a software that is free and is built by unpaid volunteers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants