qcut() should make sure the bins bounderies are unique before passing them to _bins_to_cuts #7751

yonil7 · 2014-07-14T18:47:54Z

for example:

pd.qcut([1,1,1,1,1,1,1,1,1,1,1,1,1,5,5,5], [0.00001, 0.5])

will raise "ValueError: Bin edges must be unique: array([ 1., 1.])" exception

Fix suggestion - add one new line:

def qcut(x, q, labels=None, retbins=False, precision=3):
    if com.is_integer(q):
        quantiles = np.linspace(0, 1, q + 1)
    else:
        quantiles = q
    bins = algos.quantile(x, quantiles)
--->bins = np.unique(bins)
    return _bins_to_cuts(x, bins, labels=labels, retbins=retbins,
        precision=precision, include_lowest=True)

jreback · 2014-07-14T18:50:48Z

why would you not just catch the ValueError ? better to be informed of the issue, right?

yonil7 · 2014-07-14T19:34:43Z

I think this should not throw exception as the is legitimate use.
maybe a better example:

pd.qcut([1,2,3,4,4], [0, .25, .5, .75, 1.])

also throw: ValueError: Bin edges must be unique: array([1, 2, 3, 4, 4], dtype=int64)

yonil7 · 2014-07-14T19:39:09Z

INSTALLED VERSIONS

commit: None
python: 2.7.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.14.0
nose: 1.3.0
Cython: 0.19.1
numpy: 1.8.1
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 1.0.0
sphinx: 1.1.3
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.4
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: None
bq: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.6
pymysql: None
psycopg2: None

jreback · 2014-07-14T19:53:20Z

maybe you misunderstand me. This is a user error (in that you generally pass unique bins, and you CAN make them unique before passing them in). Yes it could be done in the function, but wouldn't you as a user want to know that you are passing non-unique bins?

yonil7 · 2014-07-14T20:05:30Z

The issue is regarding the pd.qcut() function (not pd.cut()). It should get array of samples and quantiles/ranks.
In the prev example, I passed array of samples [1,2,3,4,4] and would like it to calculate for each sample in that array to which of the 4 quantiles it belongs. (quantile 1,2,3 or 4)

jreback · 2014-07-14T20:11:11Z

its the same issue. If you uniqfy in one, you would in the other. That's not the question though. why should this happen silently? (automatically). maybe I am missing something.

@jseabold ?

jseabold · 2014-07-15T12:52:45Z

I would expect this to raise an informative error to the user. "The quantiles you selected do not result in unique bins. Try selecting fewer quantiles." I.e., if your 25th quantile is the same as your median, then you should be told this and you shouldn't ask for both. Trying to get fancy under the hood is asking for confusion IMO.

jreback · 2014-07-15T12:54:07Z

@jseabold ok, I can buy that this could produce a better msg.

@yonil7 want to take this on with a pull-request?

…message. Closes pandas-dev#7751

edjoesu · 2014-12-07T00:04:39Z

In addition to just raising a more informative error, I think that there should be an option to automatically handle duplicates in the following way:

Suppose we have a list with too many duplicates, say we want to split [1,2,3,3,3,3,3,3,4,5,6,7] into quartiles. Right now qcut fails, because the second-lowest quartile consists entirely of '3's, duplicating the bin edges. But there is a natural way to assign the quartiles if we allow duplicate values to be split among different bins: [1,2,3], [3,3,3], [3,3,4], [5,6,7].

Now this is not completely ideal -- the assignment is arbitrary, and there is no way of knowing what to do by looking at the bin edges alone. But in the real world it can be convenient to have an option that 'just works', and I think this warrants a 'split_dup=True' option that is false by default.

What do people think? I'll add this if people support it.

jreback · 2014-12-07T00:15:21Z

@edjoesu start with the error msg. If you want to propose a new issue for discussion that is ok. I think if you show several test cases, some of which you CAN distinguish and some of which have to raise then prob ok, but needs discussion.

mritchie712 · 2014-12-11T18:32:43Z

Is there any work around for this until it's changed?

jreback · 2014-12-11T18:41:45Z

this is not a bug, just a more informative error message.

mritchie712 · 2014-12-11T18:48:37Z

Got it, I thought it was being changed to a warning instead of an error. I personally would like the option to allow non-unique bins, but I also understand the case against it.

I just commented out those lines in tile.py and it seems to have allowed me to use duplicates.

springcoil · 2014-12-30T17:23:59Z

Does anyone own this change? Or has it already been done?

dukebody · 2015-05-13T11:42:05Z

See http://stackoverflow.com/questions/20158597/how-to-qcut-with-non-unique-bin-edges for a discussion about this.

@edjoesu I agree with your proposal, but I foresee that the implementation won't be trivial, since the way the current code assigns values to bins I think some bins would become empty if duplicates are allowed.

dukebody · 2015-05-13T13:28:43Z

I think we could add a "duplicate_edges" parameter, with the following options:

'raise' (default). Raise an error if duplicate bin edges are found.
'drop'. Delete duplicate edges, adding the bins = np.unique(bins) line as in qcut() should make sure the bins bounderies are unique before passing them to _bins_to_cuts #7751 (comment). This would result in less bins than specified, and some larger (with more elements) than others.
'unique'. Perform the binning based on the unique values present in the input array. This is essentialy the first solution in http://stackoverflow.com/questions/20158597/how-to-qcut-with-non-unique-bin-edges and would imply unevenly sized bins, without altering the total number of bins.

Does it look reasonable? What do you think?

rainjacket · 2015-11-30T21:39:37Z

dukebody's suggestion looks good to me.

For my use case, I would prefer for qcut to just return the (non-unique) bin list, and let me handle it.

randomgambit · 2016-07-26T22:02:52Z

hi guys let me jump in and push for that to happen as well. Solutions already exist in many other languages, its just a matter of choosing one of them. See also https://stackoverflow.com/questions/38594277/how-to-compute-quantile-bins-in-pandas-when-there-are-ties-in-the-data/38596578?noredirect=1#comment64582414_38596578

randomgambit · 2016-07-26T22:12:49Z

see for instance here for a possible solution adopted in SAS

https://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146840.htm

jreback · 2016-07-26T22:26:52Z

@randomgambit its not about a soln. This is quite straightforward. Its about someone actually implementing it. There are many many many issues. If someone wants to submit a PR that moves things WAY faster.

randomgambit · 2016-07-26T22:31:57Z

i get it jeff, no worries. unfortunately i dont have the python skills to help you and get the job done. I can only give hints for the good samaritan that wants to improve the current work in this direction

jreback added the Error Reporting label Jul 15, 2014

jreback added this to the 0.15.1 milestone Jul 15, 2014

jreback added the Reshaping label Jul 15, 2014

This was referenced Sep 18, 2014

pandas.cut crashes on all zero input #8309

Closed

Bloomberg Hackathon #8323

Closed

jreback added the Good as first PR label Sep 21, 2014

edjoesu added a commit to edjoesu/pandas that referenced this issue Dec 6, 2014

Calling qcut with too many duplicates now gives an informative error …

344fb86

…message. Closes pandas-dev#7751

edjoesu mentioned this issue Dec 6, 2014

Calling qcut with too many duplicates now gives an informative error #9030

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

ghost mentioned this issue Apr 17, 2015

Qcut msg #9879

Closed

dat-boris mentioned this issue Sep 17, 2016

Fixing cases when ValueError from alphalens.tears.create_factor_tear_sheet quantopian/alphalens#87

Closed

ashishsingal1 mentioned this issue Oct 19, 2016

ERR: qcut uniquess checking #14455

Closed

4 tasks

luca-s mentioned this issue Nov 25, 2016

BUG: fix "ValueError: Bin edges must be unique" exception quantopian/alphalens#104

Closed

ashishsingal1 mentioned this issue Dec 27, 2016

ERR: qcut uniquess checking (try 2) #15000

Closed

4 tasks

jreback modified the milestones: 0.20.0, Next Major Release Dec 30, 2016

jreback closed this as completed in 8051d61 Dec 30, 2016

jorisvandenbossche mentioned this issue Jan 5, 2017

qcut can fail for highly discontinuous data distributions #15069

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qcut() should make sure the bins bounderies are unique before passing them to _bins_to_cuts #7751

qcut() should make sure the bins bounderies are unique before passing them to _bins_to_cuts #7751

yonil7 commented Jul 14, 2014

jreback commented Jul 14, 2014

yonil7 commented Jul 14, 2014

yonil7 commented Jul 14, 2014

jreback commented Jul 14, 2014

yonil7 commented Jul 14, 2014

jreback commented Jul 14, 2014

jseabold commented Jul 15, 2014

jreback commented Jul 15, 2014

edjoesu commented Dec 7, 2014

jreback commented Dec 7, 2014

mritchie712 commented Dec 11, 2014

jreback commented Dec 11, 2014

mritchie712 commented Dec 11, 2014

springcoil commented Dec 30, 2014

dukebody commented May 13, 2015

dukebody commented May 13, 2015

rainjacket commented Nov 30, 2015

randomgambit commented Jul 26, 2016

randomgambit commented Jul 26, 2016

jreback commented Jul 26, 2016

randomgambit commented Jul 26, 2016 •

edited by jreback

qcut() should make sure the bins bounderies are unique before passing them to _bins_to_cuts #7751

qcut() should make sure the bins bounderies are unique before passing them to _bins_to_cuts #7751

Comments

yonil7 commented Jul 14, 2014

jreback commented Jul 14, 2014

yonil7 commented Jul 14, 2014

yonil7 commented Jul 14, 2014

INSTALLED VERSIONS

jreback commented Jul 14, 2014

yonil7 commented Jul 14, 2014

jreback commented Jul 14, 2014

jseabold commented Jul 15, 2014

jreback commented Jul 15, 2014

edjoesu commented Dec 7, 2014

jreback commented Dec 7, 2014

mritchie712 commented Dec 11, 2014

jreback commented Dec 11, 2014

mritchie712 commented Dec 11, 2014

springcoil commented Dec 30, 2014

dukebody commented May 13, 2015

dukebody commented May 13, 2015

rainjacket commented Nov 30, 2015

randomgambit commented Jul 26, 2016

randomgambit commented Jul 26, 2016

jreback commented Jul 26, 2016

randomgambit commented Jul 26, 2016 • edited by jreback

randomgambit commented Jul 26, 2016 •

edited by jreback