Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
qcut() should make sure the bins bounderies are unique before passing them to _bins_to_cuts #7751
Comments
|
why would you not just catch the |
yonil7
commented
Jul 14, 2014
|
I think this should not throw exception as the is legitimate use. pd.qcut([1,2,3,4,4], [0, .25, .5, .75, 1.])also throw: ValueError: Bin edges must be unique: array([1, 2, 3, 4, 4], dtype=int64) |
yonil7
commented
Jul 14, 2014
INSTALLED VERSIONScommit: None pandas: 0.14.0 |
|
maybe you misunderstand me. This is a user error (in that you generally pass unique bins, and you CAN make them unique before passing them in). Yes it could be done in the function, but wouldn't you as a user want to know that you are passing non-unique bins? |
yonil7
commented
Jul 14, 2014
|
The issue is regarding the pd.qcut() function (not pd.cut()). It should get array of samples and quantiles/ranks. |
|
its the same issue. If you uniqfy in one, you would in the other. That's not the question though. why should this happen silently? (automatically). maybe I am missing something. |
|
I would expect this to raise an informative error to the user. "The quantiles you selected do not result in unique bins. Try selecting fewer quantiles." I.e., if your 25th quantile is the same as your median, then you should be told this and you shouldn't ask for both. Trying to get fancy under the hood is asking for confusion IMO. |
jreback
added the
Error Reporting
label
Jul 15, 2014
jreback
added this to the
0.15.1
milestone
Jul 15, 2014
jreback
added the
Reshaping
label
Jul 15, 2014
This was referenced Sep 18, 2014
jreback
added the
Good as first PR
label
Sep 21, 2014
edjoesu
added a commit
to edjoesu/pandas
that referenced
this issue
Dec 6, 2014
|
|
edjoesu |
344fb86
|
edjoesu
referenced
this issue
Dec 6, 2014
Closed
Calling qcut with too many duplicates now gives an informative error #9030
edjoesu
commented
Dec 7, 2014
|
In addition to just raising a more informative error, I think that there should be an option to automatically handle duplicates in the following way: Suppose we have a list with too many duplicates, say we want to split [1,2,3,3,3,3,3,3,4,5,6,7] into quartiles. Right now qcut fails, because the second-lowest quartile consists entirely of '3's, duplicating the bin edges. But there is a natural way to assign the quartiles if we allow duplicate values to be split among different bins: [1,2,3], [3,3,3], [3,3,4], [5,6,7]. Now this is not completely ideal -- the assignment is arbitrary, and there is no way of knowing what to do by looking at the bin edges alone. But in the real world it can be convenient to have an option that 'just works', and I think this warrants a 'split_dup=True' option that is false by default. What do people think? I'll add this if people support it. |
|
@edjoesu start with the error msg. If you want to propose a new issue for discussion that is ok. I think if you show several test cases, some of which you CAN distinguish and some of which have to raise then prob ok, but needs discussion. |
mritchie712
commented
Dec 11, 2014
|
Is there any work around for this until it's changed? |
|
this is not a bug, just a more informative error message. |
mritchie712
commented
Dec 11, 2014
|
Got it, I thought it was being changed to a warning instead of an error. I personally would like the option to allow non-unique bins, but I also understand the case against it. I just commented out those lines in tile.py and it seems to have allowed me to use duplicates. |
|
Does anyone own this change? Or has it already been done? |
jreback
modified the milestone: 0.16.0, Next Major Release
Mar 6, 2015
|
See http://stackoverflow.com/questions/20158597/how-to-qcut-with-non-unique-bin-edges for a discussion about this. @edjoesu I agree with your proposal, but I foresee that the implementation won't be trivial, since the way the current code assigns values to bins I think some bins would become empty if duplicates are allowed. |
|
I think we could add a "duplicate_edges" parameter, with the following options:
Does it look reasonable? What do you think? |
rainjacket
commented
Nov 30, 2015
|
dukebody's suggestion looks good to me. For my use case, I would prefer for qcut to just return the (non-unique) bin list, and let me handle it. |
randomgambit
commented
Jul 26, 2016
|
hi guys let me jump in and push for that to happen as well. Solutions already exist in many other languages, its just a matter of choosing one of them. See also https://stackoverflow.com/questions/38594277/how-to-compute-quantile-bins-in-pandas-when-there-are-ties-in-the-data/38596578?noredirect=1#comment64582414_38596578 |
randomgambit
commented
Jul 26, 2016
|
see for instance here for a possible solution adopted in SAS https://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146840.htm |
|
@randomgambit its not about a soln. This is quite straightforward. Its about someone actually implementing it. There are many many many issues. If someone wants to submit a PR that moves things WAY faster. |
randomgambit
commented
Jul 26, 2016
•
|
i get it jeff, no worries. unfortunately i dont have the python skills to help you and get the job done. I can only give hints for the good samaritan that wants to improve the current work in this direction |
yonil7 commentedJul 14, 2014
xref #8309
for example:
will raise "ValueError: Bin edges must be unique: array([ 1., 1.])" exception
Fix suggestion - add one new line: