-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: drop_duplicates drops non-duplicate rows in the presence of integer columns #11403
Conversation
|
||
df = pd.DataFrame([[-2, 0], [0, -4]]) | ||
assert_frame_equal(df.drop_duplicates(), df) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u add the op example here as well
do we have sufficient coverage for various dtypes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see tests for strings, integers, and floats. This particular problem happens only for integers, though.
f436d94
to
34467fc
Compare
labels, shape = vals, unique1d(vals) | ||
else: | ||
labels, shape = factorize(vals, size_hint=min(len(self), _SIZE_HINT_LIMIT)) | ||
vals = vals.astype('i8', copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you doing all these lower
, upper
thing? in any case you need to factorize integers because you may not have all the integers between min and max value;
all you need is to undo #10917 :
diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index 827373c..08d2857 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -2992,13 +2992,7 @@ class DataFrame(NDFrame):
from pandas.hashtable import duplicated_int64, _SIZE_HINT_LIMIT
def f(vals):
-
- # if we have integers we can directly index with these
- if com.is_integer_dtype(vals):
- from pandas.core.nanops import unique1d
- labels, shape = vals, unique1d(vals)
- else:
- labels, shape = factorize(vals, size_hint=min(len(self), _SIZE_HINT_LIMIT))
+ labels, shape = factorize(vals, size_hint=min(len(self), _SIZE_HINT_LIMIT))
return labels.astype('i8',copy=False), len(shape)
if subset is None:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to avoid undoing #10917, because factorizing the integers is slower than using them directly. If all of the integers are non-negative, then everything actually works perfectly just replacing shape with max + 1. If some of the numbers are negative, then we'd like to just shift them up to all be non-negative, but that might cause an overflow that we need to check for.
plz do not modify |
This PR has the perf benefit without the bug, so why revert #10917 instead? |
|
|
@evanpw before trying to explain what the function does, can you plz check commit history of the function code and see who had contributed to its code? you are modifying the function code to fix it, then you say that it does not need factorized labels! that function does a core calculation to many pandas functionality, and has nothing to do with the bug you are trying to close. |
I see no problem with @evanpw trying a different soln. If we can assure that it does the correct thing, but maybe in a different way to keep the perf benefits, where is the harm? |
@behzadnouri of course I did this change. yet we didn't have enough test coverage or it wouldn't have been ok. I am all for rolling this back. but @evanpw soln seems reasonable. pls don't be dismissive of others, it is simply not nice. |
@behzadnouri see that would be a MUCH better comment. thank you. I agree we should be clear on whether things are factorized or not going into functions. So maybe the doc-string needs to be updated to make that clear. |
@evanpw please consider these two cases
|
else: | ||
width = long(upper) - long(lower) | ||
|
||
if width < np.iinfo(np.int64).max - 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps make special constants like _INT64_MAX
available in some module and then groupby.py
and here can just import
it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh these could be in pandas/core/common.py
I guess (do separate PR for this). they are already imported directly from the c-lib in cython, so this is just for python space
@behzadnouri Sorry if I came across as rude. I really like the function you've written! I just happened to notice that it also works perfectly for non-factorized inputs, except in a certain edge case involving integer overflow (and already has to handle non-factorized intermediate values in some cases), so that we could keep the performance improvement from jreback's change while fixing the bug. I personally have no pressing need for |
@kawochen Good catch. Should be fixed now. |
Should I do it in a separate PR? |
@evanpw no you can do it here. separate your tests into a single commit, then you'll will need to fix up some stuff on the revert (e.g. the whatsnew note). so you'll end up with the revert commit with some conflicts fixed and your tests in a commit. you will also need a new whatsnew note for 0.17.1 (which you can put in either) |
2dea14a
to
3f68d93
Compare
This wiped out the whatsnew message from 0.17.0. I'm not sure if that's the right thing in this case. |
@evanpw no, leave the whatsnew from 0.17.0 (and add a new one for 0.17.1), leave the asv benchmark as well. |
…ger columns (GH 11376)
… arrays" This reverts commit a00c7ea, but leaves new tests and benchmark
3f68d93
to
b710728
Compare
Okay, this is at the very limit of my git ability, but it should be correct now. |
can you add back the asv benchmark? (you can just add it in and combine with say the tests commit) |
It should still be there. I only reverted the changes to frame.py. |
ahh, right....hahah ok then. ping on green. |
all green |
BUG: drop_duplicates drops non-duplicate rows in the presence of integer columns
thanks! |
Fixes GH #11376