Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: drop_duplicates drops non-duplicate rows in the presence of integer columns #11403

Merged
merged 2 commits into from
Oct 24, 2015

Conversation

evanpw
Copy link
Contributor

@evanpw evanpw commented Oct 21, 2015

Fixes GH #11376


df = pd.DataFrame([[-2, 0], [0, -4]])
assert_frame_equal(df.drop_duplicates(), df)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add the op example here as well
do we have sufficient coverage for various dtypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see tests for strings, integers, and floats. This particular problem happens only for integers, though.

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 21, 2015
@jreback jreback added this to the 0.17.1 milestone Oct 21, 2015
labels, shape = vals, unique1d(vals)
else:
labels, shape = factorize(vals, size_hint=min(len(self), _SIZE_HINT_LIMIT))
vals = vals.astype('i8', copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you doing all these lower, upper thing? in any case you need to factorize integers because you may not have all the integers between min and max value;

all you need is to undo #10917 :

diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index 827373c..08d2857 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -2992,13 +2992,7 @@ class DataFrame(NDFrame):
         from pandas.hashtable import duplicated_int64, _SIZE_HINT_LIMIT

         def f(vals):
-
-            # if we have integers we can directly index with these
-            if com.is_integer_dtype(vals):
-                from pandas.core.nanops import unique1d
-                labels, shape = vals, unique1d(vals)
-            else:
-                labels, shape = factorize(vals, size_hint=min(len(self), _SIZE_HINT_LIMIT))
+            labels, shape = factorize(vals, size_hint=min(len(self), _SIZE_HINT_LIMIT))
             return labels.astype('i8',copy=False), len(shape)

         if subset is None:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to avoid undoing #10917, because factorizing the integers is slower than using them directly. If all of the integers are non-negative, then everything actually works perfectly just replacing shape with max + 1. If some of the numbers are negative, then we'd like to just shift them up to all be non-negative, but that might cause an overflow that we need to check for.

@behzadnouri
Copy link
Contributor

plz do not modify groupby.py:get_group_index function. that function has nothing to do with #11376 and does not need any fixing

@jreback
Copy link
Contributor

jreback commented Oct 23, 2015

yeh I think Ideally we'd like to get the perf benefit on not factororizing integers if possible, but seems like reverting #10917 (though with @evanpw ) tests would be the ideal

@evanpw
Copy link
Contributor Author

evanpw commented Oct 23, 2015

This PR has the perf benefit without the bug, so why revert #10917 instead?

@behzadnouri
Copy link
Contributor

@evanpw get_group_index should only get factorized values; you are giving it wrong values then trying to fix it by factorizing inside the function!

can you plz just revert #10917 and not change get_group_index?

@evanpw
Copy link
Contributor Author

evanpw commented Oct 23, 2015

get_group_index doesn't need factorized values, though. All it needs is non-negative integers in a known range. The situation where a column needs to be factorized to make progress is a rare edge case that only happens when the difference between the largest and smallest integers in the column is comparable in magnitude to INT64_MAX.

@evanpw
Copy link
Contributor Author

evanpw commented Oct 23, 2015

get_group_index actually has a beautiful mathematical description. It takes a subset U of the group Z_n1 x Z_n2 x ... Z_nm and maps it to the corresponding subset of Z_{n1 * n2 * ... * nm} (i.e., a list of numbers) via the usual isomorphism (the one you get by sorting the former group lexicographically). It doesn't matter whether or not U projects down to the entire space Z_ni for every i (i.e., it doesn't matter whether every column is factorized). It still works, as long as you're careful about integer overflow.

@behzadnouri
Copy link
Contributor

@evanpw before trying to explain what the function does, can you plz check commit history of the function code and see who had contributed to its code?

you are modifying the function code to fix it, then you say that it does not need factorized labels!

that function does a core calculation to many pandas functionality, and has nothing to do with the bug you are trying to close.

@jreback
Copy link
Contributor

jreback commented Oct 23, 2015

@behzadnouri

I see no problem with @evanpw trying a different soln. If we can assure that it does the correct thing, but maybe in a different way to keep the perf benefits, where is the harm?

@behzadnouri
Copy link
Contributor

@jreback

you are the one who has done the harm, namely #10917

obviously you do not understand how these functions work. can u plz at least not be dismissive when i am trying to help you.

@jreback
Copy link
Contributor

jreback commented Oct 23, 2015

@behzadnouri of course I did this change. yet we didn't have enough test coverage or it wouldn't have been ok.

I am all for rolling this back. but @evanpw soln seems reasonable.

pls don't be dismissive of others, it is simply not nice.

@behzadnouri
Copy link
Contributor

@jreback get_group_index has nothing to do with #11376 bug

can we plz not modify that function? that function should only work with factorized labels! you see that this PR is modifying it to fix it because it is calling the function with non-factorized labels.

@jreback
Copy link
Contributor

jreback commented Oct 23, 2015

@behzadnouri see that would be a MUCH better comment. thank you.

I agree we should be clear on whether things are factorized or not going into functions.

So maybe the doc-string needs to be updated to make that clear.

@kawochen
Copy link
Contributor

@evanpw please consider these two cases

import pandas as pd
import numpy as np
_INT64_MAX = np.iinfo(np.int64).max
df_e = pd.DataFrame([[0]*63, [_INT64_MAX-1]*63]).drop_duplicates()
df_w = pd.DataFrame([[0]*63, [_INT64_MAX]*63]).drop_duplicates()

else:
width = long(upper) - long(lower)

if width < np.iinfo(np.int64).max - 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps make special constants like _INT64_MAX available in some module and then groupby.py and here can just import it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh these could be in pandas/core/common.py I guess (do separate PR for this). they are already imported directly from the c-lib in cython, so this is just for python space

@evanpw
Copy link
Contributor Author

evanpw commented Oct 23, 2015

@behzadnouri Sorry if I came across as rude. I really like the function you've written! I just happened to notice that it also works perfectly for non-factorized inputs, except in a certain edge case involving integer overflow (and already has to handle non-factorized intermediate values in some cases), so that we could keep the performance improvement from jreback's change while fixing the bug. I personally have no pressing need for drop_duplicates to be super fast, so if this change is going to be controversial, then let's just revert #10917.

@evanpw
Copy link
Contributor Author

evanpw commented Oct 23, 2015

@kawochen Good catch. Should be fixed now.

@jreback
Copy link
Contributor

jreback commented Oct 23, 2015

ok, @evanpw pls revert #10917. but KEEP your tests, I guess that is how this slipped thru.

@evanpw
Copy link
Contributor Author

evanpw commented Oct 23, 2015

Should I do it in a separate PR?

@jreback
Copy link
Contributor

jreback commented Oct 23, 2015

@evanpw no you can do it here. separate your tests into a single commit, then you'll will need to fix up some stuff on the revert (e.g. the whatsnew note). so you'll end up with the revert commit with some conflicts fixed and your tests in a commit. you will also need a new whatsnew note for 0.17.1 (which you can put in either)

@evanpw
Copy link
Contributor Author

evanpw commented Oct 23, 2015

This wiped out the whatsnew message from 0.17.0. I'm not sure if that's the right thing in this case.

@jreback
Copy link
Contributor

jreback commented Oct 23, 2015

@evanpw no, leave the whatsnew from 0.17.0 (and add a new one for 0.17.1), leave the asv benchmark as well.

… arrays"

This reverts commit a00c7ea, but leaves new tests and benchmark
@evanpw
Copy link
Contributor Author

evanpw commented Oct 23, 2015

Okay, this is at the very limit of my git ability, but it should be correct now.

@jreback
Copy link
Contributor

jreback commented Oct 23, 2015

can you add back the asv benchmark? (you can just add it in and combine with say the tests commit)

@evanpw
Copy link
Contributor Author

evanpw commented Oct 23, 2015

It should still be there. I only reverted the changes to frame.py.

@jreback
Copy link
Contributor

jreback commented Oct 23, 2015

ahh, right....hahah ok then.

ping on green.

@evanpw
Copy link
Contributor Author

evanpw commented Oct 24, 2015

all green

jreback added a commit that referenced this pull request Oct 24, 2015
BUG: drop_duplicates drops non-duplicate rows in the presence of integer columns
@jreback jreback merged commit 8a04d63 into pandas-dev:master Oct 24, 2015
@jreback
Copy link
Contributor

jreback commented Oct 24, 2015

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants