BUG: drop_duplicates drops non-duplicate rows in the presence of integer columns #11403

evanpw · 2015-10-21T21:39:13Z

Fixes GH #11376

jreback · 2015-10-21T21:54:54Z

pandas/tests/test_frame.py

+
+        df = pd.DataFrame([[-2, 0], [0, -4]])
+        assert_frame_equal(df.drop_duplicates(), df)
+


can u add the op example here as well
do we have sufficient coverage for various dtypes?

I see tests for strings, integers, and floats. This particular problem happens only for integers, though.

behzadnouri · 2015-10-23T11:20:20Z

pandas/core/frame.py

-                labels, shape = vals, unique1d(vals)
-            else:
-                labels, shape = factorize(vals, size_hint=min(len(self), _SIZE_HINT_LIMIT))
+                vals = vals.astype('i8', copy=False)


why are you doing all these lower, upper thing? in any case you need to factorize integers because you may not have all the integers between min and max value;

all you need is to undo #10917 :

diff --git a/pandas/core/frame.py b/pandas/core/frame.py index 827373c..08d2857 100644 --- a/pandas/core/frame.py +++ b/pandas/core/frame.py @@ -2992,13 +2992,7 @@ class DataFrame(NDFrame): from pandas.hashtable import duplicated_int64, _SIZE_HINT_LIMIT def f(vals): - - # if we have integers we can directly index with these - if com.is_integer_dtype(vals): - from pandas.core.nanops import unique1d - labels, shape = vals, unique1d(vals) - else: - labels, shape = factorize(vals, size_hint=min(len(self), _SIZE_HINT_LIMIT)) + labels, shape = factorize(vals, size_hint=min(len(self), _SIZE_HINT_LIMIT)) return labels.astype('i8',copy=False), len(shape) if subset is None:

I'm trying to avoid undoing #10917, because factorizing the integers is slower than using them directly. If all of the integers are non-negative, then everything actually works perfectly just replacing shape with max + 1. If some of the numbers are negative, then we'd like to just shift them up to all be non-negative, but that might cause an overflow that we need to check for.

behzadnouri · 2015-10-23T11:29:08Z

plz do not modify groupby.py:get_group_index function. that function has nothing to do with #11376 and does not need any fixing

jreback · 2015-10-23T13:04:23Z

yeh I think Ideally we'd like to get the perf benefit on not factororizing integers if possible, but seems like reverting #10917 (though with @evanpw ) tests would be the ideal

evanpw · 2015-10-23T13:16:12Z

This PR has the perf benefit without the bug, so why revert #10917 instead?

behzadnouri · 2015-10-23T17:21:01Z

@evanpw get_group_index should only get factorized values; you are giving it wrong values then trying to fix it by factorizing inside the function!

can you plz just revert #10917 and not change get_group_index?

evanpw · 2015-10-23T17:26:26Z

get_group_index doesn't need factorized values, though. All it needs is non-negative integers in a known range. The situation where a column needs to be factorized to make progress is a rare edge case that only happens when the difference between the largest and smallest integers in the column is comparable in magnitude to INT64_MAX.

evanpw · 2015-10-23T17:39:02Z

get_group_index actually has a beautiful mathematical description. It takes a subset U of the group Z_n1 x Z_n2 x ... Z_nm and maps it to the corresponding subset of Z_{n1 * n2 * ... * nm} (i.e., a list of numbers) via the usual isomorphism (the one you get by sorting the former group lexicographically). It doesn't matter whether or not U projects down to the entire space Z_ni for every i (i.e., it doesn't matter whether every column is factorized). It still works, as long as you're careful about integer overflow.

behzadnouri · 2015-10-23T18:20:37Z

@evanpw before trying to explain what the function does, can you plz check commit history of the function code and see who had contributed to its code?

you are modifying the function code to fix it, then you say that it does not need factorized labels!

that function does a core calculation to many pandas functionality, and has nothing to do with the bug you are trying to close.

jreback · 2015-10-23T18:22:55Z

@behzadnouri

I see no problem with @evanpw trying a different soln. If we can assure that it does the correct thing, but maybe in a different way to keep the perf benefits, where is the harm?

behzadnouri · 2015-10-23T18:31:08Z

@jreback

you are the one who has done the harm, namely #10917

obviously you do not understand how these functions work. can u plz at least not be dismissive when i am trying to help you.

jreback · 2015-10-23T18:32:36Z

@behzadnouri of course I did this change. yet we didn't have enough test coverage or it wouldn't have been ok.

I am all for rolling this back. but @evanpw soln seems reasonable.

pls don't be dismissive of others, it is simply not nice.

behzadnouri · 2015-10-23T18:37:41Z

@jreback get_group_index has nothing to do with #11376 bug

can we plz not modify that function? that function should only work with factorized labels! you see that this PR is modifying it to fix it because it is calling the function with non-factorized labels.

jreback · 2015-10-23T18:39:25Z

@behzadnouri see that would be a MUCH better comment. thank you.

I agree we should be clear on whether things are factorized or not going into functions.

So maybe the doc-string needs to be updated to make that clear.

kawochen · 2015-10-23T19:28:23Z

@evanpw please consider these two cases

import pandas as pd
import numpy as np
_INT64_MAX = np.iinfo(np.int64).max
df_e = pd.DataFrame([[0]*63, [_INT64_MAX-1]*63]).drop_duplicates()
df_w = pd.DataFrame([[0]*63, [_INT64_MAX]*63]).drop_duplicates()

kawochen · 2015-10-23T19:30:35Z

pandas/core/frame.py

+                else:
+                    width = long(upper) - long(lower)
+
+                    if width < np.iinfo(np.int64).max - 1:


perhaps make special constants like _INT64_MAX available in some module and then groupby.py and here can just import it?

yeh these could be in pandas/core/common.py I guess (do separate PR for this). they are already imported directly from the c-lib in cython, so this is just for python space

evanpw · 2015-10-23T19:50:12Z

@behzadnouri Sorry if I came across as rude. I really like the function you've written! I just happened to notice that it also works perfectly for non-factorized inputs, except in a certain edge case involving integer overflow (and already has to handle non-factorized intermediate values in some cases), so that we could keep the performance improvement from jreback's change while fixing the bug. I personally have no pressing need for drop_duplicates to be super fast, so if this change is going to be controversial, then let's just revert #10917.

evanpw · 2015-10-23T19:54:51Z

@kawochen Good catch. Should be fixed now.

jreback · 2015-10-23T20:02:40Z

ok, @evanpw pls revert #10917. but KEEP your tests, I guess that is how this slipped thru.

evanpw · 2015-10-23T20:03:09Z

Should I do it in a separate PR?

jreback · 2015-10-23T20:04:51Z

@evanpw no you can do it here. separate your tests into a single commit, then you'll will need to fix up some stuff on the revert (e.g. the whatsnew note). so you'll end up with the revert commit with some conflicts fixed and your tests in a commit. you will also need a new whatsnew note for 0.17.1 (which you can put in either)

evanpw · 2015-10-23T20:26:00Z

This wiped out the whatsnew message from 0.17.0. I'm not sure if that's the right thing in this case.

jreback · 2015-10-23T20:28:13Z

@evanpw no, leave the whatsnew from 0.17.0 (and add a new one for 0.17.1), leave the asv benchmark as well.

…ger columns (GH 11376)

… arrays" This reverts commit a00c7ea, but leaves new tests and benchmark

evanpw · 2015-10-23T20:55:16Z

Okay, this is at the very limit of my git ability, but it should be correct now.

jreback · 2015-10-23T20:56:44Z

can you add back the asv benchmark? (you can just add it in and combine with say the tests commit)

evanpw · 2015-10-23T21:08:09Z

It should still be there. I only reverted the changes to frame.py.

jreback · 2015-10-23T21:10:17Z

ahh, right....hahah ok then.

ping on green.

evanpw · 2015-10-24T00:13:40Z

all green

BUG: drop_duplicates drops non-duplicate rows in the presence of integer columns

jreback · 2015-10-24T00:18:30Z

thanks!

jreback reviewed Oct 21, 2015
View reviewed changes

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 21, 2015

jreback added this to the 0.17.1 milestone Oct 21, 2015

evanpw force-pushed the drop_dup_integers branch from f436d94 to 34467fc Compare October 22, 2015 17:03

behzadnouri reviewed Oct 23, 2015
View reviewed changes

kawochen reviewed Oct 23, 2015
View reviewed changes

evanpw force-pushed the drop_dup_integers branch from 2dea14a to 3f68d93 Compare October 23, 2015 20:24

BUG: drop_duplicates drops non-duplicate rows in the presence of inte…

d6ae52a

…ger columns (GH 11376)

Revert "PERF: perf improvements in drop_duplicates for integer dtyped…

b710728

… arrays" This reverts commit a00c7ea, but leaves new tests and benchmark

evanpw force-pushed the drop_dup_integers branch from 3f68d93 to b710728 Compare October 23, 2015 20:54

jreback added a commit that referenced this pull request Oct 24, 2015

Merge pull request #11403 from evanpw/drop_dup_integers

8a04d63

BUG: drop_duplicates drops non-duplicate rows in the presence of integer columns

jreback merged commit 8a04d63 into pandas-dev:master Oct 24, 2015

evanpw deleted the drop_dup_integers branch October 24, 2015 00:47

This was referenced Oct 25, 2015

drop_duplicates destroys non-duplicated data under 0.17 #11376

Closed

DataFrame.duplicated detects duplicates when none exist #11436

Closed

0.17 drop_duplicates() incorrectly dropping non-unique values #11459

Closed

TomAugspurger mentioned this pull request Nov 3, 2015

drop_duplicates() is dropping more than just duplicates in 0.17.0 #11512

Closed

This was referenced Nov 7, 2015

Potential bug: drop_duplicates() and duplicated() fail for multiple integer columns #11543

Closed

unexpected behaviour of DataFrame.duplicated #11567

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: drop_duplicates drops non-duplicate rows in the presence of integer columns #11403

BUG: drop_duplicates drops non-duplicate rows in the presence of integer columns #11403

evanpw commented Oct 21, 2015

jreback Oct 21, 2015

evanpw Oct 22, 2015

behzadnouri Oct 23, 2015

evanpw Oct 23, 2015

behzadnouri commented Oct 23, 2015

jreback commented Oct 23, 2015

evanpw commented Oct 23, 2015

behzadnouri commented Oct 23, 2015

evanpw commented Oct 23, 2015

evanpw commented Oct 23, 2015

behzadnouri commented Oct 23, 2015

jreback commented Oct 23, 2015

behzadnouri commented Oct 23, 2015

jreback commented Oct 23, 2015

behzadnouri commented Oct 23, 2015

jreback commented Oct 23, 2015

kawochen commented Oct 23, 2015

kawochen Oct 23, 2015

jreback Oct 23, 2015

evanpw commented Oct 23, 2015

evanpw commented Oct 23, 2015

jreback commented Oct 23, 2015

evanpw commented Oct 23, 2015

jreback commented Oct 23, 2015

evanpw commented Oct 23, 2015

jreback commented Oct 23, 2015

evanpw commented Oct 23, 2015

jreback commented Oct 23, 2015

evanpw commented Oct 23, 2015

jreback commented Oct 23, 2015

evanpw commented Oct 24, 2015

jreback commented Oct 24, 2015


		df = pd.DataFrame([[-2, 0], [0, -4]])
		assert_frame_equal(df.drop_duplicates(), df)

BUG: drop_duplicates drops non-duplicate rows in the presence of integer columns #11403

BUG: drop_duplicates drops non-duplicate rows in the presence of integer columns #11403

Conversation

evanpw commented Oct 21, 2015

jreback Oct 21, 2015

Choose a reason for hiding this comment

evanpw Oct 22, 2015

Choose a reason for hiding this comment

behzadnouri Oct 23, 2015

Choose a reason for hiding this comment

evanpw Oct 23, 2015

Choose a reason for hiding this comment

behzadnouri commented Oct 23, 2015

jreback commented Oct 23, 2015

evanpw commented Oct 23, 2015

behzadnouri commented Oct 23, 2015

evanpw commented Oct 23, 2015

evanpw commented Oct 23, 2015

behzadnouri commented Oct 23, 2015

jreback commented Oct 23, 2015

behzadnouri commented Oct 23, 2015

jreback commented Oct 23, 2015

behzadnouri commented Oct 23, 2015

jreback commented Oct 23, 2015

kawochen commented Oct 23, 2015

kawochen Oct 23, 2015

Choose a reason for hiding this comment

jreback Oct 23, 2015

Choose a reason for hiding this comment

evanpw commented Oct 23, 2015

evanpw commented Oct 23, 2015

jreback commented Oct 23, 2015

evanpw commented Oct 23, 2015

jreback commented Oct 23, 2015

evanpw commented Oct 23, 2015

jreback commented Oct 23, 2015

evanpw commented Oct 23, 2015

jreback commented Oct 23, 2015

evanpw commented Oct 23, 2015

jreback commented Oct 23, 2015

evanpw commented Oct 24, 2015

jreback commented Oct 24, 2015