drop_duplicates destroys non-duplicated data under 0.17 #11376

RPGillespie6 · 2015-10-19T22:31:00Z

The drop_duplicates() function in Python 3 is broken. Take the following example snippet:

import pandas as pd

raw_data = {'x': [7,6,3,3,4,8,0],'y': [0,6,5,5,9,1,2]}
df = pd.DataFrame(raw_data, columns = ['x', 'y'])

print("Before:", df)
df = df.drop_duplicates()
print("After:", df)

When run under python 2, the results are correct, but when running under python 3, pandas removes 6,6 from the frame, which is a completely unique row. When using this function with large CSV files, it causes thousands of lines of unique data loss.

See:
http://stackoverflow.com/questions/33224356/why-is-pandas-dropping-unique-rows

The text was updated successfully, but these errors were encountered:

jreback · 2015-10-19T22:36:54Z

pls pd.show_versions()

jreback · 2015-10-19T22:39:39Z

cc @sinhrks

oscar6echo · 2015-10-19T23:22:36Z

In the same field - and maybe connected (?) - I ran into a Python 2 case where the .duplicated() method applied to a dataframe returned lines that were NOT duplicates. Too large to paste the example here. But wanted to mention it in case it almost the same bug under the hood.

kawochen · 2015-10-19T23:30:20Z

able to reproduce in master (PY3)

dsm054 · 2015-10-20T02:10:31Z

I see the loss of 6,6 under both 2.7.6 and 3.5.0 with 0.17.0-- that is, I see no 2 vs. 3 difference in this example. (My notebook has a somewhat rare 32-bit Python linux environment, and so problems sometimes manifest differently.)

>>> df = pd.DataFrame([[1,0],[0,2]])
>>> df
   0  1
0  1  0
1  0  2
>>> df.duplicated()
0    False
1     True
dtype: bool
>>> df.astype(str).duplicated()
0    False
1    False
dtype: bool
>>> df.astype(np.float64).duplicated()
0    False
1    False
dtype: bool

dsm054 · 2015-10-20T02:47:21Z

I'm not convinced that the comment # if we have integers we can directly index with these before the int shortcut in f is correct. Not sure that get_group_index plays nicely with the integers if they're not ranked by column. For example:

In [202]: get_group_index([np.array([1,0]), np.array([0,2])], np.array([2,2]), False, False)
Out[202]: array([2, 2], dtype=int64)

Anyway, one route to a quick fix would be to push everything along the factorize branch, IIUC.

sinhrks · 2015-10-20T12:03:46Z

I can't track enough yet, but below part looks to break factorized labels.

https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L3768

@behzadnouri Any idea?

pag · 2015-10-21T09:43:41Z

Just a 'me too'. I'm running into this problem under python 2.7.10 now that I've upgraded to pandas 1.7. I can also reproduce using the example given in the stackoverflow post:

i.e. it's dropped the 3,5 duplicated (correct) but also the 6,6 row (incorrect).

evanpw · 2015-10-21T18:58:25Z

If all of the integers are non-negative, we can index directly with integers, but the shape should be the largest integer present + 1, rather than the number of unique integers.

behzadnouri · 2015-10-21T23:19:09Z

@jreback
@sinhrks

this is broken by #10917
core/frame.py#L2997-L3002.

factorize cannot be replaced by unique1d.

>>> a
array([ 5,  8, 11])
>>> unique1d(a)
array([ 5,  8, 11])
>>> factorize(a)[0]
array([0, 1, 2])

fridiculous · 2015-10-25T04:02:52Z

👍 on duplicated issues with 0.17.0

i'm seeing a similar problems, as the following should return an empty data frame:

df[df.duplicated(['charstring','number'], keep=False)]

index   charstring                              number
7989    E1FDF0E7-DFBD-428F-A6E7-5D48EAD3A559    435
7990    B3308C59-B9CF-42CB-A2C6-A406CA36EF2B    0

jreback · 2015-10-25T05:32:36Z

closed by #11403

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 19, 2015

jreback added this to the 0.17.1 milestone Oct 19, 2015

evanpw mentioned this issue Oct 21, 2015

BUG: drop_duplicates drops non-duplicate rows in the presence of integer columns #11403

Merged

RPGillespie6 changed the title ~~drop_duplicates destroys non-duplicated data under Python 3~~ drop_duplicates destroys non-duplicated data under 0.17 Oct 21, 2015

jreback closed this as completed Oct 25, 2015

This was referenced Oct 27, 2015

DataFrame.duplicated detects duplicates when none exist #11436

Closed

0.17 drop_duplicates() incorrectly dropping non-unique values #11459

Closed

TomAugspurger mentioned this issue Nov 3, 2015

drop_duplicates() is dropping more than just duplicates in 0.17.0 #11512

Closed

This was referenced Nov 7, 2015

Potential bug: drop_duplicates() and duplicated() fail for multiple integer columns #11543

Closed

unexpected behaviour of DataFrame.duplicated #11567

Closed

jreback mentioned this issue Nov 20, 2015

DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.0 #11668

Closed

kenjioman mentioned this issue Mar 25, 2020

df.drop_duplicates() not working as expected #32993

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drop_duplicates destroys non-duplicated data under 0.17 #11376

drop_duplicates destroys non-duplicated data under 0.17 #11376

RPGillespie6 commented Oct 19, 2015

jreback commented Oct 19, 2015

jreback commented Oct 19, 2015

oscar6echo commented Oct 19, 2015

kawochen commented Oct 19, 2015

dsm054 commented Oct 20, 2015

dsm054 commented Oct 20, 2015

sinhrks commented Oct 20, 2015

pag commented Oct 21, 2015

evanpw commented Oct 21, 2015

behzadnouri commented Oct 21, 2015

fridiculous commented Oct 25, 2015

jreback commented Oct 25, 2015

drop_duplicates destroys non-duplicated data under 0.17 #11376

drop_duplicates destroys non-duplicated data under 0.17 #11376

Comments

RPGillespie6 commented Oct 19, 2015

jreback commented Oct 19, 2015

jreback commented Oct 19, 2015

oscar6echo commented Oct 19, 2015

kawochen commented Oct 19, 2015

dsm054 commented Oct 20, 2015

dsm054 commented Oct 20, 2015

sinhrks commented Oct 20, 2015

pag commented Oct 21, 2015

evanpw commented Oct 21, 2015

behzadnouri commented Oct 21, 2015

fridiculous commented Oct 25, 2015

jreback commented Oct 25, 2015