drop_duplicates destroys non-duplicated data under 0.17 #11376

Closed
RPGillespie6 opened this Issue Oct 19, 2015 · 12 comments

Comments

Projects
None yet
10 participants

The drop_duplicates() function in Python 3 is broken. Take the following example snippet:

import pandas as pd

raw_data = {'x': [7,6,3,3,4,8,0],'y': [0,6,5,5,9,1,2]}
df = pd.DataFrame(raw_data, columns = ['x', 'y'])

print("Before:", df)
df = df.drop_duplicates()
print("After:", df)

When run under python 2, the results are correct, but when running under python 3, pandas removes 6,6 from the frame, which is a completely unique row. When using this function with large CSV files, it causes thousands of lines of unique data loss.

See:
http://stackoverflow.com/questions/33224356/why-is-pandas-dropping-unique-rows

Contributor

jreback commented Oct 19, 2015

pls pd.show_versions()

jreback added this to the 0.17.1 milestone Oct 19, 2015

Contributor

jreback commented Oct 19, 2015

In the same field - and maybe connected (?) - I ran into a Python 2 case where the .duplicated() method applied to a dataframe returned lines that were NOT duplicates. Too large to paste the example here. But wanted to mention it in case it almost the same bug under the hood.

Contributor

kawochen commented Oct 19, 2015

able to reproduce in master (PY3)

Contributor

dsm054 commented Oct 20, 2015

I see the loss of 6,6 under both 2.7.6 and 3.5.0 with 0.17.0-- that is, I see no 2 vs. 3 difference in this example. (My notebook has a somewhat rare 32-bit Python linux environment, and so problems sometimes manifest differently.)

>>> df = pd.DataFrame([[1,0],[0,2]])
>>> df
   0  1
0  1  0
1  0  2
>>> df.duplicated()
0    False
1     True
dtype: bool
>>> df.astype(str).duplicated()
0    False
1    False
dtype: bool
>>> df.astype(np.float64).duplicated()
0    False
1    False
dtype: bool
Contributor

dsm054 commented Oct 20, 2015

I'm not convinced that the comment # if we have integers we can directly index with these before the int shortcut in f is correct. Not sure that get_group_index plays nicely with the integers if they're not ranked by column. For example:

In [202]: get_group_index([np.array([1,0]), np.array([0,2])], np.array([2,2]), False, False)
Out[202]: array([2, 2], dtype=int64)

Anyway, one route to a quick fix would be to push everything along the factorize branch, IIUC.

Member

sinhrks commented Oct 20, 2015

I can't track enough yet, but below part looks to break factorized labels.

@behzadnouri Any idea?

pag commented Oct 21, 2015

Just a 'me too'. I'm running into this problem under python 2.7.10 now that I've upgraded to pandas 1.7. I can also reproduce using the example given in the stackoverflow post:

Before
   x  y
0  7  0
1  6  6
2  3  5
3  3  5
4  4  9
5  8  1
6  0  2

After:
   x  y
0  7  0
2  3  5
4  4  9
5  8  1
6  0  2

i.e. it's dropped the 3,5 duplicated (correct) but also the 6,6 row (incorrect).

Contributor

evanpw commented Oct 21, 2015

If all of the integers are non-negative, we can index directly with integers, but the shape should be the largest integer present + 1, rather than the number of unique integers.

Contributor

behzadnouri commented Oct 21, 2015

@jreback
@sinhrks

this is broken by pydata#10917
core/frame.py#L2997-L3002.

factorize cannot be replaced by unique1d.

>>> a
array([ 5,  8, 11])
>>> unique1d(a)
array([ 5,  8, 11])
>>> factorize(a)[0]
array([0, 1, 2])

RPGillespie6 changed the title from drop_duplicates destroys non-duplicated data under Python 3 to drop_duplicates destroys non-duplicated data under 0.17 Oct 21, 2015

👍 on duplicated issues with 0.17.0

i'm seeing a similar problems, as the following should return an empty data frame:

df[df.duplicated(['charstring','number'], keep=False)]

index   charstring                              number
7989    E1FDF0E7-DFBD-428F-A6E7-5D48EAD3A559    435
7990    B3308C59-B9CF-42CB-A2C6-A406CA36EF2B    0
Contributor

jreback commented Oct 25, 2015

closed by #11403

jreback closed this Oct 25, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment