Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drop_duplicates destroys non-duplicated data under 0.17 #11376

Closed
RPGillespie6 opened this issue Oct 19, 2015 · 12 comments

Comments

@RPGillespie6
Copy link

commented Oct 19, 2015

The drop_duplicates() function in Python 3 is broken. Take the following example snippet:

import pandas as pd

raw_data = {'x': [7,6,3,3,4,8,0],'y': [0,6,5,5,9,1,2]}
df = pd.DataFrame(raw_data, columns = ['x', 'y'])

print("Before:", df)
df = df.drop_duplicates()
print("After:", df)

When run under python 2, the results are correct, but when running under python 3, pandas removes 6,6 from the frame, which is a completely unique row. When using this function with large CSV files, it causes thousands of lines of unique data loss.

See:
http://stackoverflow.com/questions/33224356/why-is-pandas-dropping-unique-rows

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2015

pls pd.show_versions()

@jreback jreback added this to the 0.17.1 milestone Oct 19, 2015

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2015

@oscar6echo

This comment has been minimized.

Copy link

commented Oct 19, 2015

In the same field - and maybe connected (?) - I ran into a Python 2 case where the .duplicated() method applied to a dataframe returned lines that were NOT duplicates. Too large to paste the example here. But wanted to mention it in case it almost the same bug under the hood.

@kawochen

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2015

able to reproduce in master (PY3)

@dsm054

This comment has been minimized.

Copy link
Contributor

commented Oct 20, 2015

I see the loss of 6,6 under both 2.7.6 and 3.5.0 with 0.17.0-- that is, I see no 2 vs. 3 difference in this example. (My notebook has a somewhat rare 32-bit Python linux environment, and so problems sometimes manifest differently.)

>>> df = pd.DataFrame([[1,0],[0,2]])
>>> df
   0  1
0  1  0
1  0  2
>>> df.duplicated()
0    False
1     True
dtype: bool
>>> df.astype(str).duplicated()
0    False
1    False
dtype: bool
>>> df.astype(np.float64).duplicated()
0    False
1    False
dtype: bool
@dsm054

This comment has been minimized.

Copy link
Contributor

commented Oct 20, 2015

I'm not convinced that the comment # if we have integers we can directly index with these before the int shortcut in f is correct. Not sure that get_group_index plays nicely with the integers if they're not ranked by column. For example:

In [202]: get_group_index([np.array([1,0]), np.array([0,2])], np.array([2,2]), False, False)
Out[202]: array([2, 2], dtype=int64)

Anyway, one route to a quick fix would be to push everything along the factorize branch, IIUC.

@sinhrks

This comment has been minimized.

Copy link
Member

commented Oct 20, 2015

I can't track enough yet, but below part looks to break factorized labels.

@behzadnouri Any idea?

@pag

This comment has been minimized.

Copy link

commented Oct 21, 2015

Just a 'me too'. I'm running into this problem under python 2.7.10 now that I've upgraded to pandas 1.7. I can also reproduce using the example given in the stackoverflow post:

Before
   x  y
0  7  0
1  6  6
2  3  5
3  3  5
4  4  9
5  8  1
6  0  2

After:
   x  y
0  7  0
2  3  5
4  4  9
5  8  1
6  0  2

i.e. it's dropped the 3,5 duplicated (correct) but also the 6,6 row (incorrect).

@evanpw

This comment has been minimized.

Copy link
Contributor

commented Oct 21, 2015

If all of the integers are non-negative, we can index directly with integers, but the shape should be the largest integer present + 1, rather than the number of unique integers.

@behzadnouri

This comment has been minimized.

Copy link
Contributor

commented Oct 21, 2015

@jreback
@sinhrks

this is broken by #10917
core/frame.py#L2997-L3002.

factorize cannot be replaced by unique1d.

>>> a
array([ 5,  8, 11])
>>> unique1d(a)
array([ 5,  8, 11])
>>> factorize(a)[0]
array([0, 1, 2])

@RPGillespie6 RPGillespie6 changed the title drop_duplicates destroys non-duplicated data under Python 3 drop_duplicates destroys non-duplicated data under 0.17 Oct 21, 2015

@fridiculous

This comment has been minimized.

Copy link

commented Oct 25, 2015

👍 on duplicated issues with 0.17.0

i'm seeing a similar problems, as the following should return an empty data frame:

df[df.duplicated(['charstring','number'], keep=False)]

index   charstring                              number
7989    E1FDF0E7-DFBD-428F-A6E7-5D48EAD3A559    435
7990    B3308C59-B9CF-42CB-A2C6-A406CA36EF2B    0
@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 25, 2015

closed by #11403

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.