Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
drop_duplicates destroys non-duplicated data under 0.17 #11376
Comments
|
pls |
jreback
added Bug Reshaping
labels
Oct 19, 2015
jreback
added this to the
0.17.1
milestone
Oct 19, 2015
|
cc @sinhrks |
oscar6echo
commented
Oct 19, 2015
|
In the same field - and maybe connected (?) - I ran into a Python 2 case where the .duplicated() method applied to a dataframe returned lines that were NOT duplicates. Too large to paste the example here. But wanted to mention it in case it almost the same bug under the hood. |
|
able to reproduce in master (PY3) |
|
I see the loss of 6,6 under both 2.7.6 and 3.5.0 with 0.17.0-- that is, I see no 2 vs. 3 difference in this example. (My notebook has a somewhat rare 32-bit Python linux environment, and so problems sometimes manifest differently.)
|
|
I'm not convinced that the comment
Anyway, one route to a quick fix would be to push everything along the factorize branch, IIUC. |
|
I can't track enough yet, but below part looks to break factorized labels. @behzadnouri Any idea? |
pag
commented
Oct 21, 2015
|
Just a 'me too'. I'm running into this problem under python 2.7.10 now that I've upgraded to pandas 1.7. I can also reproduce using the example given in the stackoverflow post:
i.e. it's dropped the 3,5 duplicated (correct) but also the 6,6 row (incorrect). |
|
If all of the integers are non-negative, we can index directly with integers, but the shape should be the largest integer present + 1, rather than the number of unique integers. |
evanpw
referenced
this issue
Oct 21, 2015
Merged
BUG: drop_duplicates drops non-duplicate rows in the presence of integer columns #11403
|
this is broken by pydata#10917
>>> a
array([ 5, 8, 11])
>>> unique1d(a)
array([ 5, 8, 11])
>>> factorize(a)[0]
array([0, 1, 2]) |
RPGillespie6
changed the title from
drop_duplicates destroys non-duplicated data under Python 3 to drop_duplicates destroys non-duplicated data under 0.17
Oct 21, 2015
fridiculous
commented
Oct 25, 2015
|
i'm seeing a similar problems, as the following should return an empty data frame:
|
|
closed by #11403 |
RPGillespie6 commentedOct 19, 2015
The
drop_duplicates()function in Python 3 is broken. Take the following example snippet:When run under python 2, the results are correct, but when running under python 3, pandas removes
6,6from the frame, which is a completely unique row. When using this function with large CSV files, it causes thousands of lines of unique data loss.See:
http://stackoverflow.com/questions/33224356/why-is-pandas-dropping-unique-rows