Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.drop_duplicates() not working as expected #32993

Closed
kenjioman opened this issue Mar 25, 2020 · 9 comments
Closed

df.drop_duplicates() not working as expected #32993

kenjioman opened this issue Mar 25, 2020 · 9 comments
Assignees
Labels
Bug duplicated duplicated, drop_duplicates Strings String extension data type and string data

Comments

@kenjioman
Copy link

Code Sample, a copy-pastable example

df = pd.DataFrame([
    ['Microsoft'],
    ['Microsoft\x00quq2a<ScRiPt>alert(1)</ScRiPt>srfe1'],
    ['../../../../../../../../../../../../../../../../etc/passwd\x00Microsoft'],
    ['..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\windows\\win.ini\x00Microsoft'],
    ['Microsoft\x00vp30o<script>alert(1)</script>vz075'],
    ['../../../../../../../../../../../../../../../../etc/passwdX\x00Microsoft'],
    ['..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\windows\\win.ini'],
    ['..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\windows\\win.iniX\x00Microsoft'],
    ['../../../../../../../../../../../../../../../../etc/passwd'],
    ['../../../../../../../../../../../../../../../../etc/passwdX'],
    ['..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\windows\\win.iniX']])

df.drop_duplicates()

Problem description

If you try running the code above, you see only rows 0, 2, 3, 5, and 7 are retained. However, the actual strings are all unique, and I would have expected drop_duplicates() to retain all rows. It's looking like drop_duplicates() only compares strings up to non-printing characters, then everything else gets ignored in the comparisons.

Also, to note, I discovered this bug since I originally had these strings in a set and I didn't expect drop_duplicates() to do anything, but it did.

I've verified on pandas.__version__ == '1.0.3' (installed via anaconda on linux in a new test environment), and I looked for other existing issues, but didn't find anything that seemed to match (although #11376 seems to be close/ in the same vein).

Expected Output

(github markdown is highlighting some of the rows red, for some reason, but this is unrelated to what should be shown.)

>>> df
                                                    0
0                                           Microsoft
1       Microsoftquq2a<ScRiPt>alert(1)</ScRiPt>srfe1
2   ../../../../../../../../../../../../../../../....
3   ..\..\..\..\..\..\..\..\..\..\..\..\..\..\..\....
4       Microsoftvp30o<script>alert(1)</script>vz075
5   ../../../../../../../../../../../../../../../....
6   ..\..\..\..\..\..\..\..\..\..\..\..\..\..\..\....
7   ..\..\..\..\..\..\..\..\..\..\..\..\..\..\..\....
8   ../../../../../../../../../../../../../../../....
9   ../../../../../../../../../../../../../../../....
10  ..\..\..\..\..\..\..\..\..\..\..\..\..\..\..\....

Output of pd.show_versions() (ignoring rows with "None")

INSTALLED VERSIONS
------------------
python           : 3.8.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.15.0-88-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.3
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 46.1.1.post20200322
@TomAugspurger
Copy link
Contributor

Can you investigate where things are going wrong? Likely somewhere in our factorization / hashtable code, though I'm not sure. (_libs/hashtable.pyx)

@AMR-KELEG
Copy link

I am interested in checking the issue this weekend.
I will let you know whenever I have any new updates.

@kenjioman
Copy link
Author

One thing I should note -- it looks like it wasn't all non-printing characters, but I believe this is specific to \x00, the null-character. And @AMR-KELEG, thanks for being willing to take a look!

@JochenFromm
Copy link

The problem also happens for thepandas.unique function in core/algorithms.py. Normal strings do not cause problems...

>> import pandas as pd
>> pd.unique(pd.Series(['Microsoft', 'Microsoft VP']))
array(['Microsoft', 'Microsoft VP'], dtype=object)

...but for strings that contain the special character \x00 the function only returns one entry (\x01 and \x01 work, though):

>> pd.unique(pd.Series(['Microsoft', 'Microsoft\x00VP']))
array(['Microsoft'], dtype=object)

>> pd.unique(pd.Series(['Microsoft\x00VP', 'Microsoft']))
array(['Microsoft\x00VP'], dtype=object)

This confirms that the problem probably happens in the hashtable _libs/hashtable.pyx as suggested by @TomAugspurger

@aadishms
Copy link

take

@sarwatfatimam
Copy link

sarwatfatimam commented Apr 11, 2020

I am facing the same issue. A dataframe of 682 rows and 29 columns with all columns data types being either string or int64, it drops out two unique rows from the dataframe. These two rows are not completely unique, some of its columns are unique while other columns are the same but overall these rows are unique and should not be dropped. Another weird thing is that if I check these two rows separately in a dataframe, drop_duplicates do not drop them but rather retain it.

This is really bad because it can potentially drop out a lot more rows for a data size ranging in millions. What I have debugged so far, the problem lies in pandas.core.algorithms.factorize which is being used by the duplicated function in frame.py.

df = pd.DataFrame({'Id':['5915', '5915'], 'A':['LAB GLUCOSE, BEDSIDE', 'LAB GLUCOSE, BEDSIDE'], 'B':['82962', '82962'],'C': ['Glucose Blood Test', 'Glucose Blood Test'], 'D': ['300', '300'], 'E':['Laboratory-general', 'Laboratory-general'], 'F':[None, None], 'G': ['Lab POCT','Lab POCT'], 'H':['Lab and Path', 'Lab and Path'], 'DS': [5, 12], 'I': [datetime.date(2013,4,4),datetime.date(2013,4,11)], 'J':[None, None], 'K': ['4.21', '2.10'], 'L':['3.60','1.80'], 'M':['$82.00', '$41.00'], 'N':['2.00','1.00'], 'I':['2.105','2.1'],'O':['99.0','99.0'] , 'P':['7.81000000000005', '3.900000000004'], 'Q': [3030, 3030], 'R':['Lab and Path', 'Lab and Path'], 'S':['Lab', 'Lab'], 'T':['General Lab', 'General Lab'],'U':['Lab Glucose, Bedside', 'Lab Glucose, Bedside'], 'V':[0,0], 'W':[None, None], 'X':['Other', 'Other'], 'Y':[None, None], 'Z':['229', '229'] })

@kenjioman
Copy link
Author

@aadishms, just curious, any updates?

@jorisvandenbossche jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug labels Jun 4, 2020
@mroeschke mroeschke added duplicated duplicated, drop_duplicates and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Apr 18, 2021
@chaostheory
Copy link

Thank you to everyone working on Pandas. It's a great library and tool.

Now that's out of the way, I just wanted to confirm that this method still drops rows that are NOT duplicates.

It was one of the hardest bugs to pinpoint. Even after looking at my data, I still don't understand why this method would think that the rows it dropped were dupes. If any project member is interested in looking at my data, ping me.

@mroeschke mroeschke added the Strings String extension data type and string data label Jul 30, 2021
@mroeschke
Copy link
Member

Seems like a duplicate issue of #34551, and that thread is closer to pinpointing a potential fix so closing in favor of that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug duplicated duplicated, drop_duplicates Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

9 participants