df.drop_duplicates() not working as expected #32993

kenjioman · 2020-03-25T00:36:38Z

Code Sample, a copy-pastable example

df = pd.DataFrame([
    ['Microsoft'],
    ['Microsoft\x00quq2a<ScRiPt>alert(1)</ScRiPt>srfe1'],
    ['../../../../../../../../../../../../../../../../etc/passwd\x00Microsoft'],
    ['..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\windows\\win.ini\x00Microsoft'],
    ['Microsoft\x00vp30o<script>alert(1)</script>vz075'],
    ['../../../../../../../../../../../../../../../../etc/passwdX\x00Microsoft'],
    ['..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\windows\\win.ini'],
    ['..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\windows\\win.iniX\x00Microsoft'],
    ['../../../../../../../../../../../../../../../../etc/passwd'],
    ['../../../../../../../../../../../../../../../../etc/passwdX'],
    ['..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\..\\windows\\win.iniX']])

df.drop_duplicates()

Problem description

If you try running the code above, you see only rows 0, 2, 3, 5, and 7 are retained. However, the actual strings are all unique, and I would have expected drop_duplicates() to retain all rows. It's looking like drop_duplicates() only compares strings up to non-printing characters, then everything else gets ignored in the comparisons.

Also, to note, I discovered this bug since I originally had these strings in a set and I didn't expect drop_duplicates() to do anything, but it did.

I've verified on pandas.__version__ == '1.0.3' (installed via anaconda on linux in a new test environment), and I looked for other existing issues, but didn't find anything that seemed to match (although #11376 seems to be close/ in the same vein).

Expected Output

(github markdown is highlighting some of the rows red, for some reason, but this is unrelated to what should be shown.)

>>> df
                                                    0
0                                           Microsoft
1       Microsoftquq2a<ScRiPt>alert(1)</ScRiPt>srfe1
2   ../../../../../../../../../../../../../../../....
3   ..\..\..\..\..\..\..\..\..\..\..\..\..\..\..\....
4       Microsoftvp30o<script>alert(1)</script>vz075
5   ../../../../../../../../../../../../../../../....
6   ..\..\..\..\..\..\..\..\..\..\..\..\..\..\..\....
7   ..\..\..\..\..\..\..\..\..\..\..\..\..\..\..\....
8   ../../../../../../../../../../../../../../../....
9   ../../../../../../../../../../../../../../../....
10  ..\..\..\..\..\..\..\..\..\..\..\..\..\..\..\....

Output of `pd.show_versions()` (ignoring rows with "None")

INSTALLED VERSIONS
------------------
python           : 3.8.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.15.0-88-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.3
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 46.1.1.post20200322

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-03-26T21:12:28Z

Can you investigate where things are going wrong? Likely somewhere in our factorization / hashtable code, though I'm not sure. (_libs/hashtable.pyx)

AMR-KELEG · 2020-03-26T22:58:24Z

I am interested in checking the issue this weekend.
I will let you know whenever I have any new updates.

kenjioman · 2020-03-27T01:15:12Z

One thing I should note -- it looks like it wasn't all non-printing characters, but I believe this is specific to \x00, the null-character. And @AMR-KELEG, thanks for being willing to take a look!

JochenFromm · 2020-03-27T14:24:05Z

The problem also happens for thepandas.unique function in core/algorithms.py. Normal strings do not cause problems...

>> import pandas as pd
>> pd.unique(pd.Series(['Microsoft', 'Microsoft VP']))
array(['Microsoft', 'Microsoft VP'], dtype=object)

...but for strings that contain the special character \x00 the function only returns one entry (\x01 and \x01 work, though):

>> pd.unique(pd.Series(['Microsoft', 'Microsoft\x00VP']))
array(['Microsoft'], dtype=object)

>> pd.unique(pd.Series(['Microsoft\x00VP', 'Microsoft']))
array(['Microsoft\x00VP'], dtype=object)

This confirms that the problem probably happens in the hashtable _libs/hashtable.pyx as suggested by @TomAugspurger

aadishms · 2020-03-28T23:26:31Z

take

sarwatfatimam · 2020-04-11T16:40:08Z

I am facing the same issue. A dataframe of 682 rows and 29 columns with all columns data types being either string or int64, it drops out two unique rows from the dataframe. These two rows are not completely unique, some of its columns are unique while other columns are the same but overall these rows are unique and should not be dropped. Another weird thing is that if I check these two rows separately in a dataframe, drop_duplicates do not drop them but rather retain it.

This is really bad because it can potentially drop out a lot more rows for a data size ranging in millions. What I have debugged so far, the problem lies in pandas.core.algorithms.factorize which is being used by the duplicated function in frame.py.

df = pd.DataFrame({'Id':['5915', '5915'], 'A':['LAB GLUCOSE, BEDSIDE', 'LAB GLUCOSE, BEDSIDE'], 'B':['82962', '82962'],'C': ['Glucose Blood Test', 'Glucose Blood Test'], 'D': ['300', '300'], 'E':['Laboratory-general', 'Laboratory-general'], 'F':[None, None], 'G': ['Lab POCT','Lab POCT'], 'H':['Lab and Path', 'Lab and Path'], 'DS': [5, 12], 'I': [datetime.date(2013,4,4),datetime.date(2013,4,11)], 'J':[None, None], 'K': ['4.21', '2.10'], 'L':['3.60','1.80'], 'M':['$82.00', '$41.00'], 'N':['2.00','1.00'], 'I':['2.105','2.1'],'O':['99.0','99.0'] , 'P':['7.81000000000005', '3.900000000004'], 'Q': [3030, 3030], 'R':['Lab and Path', 'Lab and Path'], 'S':['Lab', 'Lab'], 'T':['General Lab', 'General Lab'],'U':['Lab Glucose, Bedside', 'Lab Glucose, Bedside'], 'V':[0,0], 'W':[None, None], 'X':['Other', 'Other'], 'Y':[None, None], 'Z':['229', '229'] })

kenjioman · 2020-05-08T06:54:15Z

@aadishms, just curious, any updates?

chaostheory · 2021-06-08T06:07:49Z

Thank you to everyone working on Pandas. It's a great library and tool.

Now that's out of the way, I just wanted to confirm that this method still drops rows that are NOT duplicates.

It was one of the hardest bugs to pinpoint. Even after looking at my data, I still don't understand why this method would think that the rows it dropped were dupes. If any project member is interested in looking at my data, ping me.

mroeschke · 2021-08-07T22:38:43Z

Seems like a duplicate issue of #34551, and that thread is closer to pinpointing a potential fix so closing in favor of that issue.

github-actions bot assigned aadishms Mar 28, 2020

jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug labels Jun 4, 2020

jorisvandenbossche mentioned this issue Jun 4, 2020

BUG: DataFrame.drop_duplicates confuses NULL bytes #34551

Open

mroeschke added duplicated duplicated, drop_duplicates and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Apr 18, 2021

mroeschke added the Strings String extension data type and string data label Jul 30, 2021

mroeschke closed this as completed Aug 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

df.drop_duplicates() not working as expected #32993

df.drop_duplicates() not working as expected #32993

kenjioman commented Mar 25, 2020

TomAugspurger commented Mar 26, 2020

AMR-KELEG commented Mar 26, 2020

kenjioman commented Mar 27, 2020

JochenFromm commented Mar 27, 2020

aadishms commented Mar 28, 2020

sarwatfatimam commented Apr 11, 2020 •

edited

Loading

kenjioman commented May 8, 2020

chaostheory commented Jun 8, 2021

mroeschke commented Aug 7, 2021

df.drop_duplicates() not working as expected #32993

df.drop_duplicates() not working as expected #32993

Comments

kenjioman commented Mar 25, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions() (ignoring rows with "None")

TomAugspurger commented Mar 26, 2020

AMR-KELEG commented Mar 26, 2020

kenjioman commented Mar 27, 2020

JochenFromm commented Mar 27, 2020

aadishms commented Mar 28, 2020

sarwatfatimam commented Apr 11, 2020 • edited Loading

kenjioman commented May 8, 2020

chaostheory commented Jun 8, 2021

mroeschke commented Aug 7, 2021

Output of `pd.show_versions()` (ignoring rows with "None")

sarwatfatimam commented Apr 11, 2020 •

edited

Loading