Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug: drop_duplicates() and duplicated() fail for multiple integer columns #11543

Closed
pekaalto opened this issue Nov 7, 2015 · 2 comments
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@pekaalto
Copy link

pekaalto commented Nov 7, 2015

It seems that drop_duplicates() and duplicated() methods are not working properly for large integer columns. Here is my example data frame http://pastebin.com/KVHxUpgz

import pandas as pd

pd.read_clipboard(delimiter=',')
r = x.duplicated(keep=False)
print(x[r])

This gives me:
x1 x2
8 16000010001 8470207
95 16000010009 8470039

Clearly these are not duplicates but seems like pandas thinks they are!

Also drop_duplicates() seems to fail:

print(len(x),len(x.drop_duplicates()))

gives: 101 100

When I convert my columns to string they are not duplicates anymore:

r1 = x.apply(lambda x: '%d-%d' % tuple(x),axis=1).duplicated()
print(r1.sum())

is 0 as it should.

Here is the versions:

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: fi_FI

pandas: 0.17.0
nose: None
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: None
IPython: 4.0.0
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 2.2.6
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None

@pekaalto pekaalto changed the title drop_duplicates() and duplicated() fail for multiple integer columns Potential bug: drop_duplicates() and duplicated() fail for multiple integer columns Nov 7, 2015
@jreback
Copy link
Contributor

jreback commented Nov 7, 2015

thanks for the report, a dupe of: #11376

this was already fixed here: #11403

and will be in forthcoming 0.17.1 (it's in master now)

@jreback jreback closed this as completed Nov 7, 2015
@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 7, 2015
@pekaalto
Copy link
Author

pekaalto commented Nov 7, 2015

Thanks! My searching skills suck :(
E: ...for some reason I was searching open issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

2 participants