Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected behaviour of DataFrame.duplicated #11567

Closed
evfro opened this issue Nov 10, 2015 · 1 comment
Closed

unexpected behaviour of DataFrame.duplicated #11567

evfro opened this issue Nov 10, 2015 · 1 comment
Labels
Bug Duplicate Report Duplicate issue or pull request

Comments

@evfro
Copy link

evfro commented Nov 10, 2015

At least for large datasets DataFrame.duplicated returns incorrect results.

Consider MovieLens10M data (this code will automatically download the data from grouplens website):

import pandas as pd
from requests import get
from StringIO import StringIO
zip_file_url = 'http://files.grouplens.org/datasets/movielens/ml-10m.zip'
zip_response = get(zip_file_url)
zip_contents = StringIO(zip_response.content)

with ZipFile(zip_contents) as zfile:
    zdata = zfile.read('ml-10M100K/ratings.dat')
    delimiter = ';'
    zdata = zdata.replace('::', delimiter) # makes data compatible with pandas c-engine
    mldata = pd.read_csv(StringIO(zdata), sep=delimiter, header=None, engine='c',
                              names=['userid', 'movieid', 'rating', 'timestamp'],
                              usecols=['userid', 'movieid', 'rating'])

The data (mldata variable) contains no duplicates, which can be verified:

(mldata.groupby(['userid', 'movieid']).size()>1).any()
False

mldata.set_index(['userid', 'movieid']).index.is_unique
True

However, DataFrame.duplicated gives:

dups = mldata.duplicated(['userid', 'movieid'], keep=False)
print dups.any()
print dups.sum()

True
12127

Expected:

False
0

pd.show_versions():

INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.3.2
Cython: 0.23.3
numpy: 1.10.0
scipy: 0.16.0
statsmodels: None
IPython: 3.2.1
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None

@jreback
Copy link
Contributor

jreback commented Nov 11, 2015

thanks for the report, a dupe of: #11376

this was already fixed here: #11403

and will be in forthcoming 0.17.1 (it's in master now)

@jreback jreback closed this as completed Nov 11, 2015
@jreback jreback added Bug Duplicate Report Duplicate issue or pull request labels Nov 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

2 participants