unexpected behaviour of DataFrame.duplicated #11567

evfro · 2015-11-10T20:56:39Z

At least for large datasets DataFrame.duplicated returns incorrect results.

Consider MovieLens10M data (this code will automatically download the data from grouplens website):

import pandas as pd
from requests import get
from StringIO import StringIO
zip_file_url = 'http://files.grouplens.org/datasets/movielens/ml-10m.zip'
zip_response = get(zip_file_url)
zip_contents = StringIO(zip_response.content)

with ZipFile(zip_contents) as zfile:
    zdata = zfile.read('ml-10M100K/ratings.dat')
    delimiter = ';'
    zdata = zdata.replace('::', delimiter) # makes data compatible with pandas c-engine
    mldata = pd.read_csv(StringIO(zdata), sep=delimiter, header=None, engine='c',
                              names=['userid', 'movieid', 'rating', 'timestamp'],
                              usecols=['userid', 'movieid', 'rating'])

The data (mldata variable) contains no duplicates, which can be verified:

(mldata.groupby(['userid', 'movieid']).size()>1).any()
False

mldata.set_index(['userid', 'movieid']).index.is_unique
True

However, DataFrame.duplicated gives:

dups = mldata.duplicated(['userid', 'movieid'], keep=False)
print dups.any()
print dups.sum()

True
12127

Expected:

False
0

pd.show_versions():

INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.3.2
Cython: 0.23.3
numpy: 1.10.0
scipy: 0.16.0
statsmodels: None
IPython: 3.2.1
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None

The text was updated successfully, but these errors were encountered:

jreback · 2015-11-11T01:39:17Z

thanks for the report, a dupe of: #11376

this was already fixed here: #11403

and will be in forthcoming 0.17.1 (it's in master now)

jreback closed this as completed Nov 11, 2015

jreback added Bug Duplicate Report Duplicate issue or pull request labels Nov 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unexpected behaviour of DataFrame.duplicated #11567

unexpected behaviour of DataFrame.duplicated #11567

evfro commented Nov 10, 2015

jreback commented Nov 11, 2015

unexpected behaviour of DataFrame.duplicated #11567

unexpected behaviour of DataFrame.duplicated #11567

Comments

evfro commented Nov 10, 2015

INSTALLED VERSIONS

jreback commented Nov 11, 2015