Skip to content

set_index drops data in the presence of duplicates when inplace=True and verify_integrity=True #1831

@snth

Description

@snth

When calling set_index on an index with duplicates, the verify_integrity=True option correctly identifies the duplicates but this check appears to take place after the original columns have already been dropped when inplace=True is also passed. This results in data being lost.

I believe it would be better if the original DataFrame object was only modified in the case that the set_index operation is successful.

Code to reproduce the problem:

In [189]: df = DataFrame({'one':[1, 1, 2], 'two':[1,2,3]})

In [190]: df
Out[190]: 
   one  two
0    1    1
1    1    2
2    2    3

In [191]: df.set_index(['one'], inplace=True, verify_integrity=True)
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
/mnt/hgfs/fastdata/<ipython-input-191-e1c0e8c92f6c> in <module>()
----> 1 df.set_index(['one'], inplace=True, verify_integrity=True)

/home/tobias/code/envs/mac/local/lib/python2.7/site-packages/pandas/core/frame.pyc in set_index(self, keys, drop, append, inplace, verify_integrity)
   2328         if verify_integrity and not index.is_unique:
   2329             duplicates = index.get_duplicates()
-> 2330             raise Exception('Index has duplicate keys: %s' % duplicates)
   2331 
   2332         # clear up memory usage


Exception: Index has duplicate keys: [1]

In [192]: df
Out[192]: 
   two
0    1
1    2
2    3

In [202]: print sys.version
2.7.3 (default, Aug  1 2012, 05:14:39) 
[GCC 4.6.3]

In [203]: print pd.version.version
0.8.1

In [204]: 

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIndexingRelated to indexing on series/frames, not to indexes themselves

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions