BUG: .loc with duplicated label may have incorrect index dtype #11497

Merged
merged 1 commit into from Nov 29, 2015

Conversation

Projects
None yet
2 participants
Member

sinhrks commented Nov 1, 2015

.loc result with duplicated keys may have incorred Index dtype.

import pandas as pd

ser = pd.Series([0.1, 0.2], index=pd.Index([1, 2], name='idx'))

# OK
ser.loc[[2, 2, 1]].index
# Int64Index([2, 2, 1], dtype='int64', name=u'idx')

# NG, Int64Index(dtype=object) 
ser.loc[[3, 2, 3]].index 
# Int64Index([3, 2, 3], dtype='object', name=u'idx')
ser.loc[[3, 2, 3, 'x']].index 
# Int64Index([3, 2, 3, u'x'], dtype='object', name=u'idx')

idx = pd.date_range('2011-01-01', '2011-01-02', freq='D', name='idx')
ser = pd.Series([0.1, 0.2], index=idx, name='s')

# OK
ser.loc[[pd.Timestamp('2011-01-02'), pd.Timestamp('2011-01-02'), pd.Timestamp('2011-01-01')]].index
# DatetimeIndex(['2011-01-02', '2011-01-02', '2011-01-01'], dtype='datetime64[ns]', name=u'idx', freq=None)

# NG, ValueError
ser.loc[[pd.Timestamp('2011-01-03'), pd.Timestamp('2011-01-02'), pd.Timestamp('2011-01-03')]].index
# ValueError: Inferred frequency None from passed dates does not conform to passed frequency D

After the PR:

Above OK results are unchanged.

import pandas as pd
ser = pd.Series([0.1, 0.2], index=pd.Index([1, 2], name='idx'))

ser.loc[[3, 2, 3]].index 
# Int64Index([3, 2, 3], dtype='int64', name=u'idx')
ser.loc[[3, 2, 3, 'x']].index 
# Index([3, 2, 3, u'x'], dtype='object', name=u'idx')

idx = pd.date_range('2011-01-01', '2011-01-02', freq='D', name='idx')
ser = pd.Series([0.1, 0.2], index=idx, name='s')

ser.loc[[pd.Timestamp('2011-01-03'), pd.Timestamp('2011-01-02'), pd.Timestamp('2011-01-03')]].index
# DatetimeIndex(['2011-01-03', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', name=u'idx', freq=None)

sinhrks added this to the 0.17.1 milestone Nov 1, 2015

@jreback jreback commented on an outdated diff Nov 1, 2015

pandas/core/index.py
@@ -2160,7 +2160,13 @@ def _reindex_non_unique(self, target):
new_indexer = np.arange(len(self.take(indexer)))
new_indexer[~check] = -1
- return self._shallow_copy(new_labels), indexer, new_indexer
+ attrs = self._get_attributes_dict()
+ # non-unique slicing must reset freq
+ attrs.pop('freq', None)
+ try:
+ return self._constructor(new_labels, **attrs), indexer, new_indexer
@jreback

jreback Nov 1, 2015

Contributor

I think this logic should go in Index._shallow_copy itself.

sinhrks referenced this pull request Nov 7, 2015

Open

BUG/API: Clarify the behaviour of fillna downcasting #11537

0 of 1 task complete

@jreback jreback and 1 other commented on an outdated diff Nov 7, 2015

pandas/core/index.py
@@ -382,6 +390,8 @@ def _shallow_copy(self, values=None, infer=False, **kwargs):
values : the values to create the new Index, optional
infer : boolean, default False
if True, infer the new type of the passed values
+ reset_attributes : boolean, default False
+ if True, reset attributes specified in _reset_attributes
kwargs : updates the default attributes for this Index
@jreback

jreback Nov 7, 2015

Contributor

this is pretty convoluted

why is this needed?

@sinhrks

sinhrks Nov 7, 2015

Member

Tried to handle freq reset for DTI / keep for PeriodIndex operation in the function. Will consider better way and remove from this PR.

Member

sinhrks commented Nov 7, 2015

Updated and now green.

@jreback jreback and 1 other commented on an outdated diff Nov 7, 2015

pandas/core/index.py
@@ -391,6 +395,11 @@ def _shallow_copy(self, values=None, infer=False, **kwargs):
if infer:
attributes['copy'] = False
+ if self._infer_as_myclass:
@jreback

jreback Nov 7, 2015

Contributor

why is infer not good enough here? (or maybe just reprupose and make infer a class attribute? (lke what you are calling _infer_as_myclass). e.g. confusing that we need/have 2

@sinhrks

sinhrks Nov 7, 2015

Member

To achieve below result. .loc against DatetimeIndex internally converts the index to Int64Index. Thus, there should be a logic to revert to DatetimeIndex.

pd.Index([1., 2., 3.])._shallow_copy([1, 2, 3], infer=True)
# Int64Index([1, 2, 3], dtype='int64')

idx = pd.date_range('2011-01-01', freq='D', periods=3)
idx._shallow_copy(idx.asi8, infer=True)
# DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq='D')

In current master, the below case results in:

idx._shallow_copy(idx.asi8, infer=True)
Int64Index([1293840000000000000, 1293926400000000000, 1294012800000000000], dtype='int64')
@jreback

jreback Nov 7, 2015

Contributor

I understand. The point is that maybe if we have a class attribute, we don't need the instance one (e.g. passing infer=True at all). OR (if you think we need both) I would name them the same, and only use the instance passed one if its not None

@jreback

jreback Nov 7, 2015

Contributor

where is the inference of the above example?

@sinhrks

sinhrks Nov 8, 2015

Member

Below is the behavior of current master.

pd.Index([1., 2., 3.])._shallow_copy([1, 2, 3], infer=True)
# Int64Index([1, 2, 3], dtype='int64')

pd.Index([1., 2., 3.])._shallow_copy([1, 2, 3])
# Float64Index([1, 2, 3], dtype='int64')
idx = pd.date_range('2011-01-01', freq='D', periods=3)
idx._shallow_copy(idx.asi8, infer=True)
# Int64Index([1293840000000000000, 1293926400000000000, 1294012800000000000], dtype='int64')

idx._shallow_copy(idx.asi8)
# DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq='D')

.loc must infer because user may pass different dtype.

pd.Index([1., 2., 3.])._shallow_copy(['a', 'b', 'c'])
# Float64Index([u'a', u'b', u'c'], dtype='|S1')
# -> NG, incorrect dtype

Thus, I understand _shallow_copy should cover followings.

  • Return the same Index if values are guaranteed to be the same dtype as the original (without inference)
  • Return the correct type of Index if values can have different dtypes from the original (with inference)
    • Inference prioritizing original class (used in datetime-like Index accepting internal repr like asi8)
    • Normal inference (simply pass values to Index)

Class attribute _infer_as_myclass intends to split the logic as below (or needs to override _shallow_copy in tseries/base)

pd.Index([1., 2., 3.])._shallow_copy(['2011-01', '2011-02', '2011-02'], infer=True)
# Index([u'2011-01', u'2011-02', u'2011-02'], dtype='object')

idx._shallow_copy(['2011-01', '2011-02', '2011-02'], infer=True)
# Index([u'2011-01', u'2011-02', u'2011-02'], dtype='object')
# -> want DatetimeIndex in this case. 
@jreback

jreback Nov 8, 2015

Contributor

@sinhrks not questing the need for this! just whether we need infer AND _infer_as_myclass or could we just use a single class variable for this (rather than the passed kw infer). Just confusing to have 2 sort-of-similar ways of doing this.

@sinhrks

sinhrks Nov 14, 2015

Member

Ah I see. One idea is splitting the method. Based on the current impl:

  • Use _simple_new when inference is not needed.
  • Use _shallow_copy when inference is needed.

I've once tried it, but it isn't work by the simple change because of some impl differences, for example MultiIndex doesn't have _simple_new.

Let me try it in a separate PR, or change the milestone to 0.18.0.

Contributor

jreback commented Nov 13, 2015

@sinhrks can you update

@jreback jreback modified the milestone: Next Major Release, 0.17.1 Nov 14, 2015

Contributor

jreback commented Nov 14, 2015

@sinhrks yeh, prob requires some playing around to get this to work nicely. Like your separation of concerns though. We should document this prob at top of core/index.py

Member

sinhrks commented Nov 24, 2015

How about adding _infer_new as below?

  • _infer_new: The same as current _shallow_copy(infer=True). Always tries to infer correct type.
  • _shallow_copy: Remove infer kw. Always use the same type as the caller.

We can replace _shallow_copy by _get_attribute_dict + _simple_new completely, but I didn't do because adding _get_attribute_dict is redundant / easy to be missed.

If OK, I'll squash.

@sinhrks sinhrks modified the milestone: 0.18.0, Next Major Release Nov 24, 2015

Contributor

jreback commented Nov 25, 2015

@sinhrks seems reasonable call it _shallow_copy_with_infer? to make the connection with _shallow_copy

Member

sinhrks commented Nov 28, 2015

@jreback OK, renamed. Also added general explanation about simple_new, shallow_copy and shallow_copy_with_infer. Could you check?

@jreback jreback added a commit that referenced this pull request Nov 29, 2015

@jreback jreback Merge pull request #11497 from sinhrks/loc_dtype
BUG: .loc with duplicated label may have incorrect index dtype
431e224

@jreback jreback merged commit 431e224 into pandas-dev:master Nov 29, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Contributor

jreback commented Nov 29, 2015

thanks @sinhrks and nice docs!

sinhrks deleted the sinhrks:loc_dtype branch Nov 29, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment