New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series.map({}) should retain the original dtype #18509

Closed
jreback opened this Issue Nov 27, 2017 · 3 comments

Comments

Projects
None yet
2 participants
@jreback
Contributor

jreback commented Nov 27, 2017

#18491 (comment)

In [1]: pd.Series(pd.date_range("2012-01-01", periods=3)).map({})
Out[1]: 
0   NaN
1   NaN
2   NaN
dtype: float64

In [2]: pd.date_range("2012-01-01", periods=3).map({})
Out[2]: DatetimeIndex(['NaT', 'NaT', 'NaT'], dtype='datetime64[ns]', freq=None)

[1] should infer to a datetime64 as well

@jschendel comments

On master:

In [2]: pd.__version__
Out[2]: '0.22.0.dev0+241.gf745e52'

In [3]: pd.date_range('20170101', periods=4).map({})
Out[3]: DatetimeIndex(['NaT', 'NaT', 'NaT', 'NaT'], dtype='datetime64[ns]', freq=None)

IntervalIndex, CategoricalIndex, and Index with object dtype get coerced to Float64Index:

In [4]: pd.interval_range(0, 5).map({})
Out[4]: Float64Index([nan, nan, nan, nan, nan], dtype='float64')

In [5]: pd.CategoricalIndex(list('abca')).map({})
Out[5]: Float64Index([nan, nan, nan, nan], dtype='float64')

In [6]: pd.Index(list('abca')).map({})
Out[6]: Float64Index([nan, nan, nan, nan], dtype='float64')

PeriodIndex and TimedeltaIndex get coerced DatetimeIndex:

In [7]: pd.period_range('2017Q1', periods=4, freq='Q').map({})
Out[7]: DatetimeIndex(['NaT', 'NaT', 'NaT', 'NaT'], dtype='datetime64[ns]', freq=None)

In [8]: pd.timedelta_range(1, periods=4).map({})
Out[8]: DatetimeIndex(['NaT', 'NaT', 'NaT', 'NaT'], dtype='datetime64[ns]', freq=None)

@jreback jreback added this to the 0.22.0 milestone Nov 27, 2017

@jreback jreback self-assigned this Nov 27, 2017

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 27, 2017

@jschendel yeah [7], and [8] 'work' but should be the correct type (bug I introduced). The others are not well tested (cat / interval), so not surprising they are wrong. [6] also not tested well.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Nov 27, 2017

Repeating my comment from #18491 here:

I am not sure I find this special casing a good idea. I would rather keep .map 'dummy' and simply create a new index/series from the newly created values, without trying to infer anything more than pd.Index(..) / pd.Series(..) already does.

Do you have a specific use case for this? (for wanting to be smart?)
I think the user can always change the dtype after doing the map if it wants something specific or to preserve the index.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Nov 27, 2017

To further clarify: I think we are just trying to be too smart here, introducing a lot of corner cases. I would say: just let the user handle this (if he/she wants to retain the index type, he/she can simply wrap the map call in the appropriate index type).

So what is the rationale of trying to keep the index type in case of all-NaNs ? Because you think a user typically uses map to create a new index of the same type but with different values? And it such a case that your DatetimeIndex becoming a FloatIndex is suprising?
In my personal use cases of map, I am typically not interested in preserving the index type, as my functions typically return something completely different (in terms of dtype). In that case, this "trying to preserve the dtype" can also give unexpected results the other way around. Assume I have a function that returns floats, but for some reason for the index I have, only returns np.nans:

In [44]: pd.date_range("2010-01-01", periods=3, freq='A').map(lambda x: np.nan if x.year > 2010 else 1)
Out[44]: Float64Index([1.0, nan, nan], dtype='float64')

In [45]: pd.date_range("2011-01-01", periods=3, freq='A').map(lambda x: np.nan if x.year > 2010 else 1)
Out[45]: DatetimeIndex(['NaT', 'NaT', 'NaT'], dtype='datetime64[ns]', freq=None)

In that case I, unexpectedly, get a DatetimeIndex instead of a FloatIndex.

To put it differently, we wouldn't distinguish a NaN and NaT return:

In [39]: pd.date_range("2012-01-01", periods=3).map(lambda x: pd.NaT)
Out[39]: DatetimeIndex(['NaT', 'NaT', 'NaT'], dtype='datetime64[ns]', freq=None)

In [40]: pd.date_range("2012-01-01", periods=3).map(lambda x: np.nan)
Out[40]: DatetimeIndex(['NaT', 'NaT', 'NaT'], dtype='datetime64[ns]', freq=None)

Also, you added a special case of trying to preserve uint64 data type for Uint64Index. Shouldn't we then do this for Series with different dtypes as well?

And about [1] (first example at the top) being a bug: this has always been the behaviour, so if we change this I would see it as a API breaking change rather than a bug fix.

jreback added a commit to jreback/pandas that referenced this issue Nov 27, 2017

jreback added a commit to jreback/pandas that referenced this issue Dec 2, 2017

jreback added a commit that referenced this issue Dec 2, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment