CLN: Make ufunc works for Index #10638

Merged
merged 1 commit into from Sep 6, 2015

Conversation

Projects
None yet
3 participants
Member

sinhrks commented Jul 20, 2015

closes #9966, #9974 (PR)

I understand these are never used because Index is no longer the subclass of np.array. Correct?

np.sin(pd.Categorical([1, 2, 3]))
array([ 0.84147098,  0.90929743,  0.14112001])
  • MultiIndex: Raise TypeError or AttributeError, as ufuncs are performed to array of tuples.
Contributor

jreback commented Jul 20, 2015

yep I think you could take them out. as np.array(index) will coerce to a ndarray (and not an index sub-class). This is done by _shallow_copy in any event.

jreback added the Clean label Jul 20, 2015

Member

shoyer commented Jul 20, 2015

No, we need this. Otherwise np.sin(idx) returns an ndarray.

It's not documented, but __array_finalize__ works even on non-subclasses.

Contributor

jreback commented Jul 20, 2015

ok let's add some tests and clean up the code
I suspect this should be in core/base then

eg to support the example above

Contributor

jreback commented Jul 20, 2015

IIRC there is another issue where for example (#9966 with a PR #9974) :

In [22]: np.sin(Index(range(5)))
Out[22]: Int64Index([0.0, 0.841470984808, 0.909297426826, 0.14112000806, -0.756802495308], dtype='float64')

which is wrong, this needs to be created w/o slightly differently, e.g.

Index(func(...), **self._get_attributes_dict())

so that the attributes get propogates AND the correct type gets inferred (e.g. this would return a FloatI64Index in this case)

maybe you want to supersede #9974

Member

sinhrks commented Jul 21, 2015

Thanks. I'll fix Index.__array_wraps__ to make ufunc work.

https://github.com/pydata/pandas/blob/master/pandas/core/index.py#L253

Am I correct that defining __array_finalize__ is not needed if __array_wraps__ is defined? Looks __array_finalize__ is not defined in normal Index.

http://docs.scipy.org/doc/numpy/reference/arrays.classes.html

Member

shoyer commented Jul 21, 2015

This is correct -- array_wraps is what we need.

On Tue, Jul 21, 2015 at 5:44 AM, Sinhrks notifications@github.com wrote:

Thanks. I'll fix Index.__array_wraps__ to make ufunc work.
https://github.com/pydata/pandas/blob/master/pandas/core/index.py#L253
Am I correct that defining __array_finalize__ is not needed if __array_wraps__ is defined? Looks __array_finalize__ is not defined in normal Index.

http://docs.scipy.org/doc/numpy/reference/arrays.classes.html

Reply to this email directly or view it on GitHub:
pydata#10638 (comment)

sinhrks changed the title from CLN: Remote Datetime-like Index.__array_finalize__ to CLN: Remove Datetime-like Index.__array_finalize__ Jul 22, 2015

Member

sinhrks commented Jul 22, 2015

Let me confirm 2 points:

  1. Invalid ops for datetime-like indexes.

    I understand numeric ufuncs (like np.sin) for datetime-like is almost meaningless. PeriodIndex is likely to return unexpected result because of current __array__ definition. One idea is to raise error when PeriodIndex get float (invalid) result.

    import pandas as pd
    
    # OK
    np.sin(pd.DatetimeIndex(['2015-07', '2015-08']))
    # TypeError: ufunc 'sin' not supported for the input types, and the inputs could not be safely coerced to   any supported types according to the casting rule ''safe''
    
    # NG
    pidx = pd.PeriodIndex(['2015-07', '2015-08'], freq='M')
    np.sin(pidx)
    # Float64Index([-0.594884318359, 0.354966539136], dtype='float64')
    
    # ufunc is applied to i8 values
    pidx.__array__()
    # array([546, 547])
    
  2. Some ufunc returns bool dtype. In such cases, it should return bool array as it is (like Index.duplicated)

    np.isnan(pidx)
    # Index([False, False], dtype='object')
    
Contributor

jreback commented Jul 23, 2015

@sinhrks both of those points look right

jreback added this to the 0.17.0 milestone Jul 23, 2015

Member

shoyer commented Jul 23, 2015

Yes, agreed

On Thu, Jul 23, 2015 at 4:14 AM, jreback notifications@github.com wrote:

@sinhrks both of those points look right

Reply to this email directly or view it on GitHub:
pydata#10638 (comment)

@jreback jreback commented on the diff Jul 23, 2015

pandas/tseries/period.py
@@ -281,6 +281,16 @@ def __contains__(self, key):
return False
return key.ordinal in self._engine
+ def __array_wrap__(self, result, context=None):
@jreback

jreback Jul 23, 2015

Contributor

doesn't this need to go in tseries/base, e.g. ufuncs are not valid for any datetimelike?

@shoyer

shoyer Jul 23, 2015

Member

timedelta64 and datetime64 are real numpy dtypes, so ufuncs will already choke on them:

In [1]: import pandas as pd

In [2]: i = pd.date_range('2000-01-01', periods=5)

In [3]: import numpy as np

In [4]: np.square(i)
TypeError: ufunc 'square' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [5]: np.square(i.to_period())
Out[5]:
PeriodIndex(['330671-10-02', '330731-10-03', '330791-10-05', '330851-10-09',
             '330911-10-16'],
            dtype='int64', freq='D')
@shoyer

shoyer Jul 23, 2015

Member

I agree that tests for these other index types would be a good idea, though.

Also, I think we should probably prohibit all ufuncs on PeriodIndex, not just those that return float. For example, it's not valid to square periods, even though that returns integers.

@sinhrks

sinhrks Aug 17, 2015

Member

Raising TypeError in __array_wrap__ affects to arithmetic using np.array. Currently, it works:

pd.PeriodIndex(['2011', '2012'], freq='A') + np.array([1, 2])
# PeriodIndex(['2012', '2014'], dtype='int64', freq='A-DEC')

So it can't be prohibited?

@shoyer

shoyer Aug 17, 2015

Member

PeriodIndex + integer ndarray should not be expected to work -- the array
does not have the right type!

On Mon, Aug 17, 2015 at 3:35 PM, Sinhrks notifications@github.com wrote:

In pandas/tseries/period.py
pydata#10638 (comment):

@@ -281,6 +281,16 @@ def contains(self, key):
return False
return key.ordinal in self._engine

  • def array_wrap(self, result, context=None):

Raising TypeError in array_wrap affects to arithmetic using np.array.
Currently, it works:

pd.PeriodIndex(['2011', '2012'], freq='A') + np.array([1, 2])

PeriodIndex(['2012', '2014'], dtype='int64', freq='A-DEC')

So it can't be prohibited?


Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/pull/10638/files#r37245206.

@jreback

jreback Aug 17, 2015

Contributor

actually I disagree
add integers to period works and makes sense as its a freq op

@sinhrks

sinhrks Aug 17, 2015

Member

Yes, because this is a valid freq shift, np.array should work as the same.

pd.PeriodIndex(['2011', '2012'], freq='A') + 1
# PeriodIndex(['2012', '2013'], dtype='int64', freq='A-DEC')

And #10744 has been done expecting arithmetic using np.array works.

@sinhrks

sinhrks Aug 18, 2015

Member

Reconsidered this, and defining _add_ndarray for PeriodIndex may be an alternative. I think options are:

  1. Always raise TypeError in __array_wraps__. Support ndarray ops in another methods (PeriodIndex must be lhs).
  2. Raise TypeError if __array_wraps__ gets non-integers. Some ufunc which return int outputs meaningless results (like square)

NOTE: Maybe not intentional, PeriodIndex.shift works for ndarray. Thus we can use shift for ndarray ops.

@jreback

jreback Aug 18, 2015

Contributor

@sinhrks remember that ATM PeriodIndex is an odd duck. Its a real int64 dtyped array, that happens to have boxing. So it allows any integer-like ops (as opposed to DatetimeIndex which prohibits a lot more because its M8[ns]).

So its needs handling to error on prohibited ops. So on on raising in __array_wraps__ for prohibited ops, but this might require some dispatching mechanism to the sub-class to determine what is allowed. E.g. we do it similarly like this in the _add_numeric_ops and like routines.

sinhrks changed the title from CLN: Remove Datetime-like Index.__array_finalize__ to CLN: Make ufunc works for Index Jul 23, 2015

Contributor

jreback commented Aug 5, 2015

@sinhrks you might want to steal tests from #9974

Contributor

jreback commented Aug 15, 2015

@sinhrks how's this coming along?

Contributor

jreback commented Aug 28, 2015

@sinhrks how's this coming?

Contributor

jreback commented Aug 31, 2015

@sinhrks if you can get to this in the next couple of days gr8, otherwise will defer this.

Member

sinhrks commented Aug 31, 2015

I've looked through _add_numeric_operations. You mean to define PeriodIndex.sin and so on used by np.sin and raise error?

Contributor

jreback commented Aug 31, 2015

@sinhrks I think PeriodIndex needs to override __array_wrap__. and raise if context is not None (I think). This is the return call from a ufunc type of operation, so it will 'succed' but we should raise an error. I don't know if their is a way to intercept it up front (their is if you are sub-class I think), but not sure about a duck-like

Member

sinhrks commented Aug 31, 2015

@jreback Thanks, will check this.

@sinhrks sinhrks and 2 others commented on an outdated diff Sep 1, 2015

pandas/tseries/period.py
@@ -306,6 +307,23 @@ def __contains__(self, key):
return False
return key.ordinal in self._engine
+ def __array_wrap__(self, result, context=None):
+ """
+ Gets called after a ufunc. Needs additional handling as
+ PeriodIndex stores internal data as int dtype
+ """
+ if isinstance(context, tuple) and len(context) > 0:
+ func = context[0]
+ if (func is np.add):
+ return self._add_delta(context[1][1])
+ elif (func is np.subtract):
+ return self._add_delta(-context[1][1])
+ elif isinstance(func, np.ufunc):
+ if 'M->M' not in func.types:
+ msg = "ufunc '{0}' not supported for the PeriodIndex"
+ raise ValueError(msg.format(func.__name__))
@sinhrks

sinhrks Sep 1, 2015

Member

Should be TypeError, but cannot be raised from here because numpy catches.

@jreback

jreback Sep 1, 2015

Contributor

@shoyer you know any better way to override the ufunc calling scheme? (e.g. as __numpy_ufunc__ doesn't exist yet). or is this best way?

@shoyer

shoyer Sep 1, 2015

Member

I think this sort of thing is the best we can do for now

@jreback

jreback Sep 1, 2015

Contributor

ok let's add a note that eventually we could replace this with numpy_ufunc

Member

sinhrks commented Sep 5, 2015

Updated and now green.

@jreback jreback commented on the diff Sep 5, 2015

pandas/tests/test_index.py
+ def test_numpy_ufuncs(self):
+ # test ufuncs of numpy 1.9.2. see:
+ # http://docs.scipy.org/doc/numpy/reference/ufuncs.html
+
+ # some functions are skipped because it may return different result
+ # for unicode input depending on numpy version
+
+ for name, idx in compat.iteritems(self.indices):
+ for func in [np.exp, np.exp2, np.expm1, np.log, np.log2, np.log10,
+ np.log1p, np.sqrt, np.sin, np.cos,
+ np.tan, np.arcsin, np.arccos, np.arctan,
+ np.sinh, np.cosh, np.tanh, np.arcsinh, np.arccosh,
+ np.arctanh, np.deg2rad, np.rad2deg]:
+ if isinstance(idx, pd.tseries.base.DatetimeIndexOpsMixin):
+ # raise TypeError or ValueError (PeriodIndex)
+ # PeriodIndex behavior should be changed in future version
@jreback

jreback Sep 5, 2015

Contributor

why the different exception behavior?

@sinhrks

sinhrks Sep 5, 2015

Member

TypeError raised from __array_wrap__ seems to be catchrd by numpy and return unintended result.

Contributor

jreback commented Sep 5, 2015

this doesn't seeem to close #9966 ?

In [9]: df = pd.DataFrame({'i': np.linspace(0, 5, 6).astype(np.int64), 
               'j': np.linspace(2, 3, 6)}).set_index('i')

In [10]: df
Out[10]: 
     j
i     
0  2.0
1  2.2
2  2.4
3  2.6
4  2.8
5  3.0

In [11]: df.index
Out[11]: Int64Index([0, 1, 2, 3, 4, 5], dtype='int64', name=u'i')

In [12]: df.index = df.index / 10.

In [13]: df.index
Out[13]: Int64Index([0.0, 0.1, 0.2, 0.3, 0.4, 0.5], dtype='float64', name=u'i')

Member

sinhrks commented Sep 5, 2015

@jreback Standard arithmetic is handled by _numeric_binop. Now fixed.

df.index / 2.
# Float64Index([0.0, 0.5, 1.0, 1.5, 2.0, 2.5], dtype='float64', name=u'i')

np.divide(df.index, 2.)
# Float64Index([0.0, 0.5, 1.0, 1.5, 2.0, 2.5], dtype='float64', name=u'i')
Member

sinhrks commented Sep 5, 2015

Failed in Timedelta ops. On current master, mult and div doesn't affect to existing freq, but it should?

import pandas as pd
idx = pd.timedelta_range('2 days', periods=3, freq='2D')

idx * 2
# TimedeltaIndex(['4 days', '8 days', '12 days'], dtype='timedelta64[ns]', freq='2D')
# -> freq should be "4D"?

idx / 2
# TimedeltaIndex(['1 days', '2 days', '3 days'], dtype='timedelta64[ns]', freq='2D')
# -> freq should be "D"?
Contributor

jreback commented Sep 5, 2015

@sinhrks hmm I think you need to set the freq=None so it get's reinferred (or maybe can et it to 'infer')

@jreback jreback added a commit that referenced this pull request Sep 6, 2015

@jreback jreback Merge pull request #10638 from sinhrks/array_finalize
CLN: Make ufunc works for Index
0cacd24

@jreback jreback merged commit 0cacd24 into pandas-dev:master Sep 6, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Contributor

jreback commented Sep 6, 2015

@sinhrks thanks!

sinhrks deleted the sinhrks:array_finalize branch Sep 6, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment