Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtensionArray Ops discussion #19577

Closed
TomAugspurger opened this issue Feb 7, 2018 · 6 comments

Comments

Projects
None yet
5 participants
@TomAugspurger
Copy link
Contributor

commented Feb 7, 2018

Split from #19520

How to handle things like __eq__, __add__, etc.

Neither the Python 2 or Python 3 object defaults are appropriate, so I think we should make it abstract or provide a default implementation for some / all of these.

We should be able to do a default implementation for the simple case of binop(extension_array, extension_array) by

  • masking nulls in either self | other
  • cast self / other to ndarrays
  • compare and apply null mask

@jorisvandenbossche raises the point that for some extension arrays, casting to an object ndarray can be expensive. It'd be unfortunate to do the casting if we're just going to raise a TypeError when the binop is done. In this case the subclass will need to override all those ops (or we could add class attribute like _comparable and check that before doing __eq__ or __lt__, etc).

How much coercion do we want to attempt? I would vote to stay simple and only attempt a comparison if

other is an instance of the same ExtensionArray type
other is a scalar of self.dtype.type.

Otherwise we return NotImplemented. Subclasses can of course choose more or less aggressive coercion rules.

When boxed in a Series or Index, I think the goal should be

  1. dispatch to the underlying .values comparsion
  2. Reconstruct the appropriate output (box in series / index with name).

@TomAugspurger TomAugspurger added this to the Next Major Release milestone Feb 7, 2018

@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.23.0 Feb 7, 2018

@jbrockmendel

This comment has been minimized.

Copy link
Member

commented Feb 7, 2018

Migrated from #19520:

A thorny issue: What to do with comparison methods

I've been planning on moving the Index/Series arithmetic/comparison ops into Array subclasses, so that Series.__op__(other) could just wrap self._data.blocks[0].__op__(other) and we get consistency between Series/DataFrame ops for free.

Most of the comparison ops are ready to make that jump; I can prioritize it if getting that done quickly will help you out.

One caveat I'm concerned about is that the existing implementations in the Index subclasses have gotten tangled up with a bunch of unrelated Index machinery. Ideally I'd like these operations to go into self-contained mixin classes that rely on constructor methods, but are independent of slicing/concat/reindex/dropna/... mentioned above.

@jbrockmendel

This comment has been minimized.

Copy link
Member

commented Feb 15, 2018

@TomAugspurger two quick namespace-gameplan questions.

Assuming the arith/comparison methods currently in DatetimeIndexOpsMixin/DTI/TDI/PI get moved into analogous array classes, do you envision these getting a) mixed into the appropriate Index/Block subclasses or b) accessed via composition? If the latter, what name (DatetimeIndex._??_values) should these use to get at the pdarray implementation?

Because the datetimelike methods wrap some of the base Index methods, some of those will need to move up too. Where do you envision something like BaseArray living?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Feb 15, 2018

It will be composition (that is what Block is already doing), and for Index it will be the "_values" attribute IIRC

@jreback jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018

@Dr-Irv

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2018

@TomAugspurger So now I've hit upon this issue of how to deal with the binary operators, and the question is what should these operators return (in the case of the arithmetic operators). Consider the decimal example in tests/extension/decimal and the following code:

In [1]: import decimal

In [2]: import pandas as pd

In [3]: from pandas.tests.extension.decimal.array import DecimalArray
   ...:
   ...: a = DecimalArray([decimal.Decimal(str(i)) for i in range(5)])
   ...: b = a.values
   ...: sa = pd.Series(a)
   ...: sb = pd.Series(b)
   ...: tb = sb + sb
   ...:

In [4]: tb
Out[4]:
0    0
1    2
2    4
3    6
4    8
dtype: object

In [5]: ta = sa + sa
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\ops.py in na_op(x, y)
   1008         try:
-> 1009             result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
   1010         except TypeError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\computation\expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
    204     if use_numexpr:
--> 205         return _evaluate(op, op_str, a, b, **eval_kwargs)
    206     return _evaluate_standard(op, op_str, a, b)

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\computation\expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs)
    119     if result is None:
--> 120         result = _evaluate_standard(op, op_str, a, b)
    121

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\computation\expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
     64     with np.errstate(all='ignore'):
---> 65         return op(a, b)
     66

TypeError: unsupported operand type(s) for +: 'DecimalArray' and 'DecimalArray'

During handling of the above exception, another exception occurred:

AssertionError                            Traceback (most recent call last)
<ipython-input-5-8612afbce1b2> in <module>()
----> 1 ta = sa + sa

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\ops.py in wrapper(left, right)
   1064             rvalues = rvalues.values
   1065
-> 1066         result = safe_na_op(lvalues, rvalues)
   1067         return construct_result(left, result,
   1068                                 index=left.index, name=res_name, dtype=None)

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\ops.py in safe_na_op(lvalues, rvalues)
   1028         try:
   1029             with np.errstate(all='ignore'):
-> 1030                 return na_op(lvalues, rvalues)
   1031         except Exception:
   1032             if is_object_dtype(lvalues):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\ops.py in na_op(x, y)
   1015                 result[mask] = op(x[mask], com._values_from_object(y[mask]))
   1016             else:
-> 1017                 assert isinstance(x, np.ndarray)
   1018                 result = np.empty(len(x), dtype=x.dtype)
   1019                 mask = notna(x)

AssertionError:

So if the Decimal values are in a Series with dtype object, I can add them just fine, and I get a Series with dtype object. In the case of adding the Series that is backed by a DecimalArray, then it fails. It's not clear to me if we want to result to be the type of the subclass of ExtensionArray or if the person implementing the subclass can specify which class is used for each operator. (One could imagine that the result of a binary operator could be a different ExtensionArray subclass, or even that the result of different binary operators could be different subclasses. For example, arithmetic operators produce objects, and relational operators produce booleans).

So there are 3 options as I see it:

  1. Make it the responsibility of the person subclassing ExtensionArray to implement the operators to determine the return type
  2. Make the result of all ExtensionArraySubclass.binop(ExtensionArraySubclass) always be ExtensionArraySubclass
  3. Set up something in the ExtensionArray implementation that allows the class containing results of operators to be defined by the person implementing the subclass.

I think I'd favor (3), but I could live with (2). I'd prefer to not do (1), as that is a lot of effort that it seems people would have to repeat.

Thoughts?

@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.24.0 Apr 23, 2018

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Apr 23, 2018

I think initially we should go for option 1, and make sure that pandas actually dispatches to it. But I don't fully understand your option 3. Can you clarify a bit more?

I don't think option 2 is actually an option (IMO it is also independent of deciding where the actual operation is implemented), because, as you mention, the result of an operation does not necessarily need to be of the same type (additional example: substraction of datetimes gives timedelta. That is definitely a case we need to handle)

@Dr-Irv

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2018

@jorisvandenbossche Yes, I see now that option 2 isn't viable.

My idea on option 3 is something like this. The subclass implements a _binop_result_class class attribute that contains a dictionary. The keys to that dictionary are the different operators, and the values for each key are the array class that the operator returns. Any operators not in the dictionary use default pandas behaviors (e.g., return np.array([], dtype='object') for arithmetic operators, and booleans for logical ones).

The default behavior is to just do an element-by-element operation. If the class of the underlying dtype has implemented the operator, then it gets called automatically, and all is well.

@Dr-Irv Dr-Irv referenced this issue Apr 30, 2018

Closed

ENH: Support operators for ExtensionArray #20889

4 of 4 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.