ENH: Integer NA Extension Array #21160

jreback · 2018-05-22T00:00:03Z

closes #20700
closes #20747

In [1]: df = pd.DataFrame({
'A': pd.Series([1, 2, np.nan], dtype='Int64'), 
'B': pd.Series([1, np.nan, 3], dtype='UInt8'), 
'C': [1, 2, 3]})

In [2]: df
Out[2]: 
     A    B  C
0    1    1  1
1    2  NaN  2
2  NaN    3  3

In [3]: df.dtypes
Out[3]: 
A    Int64
B    UInt8
C    int64
dtype: object

In [4]: df.A + df.B
Out[4]: 
0      2
1    NaN
2    NaN
dtype: Int64

In [5]: df.A + df.C
Out[5]: 
0      2
1      4
2    NaN
dtype: Int64

In [6]: (df.A + df.C) * 3
Out[6]: 
0      6
1     12
2    NaN
dtype: Int64
In [7]: (df.A + df.C) * 3 == 1
Out[7]: 
0    False
1    False
2    False
dtype: bool

In [8]: (df.A + df.C) * 3 == 12
Out[8]: 
0    False
1     True
2    False
dtype: bool

jreback · 2018-05-22T00:00:27Z

cc @TomAugspurger @jorisvandenbossche @cpcloud @jbrockmendel @Dr-Irv

jbrockmendel · 2018-05-22T00:04:26Z

pandas/core/arrays/base.py

+
+        Parameters
+        ----------
+        other : ExtenionArray or list/tuple of ExtenionArrays


typo Extenion-->Extension

codecov · 2018-05-22T00:34:20Z

Codecov Report

Merging #21160 into master will increase coverage by 0.01%.
The diff coverage is 95.48%.

@@            Coverage Diff             @@
##           master   #21160      +/-   ##
==========================================
+ Coverage   91.96%   91.98%   +0.01%     
==========================================
  Files         166      167       +1     
  Lines       50329    50606     +277     
==========================================
+ Hits        46287    46551     +264     
- Misses       4042     4055      +13

Flag	Coverage Δ
#multiple	`90.39% <95.48%> (+0.02%)`	⬆️
#single	`42.18% <34.51%> (-0.05%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/dtypes/concat.py	`99.18% <ø> (ø)`	⬆️
pandas/core/arrays/categorical.py	`95.95% <100%> (ø)`	⬆️
pandas/core/arrays/base.py	`87.85% <100%> (ø)`	⬆️
pandas/core/indexes/base.py	`96.37% <100%> (+0.01%)`	⬆️
pandas/core/missing.py	`91.66% <100%> (+0.02%)`	⬆️
pandas/core/dtypes/cast.py	`88.52% <100%> (+0.16%)`	⬆️
pandas/core/series.py	`94.1% <100%> (-0.02%)`	⬇️
pandas/core/arrays/__init__.py	`100% <100%> (ø)`	⬆️
pandas/core/internals.py	`95.49% <100%> (+0.06%)`	⬆️
pandas/core/dtypes/common.py	`95.2% <100%> (+0.12%)`	⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 537b65c...4f04f90. Read the comment docs.

jbrockmendel · 2018-05-22T00:39:32Z

There's a lot here, so preliminary thoughts based on a quick pass:

Getting integer NA implemented will be a big win!
If there are bite-sized pieces that can be broken off, that will make life easier
For a bitmask is there a good heuristic for when to use sparse vs dense? "Too early to worry about that" is a reasonable answer.
For ops, this skips several of the previously-discussed steps in transitioning to One True Implementation.
- IIRC there are inconsistencies in Series vs Index division by zero behavior for numeric types. I'd prefer to get these aligned before transitioning the dispatch logic.
- Should IntegerArray also back Int64Index and UInt64Index?

Dr-Irv · 2018-05-22T13:36:36Z

@jreback Here is my take on the operators implementation, based on what I did in #20889. Looking at your implementation in pandas/core/arrays/integer.py, you've defined the operators on the arrays themselves. I did set things up in #20889 so that if someone did that with ExtensionArray, there is a dispatch to those methods. The important thing in pandas/core/ops.py is that you can't assume anything about the underlying implementation of ExtensionArray, and it's possible that your changes to ops.py might be assuming that ExtensionArray is using a numpy implementation under the hood. I think (but I'm not 100% sure) that if you take what I did in pandas/core/ops.py with what you've done here, things would work correctly.

TomAugspurger

Only partway through. Will try to take a closer look later today.

TomAugspurger · 2018-05-23T11:18:45Z

pandas/core/dtypes/base.py

@@ -156,6 +156,12 @@ def name(self):
        """
        raise AbstractMethodError(self)

+    @property
+    def array_type(self):


Add this to the list of abstract methods / properties on line 110

This change limits us to 1 array type per extension dtype. Is everyone OK with that? I don't see any downsides to that.

this is not a problem, you simply have multiple arrays and multiple dtypes (as I did here).

TomAugspurger · 2018-05-23T11:22:43Z

pandas/core/dtypes/dtypes.py

 from pandas import compat
 from pandas.core.dtypes.generic import ABCIndexClass, ABCCategoricalIndex

 from .base import ExtensionDtype, _DtypeOpsMixin


+class Registry(object):
+    """ class to register our dtypes for inference


just "Registry for dtype inference"

jreback · 2018-05-24T00:45:32Z

I split out #21185 with the changes to EA. To answer some of @jbrockmendel questions

Getting integer NA implemented will be a big win!
yes

If there are bite-sized pieces that can be broken off, that will make life easier
see above

For a bitmask is there a good heuristic for when to use sparse vs dense? "Too early to worry about that" is a reasonable answer.

I think this would overcomplicate things.

For ops, this skips several of the previously-discussed steps in transitioning to One True Implementation.

how so, the implementation is barely changed here

IIRC there are inconsistencies in Series vs Index division by zero behavior for numeric types. I'd prefer to get these aligned before transitioning the dispatch logic.

yes this is tricky, we are matching on the current impl. if we were to change this we should adjust.

Should IntegerArray also back Int64Index and UInt64Index?

sure, but we don't really support the concept of an ExtensionIndex yet.

jbrockmendel · 2018-05-24T18:50:19Z

pandas/core/arrays/integer.py

+    if mask is None:
+        mask = isna(values)
+    else:
+        assert len(mask) == len(values)


should this raise a ValueError instead of AssertionError?

no, this is an internal construction error, need to satisfy the input guarantees

jbrockmendel · 2018-05-24T18:52:33Z

pandas/core/ops.py

+                    is_extension_array_dtype(y) and not
+                    is_scalar(y)):
+                y = x.__class__._from_sequence(y)
+            return op(x, y)


overlap in comment in #21191 since IntegerArray has an __init__ that I'm pretty sure can take left here (actually below in wrapper), I'm hopeful that in this case we can save a lot of trouble by using dispatch_to_index_op directly (passing the EA subclass rather than an Index subclass)

jreback · 2018-05-29T00:22:24Z

latest push remove the type specific arrays, IOW everything is now just a IntegerArray with a dtype; and added a repr that shares code with Index. (so it looks similar).

Dr-Irv

In pandas/core/arrays/base.py, I think you are missing __rmod__ in the list of arithmetic operators

WillAyd · 2018-07-16T23:20:11Z

pandas/core/arrays/integer.py

+    def _from_factorized(cls, values, original):
+        return cls(values, dtype=original.dtype)
+
+    def __getitem__(self, item):


Without trying to overcomplicated things have we considered moving some of these items to a MaskedEAMixin? I'm thinking of taking a stab at the Boolean EA next and can see this being generalizable along with a few other methods (__iter__, __setitem__, perhaps take, etc...)

certainly could, though I think might be better just to just directly subclass IntegerArray and the dtype, but that's for another PR :>

jreback · 2018-07-16T23:37:09Z

anyhow any final comments @pandas-dev/pandas-core

certainly will be follows in any event.

jreback · 2018-07-20T13:41:49Z

@jorisvandenbossche

chris-b1 · 2018-07-20T20:09:05Z

Made a couple doc follow-up notes in #22003

jorisvandenbossche · 2018-07-20T20:17:09Z

I think we should consider not using np.nan for the missing value indicator (as the value what you get back on scalar access, on conversion to numpy array

what is the reasoning here? users are accustomed to using np.nan and pd.NaT exclusively. (and not really None), and an already constructed float array that is using integers + nan is a natural conversion.

I am not necessarily saying that we should use None, but the fact is that we currently somewhat 'misuse' np.nan as missing value indicator (for lack of good alternative, for sure). But when we start adding the capability to more data types than float, I think we should consider using a separate value for this.
For example in arrow there is a distinction between NaN and NA (there was recently some discussion on that). I think in the end we should also do that. I am not sure when we should do that, but I would like to at least see some discussion about it.

jreback · 2018-07-20T20:28:20Z

I am not necessarily saying that we should use None, but the fact is that we currently somewhat 'misuse' np.nan as missing value indicator (for lack of good alternative, for sure). But when we start adding the capability to more data types than float, I think we should consider using a separate value for this.
For example in arrow there is a distinction between NaN and NA (there was recently some discussion on that). I think in the end we should also do that. I am not sure when we should do that, but I would like to at least see some discussion about it.

you can certainly make an issue to discuss this. and I agree we should prob move to a singular Null type. But this is not the PR for it. I disagree that None is any way superior here, and is just plain confusing.

jorisvandenbossche · 2018-07-20T20:30:08Z

I disagree that None is any way superior here, and is just plain confusing.

Again, I didn't say we should use None for this. I also think we should not use that, it is already used in many other contexts.

jreback · 2018-07-20T20:31:23Z

ok, let's certainly discuss. In reality is a pretty cosmetic detail (mostly for actual printing).

jreback · 2018-07-20T20:44:39Z

bombs away

jorisvandenbossche · 2018-07-20T20:47:10Z

Ahum, I was actually reviewing this now

jorisvandenbossche

It would make it easier to review if you do not rebase your commits, once you do that github's features to automatically let me see what has changed since the last review does not work anymore.

To come back to my previous comment: can you add a section to the actual documentation? (not only whatsnew)

jorisvandenbossche · 2018-07-20T19:59:52Z

pandas/tests/extension/base/dtype.py

+        expected = pd.Series([True, True, False, False],
+                             index=list('ABCD'))
+        result = df.dtypes.apply(str) == str(dtype)
+        self.assert_series_equal(result, expected)


also test with the dtype itself? (result = df.dtypes == dtype)

jorisvandenbossche · 2018-07-20T20:21:27Z

pandas/core/arrays/integer.py

+        IntegerArray
+        """
+        self._data, self._mask = coerce_to_array(
+            values, dtype=dtype, mask=mask, copy=copy)


Can you respond here further?

jorisvandenbossche · 2018-07-20T20:21:50Z

doc/source/whatsnew/v0.24.0.txt

+   # coerce when needed
+   s + 0.01
+
+These dtypes can operate as part of of ``DataFrame``.


"of of " -> "of a"

jorisvandenbossche · 2018-07-20T20:26:59Z

pandas/core/arrays/integer.py

+    """
+    We represent an IntegerArray with 2 numpy arrays
+    - data: contains a numpy integer array of the appropriate dtype
+    - mask: a boolean array holding a mask on the data, False is missing


False is missing

I don't think this is true currently?

But given the discussion earlier, I think it would be good to actually implement what you stated there to follow the example of arrow?

jorisvandenbossche · 2018-07-20T20:35:28Z

pandas/core/arrays/integer.py

+
+        # coerce
+        data = self._coerce_to_ndarray()
+        return data.astype(dtype=dtype, copy=False)


Should we treat converting to float separately here? (that could be easily made more performant, as a probably common use case for astype)

jorisvandenbossche · 2018-07-20T20:38:14Z

pandas/core/arrays/integer.py

+
+    def _maybe_mask_result(self, result, mask, other, op_name):
+        """
+        Parameters


jorisvandenbossche · 2018-07-20T20:39:08Z

pandas/core/arrays/integer.py

+                raise NotImplementedError(
+                    "can only perform ops with 1-d structures")
+            elif is_list_like(other):
+                other = np.asarray(other)


Should we try to convert to IntegerArray here if possible?
eg s + s.tolist() gives floats (in case s is a series with int-na dtype)

jorisvandenbossche · 2018-07-20T20:44:45Z

pandas/core/ops.py

+    # otherwise perform the op
+    if isinstance(right, compat.string_types):
+        raise TypeError("{typ} cannot perform the operation mod".format(
+            typ=type(left).__name__))


jorisvandenbossche · 2018-07-20T20:46:14Z

pandas/core/ops.py

+        if isinstance(right, np.ndarray):
+
+            # handle numpy scalars, this is a PITA
+            # TODO(jreback)


Can you clarify this to do comment?

jorisvandenbossche · 2018-07-20T20:49:07Z

pandas/tests/extension/integer/test_integer.py

+        # assert our expected result
+        self.assert_series_equal(result, expected)
+
+    def test_arith_integer_array(self, data, all_arithmetic_operators):


We need to discuss how to handle this (it will pop-up in all the other internal extension dtypes as well), so now is a good time.
I would personally move all specific tests that you add here that are not in the parent base tests to tests/arrays/integer, and only subclass here to check that the base tests make sense

jorisvandenbossche · 2018-07-20T20:56:05Z

pandas/core/arrays/__init__.py

                   ExtensionScalarOpsMixin)
 from .categorical import Categorical  # noqa
 from .datetimes import DatetimeArrayMixin  # noqa
 from .interval import IntervalArray  # noqa
 from .period import PeriodArrayMixin  # noqa
 from .timedeltas import TimedeltaArrayMixin  # noqa
+from .integer import (  # noqa
+    IntegerArray, to_integer_array)


What was the goal of exposing to_integer_array here? Is it used somewhere else?

jreback · 2018-07-20T20:57:23Z

well I'll adress these in a followup. and for the record I didn't actually rebase this.

jreback · 2018-07-20T20:59:23Z

Can you respond here further?

your comment about the init. Well I already address this and it is simply not possible to have the init do nothing. I am -1 on adding from_* methods. This is how it is implemented. If you want to change it you are welcome to submit a PR.

shoyer · 2018-07-20T21:39:30Z

This is how it is implemented. If you want to change it you are welcome to submit a PR.

Sorry, but I don't think this should be an acceptable response. I think we should either roll back this PR or disable all public interfaces for this functionality until the (collective) development team is happy with the new features. As of now, I would consider this a blocker for new releases.

As a general policy, I think we should require explicit approval from at least one other core developer before merging pull requests. I am all for avoiding large PRs and doing collaborative development in master, but we should not merge PRs unless they pass code review (certainly not for large changes like this one).

jreback · 2018-07-20T21:45:17Z

@shoyer that is a disengenuos response. If you want to do code review on the 100's of open PR's be my guest. This PR has been in a ready state for weeks.

jreback · 2018-07-20T21:46:49Z

As a general policy, I think we should require explicit approval from at least one other core developer before merging pull requests. I am all for avoiding large PRs and doing collaborative development in master, but we should not merge PRs unless they pass code review (certainly not for large changes like this one)

If folks respond in a timely manner, sure. If you haven't noticed I approve every other PR in a pretty timely manner. After round and rounds of comments.

I believe I have earned, and deserve the same consideration.

jorisvandenbossche · 2018-07-20T22:14:48Z

If you want to do code review on the 100's of open PR's be my guest.

This PR is not just "one of the 100's of open PRs", but a major change that needs more attention than smaller PRs. I agree with Stephan that it would be good for such PRs to have the explicit +1 of other core devs.
I did my best to review again as fast as possible, but have been very busy the last days. Time of other core devs is scarce, that is a reality we have to live with.

But OK, although I would have preferred to iterate further in this PR, let's move forward in follow-up PRs!
It would be helpful for that if you could respond to some of my comments above (of which some were already there in my previous review).

toobaz · 2018-07-20T22:18:28Z

doc/source/whatsnew/v0.24.0.txt

+
+.. warning::
+
+   The Integer NA support currently uses the captilized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date.


captilized -> capitalized?

jreback · 2018-07-20T22:21:06Z

This PR is not just "one of the 100's of open PRs"

This misses the point. I spend an inordinate amount of time reviewing practically everything. I don't expect everyone to do this, rather for my PR's, which I DO expect review time from other folks.

At the same time, endless comments, while certainly debatable, we don't force on anyone. Many many times, we have done follow up PR's just to cut down on the endless review cycle.

In any event I will address your comments.

* ENH: add integer-na support via an ExtensionArray closes pandas-dev#20700 closes pandas-dev#20747

jreback added Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels May 22, 2018

jreback added this to the 0.24.0 milestone May 22, 2018

jbrockmendel reviewed May 22, 2018

View reviewed changes

jbrockmendel mentioned this pull request May 23, 2018

ENH: Support operators for ExtensionArray #20889

Closed

4 tasks

jreback force-pushed the intna branch from b609e07 to c98667c Compare May 23, 2018 10:22

TomAugspurger reviewed May 23, 2018

View reviewed changes

jreback mentioned this pull request May 24, 2018

ENH: add in extension dtype registry #21185

Merged

jreback force-pushed the intna branch from e83f078 to 3995d44 Compare May 24, 2018 00:41

jreback changed the title ~~WIP: Integer NA Extension Array~~ ENH: Integer NA Extension Array May 24, 2018

jreback force-pushed the intna branch 2 times, most recently from 741edac to 97b01e4 Compare May 24, 2018 11:44

jreback mentioned this pull request May 24, 2018

ENH: extension ops #21191

Closed

jbrockmendel reviewed May 24, 2018

View reviewed changes

jreback force-pushed the intna branch 3 times, most recently from 5c9975f to d1e5281 Compare May 25, 2018 23:30

This was referenced May 25, 2018

implement arith ops on pd.Categorical #21213

Closed

Implement integer array add/sub for datetimelike indexes #19959

Merged

jreback force-pushed the intna branch from d1e5281 to 7c79ebc Compare May 29, 2018 00:15

Dr-Irv reviewed May 30, 2018

View reviewed changes

Merge branch 'master' into intna

e9e0937

WillAyd reviewed Jul 16, 2018

View reviewed changes

Merge branch 'master' into intna

4f04f90

h-vetinari mentioned this pull request Jul 17, 2018

DEPR: join_axes-kwarg for pd.concat #21951

Closed

chris-b1 mentioned this pull request Jul 20, 2018

IntegerArray docs follow-up #22003

Closed

jreback merged commit 8fd8d0d into pandas-dev:master Jul 20, 2018

jorisvandenbossche reviewed Jul 20, 2018

View reviewed changes

toobaz reviewed Jul 20, 2018

View reviewed changes

This was referenced Jul 23, 2018

TST: restructure internal extension arrays tests (split between /arrays and /extension) #22026

Merged

Split fastpath IntegerArray constructor and general purpose constructor #22070

Merged

WillAyd mentioned this pull request Aug 6, 2018

Integer NA Doc Updates #22220

Closed

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

ENH: Integer NA Extension Array (pandas-dev#21160)

8008e49

* ENH: add integer-na support via an ExtensionArray closes pandas-dev#20700 closes pandas-dev#20747

shadchin mentioned this pull request Jun 8, 2019

.astype() and Decimal #26731

Closed


		.. warning::

		The Integer NA support currently uses the captilized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date.

ENH: Integer NA Extension Array #21160

ENH: Integer NA Extension Array #21160

Conversation

jreback commented May 22, 2018 • edited Loading

jreback commented May 22, 2018

Choose a reason for hiding this comment

codecov bot commented May 22, 2018 • edited Loading

Codecov Report

jbrockmendel commented May 22, 2018

Dr-Irv commented May 22, 2018

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 29, 2018

Dr-Irv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jul 16, 2018

jreback commented Jul 20, 2018

chris-b1 commented Jul 20, 2018

jorisvandenbossche commented Jul 20, 2018

jreback commented Jul 20, 2018

jorisvandenbossche commented Jul 20, 2018

jreback commented Jul 20, 2018

jreback commented Jul 20, 2018

jorisvandenbossche commented Jul 20, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jul 20, 2018

jreback commented Jul 20, 2018

shoyer commented Jul 20, 2018

jreback commented Jul 20, 2018

jreback commented Jul 20, 2018

jorisvandenbossche commented Jul 20, 2018

Choose a reason for hiding this comment

jreback commented Jul 20, 2018

jreback commented May 22, 2018 •

edited

Loading

codecov bot commented May 22, 2018 •

edited

Loading