Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Integer NA Extension Array #21160

Merged
merged 23 commits into from Jul 20, 2018
Merged

ENH: Integer NA Extension Array #21160

merged 23 commits into from Jul 20, 2018

Conversation

jreback
Copy link
Contributor

@jreback jreback commented May 22, 2018

closes #20700
closes #20747

In [1]: df = pd.DataFrame({
'A': pd.Series([1, 2, np.nan], dtype='Int64'), 
'B': pd.Series([1, np.nan, 3], dtype='UInt8'), 
'C': [1, 2, 3]})

In [2]: df
Out[2]: 
     A    B  C
0    1    1  1
1    2  NaN  2
2  NaN    3  3

In [3]: df.dtypes
Out[3]: 
A    Int64
B    UInt8
C    int64
dtype: object

In [4]: df.A + df.B
Out[4]: 
0      2
1    NaN
2    NaN
dtype: Int64

In [5]: df.A + df.C
Out[5]: 
0      2
1      4
2    NaN
dtype: Int64

In [6]: (df.A + df.C) * 3
Out[6]: 
0      6
1     12
2    NaN
dtype: Int64
In [7]: (df.A + df.C) * 3 == 1
Out[7]: 
0    False
1    False
2    False
dtype: bool

In [8]: (df.A + df.C) * 3 == 12
Out[8]: 
0    False
1     True
2    False
dtype: bool

@jreback jreback added Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels May 22, 2018
@jreback jreback added this to the 0.24.0 milestone May 22, 2018
@jreback
Copy link
Contributor Author

jreback commented May 22, 2018


Parameters
----------
other : ExtenionArray or list/tuple of ExtenionArrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo Extenion-->Extension

@codecov
Copy link

codecov bot commented May 22, 2018

Codecov Report

Merging #21160 into master will increase coverage by 0.01%.
The diff coverage is 95.48%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21160      +/-   ##
==========================================
+ Coverage   91.96%   91.98%   +0.01%     
==========================================
  Files         166      167       +1     
  Lines       50329    50606     +277     
==========================================
+ Hits        46287    46551     +264     
- Misses       4042     4055      +13
Flag Coverage Δ
#multiple 90.39% <95.48%> (+0.02%) ⬆️
#single 42.18% <34.51%> (-0.05%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/concat.py 99.18% <ø> (ø) ⬆️
pandas/core/arrays/categorical.py 95.95% <100%> (ø) ⬆️
pandas/core/arrays/base.py 87.85% <100%> (ø) ⬆️
pandas/core/indexes/base.py 96.37% <100%> (+0.01%) ⬆️
pandas/core/missing.py 91.66% <100%> (+0.02%) ⬆️
pandas/core/dtypes/cast.py 88.52% <100%> (+0.16%) ⬆️
pandas/core/series.py 94.1% <100%> (-0.02%) ⬇️
pandas/core/arrays/__init__.py 100% <100%> (ø) ⬆️
pandas/core/internals.py 95.49% <100%> (+0.06%) ⬆️
pandas/core/dtypes/common.py 95.2% <100%> (+0.12%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 537b65c...4f04f90. Read the comment docs.

@jbrockmendel
Copy link
Member

There's a lot here, so preliminary thoughts based on a quick pass:

  • Getting integer NA implemented will be a big win!

  • If there are bite-sized pieces that can be broken off, that will make life easier

  • For a bitmask is there a good heuristic for when to use sparse vs dense? "Too early to worry about that" is a reasonable answer.

  • For ops, this skips several of the previously-discussed steps in transitioning to One True Implementation.

    • IIRC there are inconsistencies in Series vs Index division by zero behavior for numeric types. I'd prefer to get these aligned before transitioning the dispatch logic.
    • Should IntegerArray also back Int64Index and UInt64Index?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 22, 2018

@jreback Here is my take on the operators implementation, based on what I did in #20889. Looking at your implementation in pandas/core/arrays/integer.py, you've defined the operators on the arrays themselves. I did set things up in #20889 so that if someone did that with ExtensionArray, there is a dispatch to those methods. The important thing in pandas/core/ops.py is that you can't assume anything about the underlying implementation of ExtensionArray, and it's possible that your changes to ops.py might be assuming that ExtensionArray is using a numpy implementation under the hood. I think (but I'm not 100% sure) that if you take what I did in pandas/core/ops.py with what you've done here, things would work correctly.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only partway through. Will try to take a closer look later today.

@@ -156,6 +156,12 @@ def name(self):
"""
raise AbstractMethodError(self)

@property
def array_type(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this to the list of abstract methods / properties on line 110

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change limits us to 1 array type per extension dtype. Is everyone OK with that? I don't see any downsides to that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a problem, you simply have multiple arrays and multiple dtypes (as I did here).

from pandas import compat
from pandas.core.dtypes.generic import ABCIndexClass, ABCCategoricalIndex

from .base import ExtensionDtype, _DtypeOpsMixin


class Registry(object):
""" class to register our dtypes for inference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just "Registry for dtype inference"

@jreback jreback changed the title WIP: Integer NA Extension Array ENH: Integer NA Extension Array May 24, 2018
@jreback
Copy link
Contributor Author

jreback commented May 24, 2018

I split out #21185 with the changes to EA. To answer some of @jbrockmendel questions

Getting integer NA implemented will be a big win!
yes

If there are bite-sized pieces that can be broken off, that will make life easier
see above

For a bitmask is there a good heuristic for when to use sparse vs dense? "Too early to worry about that" is a reasonable answer.

I think this would overcomplicate things.

For ops, this skips several of the previously-discussed steps in transitioning to One True Implementation.

how so, the implementation is barely changed here

IIRC there are inconsistencies in Series vs Index division by zero behavior for numeric types. I'd prefer to get these aligned before transitioning the dispatch logic.

yes this is tricky, we are matching on the current impl. if we were to change this we should adjust.

Should IntegerArray also back Int64Index and UInt64Index?

sure, but we don't really support the concept of an ExtensionIndex yet.

@jreback jreback force-pushed the intna branch 2 times, most recently from 741edac to 97b01e4 Compare May 24, 2018 11:44
@jreback jreback mentioned this pull request May 24, 2018
if mask is None:
mask = isna(values)
else:
assert len(mask) == len(values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this raise a ValueError instead of AssertionError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this is an internal construction error, need to satisfy the input guarantees

is_extension_array_dtype(y) and not
is_scalar(y)):
y = x.__class__._from_sequence(y)
return op(x, y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overlap in comment in #21191 since IntegerArray has an __init__ that I'm pretty sure can take left here (actually below in wrapper), I'm hopeful that in this case we can save a lot of trouble by using dispatch_to_index_op directly (passing the EA subclass rather than an Index subclass)

@jreback
Copy link
Contributor Author

jreback commented May 29, 2018

latest push remove the type specific arrays, IOW everything is now just a IntegerArray with a dtype; and added a repr that shares code with Index. (so it looks similar).

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pandas/core/arrays/base.py, I think you are missing __rmod__ in the list of arithmetic operators

def _from_factorized(cls, values, original):
return cls(values, dtype=original.dtype)

def __getitem__(self, item):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without trying to overcomplicated things have we considered moving some of these items to a MaskedEAMixin? I'm thinking of taking a stab at the Boolean EA next and can see this being generalizable along with a few other methods (__iter__, __setitem__, perhaps take, etc...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

certainly could, though I think might be better just to just directly subclass IntegerArray and the dtype, but that's for another PR :>

@jreback
Copy link
Contributor Author

jreback commented Jul 16, 2018

anyhow any final comments @pandas-dev/pandas-core

certainly will be follows in any event.

@jreback
Copy link
Contributor Author

jreback commented Jul 20, 2018

@jorisvandenbossche

@chris-b1
Copy link
Contributor

Made a couple doc follow-up notes in #22003

@jorisvandenbossche
Copy link
Member

I think we should consider not using np.nan for the missing value indicator (as the value what you get back on scalar access, on conversion to numpy array

what is the reasoning here? users are accustomed to using np.nan and pd.NaT exclusively. (and not really None), and an already constructed float array that is using integers + nan is a natural conversion.

I am not necessarily saying that we should use None, but the fact is that we currently somewhat 'misuse' np.nan as missing value indicator (for lack of good alternative, for sure). But when we start adding the capability to more data types than float, I think we should consider using a separate value for this.
For example in arrow there is a distinction between NaN and NA (there was recently some discussion on that). I think in the end we should also do that. I am not sure when we should do that, but I would like to at least see some discussion about it.

@jreback
Copy link
Contributor Author

jreback commented Jul 20, 2018

I am not necessarily saying that we should use None, but the fact is that we currently somewhat 'misuse' np.nan as missing value indicator (for lack of good alternative, for sure). But when we start adding the capability to more data types than float, I think we should consider using a separate value for this.
For example in arrow there is a distinction between NaN and NA (there was recently some discussion on that). I think in the end we should also do that. I am not sure when we should do that, but I would like to at least see some discussion about it.

you can certainly make an issue to discuss this. and I agree we should prob move to a singular Null type. But this is not the PR for it. I disagree that None is any way superior here, and is just plain confusing.

@jorisvandenbossche
Copy link
Member

I disagree that None is any way superior here, and is just plain confusing.

Again, I didn't say we should use None for this. I also think we should not use that, it is already used in many other contexts.

@jreback
Copy link
Contributor Author

jreback commented Jul 20, 2018

ok, let's certainly discuss. In reality is a pretty cosmetic detail (mostly for actual printing).

@jreback jreback merged commit 8fd8d0d into pandas-dev:master Jul 20, 2018
@jreback
Copy link
Contributor Author

jreback commented Jul 20, 2018

bombs away

@jorisvandenbossche
Copy link
Member

Ahum, I was actually reviewing this now

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make it easier to review if you do not rebase your commits, once you do that github's features to automatically let me see what has changed since the last review does not work anymore.

To come back to my previous comment: can you add a section to the actual documentation? (not only whatsnew)

expected = pd.Series([True, True, False, False],
index=list('ABCD'))
result = df.dtypes.apply(str) == str(dtype)
self.assert_series_equal(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also test with the dtype itself? (result = df.dtypes == dtype)

IntegerArray
"""
self._data, self._mask = coerce_to_array(
values, dtype=dtype, mask=mask, copy=copy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you respond here further?

# coerce when needed
s + 0.01

These dtypes can operate as part of of ``DataFrame``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"of of " -> "of a"

"""
We represent an IntegerArray with 2 numpy arrays
- data: contains a numpy integer array of the appropriate dtype
- mask: a boolean array holding a mask on the data, False is missing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False is missing

I don't think this is true currently?

But given the discussion earlier, I think it would be good to actually implement what you stated there to follow the example of arrow?


# coerce
data = self._coerce_to_ndarray()
return data.astype(dtype=dtype, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we treat converting to float separately here? (that could be easily made more performant, as a probably common use case for astype)


def _maybe_mask_result(self, result, mask, other, op_name):
"""
Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one

raise NotImplementedError(
"can only perform ops with 1-d structures")
elif is_list_like(other):
other = np.asarray(other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we try to convert to IntegerArray here if possible?
eg s + s.tolist() gives floats (in case s is a series with int-na dtype)

# otherwise perform the op
if isinstance(right, compat.string_types):
raise TypeError("{typ} cannot perform the operation mod".format(
typ=type(left).__name__))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one

if isinstance(right, np.ndarray):

# handle numpy scalars, this is a PITA
# TODO(jreback)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this to do comment?

# assert our expected result
self.assert_series_equal(result, expected)

def test_arith_integer_array(self, data, all_arithmetic_operators):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to discuss how to handle this (it will pop-up in all the other internal extension dtypes as well), so now is a good time.
I would personally move all specific tests that you add here that are not in the parent base tests to tests/arrays/integer, and only subclass here to check that the base tests make sense

ExtensionScalarOpsMixin)
from .categorical import Categorical # noqa
from .datetimes import DatetimeArrayMixin # noqa
from .interval import IntervalArray # noqa
from .period import PeriodArrayMixin # noqa
from .timedeltas import TimedeltaArrayMixin # noqa
from .integer import ( # noqa
IntegerArray, to_integer_array)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the goal of exposing to_integer_array here? Is it used somewhere else?

@jreback
Copy link
Contributor Author

jreback commented Jul 20, 2018

well I'll adress these in a followup. and for the record I didn't actually rebase this.

@jreback
Copy link
Contributor Author

jreback commented Jul 20, 2018

Can you respond here further?

your comment about the init. Well I already address this and it is simply not possible to have the init do nothing. I am -1 on adding from_* methods. This is how it is implemented. If you want to change it you are welcome to submit a PR.

@shoyer
Copy link
Member

shoyer commented Jul 20, 2018

This is how it is implemented. If you want to change it you are welcome to submit a PR.

Sorry, but I don't think this should be an acceptable response. I think we should either roll back this PR or disable all public interfaces for this functionality until the (collective) development team is happy with the new features. As of now, I would consider this a blocker for new releases.

As a general policy, I think we should require explicit approval from at least one other core developer before merging pull requests. I am all for avoiding large PRs and doing collaborative development in master, but we should not merge PRs unless they pass code review (certainly not for large changes like this one).

@jreback
Copy link
Contributor Author

jreback commented Jul 20, 2018

@shoyer that is a disengenuos response. If you want to do code review on the 100's of open PR's be my guest. This PR has been in a ready state for weeks.

@jreback
Copy link
Contributor Author

jreback commented Jul 20, 2018

As a general policy, I think we should require explicit approval from at least one other core developer before merging pull requests. I am all for avoiding large PRs and doing collaborative development in master, but we should not merge PRs unless they pass code review (certainly not for large changes like this one)

If folks respond in a timely manner, sure. If you haven't noticed I approve every other PR in a pretty timely manner. After round and rounds of comments.

I believe I have earned, and deserve the same consideration.

@jorisvandenbossche
Copy link
Member

If you want to do code review on the 100's of open PR's be my guest.

This PR is not just "one of the 100's of open PRs", but a major change that needs more attention than smaller PRs. I agree with Stephan that it would be good for such PRs to have the explicit +1 of other core devs.
I did my best to review again as fast as possible, but have been very busy the last days. Time of other core devs is scarce, that is a reality we have to live with.

But OK, although I would have preferred to iterate further in this PR, let's move forward in follow-up PRs!
It would be helpful for that if you could respond to some of my comments above (of which some were already there in my previous review).


.. warning::

The Integer NA support currently uses the captilized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

captilized -> capitalized?

@jreback
Copy link
Contributor Author

jreback commented Jul 20, 2018

This PR is not just "one of the 100's of open PRs"

This misses the point. I spend an inordinate amount of time reviewing practically everything. I don't expect everyone to do this, rather for my PR's, which I DO expect review time from other folks.

At the same time, endless comments, while certainly debatable, we don't force on anyone. Many many times, we have done follow up PR's just to cut down on the endless review cycle.

In any event I will address your comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ExtensionArray construction with given dtype (sort of shallow_copy?) API: integer Extension Array