Missing values proposal: concrete steps for 1.0 #29556

jorisvandenbossche · 2019-11-11T22:14:44Z

Updated with to do list:

Original issue:

Issue to discuss the implementation strategy for #28095.
Opening a new issue, as the other one already has a lot of discussion in several discussion, and would propose to keep this one focused on the practical aspects of how to implement this (regardless of certain aspects of the NA proposal such as single NA vs dtype-specific NAs -> for that will post a summary of the discussion on #28095 tomorrow).

I would like to propose the following way forward:

On the short term (ideally for 1.0):

Already implement and provide the pd.NA scalar, and recognize it in the appropriate places as missing value (e.g. pd.isna). This way, it can already be used in external ExtentionArrays
Implement a BooleanArray with support for missing values and appropriate NA behaviour. To start, we can just use a numpy masked array approach (similar to the existing IntegerArray), not involving any pyarrow memory optimizations.
Start using this BooleanArray as the boolean result of comparison operations for IntegerArray/StringArray (breaking change for nullable integers)
- Other arrays will keep using the numpy bool, this means we have two "boolean" dtypes side by side with different behaviour, and which one you get depends on the original data type (potentially confusing for users)
Start using pd.NA as the missing value indicator for Integer/String/BooleanArray (breaking change for nullable integers)

On the intermediate term (after 1.0)

Investigate if it can be implemented optionally for other data types and "activated" to have users opt-in for existing dtypes (to be further thought out).

I think the main discussion point is if we are OK with such a breaking change for IntegerArray.
I would personally do this: IntegerArray was only introduced recently, still regarded as experimental, and the perfect use case for those changes. But, it's certainly a clear backwards incompatible, breaking change.

cc @pandas-dev/pandas-core

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-11-11T22:36:10Z

My initial reaction is +1 on having StringArray and IntegerArray start using this new pd.NA value. Of all the "experimental" things in StringArray, the NA scalar is the one I know we'll want to break, since using np.nan really doesn't make sense.

If we're able to get BooleanArray done for 1.0, then that'd be great. I think time-wise, it's OK to settle for a simple ndarray + a private mask attribute. We optimize to use pyarrow or a bitmask later. (edit: just saw #29555 (review) Nice!)

So to be explicit, +1 on the proposed breaking changes to IntegerArray.

jreback · 2019-11-12T00:20:03Z

well we would need a hard dependency on pyarrow to allow BoolenArray

since we haven’t had this discussion then -0 on it

pd.NA is fine but i see this as too aggressive to use in IntegerArray for 1.0

this should be in master for a while and get usage

TomAugspurger · 2019-11-12T01:27:43Z

It seems to me like we could do integer array using a Boolean ndarray as a mask, and use pyarrow / a bitmask as an optimization later. It’s 2x the memory for now, but that’s maybe Ok. And I think we’ll eventually want IntegerArray ops to return nullable booleans, so better to do before 1.0?

…

On Nov 11, 2019, at 18:20, Jeff Reback ***@***.***> wrote: well we would need a hard dependency on pyarrow to allow BoolenArray since we haven’t had this discussion then -0 on it pd.NA is fine but i see this as too aggressive to use in IntegerArray for 1.0 this should be in master for a while and get usage — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jreback · 2019-11-12T01:50:44Z

It seems to me like we could do integer array using a Boolean ndarray as a mask, and use pyarrow / a bitmask as an optimization later. It’s 2x the memory for now, but that’s maybe Ok. And I think we’ll eventually want IntegerArray ops to return nullable booleans, so better to do before 1.0?
…

it’s not 2x it’s like 10x compared to a native impl

WillAyd · 2019-11-12T02:03:54Z

I can't find one issue for it but I still think we should just take pyarrow on as a hard dependency if it simplifies our development. If import time is the big blocker IMO that's not really that big of a deal and would make it easier to manage / communicate BoolArray and similar developments

+0 on pd.NA. Not sure I see understand the entire scope of it from what is described but maybe something that just needs to be developed out

jorisvandenbossche · 2019-11-12T07:29:34Z

Sorry, I should have been clearer about this: all what I said above is meant to be pure numpy-based masked arrays, without any dependency on pyarrow, for now. This at least enables us to experiment with the API design without needing a heavy dependency such as pyarrow for the storage. Although I am clearly biased towards pyarrow, I don't think any of this needs pyarrow on the short term, we can perfectly implement this using numpy (and doing so doesn't complicate it, on the contrary, there is still a lot more functionality available for numpy masks compared to pyarrow arrays).

I indeed started a PR for the BooleanArray part of this: #29555. The numpy mask-based approach will not be the most efficient (both in memory and speed), but IMO it is now more important to first experiment with the design (the same can be said about the already existing IntegerArray).

it’s not 2x it’s like 10x compared to a native impl

For IntegerArray it is 2x (which is what Tom was saying I think), for BooleanArray the optimization could indeed be a lot bigger.

TomAugspurger · 2019-11-12T17:46:08Z

IMO it is now more important to first experiment with the design (the same can be said about the already existing IntegerArray).

Agreed (can add StringArray here too). I want to have 1.0 be as object-dtype free as possible :)

TomAugspurger · 2019-11-12T17:51:19Z

For IntegerArray it is 2x (which is what Tom was saying I think), for BooleanArray the optimization could indeed be a lot bigger.

For BooleanArrays with no missing values, isn't it exactly 2x more memory (assuming we store the mask when there are no missing values)? Then it's just the boolean ndarry of values + a boolean mask, so 2x. And we can, like pyarrow, skip allocating the mask if there are no NAs (though that's an optimization, not a requirement).

For one with missing values, isn't it less memory than our current implementation?

In [13]: s = pd.Series([None] + [True] * 1000)

In [14]: s.memory_usage(deep=True, index=False)
Out[14]: 36024

versus

In [15]: mask = np.array([True] + [False] * 1000)

In [16]: 2 * mask.nbytes
Out[16]: 2002

Even if we're using NaN for a boolean with missing values, we still have an improvement, since we use float64 by default

In [19]: s.astype(float).nbytes
Out[19]: 8008

jorisvandenbossche · 2019-11-12T19:36:29Z

It's indeed already an optimization compared to object dtype array of True/False/None. But I think Jeff meant the potential optimization of using pyarrow's boolean array storage, which can give a lot of additional memory improvement (8x compared to the double numpy array):

In [3]: mask = np.array([True] + [False] * 1000) 

In [4]: 2 * mask.nbytes 
Out[4]: 2002

In [5]: import pyarrow

In [6]: pyarrow.array(mask, mask=mask).nbytes
Out[6]: 252

But anyway, as I argued above, let's leave this for the future, and focus now on the API using numpy arrays under the hood (that use some more memory).

TomAugspurger · 2019-11-20T22:39:05Z

From the call today, there were no objections to changing IntegerArray to start using NA instead of np.nan.

@jorisvandenbossche do you intend to work on that? Otherwise, I can push up a WIP tomorrow (which will depend on #29597).

jorisvandenbossche · 2019-11-20T22:45:52Z

Feel free to already do a WIP on that, there is plenty to do ;)
I would first like to get those 2 open PRs merged

TomAugspurger · 2019-11-25T17:00:59Z

BooleanArray is in master. Getting #29597 in will open up more followups. I think the order is roughly

Use pd.NA in BooleanArray
1.a Implement kleene-logic in logical ops on BooleanArray
1.b Update the behaviour of any/all reductions with skipna=False (API: any/all in context of boolean dtype with missing values #29686)
Enable boolean indexing with BooleanArray (we can start with handling the case of no NAs; the behaviour for with NAs still need to be decided: DISCUSS: boolean dtype with missing value support #28778)
Use BooleanArray as the return value for logical ops in StringArray & IntegerArray
Base class for IntegerArray & BooleanArray

@jorisvandenbossche feel free to update this list (or maybe the original post) as needed [done]

jorisvandenbossche · 2019-11-25T17:27:06Z

Thanks Tom, the follow-ups I had listed in #29555 (comment) are more or less the same (added boolean indexing + any/all)

jorisvandenbossche · 2019-11-25T17:32:36Z

And updated the top post with a to-do list

jreback · 2019-11-26T12:02:45Z

Terji can’t find this comment on github but i would say that if we had a DictionaryEncoded type (that backs Categorical and maybe StringArray) then you could certainly use it in the implementation here

…

On Nov 25, 2019, at 4:44 PM, Terji Petersen ***@***.***> wrote: I'm obviously late to this party, but to me it seems like a BooleanArray could be implemented more efficient in a Categorical-like structure, where the "categories" would be enforced to [False, True] and the "codes" would be a ndarray with possible values of -1, 0 and 1. If we'd do that, we'd avoid the mask and the ops would be faster and the memory usage lower? — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jorisvandenbossche · 2019-11-26T12:29:26Z

I think a big advantage of the mask approach is the consistent data model with other arrays such as IntegerArray now (and possibly others in the future).

Also, in operations you often need or produce a mask (eg in a IntegerArray comparison producing a boolean array, or in isna of an IntegerArray), which now is always available or can be set without further computation, while in the encoded type you need to convert to the encoded type or produce a mask from the encoded data.

jorisvandenbossche · 2019-12-02T12:29:51Z

@TomAugspurger are you planning to do a PR for using it in IntegerArray? (or at least you had a beginning branch for this, I think?)

I opened a PR to do the BooleanArray case: #29961

TomAugspurger · 2019-12-02T12:32:50Z

@jorisvandenbossche I'm not finding a brach for IntegerArray using pd.NA right now. Do you have time to work on it this week?

jorisvandenbossche · 2019-12-02T12:34:57Z

I think it is here: TomAugspurger/pandas@NA-scalar...TomAugspurger:NA-scalar+IntegerArray. Feel free to just open it as a PR, and then can see who has time first

TomAugspurger · 2019-12-02T12:45:22Z

Thanks, missed that somehow. Opened a draft in #29964

TomAugspurger · 2019-12-02T12:51:59Z

@jorisvandenbossche should the subitem Use BooleanArray in comparison ops of StringArray be expanded to Use BooleanArray in comparison ops of StringArray, IntegerArray, and BooleanArray? The plan is for those to propagate NA, right?

jorisvandenbossche · 2019-12-02T12:55:53Z

The plan is for those to propagate NA, right?

Yes.
I am already handling the BooleanArray case in #29961. If the IntegerArray case is not handled in #29964, then yes we can have a subitem / PR for doing that in IntegerArray / StringArray together.

TomAugspurger · 2019-12-03T13:43:38Z

I've added an item for returning BooleanArray from boolean .str methods. I can put together a PR for that fairly quickly I think.

TomAugspurger · 2019-12-06T21:48:50Z

@jorisvandenbossche I added a TODO for implementing NA.__array_ufunc__. That'll fix things like np.bool_(True) | NA

In [4]: import numpy as np; import pandas as pd

In [5]: np.bool_(True) | pd.NA
Out[5]: NA

I just hardcoded that to return NA as a test, but we'll want to do broadcasting, etc. We have to define some behaviors:

Logical operations use Kleene logic. Returns True / False / NA for scalars, or a BooleanArray for arrays.

>>> np.bool_(True) | NA
True

>>> np.array([True, False]) | NA
BooleanArray([True, NA])

Arithmetic operations treat pd.NA as numeric. With integers we return an IntegerArray

>>> 1 + pd.NA
NA

>>> np.array([1, 2], dtype='uint64') + NA
IntegerArray([NA, NA], dtype='uint64)

With floats, we return ..., what? An ndarray[float] using NaN (I vote no)? A PandasArray (maybe)?

>>> np.array([1.1, 2.2]) + NA
?

For other ufuncs like np.sin, etc. we just return NA?

xref pandas-dev#29556

TomAugspurger · 2019-12-12T17:06:43Z

@jorisvandenbossche one of the items is

Use BooleanArray as the return value for logical ops in StringArray & IntegerArray

That's not quite right for StringArray right? Those won't implement logical ops? I think I may have added StringArray, when I meant to edit the list for comparison ops.

And have you started on NA.__array_ufunc__?

jorisvandenbossche · 2019-12-13T06:25:57Z

Use BooleanArray as the return value for logical ops in StringArray & IntegerArray

That's not quite right for StringArray right? Those won't implement logical ops? I think I may have added StringArray, when I meant to edit the list for comparison ops.

Yes, but I think it is also not correct for Integer, as integer logical ops return integers (which is also the case for Series):

In [56]: np.array([1, 2, 3]) | np.array([3, 4, 5]) 
Out[56]: array([3, 6, 7])

(logical ops are not yet implemented for IntegerArray, but that's a separate issue)

Maybe we were confused with comparison ops for a moment, but I think the full line can be removed.

And have you started on NA.__array_ufunc__?

No, not yet

TomAugspurger · 2019-12-17T18:02:01Z

I think we have open PRs for everything except

Documentation
Refactoring to a base class between IntegerArray and BooleanArray

which are blocked by the outstanding PRs

xref #29556

xref pandas-dev#29556

anisotropi4 · 2019-12-22T13:49:03Z

My question is prosaic and based on str dtype conversion:
How would I now distinguish between a missing str value and the two-character string 'NA'?

I ask as NA is a common abbreviation for 'Not Applicable', 'North America' et al, in a way that in my experience that 'NaN' or 'Not a Number' isn't

Given this, if 'NA' were generated as the default missing str value, especially if introduced as change rather than as a opt-in type option, it risks becoming a UX developer issue as I (for one) would no longer know if 'NA' is a valid or a missing data value

jorisvandenbossche · 2019-12-22T14:07:06Z

@anisotropi4 Thanks for joining in! Would you like to move that comment to a new issue? I think the several interactions of NA with string representation and conversion warrants a dedicated issue to discuss.

anisotropi4 · 2019-12-22T22:59:50Z

As requested @jorisvandenbossche I have raised this as #30415 with some more detail about current behaviour. I hope that is what you had in mind.

TomAugspurger · 2020-01-21T22:11:18Z

@jorisvandenbossche I think we can close this, unless you have additional tasks in mind.

TomAugspurger · 2020-01-27T14:59:33Z

Closing, though will address any remaining tasks in dedicated issues.

jorisvandenbossche · 2020-01-27T19:59:43Z

Yes, sorry for the slow reply, this was fine to close. The combined docs of all PRs about this issues could maybe use a check / review, but no need to keep it open for that.

jorisvandenbossche added Needs Discussion Requires discussion from core team before further action Roadmap A proposal for the roadmap. labels Nov 11, 2019

jorisvandenbossche added this to the 1.0 milestone Nov 11, 2019

jorisvandenbossche mentioned this issue Nov 12, 2019

ROADMAP: Consistent missing value handling with new NA scalar #28095

Open

TomAugspurger mentioned this issue Nov 25, 2019

ENH: add BooleanArray extension array #29555

Merged

TomAugspurger mentioned this issue Nov 25, 2019

ENH: Implement Kleene logic for BooleanArray #29842

Merged

jorisvandenbossche mentioned this issue Dec 2, 2019

ENH: add NA scalar for missing value indicator, use in StringArray. #29597

Merged

jorisvandenbossche mentioned this issue Dec 4, 2019

Use new NA scalar in BooleanArray #29961

Merged

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 12, 2019

StringArray comparisions return BooleanArray

c548c67

xref pandas-dev#29556

This was referenced Dec 12, 2019

StringArray comparisions return BooleanArray #30231

Merged

API: Return BoolArray for string ops when backed by StringArray #30239

Merged

TomAugspurger mentioned this issue Dec 17, 2019

DOC/TST: Indexing with NA raises #30308

Merged

jorisvandenbossche pushed a commit that referenced this issue Dec 18, 2019

StringArray comparisions return BooleanArray (#30231)

0badf06

xref #29556

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

StringArray comparisions return BooleanArray (pandas-dev#30231)

ed1b5fe

xref pandas-dev#29556

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

StringArray comparisions return BooleanArray (pandas-dev#30231)

96b71d1

xref pandas-dev#29556

jorisvandenbossche mentioned this issue Jan 7, 2020

REF: Implement BaseMaskedArray class for integer/boolean ExtensionArrays #30789

Merged

TomAugspurger closed this as completed Jan 27, 2020

CormacDC mentioned this issue Mar 11, 2021

BUG: Inconsistent return types for PandasArray and Series reductions #35475

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing values proposal: concrete steps for 1.0 #29556

Missing values proposal: concrete steps for 1.0 #29556

jorisvandenbossche commented Nov 11, 2019 •

edited

Loading

TomAugspurger commented Nov 11, 2019 •

edited

Loading

jreback commented Nov 12, 2019

TomAugspurger commented Nov 12, 2019 via email

jreback commented Nov 12, 2019

WillAyd commented Nov 12, 2019

jorisvandenbossche commented Nov 12, 2019

TomAugspurger commented Nov 12, 2019

TomAugspurger commented Nov 12, 2019

jorisvandenbossche commented Nov 12, 2019

TomAugspurger commented Nov 20, 2019

jorisvandenbossche commented Nov 20, 2019

TomAugspurger commented Nov 25, 2019 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented Nov 25, 2019

jorisvandenbossche commented Nov 25, 2019

jreback commented Nov 26, 2019 via email

jorisvandenbossche commented Nov 26, 2019

jorisvandenbossche commented Dec 2, 2019 •

edited

Loading

TomAugspurger commented Dec 2, 2019

jorisvandenbossche commented Dec 2, 2019

TomAugspurger commented Dec 2, 2019

TomAugspurger commented Dec 2, 2019

jorisvandenbossche commented Dec 2, 2019 •

edited

Loading

TomAugspurger commented Dec 3, 2019 •

edited

Loading

TomAugspurger commented Dec 6, 2019 •

edited

Loading

TomAugspurger commented Dec 12, 2019 •

edited

Loading

jorisvandenbossche commented Dec 13, 2019

TomAugspurger commented Dec 17, 2019

anisotropi4 commented Dec 22, 2019

jorisvandenbossche commented Dec 22, 2019

anisotropi4 commented Dec 22, 2019

TomAugspurger commented Jan 21, 2020

TomAugspurger commented Jan 27, 2020

jorisvandenbossche commented Jan 27, 2020

Missing values proposal: concrete steps for 1.0 #29556

Missing values proposal: concrete steps for 1.0 #29556

Comments

jorisvandenbossche commented Nov 11, 2019 • edited Loading

TomAugspurger commented Nov 11, 2019 • edited Loading

jreback commented Nov 12, 2019

TomAugspurger commented Nov 12, 2019 via email

jreback commented Nov 12, 2019

WillAyd commented Nov 12, 2019

jorisvandenbossche commented Nov 12, 2019

TomAugspurger commented Nov 12, 2019

TomAugspurger commented Nov 12, 2019

jorisvandenbossche commented Nov 12, 2019

TomAugspurger commented Nov 20, 2019

jorisvandenbossche commented Nov 20, 2019

TomAugspurger commented Nov 25, 2019 • edited by jorisvandenbossche Loading

jorisvandenbossche commented Nov 25, 2019

jorisvandenbossche commented Nov 25, 2019

jreback commented Nov 26, 2019 via email

jorisvandenbossche commented Nov 26, 2019

jorisvandenbossche commented Dec 2, 2019 • edited Loading

TomAugspurger commented Dec 2, 2019

jorisvandenbossche commented Dec 2, 2019

TomAugspurger commented Dec 2, 2019

TomAugspurger commented Dec 2, 2019

jorisvandenbossche commented Dec 2, 2019 • edited Loading

TomAugspurger commented Dec 3, 2019 • edited Loading

TomAugspurger commented Dec 6, 2019 • edited Loading

TomAugspurger commented Dec 12, 2019 • edited Loading

jorisvandenbossche commented Dec 13, 2019

TomAugspurger commented Dec 17, 2019

anisotropi4 commented Dec 22, 2019

jorisvandenbossche commented Dec 22, 2019

anisotropi4 commented Dec 22, 2019

TomAugspurger commented Jan 21, 2020

TomAugspurger commented Jan 27, 2020

jorisvandenbossche commented Jan 27, 2020

jorisvandenbossche commented Nov 11, 2019 •

edited

Loading

TomAugspurger commented Nov 11, 2019 •

edited

Loading

TomAugspurger commented Nov 25, 2019 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented Dec 2, 2019 •

edited

Loading

jorisvandenbossche commented Dec 2, 2019 •

edited

Loading

TomAugspurger commented Dec 3, 2019 •

edited

Loading

TomAugspurger commented Dec 6, 2019 •

edited

Loading

TomAugspurger commented Dec 12, 2019 •

edited

Loading