Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing values proposal: concrete steps for 1.0 #29556

Closed
12 of 13 tasks
jorisvandenbossche opened this issue Nov 11, 2019 · 33 comments
Closed
12 of 13 tasks

Missing values proposal: concrete steps for 1.0 #29556

jorisvandenbossche opened this issue Nov 11, 2019 · 33 comments
Labels
Needs Discussion Requires discussion from core team before further action Roadmap A proposal for the roadmap.
Milestone

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 11, 2019

Updated with to do list:


Original issue:

Issue to discuss the implementation strategy for #28095.
Opening a new issue, as the other one already has a lot of discussion in several discussion, and would propose to keep this one focused on the practical aspects of how to implement this (regardless of certain aspects of the NA proposal such as single NA vs dtype-specific NAs -> for that will post a summary of the discussion on #28095 tomorrow).

I would like to propose the following way forward:

On the short term (ideally for 1.0):

  • Already implement and provide the pd.NA scalar, and recognize it in the appropriate places as missing value (e.g. pd.isna). This way, it can already be used in external ExtentionArrays
  • Implement a BooleanArray with support for missing values and appropriate NA behaviour. To start, we can just use a numpy masked array approach (similar to the existing IntegerArray), not involving any pyarrow memory optimizations.
  • Start using this BooleanArray as the boolean result of comparison operations for IntegerArray/StringArray (breaking change for nullable integers)
    • Other arrays will keep using the numpy bool, this means we have two "boolean" dtypes side by side with different behaviour, and which one you get depends on the original data type (potentially confusing for users)
  • Start using pd.NA as the missing value indicator for Integer/String/BooleanArray (breaking change for nullable integers)

On the intermediate term (after 1.0)

  • Investigate if it can be implemented optionally for other data types and "activated" to have users opt-in for existing dtypes (to be further thought out).

I think the main discussion point is if we are OK with such a breaking change for IntegerArray.
I would personally do this: IntegerArray was only introduced recently, still regarded as experimental, and the perfect use case for those changes. But, it's certainly a clear backwards incompatible, breaking change.

cc @pandas-dev/pandas-core

@jorisvandenbossche jorisvandenbossche added Needs Discussion Requires discussion from core team before further action Roadmap A proposal for the roadmap. labels Nov 11, 2019
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Nov 11, 2019
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 11, 2019

My initial reaction is +1 on having StringArray and IntegerArray start using this new pd.NA value. Of all the "experimental" things in StringArray, the NA scalar is the one I know we'll want to break, since using np.nan really doesn't make sense.

If we're able to get BooleanArray done for 1.0, then that'd be great. I think time-wise, it's OK to settle for a simple ndarray + a private mask attribute. We optimize to use pyarrow or a bitmask later. (edit: just saw #29555 (review) Nice!)

So to be explicit, +1 on the proposed breaking changes to IntegerArray.

@jreback
Copy link
Contributor

jreback commented Nov 12, 2019

well we would need a hard dependency on pyarrow to allow BoolenArray

since we haven’t had this discussion then -0 on it

pd.NA is fine but i see this as too aggressive to use in IntegerArray for 1.0

this should be in master for a while and get usage

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 12, 2019 via email

@jreback
Copy link
Contributor

jreback commented Nov 12, 2019

It seems to me like we could do integer array using a Boolean ndarray as a mask, and use pyarrow / a bitmask as an optimization later. It’s 2x the memory for now, but that’s maybe Ok. And I think we’ll eventually want IntegerArray ops to return nullable booleans, so better to do before 1.0?

it’s not 2x it’s like 10x compared to a native impl

@WillAyd
Copy link
Member

WillAyd commented Nov 12, 2019

I can't find one issue for it but I still think we should just take pyarrow on as a hard dependency if it simplifies our development. If import time is the big blocker IMO that's not really that big of a deal and would make it easier to manage / communicate BoolArray and similar developments

+0 on pd.NA. Not sure I see understand the entire scope of it from what is described but maybe something that just needs to be developed out

@jorisvandenbossche
Copy link
Member Author

Sorry, I should have been clearer about this: all what I said above is meant to be pure numpy-based masked arrays, without any dependency on pyarrow, for now. This at least enables us to experiment with the API design without needing a heavy dependency such as pyarrow for the storage. Although I am clearly biased towards pyarrow, I don't think any of this needs pyarrow on the short term, we can perfectly implement this using numpy (and doing so doesn't complicate it, on the contrary, there is still a lot more functionality available for numpy masks compared to pyarrow arrays).

I indeed started a PR for the BooleanArray part of this: #29555. The numpy mask-based approach will not be the most efficient (both in memory and speed), but IMO it is now more important to first experiment with the design (the same can be said about the already existing IntegerArray).

it’s not 2x it’s like 10x compared to a native impl

For IntegerArray it is 2x (which is what Tom was saying I think), for BooleanArray the optimization could indeed be a lot bigger.

@TomAugspurger
Copy link
Contributor

IMO it is now more important to first experiment with the design (the same can be said about the already existing IntegerArray).

Agreed (can add StringArray here too). I want to have 1.0 be as object-dtype free as possible :)

@TomAugspurger
Copy link
Contributor

For IntegerArray it is 2x (which is what Tom was saying I think), for BooleanArray the optimization could indeed be a lot bigger.

For BooleanArrays with no missing values, isn't it exactly 2x more memory (assuming we store the mask when there are no missing values)? Then it's just the boolean ndarry of values + a boolean mask, so 2x. And we can, like pyarrow, skip allocating the mask if there are no NAs (though that's an optimization, not a requirement).

For one with missing values, isn't it less memory than our current implementation?

In [13]: s = pd.Series([None] + [True] * 1000)

In [14]: s.memory_usage(deep=True, index=False)
Out[14]: 36024

versus

In [15]: mask = np.array([True] + [False] * 1000)

In [16]: 2 * mask.nbytes
Out[16]: 2002

Even if we're using NaN for a boolean with missing values, we still have an improvement, since we use float64 by default

In [19]: s.astype(float).nbytes
Out[19]: 8008

@jorisvandenbossche
Copy link
Member Author

It's indeed already an optimization compared to object dtype array of True/False/None. But I think Jeff meant the potential optimization of using pyarrow's boolean array storage, which can give a lot of additional memory improvement (8x compared to the double numpy array):

In [3]: mask = np.array([True] + [False] * 1000) 

In [4]: 2 * mask.nbytes 
Out[4]: 2002

In [5]: import pyarrow

In [6]: pyarrow.array(mask, mask=mask).nbytes
Out[6]: 252

But anyway, as I argued above, let's leave this for the future, and focus now on the API using numpy arrays under the hood (that use some more memory).

@TomAugspurger
Copy link
Contributor

From the call today, there were no objections to changing IntegerArray to start using NA instead of np.nan.

@jorisvandenbossche do you intend to work on that? Otherwise, I can push up a WIP tomorrow (which will depend on #29597).

@jorisvandenbossche
Copy link
Member Author

Feel free to already do a WIP on that, there is plenty to do ;)
I would first like to get those 2 open PRs merged

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 25, 2019

BooleanArray is in master. Getting #29597 in will open up more followups. I think the order is roughly

  1. Use pd.NA in BooleanArray
    1.a Implement kleene-logic in logical ops on BooleanArray
    1.b Update the behaviour of any/all reductions with skipna=False (API: any/all in context of boolean dtype with missing values #29686)
  2. Enable boolean indexing with BooleanArray (we can start with handling the case of no NAs; the behaviour for with NAs still need to be decided: DISCUSS: boolean dtype with missing value support #28778)
  3. Use BooleanArray as the return value for logical ops in StringArray & IntegerArray
  4. Base class for IntegerArray & BooleanArray

@jorisvandenbossche feel free to update this list (or maybe the original post) as needed [done]

@jorisvandenbossche
Copy link
Member Author

Thanks Tom, the follow-ups I had listed in #29555 (comment) are more or less the same (added boolean indexing + any/all)

@jorisvandenbossche
Copy link
Member Author

And updated the top post with a to-do list

@jreback
Copy link
Contributor

jreback commented Nov 26, 2019 via email

@jorisvandenbossche
Copy link
Member Author

I think a big advantage of the mask approach is the consistent data model with other arrays such as IntegerArray now (and possibly others in the future).

Also, in operations you often need or produce a mask (eg in a IntegerArray comparison producing a boolean array, or in isna of an IntegerArray), which now is always available or can be set without further computation, while in the encoded type you need to convert to the encoded type or produce a mask from the encoded data.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Dec 2, 2019

@TomAugspurger are you planning to do a PR for using it in IntegerArray? (or at least you had a beginning branch for this, I think?)

I opened a PR to do the BooleanArray case: #29961

@TomAugspurger
Copy link
Contributor

@jorisvandenbossche I'm not finding a brach for IntegerArray using pd.NA right now. Do you have time to work on it this week?

@jorisvandenbossche
Copy link
Member Author

I think it is here: TomAugspurger/pandas@NA-scalar...TomAugspurger:NA-scalar+IntegerArray. Feel free to just open it as a PR, and then can see who has time first

@TomAugspurger
Copy link
Contributor

Thanks, missed that somehow. Opened a draft in #29964

@TomAugspurger
Copy link
Contributor

@jorisvandenbossche should the subitem Use BooleanArray in comparison ops of StringArray be expanded to Use BooleanArray in comparison ops of StringArray, IntegerArray, and BooleanArray? The plan is for those to propagate NA, right?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Dec 2, 2019

The plan is for those to propagate NA, right?

Yes.
I am already handling the BooleanArray case in #29961. If the IntegerArray case is not handled in #29964, then yes we can have a subitem / PR for doing that in IntegerArray / StringArray together.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 3, 2019

I've added an item for returning BooleanArray from boolean .str methods. I can put together a PR for that fairly quickly I think.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 6, 2019

@jorisvandenbossche I added a TODO for implementing NA.__array_ufunc__. That'll fix things like np.bool_(True) | NA

In [4]: import numpy as np; import pandas as pd

In [5]: np.bool_(True) | pd.NA
Out[5]: NA

I just hardcoded that to return NA as a test, but we'll want to do broadcasting, etc. We have to define some behaviors:

Logical operations use Kleene logic. Returns True / False / NA for scalars, or a BooleanArray for arrays.

>>> np.bool_(True) | NA
True

>>> np.array([True, False]) | NA
BooleanArray([True, NA])

Arithmetic operations treat pd.NA as numeric. With integers we return an IntegerArray

>>> 1 + pd.NA
NA

>>> np.array([1, 2], dtype='uint64') + NA
IntegerArray([NA, NA], dtype='uint64)

With floats, we return ..., what? An ndarray[float] using NaN (I vote no)? A PandasArray (maybe)?

>>> np.array([1.1, 2.2]) + NA
?

For other ufuncs like np.sin, etc. we just return NA?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 12, 2019

@jorisvandenbossche one of the items is

Use BooleanArray as the return value for logical ops in StringArray & IntegerArray

That's not quite right for StringArray right? Those won't implement logical ops? I think I may have added StringArray, when I meant to edit the list for comparison ops.

And have you started on NA.__array_ufunc__?

@jorisvandenbossche
Copy link
Member Author

Use BooleanArray as the return value for logical ops in StringArray & IntegerArray

That's not quite right for StringArray right? Those won't implement logical ops? I think I may have added StringArray, when I meant to edit the list for comparison ops.

Yes, but I think it is also not correct for Integer, as integer logical ops return integers (which is also the case for Series):

In [56]: np.array([1, 2, 3]) | np.array([3, 4, 5]) 
Out[56]: array([3, 6, 7])

(logical ops are not yet implemented for IntegerArray, but that's a separate issue)

Maybe we were confused with comparison ops for a moment, but I think the full line can be removed.

And have you started on NA.__array_ufunc__?

No, not yet

@TomAugspurger
Copy link
Contributor

I think we have open PRs for everything except

  1. Documentation
  2. Refactoring to a base class between IntegerArray and BooleanArray

which are blocked by the outstanding PRs

@anisotropi4
Copy link

My question is prosaic and based on str dtype conversion:
How would I now distinguish between a missing str value and the two-character string 'NA'?

I ask as NA is a common abbreviation for 'Not Applicable', 'North America' et al, in a way that in my experience that 'NaN' or 'Not a Number' isn't

Given this, if 'NA' were generated as the default missing str value, especially if introduced as change rather than as a opt-in type option, it risks becoming a UX developer issue as I (for one) would no longer know if 'NA' is a valid or a missing data value

@jorisvandenbossche
Copy link
Member Author

@anisotropi4 Thanks for joining in! Would you like to move that comment to a new issue? I think the several interactions of NA with string representation and conversion warrants a dedicated issue to discuss.

@anisotropi4
Copy link

As requested @jorisvandenbossche I have raised this as #30415 with some more detail about current behaviour. I hope that is what you had in mind.

@TomAugspurger
Copy link
Contributor

@jorisvandenbossche I think we can close this, unless you have additional tasks in mind.

@TomAugspurger
Copy link
Contributor

Closing, though will address any remaining tasks in dedicated issues.

@jorisvandenbossche
Copy link
Member Author

Yes, sorry for the slow reply, this was fine to close. The combined docs of all PRs about this issues could maybe use a check / review, but no need to keep it open for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion Requires discussion from core team before further action Roadmap A proposal for the roadmap.
Projects
None yet
Development

No branches or pull requests

5 participants