Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROADMAP: Consistent missing value handling with new NA scalar #28095

Open
jorisvandenbossche opened this issue Aug 22, 2019 · 65 comments
Open

ROADMAP: Consistent missing value handling with new NA scalar #28095

jorisvandenbossche opened this issue Aug 22, 2019 · 65 comments
Labels
API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action Roadmap A proposal for the roadmap.

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 22, 2019

I cleaned up my initial write up on the consistent missing values proposal (#27825 (comment)), and incorporated the items brought up in the last video chat. So I think it is ready for some more detailed discussion.

The last version of the full proposal can be found here: https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB

TL;DR:

  • I propose to introduce a new scalar (singleton) pd.NA that can be used as the missing value indicator (when accessing a single value, not necessarily how it is stored under the hood).
  • This can be used instead of np.nan or pd.NaT in new data types (eg nullable integers, potential string dtype)
  • Long term, we can see if there is a migration possible to use this consistently for all data types.

cc @pandas-dev/pandas-core

@jorisvandenbossche jorisvandenbossche added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate API Design Needs Discussion Requires discussion from core team before further action labels Aug 22, 2019
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Aug 22, 2019

Technical note: following our roadmap, I posted an issue with the proposal, but we need to see a bit how to have those discussions. As I posted the full proposal externally and not in a PR, it's more difficult to have inline comments (although I can also post it on eg google docs which is a bit friendlier for that). But alternatively, we could also do PRs instead of issues?)

@shoyer
Copy link
Member

shoyer commented Aug 22, 2019

This is an interesting proposal!

I'm mostly concerned about type stability / predictability, e.g.,

  • What is pd.Series([pd.NA]).dtype? (with np.nan this would be float64)
  • What is pd.Timestamp('2000-01-01') - pd.NA? (could be either Timestamp or Timedelta)

We might need an NA dtype in order to have well-defined semantics for all of these operations.

It would also be worth consider whether we would like to deviate from NaN semantics in some cases. For example, it might make sense to define NA == other as NA rather than False.

Julia has a nice documentation page explain how they support missing values. It might be a good model to emulate for pandas. Two notable difference about Julia:

  • They have robust union types and use them for missing values (e.g., Union{Missing, Int64}), which solves the isinstance problem.
  • They have only have multi-methods/functions, not methods like Python, so they don't need to worry about defining type specific methods/properties on missing values, e.g., NA.year

@jorisvandenbossche
Copy link
Member Author

Thanks for the feedback!

I'm mostly concerned about type stability / predictability, e.g.,

Yes, that's an item I briefly mentioned in the proposal with similar examples. We can (and will need to) decide on certain rules how to go about this, but they can still be unpredictable for users of course..

I also think we might need a NA dtype. If we do, that would probably mean that we return this NA type instead of guessing between eg timedelta or timestamp? However, but then what with operations that are less ambiguous? Eg float + 'someting' in principle either gives an error (if something is not a number) or gives a float. So if this gives a result (not errors), it is always float. But then should float + NA then give float? Similarly for timestamp + 'something' (while int + 'something' can have multiple return types depending on 'something')

I will need to think a bit more on usage implications of this.

It would also be worth consider whether we would like to deviate from NaN semantics in some cases. For example, it might make sense to define NA == other as NA rather than False.

That's also something I have been thinking about. I now included some additional text about this at the bottom of the proposal.

Ideally, I think we would do something like that and follow what Julia (and also mostly R and SQL) are doing (meaning: propagating NA on comparisons, having three-valued logic for boolean operations). However, this also has some additional complications, certainly if we want to introduce this gradually.
For example, assume we start using pd.NA in the new integer (and coming string) dtype, but keep the old np.nan for object / float. That would mean that NA behaviour in comparisons etc would work differently between both, initially leading to inconsistencies.

@WillAyd
Copy link
Member

WillAyd commented Aug 28, 2019

Read through but not sure about this. AFAIK for other languages (particularly Julia) there was an issue distinguishing between None and np.nan; there are some rough edges but I'd say that distinction is mostly clear in pandas.

For extension types we'd (I assume) mostly be masking the location of missing values anyway, so what particular advantage do you see to returning pd.NA from those instead of just np.nan?

@jorisvandenbossche
Copy link
Member Author

(sorry for the slow reply, busy EuroScipy)

AFAIK for other languages (particularly Julia) there was an issue distinguishing between None and np.nan;

Can you elaborate a bit more on this sentence?

For extension types we'd (I assume) mostly be masking the location of missing values anyway, so what particular advantage do you see to returning pd.NA from those instead of just np.nan?

For sure, we could also return np.nan as the scalar value (you could basically replace pd.NA with np.nan in the proposal), and this is something our users are already familiar with.
Although somewhat subjectively, some reasons why I don't really like that solution:

  • In general, np.nan is a float value (eg what you get from 0/0), and not a missing value indicator. This is maybe more a theoretical reason (as we have been using np.nan in practice for a long time as missing value indicator), but to me it "feels wrong" to start using this more broadly for new data types as the missing value indicator (because up to now, it was mostly used in actual float dtype, and in object dtype). It feels strange to return "not a number" for a missing value of strings.
    And a less theoretical reason is both R, Julia and SQL distinguish both and have a separate NA/NULL concept from float NaN.
  • Personally (and given the above), I think for consistency reasons we should rather choose between either having a "not a value" sentinel per dtype (NaT, NaS(tring), NaI(nteger), which can behave more as times, strings or integers, respectively) or having a single pd.NA scalar (and from those two I personally prefer the last). This disucssion came also up in Tom experimenting with a NotAString value.
  • Do we want, at some point, also support missing values in boolean columns? In that case, we need to decide the behaviour in certain operations, related to the questions that @shoyer brought up. If we want to follow there the behaviour that boolean arrays with missing values have in other languages, this might deviating from np.nan behaviour I think.

Originally I was planning to write a proposal combining the new sentinel value and consistently using mask-based approach for all dtypes (which more easily enables being consistent with the scalar value as well). Choosing a specific implementation (eg mask-based) will have much more impact on our code base, a possible choice for pd.NA over np.nan as the scalar value is much more a user-facing API question, where I find that pd.NA can give a more consistent, easier to teach pandas API.

@jbrockmendel
Copy link
Member

In general, np.nan is a float value (eg what you get from 0/0), and not a missing value indicator.

So if I have a Series[float64] could it contain both np.nan and pd.NA, and these signify different things?

Misc thoughts:

  • Assuming pd.NA is implemented in cython, checks for it would be more performant than checks for np.nan scalar (since that is not a singleton).
  • Does consistency with arrow play a factor in your thought process? Note pa.NA == pa.NA
  • Would it be worthwhile to implement a more limited pd.NA that was mostly just for object dtype? (maybe categorical. I think "non-arithmetic-supporting" would be the precise distinction)
    • maybe stop treating None as na, so that all recognized na scalars satisfy x != x.
    • stop casting np.nan to NaT when inserting into dt64 or td64 Series/DTA/TDA, but allow pd.NA to be inserted.
    • DTA[i] continues to return NaT instead of pd.NA, so arithmetic doesn't get broken

@jorisvandenbossche
Copy link
Member Author

So if I have a Series[float64] could it contain both np.nan and pd.NA, and these signify different things?

In theory, yes (that is what R, Julia, SQL, .. do). In practice, though, we will certainly need some kind of option (and as default in the beginning) to still treat np.nan as a missing value.

Does consistency with arrow play a factor in your thought process?

Partly, yes, but not as the main reason (but you could see Arrow as "yet another example" next to other R/Julia/SQL that distinguish both concepts, and a python one). Note that Arrow nowadays uses "null" instead of "na" (but that's just the name, not a difference in concept).

I am using R regularly as example, but I am certainly not an expert in R. Eg they distinguish both, but it seems that NaN behaves more like NA there compared to numpy's np.nan (so it's maybe not the best reference to compare to).

Would it be worthwhile to implement a more limited pd.NA that was mostly just for object dtype? (maybe categorical. I think "non-arithmetic-supporting" would be the precise distinction)

I suppose with the idea to avoid the dubious situations in what the result type should be? (eg NA + 1 -> float or int or ?)

I don't really like that on the long term (for consistency reasons), but on the short term this might be what we will do in practice anyway. But, eg in that idea, what sentinel would we use for nullable integers? Still np.nan as today I suppose. But, that gives you already the same dubious situation (what is the result type of IntegerArray([1, 2]) + np.nan, int or float?)

maybe stop treating None as na, so that all recognized na scalars satisfy x != x.

That's certainly something to consider as well, I think. Although (given eg #28124), I suppose there are people that like to use None since that can play better with other python packages.

@jbrockmendel
Copy link
Member

I suppose with the idea to avoid the dubious situations in what the result type should be? (eg NA + 1 -> float or int or ?)

That example works, but more problematic examples are Series[datetime64] - pd.NA and Series[timedelta64] / pd.NA. I know we discussed this on the last call, not sure if there is a thread on this.

@jorisvandenbossche
Copy link
Member Author

Why is that more problematic? Because float and int, although different types, still can refer the same (semantic) value, while timestamp and timedelta are different concepts. That's indeed a bigger difference.
The proposal suggests the idea of a "NA dtype", which could potentially help remediate this case (I won't say "solve" ;)). Or at least, worth to think through a few use cases to see if it can help.

I think this thread is the place to discuss this.

@jorisvandenbossche
Copy link
Member Author

I would like to revive this discussion (will try to update the proposal with some more concrete details, and send a note to the mailing list).

What are people's general thoughts about this?

@jorisvandenbossche
Copy link
Member Author

cc @xhochy you might also be interested in this given your recent explorations of boolean extension arrays / missing values in any/all operations

@jbrockmendel
Copy link
Member

@jorisvandenbossche reading through this I don't see a clear answer to the "how do we do this while avoiding breaking arithmetic?" problem. Is your position something to the effect of "just don't do arithmetic with these"?

@jorisvandenbossche
Copy link
Member Author

I don't see a clear answer to the "how do we do this while avoiding breaking arithmetic?" problem. Is your position something to the effect of "just don't do arithmetic with these"?

@jbrockmendel What do you mean with the "breaking arithmetic problem" exactly?
Assuming it is what you raised before about what the resulting dtype would be of eg Series[datetime64] - pd.NA, see my response above as well: a possible idea is to add a "NA dtype" to somewhat remediate this case.
It of course won't "solve" it in the sense of keeping the current behaviour. But it can ensure we can write down clear rules about what to expect in which cases (i.e. to have well-defined semantics).

But in the end, this is a choice to make. Yes, when dealing with scalars, having a single NA scalar looses some information compared to having a dtype-specific scalar values. But note that in the current situation, we already have this as well. Eg we don't yet have separate not-a-timedelta or not-a-period values. We could add those, and that would a counter proposal to this. But do we then also want to add a not-a-integer, not-a-string, not-a-... ? My personal opinion here is that having consistency across dtypes with a single NA value is more valuable than preserving some more information in the scalar NA value for arithmetic with scalars.

@TomAugspurger
Copy link
Contributor

Is the "NA dtype" proposal explained anywhere? IIUC, the idea would be that Series[T] + pd.NA ->Series[NA]. But this dtype would only arise in binary operations where one of the operands is an NA scalar?

What do we think about these cases?

>>> Series([0, 1]) + pd.NA
Series([NA, NA], dtype=NAType)

>>> Series([0, 1]) + pd.Series([NA, NA], dtype="int")
Series([NA, NA], dtype="int")
>>> Series([0, 1]) + [NA, NA]   # equivalent Series([0, 1]) + Series([NA, NA], dtype=NAType)
Series([NA, NA], dtype="NAType")

Is it strange that we don't have the property that

(array + array)[0].dtype <=> (array + array[0]).dtype

? I'm not sure what to make of it.

@jorisvandenbossche
Copy link
Member Author

Is the "NA dtype" proposal explained anywhere?

Not yet much more than just a similar mention as in the discussion above. I included a bit more comment now (mainly based on your comment and the rest of this comment).

What do we think about these cases?

The examples you give is what I had in mind.

In addition, I think the idea would be that this "NA dtype" could be upcasted to any other dtype (eg when concatting, ..) so that in other operations it "doesn't get in the way" if you accidentally have this type. Of course some dtype specific operations can be problematic (like .dt or .str methods, should those work?)

Is it strange that we don't have the property that

(array + array)[0].dtype <=> (array + array[0]).dtype

Yes, that is correct. But note that we also don't have that right now:

In [28]: arr_dt = pd.array([None, None], dtype='datetime64[ns]') 

In [29]: arr_td = pd.array([None, "1 days"], dtype='timedelta64[ns]') 

In [30]: arr_td + arr_dt[0] 
Out[30]: 
<TimedeltaArray>
[NaT, NaT]
Length: 2, dtype: timedelta64[ns]

In [31]: arr_td + arr_dt
Out[31]: 
<DatetimeArray>
['NaT', 'NaT']
Length: 2, dtype: datetime64[ns]

In [32]: (arr_td + arr_dt[0]).dtype                                                                                                                                                                                
Out[32]: dtype('<m8[ns]')

In [33]: (arr_td + arr_dt)[[0]].dtype                                                                                                                                                                              
Out[33]: dtype('<M8[ns]')

But, this is something that could in principle be solved by having seperate NaT for datetime/timedelta/period (cfr #24983). While when going for a single NA scalar, we cannot fully solve this.
But then (and now I am repeating myself), the question is: do we want to do this for all dtypes? Or only for the datetime-like dtypes? As you can give a similar example with ints and floats (although it is certainly the case that for datetime-like dtypes this case is more prevalent, as datetime or timedelta dtype is a bigger difference than eg float instead of int dtype)

It would be interesting to investigate how other libraries / languages are dealing with this (Julia has a more powerful Union(type, Missing) which handles this).

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 2, 2019

But then (and now I am repeating myself), the question is: do we want to do this for all dtypes?

I think there's agreement that the current state of NaN for {int, float, bool, str, ...} and NaT for {datetime, timedelta, period} isn't good. If we're changing things, I think we'll move to either pd.NA or pd.NA[T] (i.e. one NA for all dtypes, or one NA per dtype).

It would be interesting to investigate how other libraries / languages are dealing with this (Julia has a more powerful Union(type, Missing) which handles this).

Agreed. I think Julia would be a good one to investigate. In their docs, I didn't see what see what the type of Array[Union[Int, Missing]] + missing was (array + scalar NA).

@jbrockmendel
Copy link
Member

I see two arguments for a single pd.NA:

  1. Have consistency in that arraylike[na_idx] returns pd.NA regardless of arraylike.dtype
  2. Fix cases where we currently use np.nan and doing so conveys misleading dtype information.

Am I missing anything? Aiming for non-normative descriptions so far.

@jbrockmendel
Copy link
Member

Collecting responses to a bunch of stuff:

a possible idea is to add a "NA dtype"

Is there precedent for NADtype in other libraries/languages? If now we're talking about adding pd.NA and pd.NADtype, and if we use the "how many things are there" heuristic to judge proposals, then I think we'd be better off with implementing pd.NA for non-arithmetic dtypes and keeping pd.NaT as is (i.e. still have "two things").

I think the idea would be that this "NA dtype" could be upcasted to any other dtype (eg when concatting, ..) so that in other operations it "doesn't get in the way" if you accidentally have this type.

I can imagine scenarios where "don't get in the way" is useful, but also scenarios where "please raise if I accidentally try to add a datetime to a float" is more important. My intuition is that the latter is more common, but I don't have any good ideas for how to measure that.

So if I have a Series[float64] could it contain both np.nan and pd.NA, and these signify different things?

In theory, yes

If we can retain np.nan because it has a different meaning from pd.NA, why not also have pd.NaT, as it also has a different meaning?

@TomAugspurger
Copy link
Contributor

Is there precedent for NADtype in other libraries/languages?

It seems this is what Julia does.

julia> x = [1, 2, 3, missing]
4-element Array{Union{Missing, Int64},1}:
 1       
 2       
 3       
  missing

julia> x .== missing
4-element Array{Missing,1}:
 missing
 missing
 missing
 missing

IIUC, the Array{Missing, 1} is the "dtype" and shape.

From what I can tell, Date / DateTime arithmetic doesn't handle missing well. I don't know if that's by design or it's just not implemented.

julia> [Date(2014, 1, 1), missing] .+ Dates.Day(1)
ERROR: MethodError: no method matching +(::Missing, ::Day)
Closest candidates are:
  +(::Any, ::Any, ::Any, ::Any...) at operators.jl:529
  +(::Missing, ::Missing) at missing.jl:93
  +(::Missing) at missing.jl:79
  ...
Stacktrace:
 [1] _broadcast_getindex_evalf at ./broadcast.jl:625 [inlined]
 [2] _broadcast_getindex at ./broadcast.jl:598 [inlined]
 [3] getindex at ./broadcast.jl:558 [inlined]
 [4] macro expansion at ./broadcast.jl:888 [inlined]
 [5] macro expansion at ./simdloop.jl:77 [inlined]
 [6] copyto! at ./broadcast.jl:887 [inlined]
 [7] copyto! at ./broadcast.jl:842 [inlined]
 [8] copy at ./broadcast.jl:818 [inlined]
 [9] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(+),Tuple{Array{Union{Missing, Date},1},Base.RefValue{Day}}}) at ./broadcast.jl:798
 [10] top-level scope at REPL[47]:1

@jorisvandenbossche
Copy link
Member Author

Some comparisons (but note I am not an expert on any of those):

Postgres

Postgres does not have a "NULL type". From testing, it seems to just define a certain resulting type for a scalar NULL operation. For example:

SELECT
interval '1 day' + interval '1 hour' AS result1,          -- interval
interval '1 day' + timestamp '2012-01-01' AS result2,     -- timestamp
interval '1 day' + NULL AS result3;                       -- interval

has interval type for result1, timestamp type for result2, and interval type for result3 (where this could in principle be both interval or timestamp). Similar result for timestamp - timestamp = interval, timestamp - interval = timestamp and timestamp - NULL = interval.
The same can be observed for ints: int + int = int, int + float = float, int + NULL = int.

So here it seems to assume that the NULL is of its own type to determine the type of the resulting NULL (just a guess from the results I see).

That's for scalar, NULL-involving arithmetic operations. But for the rest Postgres is very similar in what is discussed in this proposal: they have a single NULL scalar usable for all types. For logical operators involving booleans and NULLs, they also have a "three-valued logic" (https://www.postgresql.org/docs/9.1/functions-logical.html) similar to Julia and what I would propose for our BooleanArray.

Julia

As Tom already noted above, you get something like

Array{Union{Missing, Int64},1} .+ Missing -> Array{Missing, 1}

where the result involving a scalar missing value effectively is of "missing type". But, I am not sure this is really comparable to our situation, as they have this "Type Unions" system. As noted by @shoyer above, that eg solves the isinstance problem (as Missing is an instance of Union{Missing, Int64}).

For the rest, it also has similar behaviour in comparison and logical operators as SQL, and as what I would personally propose for our BooleanArray as well: https://docs.julialang.org/en/v1/manual/missing/

R

Trying some things out with the tidyverse:

> library(tidyverse)
> df <- tibble(x = c(1L, 2L, NA))
> df %>% mutate(x1 =  x + 1L, x2 = x + 1.5, x3 = x + NA)
# A tibble: 3 x 4
      x    x1    x2    x3
  <int> <int> <dbl> <int>
1     1     2   2.5    NA
2     2     3   3.5    NA
3    NA    NA  NA      NA
> library(lubridate)
> df <- tibble(x = c(today(), NA))
> df %>% mutate(x1 = x - ddays(10), x2 = x - ymd(20191001), x3 = x - NA)
# A tibble: 2 x 4
  x          x1         x2      x3        
  <date>     <date>     <drtn>  <date>    
1 2019-10-03 2019-09-23  2 days NA        
2 NA         NA         NA days NA        
 

(here drtn = duration, their "timedelta" type. So translating to our terminology: timestamp - NA = timestamp)

Here it seems to preserve the original type of the column for arithmetic operations involving scalar NAs (if the original type is possible. For eg int / NA (division) it gives a float column of all NAs and not int column, as expected).

So R also has no "NA type". If you create a column of all NAs, it gives a logical typed column (boolean).

R also actually has multiple NA values (NA (logical), NA_integer_, NA_real_ and NA_character_). But those look most of the time the same to the user, and mostly the user does not need to care as normally coercion will do the expected thing (from section 20.3.4, just above here https://r4ds.had.co.nz/vectors.html#using-atomic-vectors). Those are also for basic types, so not sure how that is handled exactly in the tidyverse where they have more types in tibbles.

Apache Arrow

Arrow consistently handles missing values with a mask (so not with sentinel values as NaN or NaT). It also has a NullType for arrays with only Nulls without specified type (https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#null-layout).
However, there are not yet many arithmetic operations defined (in the C++ implementation) to compare the behaviour with scalar Nulls (there is a sum, but only for numeric types, nothing specific for datetime/timedelta).

@jorisvandenbossche
Copy link
Member Author

And now some specific answers (sorry for the long wall of text):

[Tom] If we're changing things, I think we'll move to either pd.NA or pd.NA[T] (i.e. one NA for all dtypes, or one NA per dtype).

A third option could maybe also be to have an NA for each type, but let them all look like "NA". That could help the "scalar arithmetic" issues by preserving the originating type information in the NA value. On the other hand, this seems a lot more work to implement (would eg this "NA_period" have all the methods of a Period?) and also potentially to code against.
We probably also still need to have a generic NA (eg for extension dtypes that don't want to implement their "NA_my_extension" variant), and then still have the question how that would behave in scalar operations.

[Brock] I see two arguments for a single pd.NA:

  1. Have consistency in that arraylike[na_idx] returns pd.NA regardless of arraylike.dtype
  2. Fix cases where we currently use np.nan and doing so conveys misleading dtype information.

Am I missing anything? Aiming for non-normative descriptions so far.

In the document, I give 3 main arguments: 1) inconsistent user interface, 2) proliferation of "not a value" types, 3) mis-use of NaN float (with longer descriptions for each in https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB). I think that 1) maps to your first 1., and 3) maps more or less to your 2. My 2) is basically an inversed argument of the alternative (consistently use "not a value" pattern for all types).
Are those arguments more or less OK / good descriptions? (feel free to give concrete feeback on the text, you can comment on the document)

If now we're talking about adding pd.NA and pd.NADtype, and if we use the "how many things are there" heuristic to judge proposals, then I think we'd be better off with implementing pd.NA for non-arithmetic dtypes and keeping pd.NaT as is (i.e. still have "two things").

Assuming that we then want a "not a value" for each arithmetic type (otherwise you have the same problems as now that timestamp - NaT has an unclear resulting type), you would have at least 3 "not a value" types for timestamp/timedelta/period, and in principle the different int and float types are also "arithmetic dtypes".
It might be that for something like this, the idea in the top of this comment (in reply to Tom) of having multiple NA realizations to preserve type information might be similar to what you propose under the hood, but give a more consistent user interface around "NA" for the user. But, we would still need rules how to treat a generic pd.NA (an NA that does not originate from accessing a typed columns like s[0]).

I can imagine scenarios where "don't get in the way" is useful, but also scenarios where "please raise if I accidentally try to add a datetime to a float" is more important. My intuition is that the latter is more common, but I don't have any good ideas for how to measure that.

Those things are indeed difficult to guess or measure. For the "don't get in the way", I was mainly thinking about the example of combining data such as in concat or append or combining chunks in groupby (where now incompatible dtypes cause upcasting to object). But it is true that in arithmetic operations, you again get potentially unclear expectations.

If we can retain np.nan because it has a different meaning from pd.NA, why not also have pd.NaT, as it also has a different meaning?

Yes, given that the underlying numpy array has this "not a value" sentinel, we can keep that as well (if we combine it with a mask for actual missing values). However, in practice (in the far future where a NA system would be fully rolled out), I think most of the time people will have rather NA and not NaT in their values, as all operations that introduce missing values in pandas would give NA (IO with missing values, reindexing / alignment, ..).

@jreback
Copy link
Contributor

jreback commented Oct 3, 2019

agree generally with having pa.NA; it’s going to be much simpler from a user viewpoint and implmentstion POV, but some concerns

  • having multiple implementations (meaning we present NA but different dtypes under the hood) is not scalable / overly complicated
  • we already have an Any dtype, namely object to hold an all NA values (so for example an empty Series or all NA Series w/o dtype specified would be object dtype - a change we have been wanting to make
  • we should use masks internally to implement this, but we don’t have a good bitmask holder unless we use pyarrow, which is ok, though 2D EA need some consideration here
  • we might need to change NA propagation in + (and other ops); to ignore NA (kind of like we use .add and .mul do now)
  • i don’t mind losing attributes / functions on missing values (eg why we have NaT look like a date time in the first place); it’s slighty confusing to get this back from an operation, but the alternative (many missing values types) I think is just too complicated

so i would just say implement this using pyarrow masks (but not pyarrow memory for the strays themselves; we have differing enough semantics that we likely want to keep the values in numpy arrays at least until pyarrow grows stable operations)

@jbrockmendel
Copy link
Member

TL;DR: can we break any non-controversial pieces of this off? The scope here is overwhelming.


Are those arguments more or less OK / good descriptions?

Yes, I think we are on the same page description-wise.

Assuming that we then want a "not a value" for each arithmetic type (otherwise you have the same problems as now that timestamp - NaT has an unclear resulting type)

I don't think that is a necessary assumption. The default alternative to this proposal is not proliferation implementing a bunch of new things, it is the status quo. The existing inconsistencies with NaT are annoying, but generally we have a handle on them. This is already a complicated discussion and I'd prefer to keep #24983 separate to whatever extent that is feasible.

"three-valued logic" (https://www.postgresql.org/docs/9.1/functions-logical.html) similar to Julia and what I would propose for our BooleanArray.

I think that would be great, would make our logical ops (which I'm currently working on and are a PITA) much more internally consistent. More importantly, it can be implemented independent of the rest of this proposal, which is kind of overwhelming in scope.

it’s going to be much simpler from a user viewpoint and implmentstion POV,

I understand the user pain point w/r/t non-arithmetic-dtypes (my point 2, part of Joris's 3), but I don't at all see how __getitem__ returning NaT is a pain point.

so i would just say implement this using pyarrow masks

AFAICT the contentious part of this is just about what __getitem__ returns, which is orthogonal to this. Am I missing something important?

@dhirschfeld
Copy link
Contributor

...all of which I think is in agreement with what @jorisvandenbossche said above.

@jorisvandenbossche
Copy link
Member Author

Yes, that's indeed what I tried to explain in words, thanks for the examples.
(one small note: I don't think we will have a pd.NA[int64] dtype, it will still be the Int64Dtype that is nullable by using a mask; but maybe it was only for illustration purpose to be able to make a difference with the current behaviour?)

Assuming we have this "generic" untyped pd.NA as well, you can still run into surprising cases:

In [45]: pd.Series(["1 days"], dtype='timedelta64[ns]') + pd.NA 
Out[45]: 
0   NA
dtype: timedelta64[ns]

In [45]: pd.Series(["1 days"], dtype='timedelta64[ns]') + pd.Series([pd.NA], dtype='datetime64[ns]')[0]
Out[45]: 
0   NA
dtype: datetime64[ns]

So if your NA is untyped vs the result of a computation or indexing operation, although they look the same, will give a different result.

@seberg
Copy link
Contributor

seberg commented Oct 16, 2019

(sorry posted this on the wrong thread previously)

I actually ran into this in numpy a few days back. The np.ma.masked constant behaves like 0. when it comes to type promotion in ufuncs (which triggers value based promotion, which does what was discussed here float32_arr + np.ma.masked gives float32).

However, the way this currently works np.ma.masked is an array with a .dtype and I want to deprecate value based logic for those. Which in turn means I will need to add hooks so that np.ma.masked works like a python float here for now (I can do that, by exposing a things that I probably need to expose anyway, I can even do that from python using public API).

I guess, the main question that I am wondering right now:

  1. Do you/we really want float32_arr + NA to work (at least for some functions, maybe not all)?
    I can see just as well to say that you have to use float32_arr + NA[Float32].

  2. Do we want pd.Series([1, NA, 2, 3]) to work? Or force pd.Series([1, NA], dtype=Int64) or pd.series([1, NA[Int64]]).

Note that this does not touch that series[3] = NA works fine of course. series[:] = list_including_NAs might need some care to get always right. But I think that should not be an issue, since we can probably make this call pd.Series(list_including_NAs, dtype=series.dtype) (There is one larger issue with the last one, in that numpy array coercion currently does not know about safe/same_kind/unsafe casting...).

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 17, 2019

(Posted in wrong issue before, so moving it here)

The discussion by @seberg above leads me to the following idea, which I think I may have posted elsewhere.

I think it is important to distinguish between NA meaning "data missing" versus NaN meaning "not a number". In the current pandas world, there is no differentiation. Consider this simple example:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: s=pd.Series([0,4,np.nan], dtype=float)

In [4]: s2=pd.Series([0,2,1], dtype=float)

In [5]: s
Out[5]:
0    0.0
1    4.0
2    NaN
dtype: float64

In [6]: s2
Out[6]:
0    0.0
1    2.0
2    1.0
dtype: float64

In [7]: s3=s/s2

In [8]: s3
Out[9]:
0    NaN
1    2.0
2    NaN
dtype: float64

In this example s has missing data, and the result of the division in s3 has NaN in two places. One (s3[0]) is due to an arithmetic error, and the other (s3[2]) is due to missing data. I have run into cases where I wanted to know the difference, to help hunt down a bug, which could be a bug in my arithmetic, or could be due to missing data.

IMHO, these should be represented differently. The arithmetic/boolean operations that occur with a floating point NaN are well defined in standard floating point specification documentation. But if pandas had a separate NA that meant "missing data", then we are free to define how arithmetic and boolean operations work with NA values, and those rules do not have to follow the rules for NaN. We should also determine how NaN and NA interact.

@jorisvandenbossche
Copy link
Member Author

However, the way this currently works np.ma.masked is an array with a .dtype and I want to deprecate value based logic for those. Which in turn means I will need to add hooks so that np.ma.masked works like a python float here for now

@seberg is there a numpy issue/thread with discussion about this? It's in your dtype / ufunc refactor that you want to get rid of value based dtype promotion?
How are python scalars handled in numpy right now? First coerced to a numpy (0d) array/scalar so it has a dtype to determine the dtype promotion?

Do you/we really want float32_arr + NA to work? I can see just as well to say that you have to use float32_arr + NA[Float32].

I am personally not fully convinced we need type specialized NAs, but if we do, I personally think it is important for usability that the generic NA works in most cases out of the box.

[ Dr-Irv ] I think it is important to distinguish between NA meaning "data missing" versus NaN meaning "not a number".

As mentioned above (#28095 (comment)), this proposal allows to distinguish both for floating point data, in principle. In the example you give we can indeed preserve the NaN in case of an invalid arithmetic result. NaN values are part of the actual values, while NA values are kept track of in the mask.
If and how we want to do this is an open question: eg should the dropna/isna/... functions recognize both, or with an option? .. (and I would maybe open a new issue about that if we want to discuss that in detail).

@seberg
Copy link
Contributor

seberg commented Oct 19, 2019

@jorisvandenbossche my thread would be the closest, but no. I do not want to get rid of value based promotion for python types. I.e. arra([1, 2], np.uint8) + 3 can be uint8, but array([1, 2], np.uint8) + np.int64(3) IMO should move to be int64 as output.
Right now yes: python scalars are first coerced to array, I wish to change that in that they are first resolved into a "value-based-casting" DType.

See it as creating a new function (a similar one exists in the C-api): dtype = np.dtype_of(input) and in the case of np.dtype_of(3) it would return (the class) PythonIntegerDType[3] (this is the first stage of converting to array in any case). PythonIntegerDType[3] is abstract, i.e. you cannot tag it on to an array. The NA scalar would get a similar dtype. I will likely have to do something similar for np.ma.masked (although that would be designed purely for backward compatibility).

In any case, this is nothing settled as such. But it is how I see things right now, so I think it is the most likely thing to happen (I am going to be gone for the next few weeks, may be around for a few days only).

@TomAugspurger
Copy link
Contributor

Anyone (@jorisvandenbossche?) care to summarize the state of this discussion, and maybe suggest paths to get it unstuck?

@TomAugspurger TomAugspurger added the Roadmap A proposal for the roadmap. label Nov 2, 2019
@jorisvandenbossche
Copy link
Member Author

With some delay, trying to summarize the above discussion

I think in general there seems to be approval for the idea of a dedicated NA indicator specifically for this purpose.

The main discussion item here can be reduced to: a single, dtype-agnostic NA versus multiple dtype-specialized NA realizations. And this difference impact several other aspects of the discussion (behaviour in operations, ..).

Single NA scalar:

  • Simpler implementation
  • For a majority of the use cases (where an ambiguity in resulting dtype does not come up or does not give a problem), this is also simpler for the user.
    IMO, in most cases the user should be able to just type pd.NA without needing to care about the dtype (as it will be clear from the context).
  • But, a single NA leads to ambiguous situations about the dtype of the result in certain operations involving a scalar NA (so not in an array/Series). For example: what is the dtype of the resulting Series in Series[timedelta64] + NA ?
    If we go this way, this means that we need to make and document a choice for all types for the above question.

Dtype-spcialized NAs

  • How would this be implemented? Different sublcasses? Or different instances of a single NAType class that has its 'parent' dtype as property?
  • It will not solve the "isinstance problem" (isinstance(s[scalar], s.dtype.type) is False in case of NA) or provide consistent scalar API. At least with the implementation as in the point above. Or do we actually want to make the the different NAs a subclass of their dtype's scalar types? (which is probably not possible for the numpy scalars, it might be possible for the scalars such as Timestamp and Interval we implement ourselves).
  • How does this work for externally-defined ExtensionArrays?
  • The big advantage is that operations with dtype-specific NAs will always have a clear resulting dtype.
  • But do we still want a "generic NA" that is not specialized to a certain dtype? If we want to allow the user to write pd.NA (which I think we should), we will need this. But do we allow this generic NA in operations? (in that case we still need to make those choices about the resulting dtype) Or only in a context where the dtype is clear? (eg pd.Series([pd.NA], dtype=..) or s[0] = pd.NA where s is an already exsting series)

The summary ended up more as a list of open questions ..
Personally, I would propose to start with trying to implement the "single NA scalar" option for 1.0, so we can gain experience with it (and we can re-evaluate later). But in the meantime, further discussion of the above open questions is certainly welcome.

Some other points worth noting:

  • There is also discussion of the behaviour of NA in comparison and logical operations -> that discussion is kept in DISCUSS: boolean dtype with missing value support #28778
  • We also need to decide on some practical first steps (eg are we OK with breaking changes in IntegerArray to start using pd.NA instead of np.nan ?) -> for that I opened Missing values proposal: concrete steps for 1.0 #29556
  • There were also questions around if np.nan and pd.NA then can both live in the same array / should be distinguished. I think it is certainly possible to do that, but open questions are if np.nan should still be considered "missing" (in things like isna/fillna/ ..) or if that can be configurable.
    But, this will only be relevant if we start thinking about a float dtype using pd.NA; so maybe we can keep further discussion until then (or for a separate issue).

Further, I opened a PR with a basic BooleanArray extension array (but for now without any new NA behaviour): #29555
And I will try to open another PR starting a scalar NA (but help is certainly welcome, I only have limited time to work on this).

@dhirschfeld
Copy link
Contributor

My preferred api would be dtype-specialized NAs along with a generic pd.NA which would always be cast to the appropriate specialized type or raise if it was ambiguous.
e.g.

pd.Series([pd.NA])    # raises TypeError!
pd.Series([pd.NA], dtype=pd.NA['int64'])    # works (casts the pd.NA to a pd.NA['int64'] specialized dtype)

IMHO this gives the user the simplicity of a single pd.NA type whilst removing ambiguity about mixed NA types.

I guess ideally isinstance(pd.NA['int64'](1), pd.NA) and isinstance(pd.NA['int64'](1), np.int64) would both be True

if np.nan should still be considered "missing" (in things like isna/fillna/ ..)

For backward compatibility I'd say they would need an include_nan argument which defaulted to True.

There were also questions around if np.nan and pd.NA can both live in the same array

Ideally (for me), yes. As someone else mentioned above, missing is a different concept to invalid and for certain usecases it would be very useful to be able to distinguish them.

@TomAugspurger
Copy link
Contributor

Thanks for the summary Joris.

I agree that there's probably consensus for moving forward with some kind of pd.NA, that's distinct from np.nan. There's not yet consensus on a single pd.NA vs. dtype-specialized pd.NA[T], and how to integrate pd.NA with np.nan for floating-point data.

I'm having a hard time judging which of a single pd.NA or a dtype-specialized pd.NA[T] is better ahead of time. There are pros and cons to each.

I would like to see something for 1.0, especially for StringArray, BooleanArray, and IntegerArray. I already know that StringArray shouldn't be using np.nan for it's NA value. It'd be nice to get that fixed before we release.

If you expect a single pd.NA to be easier to implement, then I'm OK with doing that for 1.0 (keeping the experimental status, and noting that it may change in the future to NA[str].

@jorisvandenbossche
Copy link
Member Author

I just opened #29597 for a quick (pure python) experiment for a single NA scalar.

@jreback
Copy link
Contributor

jreback commented Nov 13, 2019

would prefer that we actually merge something and let it sit in master for a while, so -1 on doing this for 1.0 unless it’s either not used anywhere or we significantly delay 1.0

@jorisvandenbossche
Copy link
Member Author

@jreback let's keep that discussion in #29556

@jbrockmendel
Copy link
Member

Reviewing this and #32265 in light of a little over 2 years of experience with pd.NA, the main question to which I think we need a definitive answer is "Does/Should pd.NA mean something semantically distinct from np.nan?"

If the answer is NO, then we can/need to:

If the answer is YES, then we can/need to:

We should also reconsider whether we want pd.NA to propagate through ops (and bool(pd.NA)) vs to behave like np.nan. Behaving like np.nan would:

To be clear I think the propagating behavior is "kind of neat," but there is a tradeoff with maintenance burden to consider.

@ngoldbaum
Copy link
Contributor

I am working on a new variable-width UTF8 string dtype for numpy that supports arbitrary null sentinels, with an eye to explicitly support NA so we can replace object string arrays in pandas.

This week I discovered that it's very difficult to identify the NA singleton in C code that needs to be portable across python implementations.

As far as I'm aware (please correct me if this is incorrect), the canonical way to identify NA is something like:

if obj is pd.NA:
    # handle nulls

This is problematic for pypy, particularly if I'm writing code against pypy's cpyext CPython C API emulation, since there is currently no straightforward way to spell the equivalent of the python is operator in C that will work correctly with cpyext.

In CPython, you just do pointer equality, so it's tempting to do. However, as I discovered this week any C extension that uses pointer equality like this will be subtly broken on pypy under cpyext.

The closest I have to an is implementation is an awful hack that calls the id builtin function from C.

If instead NA behaved more like NaN - e.g. pd.NA != pd.NA and bool(pd.NA) == True I could use the same duck typing code I have to handle NaN and not go out of my way to do an is check on pypy. I'd still
do it on CPython as a performance optimization though.

Sorry to resurrect this old issue but seeing how the documentation still refers to NA as experimental I thought it would still be worth bringing up. Also if this has come up already please point me to the old discussion, I couldn't find any previous discussion about this perhaps niche issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action Roadmap A proposal for the roadmap.
Projects
None yet
Development

No branches or pull requests