Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add NA scalar for missing value indicator, use in StringArray. #29597

Merged
merged 25 commits into from
Dec 1, 2019

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Nov 13, 2019

xref #28095, #28778

This PR adds a pd.NA singleton with the behaviour as discussed in above issues. For now, it's only used in StringArray in this PR.

@jorisvandenbossche jorisvandenbossche added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Nov 13, 2019

def _create_binary_propagating_op(name):
def method(self, other):
if isinstance(other, numbers.Number) or other is NA or isinstance(other, str):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question is what type of objects we should recognize here.

I think we want to recognize scalar values that can be stored in dataframes. But of course we don't have control over what kind of scalars people put in a dataframe.
For now I did numbers + string (numbers also covers numpy scalars). Our own scalars (Timedelta, Timestamp, etc) could also be added, or they can also handle that on their side)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to look at how NaT handles these, in particular for arraylike others

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want it to be recognized by arrays? (I somewhat purposefully left those out, but didn't think it fully through)

For our own internal arrays (eg IntegerArray), that can be handled on the array level? (so returning NotImplemented here)

For numpy arrays the question is what to do with it, as you cannot represent NAs in a numpy array (except in an object array). So not handling it might be fine?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For numpy arrays the question is what to do with it, as you cannot represent NAs in a numpy array (except in an object array). So not handling it might be fine?

Actually, by not handling it and deferring to numpy, this already happens (-> converting to object):

In [61]: np.array([1, 2, 3]) + pd.NA  
Out[61]: array([NA, NA, NA], dtype=object)

In [62]: pd.NA + np.array([1, 2, 3]) 
Out[62]: array([NA, NA, NA], dtype=object)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats fine for now, but longer-term this will need to be optimized

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but longer-term this will need to be optimized

Can you explain what you mean with this?

pandas/core/na_scalar.py Outdated Show resolved Hide resolved
if isinstance(other, np.int64):
# for numpy scalars we get a deprecation warning and False as result
# for equality or error for larger/lesser than
continue
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So numpy scalars we don't have full control over, so this means that if they are the left operand, we get some other behaviour:

In [27]: np.int64(1) == pd.NA 
/home/joris/miniconda3/envs/dev/bin/ipython:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
  #!/home/joris/miniconda3/envs/dev/bin/python
Out[27]: False

In [28]: pd.NA == np.int64(1) 
Out[28]: NA

In [29]: np.int64(1) < pd.NA 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-29-87134fac2734> in <module>
----> 1 np.int64(1) < pd.NA

~/scipy/pandas/pandas/core/na_scalar.py in __bool__(self)
     37 
     38     def __bool__(self):
---> 39         raise TypeError("boolean value of NA is ambiguous")
     40 
     41     def __hash__(self):

TypeError: boolean value of NA is ambiguous

In [30]: pd.NA > np.int64(1) 
Out[30]: NA

(for the first case, not sure what the behaviour will be once the change in numpy is done)

@TomAugspurger
Copy link
Contributor

Any chance you could change StringDtype.na_value to use this, and see what breaks?

There are a few places where we use np.nan when we should be referring to StringDtype.na_value instead.

I suspect that _libs.lib.Validator.is_valid_null or util.is_nan will need to be updated to recognice pd.NA as NA. So not a big deal if it's a bunch of work to prototype.

elif other is False or other is NA:
return NA
else:
return NotImplemented
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other NA objects? NaT? NaN? None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For logical ops, I am not sure it should handle those, as logical ops typically involve booleans (and NaT/NaN/None cannot be stored in a boolean array)

Eg for float is also raises right now:

In [69]: pd.NA | 1.5
...
TypeError: unsupported operand type(s) for |: 'NAType' and 'float'

In [70]: pd.NA | np.nan 
...
TypeError: unsupported operand type(s) for |: 'NAType' and 'float'

(for comparison ops, though, we should probably add those)

@jbrockmendel
Copy link
Member

core.missing and _libs.missing will need to recognize this

@TomAugspurger
Copy link
Contributor

Playing with using this in StringArray, will push something (here or a PR against this) in a bit.

Anecdotally, having __bool__ raise is kinda annoying since it breaks reductions like all.

(Pdb) pp left_value
<StringArray>
[NA, NA]
Length: 2, dtype: string
(Pdb) np.all(left_value == right_value)
*** TypeError: boolean value of NA is ambiguous

But maybe that's a good thing. I'd like to develop some more experience here.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 13, 2019

@jbrockmendel I moved it to cython (well, still mostly copy paste of the python version into the cython file) and let missing functions recognize this (isna), so feedback on that part is very welcome

@jorisvandenbossche
Copy link
Member Author

Anecdotally, having bool raise is kinda annoying since it breaks reductions like all.

We will need to rethink any/all in general as well. In principle, those are reducing operations, and reductions skip NAs, so your example should be equivalent to all of an empty Series/array.
Although in the case of any/all, skipping NAs also doesn't fully feel as the best behaviour in many cases (eg for boolean array all([True, NA]) == True would be also be somewhat strange as the NA could be False, so you don't know the result).
This also recently came up for object dtypes IIRC, and was planning to open a new issue for it (or find an existing).

@TomAugspurger
Copy link
Contributor

If anyone wants a preview of what getting this working with StringArray looks like: https://github.com/jorisvandenbossche/pandas/pull/1/files.

It'll mean a larger diff, but I think we'll want to merge that PR against Joris' branch into here, so that we can get a feel for what actually using this is going to look like before merging it into master.

@jbrockmendel
Copy link
Member

I suspect that _libs.lib.Validator.is_valid_null or util.is_nan will need to be updated to recognice pd.NA as NA. So not a big deal if it's a bunch of work to prototype.

util.is_nan is specifically for float-nan. The scalar-recognizing functions that need to recognize this are in _libs.missing.

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review November 14, 2019 13:54
@TomAugspurger TomAugspurger changed the title ENH: add NA scalar for missing value indicator ENH: add NA scalar for missing value indicator, use in StringArray. Nov 14, 2019
@TomAugspurger
Copy link
Contributor

Joris merged in my changes to make StringArray use pd.NA. It wasn't too large of a change so far.

I want to call out one behavior change we may want to make: Right now, Series[string].str methods that return numeric output (like .str.count) return either int or float dtype, depending on whether there are NAs. See https://github.com/pandas-dev/pandas/pull/29597/files#diff-683f8dacd08abda4913680fafe3a4ea7R3512.

I'm proposing that we change that behavior to always return an integer-na dtype in those cases for string dtype (no change to object dtype).

@jorisvandenbossche
Copy link
Member Author

I'm proposing that we change that behavior to always return an integer-na dtype in those cases for string dtype (no change to object dtype).

+1

It's also not a "breaking" change, as string dtype is still new. In general, with those new dtypes I think we should try to return as much as possible also new dtypes for the result (eg also for boolean results, once boolean array has landed)

@TomAugspurger
Copy link
Contributor

Just to scope this PR, I think that using pd.NA in BooleanArray should be done as a followup PR.

def _create_binary_propagating_op(name, divmod=False):

def method(self, other):
if isinstance(other, numbers.Number) or other is NA or isinstance(other, str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put the numbers.Number check last, as it will be least performant

@jorisvandenbossche
Copy link
Member Author

I added some initial docs (whatsnew + section in missing_data

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 27, 2019

@jorisvandenbossche I've looked at your doc changes. You may want to say something about that when you read data, the new pd.NA type won't be used, until it gets addressed by closing the issue I created #29752 .

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs look very nice, thanks.

This also means that ``pd.NA`` cannot be used in a context where it is
evaluated to a boolean, such as ``if condition: ...`` where ``condition`` can
potentially be ``pd.NA``. In such cases, :func:`isna` can be used to check
for ``pd.NA`` or it could be prevented that ``condition`` can be ``pd.NA``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for ``pd.NA`` or it could be prevented that ``condition`` can be ``pd.NA``
for ``pd.NA`` or ``condition`` being ``pd.NA`` can be avoided, for example by filling missing values beforehand.

(can't apply this directly, since it wraps two lines)

@jorisvandenbossche
Copy link
Member Author

cc @pandas-dev/pandas-core more comments on this?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. some very minor doc comments. can you run the asv's to see if isna is impacted in any way here.


Starting from pandas 1.0, an experimental ``pd.NA`` value (singleton) is
available to represent scalar missing values. At this moment, it is used in
the nullable integer and boolean data types and the dedicated string data type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use commas rather than 3 ands, ideally if you can link to those sections?

pd.NA | True

On the other hand, if one of the operands is ``False``, the result depends
on the value of the other operand. Therefore, in thise case ``pd.NA``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thise -> this

``NA`` in a boolean context
---------------------------

Since the actual value of an NA is unknown, it is ambiguous to convert NA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a section gotchas.truth that is very similiar, could link.

@jreback jreback added this to the 1.0 milestone Nov 27, 2019
@jorisvandenbossche
Copy link
Member Author

Updated for the doc comments, thanks! @Dr-Irv I added a sentence on that. But we should probably add a more extensive section on that in a page dedicated to those new data types (for a follow-up).

@jreback I run the isnull benchmarks on this PR and on master, nothing seems off out of the error / noise margin.

@jreback jreback merged commit 7ea4e61 into pandas-dev:master Dec 1, 2019
@jreback
Copy link
Contributor

jreback commented Dec 1, 2019

thanks @jorisvandenbossche nice addition.

unless I missed this, you didn't update nullable integers to return that, assume that's a followon.

@jorisvandenbossche jorisvandenbossche deleted the NA-scalar branch December 2, 2019 07:08
@jorisvandenbossche
Copy link
Member Author

you didn't update nullable integers to return that, assume that's a followon.

Yes, as being discussed, see #29556 for all follow-ups

keechongtan added a commit to keechongtan/pandas that referenced this pull request Dec 2, 2019
…ndexing-1row-df

* upstream/master: (49 commits)
  repr() (pandas-dev#29959)
  DOC : Typo fix in userguide/Styling (pandas-dev#29956)
  CLN: small things in pytables (pandas-dev#29958)
  API/DEPR: Change default skipna behaviour + deprecate numeric_only in Categorical.min and max (pandas-dev#27929)
  DEPR: DTI/TDI/PI constructor arguments (pandas-dev#29930)
  CLN: fix pytables passing too many kwargs (pandas-dev#29951)
  Typing (pandas-dev#29947)
  repr() (pandas-dev#29948)
  repr() (pandas-dev#29950)
  Added space at the end of the sentence (pandas-dev#29949)
  ENH: add NA scalar for missing value indicator, use in StringArray. (pandas-dev#29597)
  CLN: BlockManager.apply (pandas-dev#29825)
  TST: add test for rolling max/min/mean with DatetimeIndex over different frequencies (pandas-dev#29932)
  CLN: explicit signature for to_hdf (pandas-dev#29939)
  CLN: make kwargs explicit for pytables read_ methods (pandas-dev#29935)
  Convert core/indexes/base.py to f-strings (pandas-dev#29903)
  DEPR: dropna multiple axes, fillna int for td64, from_codes with floats, Series.nonzero (pandas-dev#29875)
  CLN: make kwargs explicit in pytables constructors (pandas-dev#29936)
  DEPR: tz_convert in the Timestamp constructor raises (pandas-dev#29929)
  STY: F-strings and repr (pandas-dev#29938)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants