Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: query / boolean selection with nullable dtypes with NAs #31503

Closed
tdpetrou opened this issue Jan 31, 2020 · 19 comments · Fixed by #31591
Closed

API: query / boolean selection with nullable dtypes with NAs #31503

tdpetrou opened this issue Jan 31, 2020 · 19 comments · Fixed by #31591
Labels
API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays
Milestone

Comments

@tdpetrou
Copy link
Contributor

tdpetrou commented Jan 31, 2020

Code Sample

Create a dataframe with nullable integer, string, and float data types.

>>> df = df = pd.DataFrame({'a': [1, 3, np.nan], 'b': ['rain', 'shine', None], 
                                                    'a_float': [1.1, 3.3, np.nan]})
>>> df = df.convert_dtypes()
>>> df
a b a_float
0 1 rain 1.1
1 3 shine 3.3
2 <NA> <NA> nan

Verify data types and attempt to use query

>>> df.dtypes
a            Int64
b            string
a_float      float64

>>> df.query('a > 2') # same as df[df['a'] > 2]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/Documents/Code Practice/pandas-dev/pandas/pandas/core/frame.py in query(self, expr, inplace, **kwargs)
   3227         try:
-> 3228             new_data = self.loc[res]
   3229         except ValueError:

~/Documents/Code Practice/pandas-dev/pandas/pandas/core/indexing.py in __getitem__(self, key)
   1683             maybe_callable = com.apply_if_callable(key, self.obj)
-> 1684             return self._getitem_axis(maybe_callable, axis=axis)
   1685 

~/Documents/Code Practice/pandas-dev/pandas/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1798             return self._get_slice_axis(key, axis=axis)
-> 1799         elif com.is_bool_indexer(key):
   1800             return self._getbool_axis(key, axis=axis)

~/Documents/Code Practice/pandas-dev/pandas/pandas/core/common.py in is_bool_indexer(key)
    133                 if np.any(key.isna()):
--> 134                     raise ValueError(na_msg)
    135             return True

ValueError: cannot mask with array containing NA / NaN values

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-52-e5d239635d7b> in <module>
----> 1 df.query('a > 2')

~/Documents/Code Practice/pandas-dev/pandas/pandas/core/frame.py in query(self, expr, inplace, **kwargs)
   3230             # when res is multi-dimensional loc raises, but this is sometimes a
   3231             # valid query
-> 3232             new_data = self[res]
   3233 
   3234         if inplace:

~/Documents/Code Practice/pandas-dev/pandas/pandas/core/frame.py in __getitem__(self, key)
   2784 
   2785         # Do we have a (boolean) 1d indexer?
-> 2786         if com.is_bool_indexer(key):
   2787             return self._getitem_bool_array(key)
   2788 

~/Documents/Code Practice/pandas-dev/pandas/pandas/core/common.py in is_bool_indexer(key)
    132             if is_extension_array_dtype(key.dtype):
    133                 if np.any(key.isna()):
--> 134                     raise ValueError(na_msg)
    135             return True
    136     elif isinstance(key, list):

ValueError: cannot mask with array containing NA / NaN values

>>> df.query('a_float > 2')
a b a_float
1 3 shine 3.3

Using query with strings works...

>>> df.query('b == "rain"')
a b a_float
0 1 rain 1.1

...but fails for boolean selection

>>> df[df['b'] == 'rain']
ValueError: cannot mask with array containing NA / NaN values

strings also fail for inequalities

>>> df.query('b >= "rain"') # also df.query('b > "rain"')
ValueError: cannot mask with array containing NA / NaN values

Problem description

The query method behaves differently for nullable integers, strings, and floats. Here's my summary of how I think they work with the query method assuming there are missing values in the columns.

  • nullable integers - fails
  • strings - works with equality, fails with inequality
  • float - works for all

I find it extremely difficult to use if the behavior for all of these types are different for query and boolean selection.

Expected Output

I think I would prefer to have both query and boolean selection working like they do with floats, where missing values evaluate as False in a condition. And even if there are missing values in the boolen mask itself, treat those as False. This would harmonize the behavior for all data types.

This would leave it up to the user to check for missing values. I believe SQL where clauses work in such a manner (missing values in conditions evaluate as False).

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.8.1.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0+untagged.1.gce8af21.dirty
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0.post20200127
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.13
tables : None
tabulate : 0.8.3
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@TomAugspurger
Copy link
Contributor

Can you include the full tracebacks? That helps show where things are failing. I'm seeing different exceptions in an environment with numexpr.

In [17]: df.query('a > 2')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-e5d239635d7b> in <module>
----> 1 df.query('a > 2')

~/sandbox/pandas/pandas/core/frame.py in query(self, expr, inplace, **kwargs)
   3223         kwargs["level"] = kwargs.pop("level", 0) + 1
   3224         kwargs["target"] = None
-> 3225         res = self.eval(expr, **kwargs)
   3226
   3227         try:

~/sandbox/pandas/pandas/core/frame.py in eval(self, expr, inplace, **kwargs)
   3338         kwargs["resolvers"] = kwargs.get("resolvers", ()) + tuple(resolvers)
   3339
-> 3340         return _eval(expr, inplace=inplace, **kwargs)
   3341
   3342     def select_dtypes(self, include=None, exclude=None) -> "DataFrame":

~/sandbox/pandas/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace)
    328         eng = _engines[engine]
    329         eng_inst = eng(parsed_expr)
--> 330         ret = eng_inst.evaluate()
    331
    332         if parsed_expr.assigner is None:

~/sandbox/pandas/pandas/core/computation/engines.py in evaluate(self)
     71
     72         # make sure no names in resolvers and locals/globals clash
---> 73         res = self._evaluate()
     74         return reconstruct_object(
     75             self.result_type, res, self.aligned_axes, self.expr.terms.return_type

~/sandbox/pandas/pandas/core/computation/engines.py in _evaluate(self)
    112         scope = env.full_scope
    113         _check_ne_builtin_clash(self.expr)
--> 114         return ne.evaluate(s, local_dict=scope)
    115
    116

~/Envs/pandas-dev/lib/python3.7/site-packages/numexpr/necompiler.py in evaluate(ex, local_dict, global_dict, out, order, casting, **kwargs)
    820     # Create a signature
    821     signature = [(name, getType(arg)) for (name, arg) in
--> 822                  zip(names, arguments)]
    823
    824     # Look up numexpr if possible.

~/Envs/pandas-dev/lib/python3.7/site-packages/numexpr/necompiler.py in <listcomp>(.0)
    819
    820     # Create a signature
--> 821     signature = [(name, getType(arg)) for (name, arg) in
    822                  zip(names, arguments)]
    823

~/Envs/pandas-dev/lib/python3.7/site-packages/numexpr/necompiler.py in getType(a)
    701     if kind == 'S':
    702         return bytes
--> 703     raise ValueError("unknown type %s" % a.dtype.name)
    704
    705

ValueError: unknown type object

I think I would prefer to have both query and boolean selection working like they do with floats, where missing values evaluate as False in a condition.

We've defined NA such that it propagates in comparison operations. There was a sizable discussion on it if you go back to the original issue, and we reached consensus that propagating was the best.

In [10]: 1 == pd.NA
Out[10]: <NA>

Not saying that can't change, since NA is experimental, but it'll be a high hurdle.

@jorisvandenbossche
Copy link
Member

We've defined NA such that it propagates in comparison operations.

And in addition also the indexing behaviour with missing values.

So the behaviour you see for boolean masking with missing values is the consequence of two things:

  • The comparison operation step propagates missing values
  • The indexing step raises on missing values

And both are open for discussion (as Tom noted, there is a reason this is experimental). But I personally think it is rather the second that we should maybe discuss. The alternative under consideration for that one was to not error on missing values in a boolean mask, but to interpret the boolean value as False (skip it in masking). That would mostly give the same behaviour as we had before.

We decided to take the "raise on missing values" behaviour for now, as it is the most conservative: the user can always do fillna(True) or fillna(False) on the boolean mask to be explicit in which behaviour they want.
(well, when using query this is actually not that straightforward ..)

But I think in this area we explicitly said: let's get some more experience with it and re-evaluate later. So feedback on this is certainly welcome. See #28778 for some of the earlier discussion.

@jorisvandenbossche jorisvandenbossche added API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Jan 31, 2020
@tdpetrou
Copy link
Contributor Author

tdpetrou commented Jan 31, 2020

@TomAugspurger - Edited post to include traceback

This is going to be extraordinarily confusing for users and an absolute pain to have all these data types with all these different behaviors when filtering with missing values.

How is anybody going to reasonably use nullable integers to filter their data? Convert back to float first?

Users are going to be so confused that nullable integers filter differently than floats and strings only filter for equality, but object columns work (even with pd.NA in them).

>>> df1 = pd.DataFrame({'a': ['rain', 'shine', pd.NA]}) # dtype is object
>>> df1.query('a == "rain"') # works
>>> df1.query('a <= "rain"') # works

@jorisvandenbossche
Copy link
Member

Using query with strings works... df.query('b == "rain"')

This seems a bug, it should be consistent with boolean selection df[df['b'] == 'rain']

How is anybody going to reasonably use nullable integers to filter their data? Convert back to float first?

There is no need to convert to a float. You can still filter with missing values, but for now you need to be explicit how to treat missing values (with fillna), as I explained above. But to be clear, this is not set in stone. Feedback is very valuable, but so let's discuss the specifics.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 31, 2020

And it's important to distinguish the two operations here. We have

mask = integer_array == 1  # comparison propagates NA
filtered = df[mask]        # indexing raises on NA

Changing the behavior of NA in comparisons just (or primarily) for indexing isn't warranted IMO. There are situations where I'm doing a comparison operation not followed by indexing, and I'd like the NA to propagate.

So IMO the focus should be on the behavior of NA in indexing. @tdpetrou would you be able to go through the NA indexing discussion and summarize the arguments for the various options (IIRC, the 3 were raise -- which we implemented, propagate NA, and treat NA as False).?

edit: Whoops, I didn't see #31503 (comment) until now, and realized I basically repeated Joris 😄

@dsaxton
Copy link
Member

dsaxton commented Jan 31, 2020

My two cents: I think a sensible default would be to treat NA as False when filtering (so we only select values that are definitely true). This would be the same as a typical where clause in SQL, and in my opinion is generally what people want when they filter data.

@tdpetrou
Copy link
Contributor Author

tdpetrou commented Jan 31, 2020

@TomAugspurger Agreed - NA comparisons should remain NA. For indexing, NA's should be treated as False. Problems solved. It would dramatically simplify things for the user. There are many reasons for this.

• Matches behavior of old pandas types
• Matches SQL
• Prevents you from having to fill in missing values (which you often don't know) - people would fill these in with garbage just to make the filter work or just convert to float.

Right now, filtering depends on the data type, which to me, is just absurd. The users are going to get so confused and frustrated.

The workflow right now for filtering nullable ints is one of two things:

>>> df.fillna({'a': garbage_number}).query('a > 2') # or
>>> df.astype({'a': 'float'}).query('a > 2')

There are so many instances where you just want to filter on the known data without having to go through these huge unnecessary headaches.

@TomAugspurger
Copy link
Contributor

Try to keep things respectful... We put a good amount of discussion into this. My recollection is that raising seemed the safest thing to do for now, in part because it matches the behavior of NaN, and in part because it left open the door to specifying a behavior later on.


• Matches behavior of old pandas types

Can you clarify that? When indexing with NaNs, NumPy (or pandas) will raise, right?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 31, 2020

My two cents: I think a sensible default would be to treat NA as False when filtering (so we only select values that are definitely true).

This does make a ton of sense for boolean masking. Keep in mind, though, that our __getitem__ also allows fancy indexing (list of integers). What is the expected behavior of the following?

In [3]: s = pd.Series([1, 2, 3])

In [4]: m = pd.array([1, 1, None])

In [5]: s[m]

It's not at all clear to me how the NA should be treated there. When you index with a list of integers, you'd expect the output shape to match the shape of the indexer, right? Should that value be NA? Should it be excluded?

Would it be strange to treat NA in an IntegerArray differently than NA in a boolean array? We don't have a typed NA (NA[bool], NA[int]), though it was discussed.

@jorisvandenbossche
Copy link
Member

@dsaxton Thanks for the input!

Matches behavior of old pandas types

Can you clarify that? When indexing with NaNs, NumPy (or pandas) will raise, right?

@TomAugspurger but you typically didn't get booleans with missing values, as the comparison didn't result in them. So for the combined workflow (comparison + boolean indexing), it mostly matches old behaviour

Right now, filtering depends on the data type, which to me, is just absurd.

@tdpetrou this is not absurd. We are adding those dtypes because we want to improve things, and yes, that means they don't behave exactly the same as the old ones. There is a reason they are labeled as experimental, and are not yet the default.
This means we are going to have a period where we have multiple dtypes yes. And I agree this will give a period where this can cause more confusion. But IMO this is unavoidable if we want to improve things. Better documentation might help? Feedback on this is also certainly welcome as someone looking at this with a fresh eye.

@jorisvandenbossche
Copy link
Member

It's not at all clear to me how the NA should be treated there.

@TomAugspurger I think @dsaxton was speaking about boolean indexing. In integer positional indexing, I think we should continue raising an error. It gives some inconsistency, but I personally think that would be fine.

@tdpetrou
Copy link
Contributor Author

Here's my final feedback on this - For filtering, NAs should be equivalent to false and would be very difficult to do if all data types exhibited the same behavior.

@TomAugspurger
Copy link
Contributor

It gives some inconsistency, but I personally think that would be fine.

If people are OK with this, then I'm on board to. I think the inconsistency is OK, since we're saying the dtype of the array is what determines the behavior, rather than the values.

cc @jbrockmendel @jreback for thoughts. I think the summary is that NA be treated as False in __getitem__ when doing a boolean mask. In all other cases (notably, positional indexing with an IntegerArray) we'd continue to raise.


If we have consensus, are you interested in working on this @tdpetrou?

@jbrockmendel
Copy link
Member

I think the summary is that NA be treated as False in getitem when doing a boolean mask

I'm fine with this, but would propose going a half-step further and have NA.__bool__ return False instead of raising. This would keep the scalar behavior consistent with the array-element behavior.

It would also prevent a ton of problems that will come up when users start getting pd.NA as the name of an Index/Series, xref #31227.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 31, 2020 via email

@jbrockmendel
Copy link
Member

What kind of operation gets you NA as the name?

If you can get NA in the index of a DataFrame, then df.iloc[n] gives a Series with NA as the name. Or is getting NA in an Index something we want to rule out (or deem unlikely)?

@TomAugspurger
Copy link
Contributor

Or is getting NA in an Index something we want to rule out (or deem unlikely)?

It's allowed right now, but the semantics of it aren't well fleshed out. Regardless, I think we have a proposal for boolean masking. We can discuss NA.__bool__ in a dedicated issue.

@jorisvandenbossche
Copy link
Member

For me, I think the query use case is also a good reason to go with the skipping-NA behaviour. As what we say about "you can be explicit by doing s[mask.fillna(False/True)]", that doesn't work with query, since the boolean array gets created inside the function.
So for query to be workable with nullable dtypes, it would need to use a behaviour that doesn't raise (or it would need to add a keyword for it, which would also be ugly I think). And having inconsistent behaviour between getitem and query also sounds as a no-go.

But to be sure we are fully aware of this: this does not fully match the behaviour of old pandas dtypes.
In the specific case of "not equal", this does give something else as before:

With numpy object dtype:

In [8]: s = pd.Series(["a", "b", np.nan])

In [9]: s != "b"  
Out[9]: 
0     True
1    False
2     True
dtype: bool

In [10]: s[s != "b"]    
Out[10]: 
0      a
2    NaN
dtype: object

while with the nullable string dtype, it would become (the last one now raises):

In [11]: s = pd.Series(["a", "b", pd.NA], dtype="string")  

In [12]: s != "b"  
Out[12]: 
0     True
1    False
2     <NA>
dtype: boolean

In [13]: s[s != "b"] 
Out[13]: 
0      a
dtype: string

I personally think I am fine with this behaviour, but it is a change that is harder to detect when switching to the new nullable dtypes.

@jorisvandenbossche jorisvandenbossche changed the title query method (and boolean selection) do not work for nullable integer type and sometimes for string API: query / boolean selection with nullable dtypes with NAs Feb 10, 2020
@TomAugspurger
Copy link
Contributor

I'm also OK with that difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants