Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support min/max on ArrowStringArray #42597

Closed
mrocklin opened this issue Jul 18, 2021 · 19 comments
Closed

ENH: Support min/max on ArrowStringArray #42597

mrocklin opened this issue Jul 18, 2021 · 19 comments
Labels
Arrow pyarrow functionality Bug Reduction Operations sum, mean, min, max, etc. Strings String extension data type and string data

Comments

@mrocklin
Copy link
Contributor

Motivation

In order for Dask to perform large shuffles (set_index, join on a non-index column, ...) on a column it needs to be able to compute quantiles.

To do this it is useful to compute min/max values.

What actually breaks

When I try to do this on columns of type string[pyarrow] I get the following exception

import pandas as pd
s = pd.Series(["a", "b", "c"]).astype("string[pyarrow]")
s.min()
~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in min(self, axis, skipna, level, numeric_only, **kwargs)
  10825         )
  10826         def min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):
> 10827             return NDFrame.min(self, axis, skipna, level, numeric_only, **kwargs)
  10828 
  10829         setattr(cls, "min", min)

~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in min(self, axis, skipna, level, numeric_only, **kwargs)
  10348 
  10349     def min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):
> 10350         return self._stat_function(
  10351             "min", nanops.nanmin, axis, skipna, level, numeric_only, **kwargs
  10352         )

~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in _stat_function(self, name, func, axis, skipna, level, numeric_only, **kwargs)
  10343                 name, axis=axis, level=level, skipna=skipna, numeric_only=numeric_only
  10344             )
> 10345         return self._reduce(
  10346             func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  10347         )

~/miniconda/lib/python3.8/site-packages/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   4380         if isinstance(delegate, ExtensionArray):
   4381             # dispatch to ExtensionArray interface
-> 4382             return delegate._reduce(name, skipna=skipna, **kwds)
   4383 
   4384         else:

~/miniconda/lib/python3.8/site-packages/pandas/core/arrays/string_arrow.py in _reduce(self, name, skipna, **kwargs)
    377     def _reduce(self, name: str, skipna: bool = True, **kwargs):
    378         if name in ["min", "max"]:
--> 379             return getattr(self, name)(skipna=skipna)
    380 
    381         raise TypeError(f"Cannot perform reduction '{name}' with string dtype")

AttributeError: 'ArrowStringArray' object has no attribute 'min'

Solution

I am hopeful that Arrow maybe already has an min/max implementation and they just haven't been hooked up yet.

@mrocklin mrocklin added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 18, 2021
@SusmithBarigidad
Copy link

Hi Mrocklin,

I am pretty new to open contributions, I can give a try on this!
Thanks

SusmithBarigidad added a commit to SusmithBarigidad/pandas that referenced this issue Jul 18, 2021
SusmithBarigidad added a commit to SusmithBarigidad/pandas that referenced this issue Jul 18, 2021
@TomAugspurger
Copy link
Contributor

Looks like pyarrow doesn't know how to do min / max on string data yet:

In [12]: import pyarrow.compute

In [13]: pyarrow.compute.min_max(s.array._data)
---------------------------------------------------------------------------
ArrowNotImplementedError                  Traceback (most recent call last)
<ipython-input-13-22066b22cdbd> in <module>
----> 1 pyarrow.compute.min_max(s.array._data)

~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/compute.py in min_max(array, options, memory_pool, **kwargs)

~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/_compute.pyx in pyarrow._compute.Function.call()

~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowNotImplementedError: Function min_max has no kernel matching input types (array[string])

@simonjayhawkins remind me, do we have a general policy on whether Arrow StringDtype will cast and fall back to a slow mode, or do we require users to do it explicitly?

@lithomas1 lithomas1 added Strings String extension data type and string data and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 19, 2021
@mrocklin
Copy link
Contributor Author

Looks like pyarrow doesn't know how to do min / max on string data yet:

Is there someone on the Arrow side that we should ping?

@TomAugspurger
Copy link
Contributor

I was going to file a JIRA at https://issues.apache.org/jira/login.jsp but I don't have my login saved on this machine :)

@simonjayhawkins
Copy link
Member

@simonjayhawkins remind me, do we have a general policy on whether Arrow StringDtype will cast and fall back to a slow mode, or do we require users to do it explicitly?

yes, we basically make ArrowStringArray match StringArray behaviour, falling back where needed. Users do not need to do anything special.

There are a few xfailed tests for ArrowStringArray that did not get fixed in time for 1.3.0. min/max is one of those. IIRC I might have fixed in #40962.

will look soon.

@simonjayhawkins
Copy link
Member

I'll relabel as a bug and we will backport the fix.

@simonjayhawkins simonjayhawkins added this to the 1.3.1 milestone Jul 19, 2021
@jreback
Copy link
Contributor

jreback commented Jul 19, 2021

I'll relabel as a bug and we will backport the fix.

does pyarrow 4.0.1 supporot this?

@simonjayhawkins
Copy link
Member

IIRC I might have fixed in #40962.

mistaken, that PR only changed the error to be a NotImplementedError instead of either TypeError: '<=' not supported between instances of 'str' and 'NoneType' or AttributeError: 'ArrowStringArray' object has no attribute 'max' as seen here when called on a Series.

@TomAugspurger
Copy link
Contributor

yes, we basically make ArrowStringArray match StringArray behaviour, falling back where needed. Users do not need to do anything special.

Hmm, I might open an issue to revisit that. Those kinds of silent performance cliffs are hard to debug. IMO, we should wait / help PyArrow implement the efficient kernels and then wrap them here, without the expensive conersion to Python objects.

@mrocklin
Copy link
Contributor Author

IMO, we should wait / help PyArrow implement the efficient kernels and then wrap them here, without the expensive conersion to Python objects

+1 . For my very narrow use case I'm very much in favor of focusing on adding min/max than on fallback behavior. I'm not looking at the whole picture though.

@simonjayhawkins
Copy link
Member

I think we discussed somewhere about adding performance warnings when object fallback was in play. will try to find it.

@simonjayhawkins
Copy link
Member

IMO, we should wait / help PyArrow implement the efficient kernels and then wrap them here, without the expensive conersion to Python objects

+1 . For my very narrow use case I'm very much in favor of focusing on adding min/max than on fallback behavior. I'm not looking at the whole picture though.

our min supported version of PyArrow for ArrowStringArray is 1.0.0 and so the kernels guaranteed available are limited.

@simonjayhawkins
Copy link
Member

our min supported version of PyArrow for ArrowStringArray is 1.0.0 and so the kernels guaranteed available are limited.

at one point we had #39908 (comment) to make this apparent to the users but was removed from the release notes and not yet added elsewhere in the documentation.

@TomAugspurger
Copy link
Contributor

We discussed fallback behavior a bit in #35169. I opened #42613 dedicated to deciding on a general policy to apply.

@simonjayhawkins
Copy link
Member

does pyarrow 4.0.1 supporot this?

it does not support pc.min_max on a pyarrow string array but does support pc.sort_indices on a string type. so could use that for now to avoid conversion to object.

def min(arr):
    argmin = pc.sort_indices(arr)[0].as_py()
    return arr[argmin].as_py()

def max(arr):
    if arr.null_count:
        arr = arr.filter(arr.is_valid())
    argmax = pc.sort_indices(arr)[-1].as_py()
    return arr[argmax].as_py()

I've not done a comprehensive performance analysis, but min would be quicker on very small arrays than object fallback and max would be about the same. On larger arrays (> 10_000 items) fallback to object would be more performant.

@simonjayhawkins
Copy link
Member

i suspect we could improve the performance for larger arrays using pc.partition_nth_indices

@TomAugspurger
Copy link
Contributor

IMO, it's not worth rushing a short-term fix. I opened https://issues.apache.org/jira/browse/ARROW-13410. I have no sense for how difficult this would be to implement in Arrow, but if there's already support for sorting then perhaps min / max won't be too difficult.

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.2, 1.3.3 Aug 15, 2021
@mroeschke mroeschke added Numeric Operations Arithmetic, Comparison, and Logical operations Reduction Operations sum, mean, min, max, etc. and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Aug 21, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 10, 2021
@simonjayhawkins
Copy link
Member

I'll change the milestone to "no action" for now as there now does not seem to be an appetite to continue down the object fallback path #42613. (personally I think the object fallback would help adoption)

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.4, No action Oct 13, 2021
@mroeschke mroeschke added the Arrow pyarrow functionality label Oct 13, 2022
@mroeschke mroeschke removed this from the No action milestone Oct 13, 2022
@mroeschke
Copy link
Member

This works now on main and 1.5. I believe this was fixed by #47730 when ArrowStringArray inherited from ArrowExtensionArray which defined _reduce. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Reduction Operations sum, mean, min, max, etc. Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

7 participants