Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ComputeError when applying an UDF in a group_by aggregation #13557

Closed
2 tasks done
trendelkampschroer opened this issue Jan 9, 2024 · 3 comments · Fixed by #14135
Closed
2 tasks done

ComputeError when applying an UDF in a group_by aggregation #13557

trendelkampschroer opened this issue Jan 9, 2024 · 3 comments · Fixed by #14135
Assignees
Labels
bug Something isn't working python Related to Python Polars

Comments

@trendelkampschroer
Copy link

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import numpy as np
import polars as pl

if __name__ == "__main__":
    frame = pl.from_dict({"id": ["a", "a", "b", "b"], "values": [0.1, 0.1, -0.1, -0.1]})
    frame.group_by(by="id").agg(pl.col("values").log1p().sum().pipe(np.expm1))
>>>
Traceback (most recent call last):
  File "bug_udf_in_aggregation.py", line 5, in <module>
    frame.group_by(by="id").agg(pl.col("values").log1p().sum().pipe(np.expm1))
  File "polars/dataframe/group_by.py", line 240, in agg
    self.df.lazy()
  File "polars/lazyframe/frame.py", line 1749, in collect
    return wrap_df(ldf.collect())
polars.exceptions.ComputeError: cannot aggregate, the column is already aggregated

Error originated in expression: 'col("values").log1p().sum().python_udf()'

Log output

No response

Issue description

The same expression was working in polars-0.19.19.

Expected behavior

Should apply the UDF to the 'sum', no exception should be raised.

Installed versions

--------Version info---------
Polars:               0.20.3
Index type:           UInt32
Platform:             macOS-14.2.1-arm64-arm-64bit
Python:               3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:35:25) [Clang 16.0.6 ]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.7.3
numpy:                1.26.3
openpyxl:             <not installed>
pandas:               2.1.4
pyarrow:              14.0.2
pydantic:             1.10.13
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@trendelkampschroer trendelkampschroer added bug Something isn't working python Related to Python Polars labels Jan 9, 2024
@deanm0000
Copy link
Collaborator

deanm0000 commented Jan 10, 2024

It's because of #13181.

If you do...

(
    frame
    .group_by(by="id")
    .agg(pl.col("values").log1p().sum().map_batches(np.expm1, is_elementwise=True))
)

then it works.

One quick note on background. When you use a ufunc, polars recognizes it and automatically runs it through map_batches so even though you're not explicitly typing it, it is being called.

The issue is that map_batches used to, effectively, not work in a group_by or over. It would just ignore the groups and do the whole column. If you're doing something like expm1 then it doesn't matter because it only operates on one value at a time. If you're doing something like a sort where it needs all the data then it'd give wrong results in group contexts.

To fix that behavior @ritchie46 added a parameter to map_batches tell it whether or not you're doing an elementwise operation so you're getting the short end of that stick here.

Maybe there's a way to make this chunk a bit smarter:

def __array_ufunc__(
self, ufunc: Callable[..., Any], method: str, *inputs: Any, **kwargs: Any
) -> Self:
"""Numpy universal functions."""
num_expr = sum(isinstance(inp, Expr) for inp in inputs)
if num_expr > 1:
if num_expr < len(inputs):
raise ValueError(
"NumPy ufunc with more than one expression can only be used"
" if all non-expression inputs are provided as keyword arguments only"
)
exprs = parse_as_list_of_expressions(inputs)
return self._from_pyexpr(pyreduce(partial(ufunc, **kwargs), exprs))
def function(s: Series) -> Series: # pragma: no cover
args = [inp if not isinstance(inp, Expr) else s for inp in inputs]
return ufunc(*args, **kwargs)
return self.map_batches(function)

One idea is to check the ufunc.types. The doc is a little vague but it seems like elementwise functions such as np.exmp1.types has a return like e->e whereas scipy.special.airy.types has 'f->ffff'. So elementwise operations have one character to the right of the arrow and group functions have 4 characters after the arrow. @alexander-beedie do you know if I have that right?

@trendelkampschroer
Copy link
Author

Thanks a lot for suggesting this fix. To me the current state seems a bit unintuitive, i.e.

import polars as pl

def expm1(values: pl.Expr) -> pl.Expr:
    return values.exp() - 1.0

if __name__ == "__main__":
    frame = pl.from_dict({"id": ["a", "a", "b", "b"], "values": [0.1, 0.1, -0.1, -0.1]})
    frame.group_by(by="id").agg(pl.col("values").log1p().sum().pipe(expm1))

is working perfectly fine, but with an unstable implmentation of expm1.

This means I cannot simply use a numpy 'ufunc' for replacing a missing compute expression. IMHO this diminishes the reusablity of ufuncs.

Having the ability to substitute a 'ufunc' for an expression is a great thing since polars is free to implement only those expressions for specialised numerical operations that it deems important for a wide enough user base.

In that light I am thinking about reopening #12891 since having available a stable expm1 implementation would actually be beneficial now.

@deanm0000
Copy link
Collaborator

Turns out all the stock prebuilt ufuncs are elementwise so I made a PR to change how they're dispatched to using that flag.

@deanm0000 deanm0000 removed the needs triage Awaiting prioritization by a maintainer label Jan 31, 2024
@deanm0000 deanm0000 self-assigned this Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
3 participants