Suggestion to update the UDF documentation #12912

Chuck321123 · 2023-12-06T12:13:06Z

Description

So by reading the documentation I'm still struggling on how I would implement longer UDF functions from python to polars. Would be nice if one or several examples could be dropped. Also in pandas, I used transform a lot compared to apply and map as you could do vectorized UDF operations similar to this: df.groupby("Group")["Relevant_Column"].transform(My_UDF_function). How would this work in polars if the UDF functions become slightly advanced? Would be nice to update the documentation a bit and see more examples of this

Link

https://pola-rs.github.io/polars/user-guide/expressions/user-defined-functions/

No response

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2023-12-06T12:19:07Z

could you show an example of a udf you'd like to use?

Chuck321123 · 2023-12-06T12:34:21Z

@MarcoGorelli Unfortunately, I don't have any now. I do, however, have a simpler UDF that revolves third-party packages. Lets take this fictive example: import Thirdpartypackage as tp

df["New_Col"] = df.groupby("Group")["Relevant_Column"].transform(lambda x: tp.tp_attribute(necessary_variable = x, optional_variable = 2)) Would for example be nice to extend the documentations around how one would implement stuff like this

MarcoGorelli · 2023-12-06T17:08:11Z

thanks - have you looked at https://pola-rs.github.io/polars/user-guide/migration/pandas/#pandas-transform ? does this answer your question?

(btw, note that if you're running a third party package with a lambda, then pandas transform won't be vectorised either)

deanm0000 · 2023-12-06T18:07:05Z

One thing to note is how powerful using numba's guvectorize is for making UDFs since they are vectorized ufuncs.

Chuck321123 · 2023-12-06T19:44:50Z

@MarcoGorelli You are right. Yes, i looked at it and understand that polars want us to use native functions instead. Although not always possible. Tried this df = df.with_columns(pl.col("Relevant_Column").map_batches(lambda x: tp.tp_attribute(necessary_variable = x, optional_variable = 2)).over("Group").alias("New_Col")) but with no luck (map and apply has been renamed to map_batches and map_elements).

MarcoGorelli · 2023-12-06T19:45:37Z

thanks - what do you mean by 'no luck'? could you make a reproducible example please

deanm0000 · 2023-12-06T19:49:21Z

You often need to wrap the custom function in pl.Series like

df = df.with_columns(pl.col("Relevant_Column").map_batches(lambda x: pl.Series(tp.tp_attribute(necessary_variable = x, optional_variable = 2))).over("Group").alias("New_Col"))

zaa730Sight · 2023-12-06T19:55:03Z

One thing to note is how powerful using numba's guvectorize is for making UDFs since they are vectorized ufuncs.

Strongly agree with this. That being said, for heavy usage the syntax gets kind of clunky. For users that know what they're doing however, I suspect the Polar's team would rather steer them into writing native Rust kernels with the py03-polars extension.

But all in all, I find it hard to beat the rapid prototyping & performance, all from a single python environment that Numba guvectorize + Polars(including the Lazyframe eval) can get you. There are some rough edges with this though.

Starting with the next Numba release, we should get this: numba/numba#9058
It would be ideal to get a multi-dim output (MxN) from a single call. Perhaps Polars would be able to wrap that in a Struct/List/Array of values as a single column (currently it can natively get you that pl.Series of compatible type back).

Also, looking forward to the future, it would be interesting to see if Polars will play nice with Mojo kernels/code, as (assuming Mojo takes off) I wager that would be a decent overlap among the user bases.

Chuck321123 · 2023-12-06T20:21:12Z

@deanm0000 I tried, and unfortunately, it didn't work. @MarcoGorelli, here is a simple reproducible example:

import pandas as pd
import numpy as np
import pandas_ta as ta
import polars as pl
np.random.seed(42)
df = pd.DataFrame({
    'Group': np.random.choice(['A', 'B', 'C'], size=50),
    'Values': np.random.rand(50) 
})
df = df.sort_values(by='Group')
df["Moving_Average"] = (df.groupby('Group')["Values"].transform(lambda x: ta.sma(close=x, length=2))).bfill()
# Alternative:
def My_UDF_Function(x):
    return ta.sma(close=x, length=2)
df["MA2"] = df.groupby('Group')["Values"].transform(My_UDF_Function).bfill()

Keep in mind that you need to do pip install pandas-ta first. I already know this can be done in polars by using the rolling function, but thought of leaving an example here

cmdlineluser · 2023-12-06T20:58:19Z

.map_batches() is not "group aware", it needs to be .map_elements()

It is mentioned in the Notes section of the docs, but perhaps it could be made clearer.

If you are looking to map a function over a window function or group_by context, refer to func:map_elements instead.

(Looks like there is also some formatting issue with the rst file?)

MarcoGorelli · 2023-12-06T21:03:04Z

looks to me like pandas-ta specifically wants a pandas series - not a polars series, and not a numpy array, it won't work with anything else:

In [42]: ta.sma(dfpl['Values'], length=2)

In [43]: ta.sma(dfpl['Values'].to_numpy(), length=2)

In [44]: ta.sma(pd.Series(dfpl['Values']), length=2)
Out[44]:
0          NaN
1     0.795927
2     0.643506
3     0.690988
4     0.525113
5     0.267241
6     0.262692
7     0.494509
8     0.680310
9     0.742702
10    0.762478
11    0.646630
12    0.389337
13    0.055688
14    0.333522
15    0.333357
16    0.090452
17    0.220633
18    0.362725
19    0.410134
20    0.665995
21    0.678027
22    0.531861
23    0.698057
24    0.826965
25    0.660905
26    0.422452
27    0.240534
28    0.102339
29    0.347099
30    0.375864
31    0.235960
32    0.545532
33    0.841617
34    0.879748
35    0.766130
36    0.498510
37    0.551201
38    0.502415
39    0.153784
40    0.384721
41    0.589381
42    0.424279
43    0.496428
44    0.431987
45    0.451654
46    0.786145
47    0.776003
48    0.384494
49    0.283826
Name: SMA_2, dtype: float64

so, barring converting to pandas (or requesting that they support polars) I'm not sure there's much that can be done here

deanm0000 · 2023-12-06T22:09:15Z

Is this similar enough?

https://stackoverflow.com/questions/77160103/exponential-moving-average-ema-calculations-in-polars-dataframe

mkleinbort-ic · 2023-12-07T12:35:27Z

For reference, here is a good example of nb.guvectorize

https://stackoverflow.com/questions/77523657/how-do-you-condense-long-recursive-polars-expressions/77525171?noredirect=1#comment136683582_77525171

But going back to your question - have you tried just using .map_elements on the list of values itself?

Polars code

df = pl.DataFrame({
    'Group': np.random.choice(['A', 'B', 'C'], size=50),
    'Values': np.random.rand(50) 
})

df1 = df.group_by('Group').agg(pl.col('Values'))
# This is what we have so far:

# shape: (3, 2)
# ┌───────┬──────────────────────────────────┐
# │ Group ┆ Values                           │
# │ ---   ┆ ---                              │
# │ str   ┆ list[f64]                        │
# ╞═══════╪══════════════════════════════════╡
# │ C     ┆ [0.035229, 0.925109, … 0.297392] │
# │ A     ┆ [0.789259, 0.004903, … 0.56967]  │
# │ B     ┆ [0.917179, 0.396942, … 0.962115] │
# └───────┴──────────────────────────────────┘

df2 = df1.with_columns(**{
    'Moving Average': pl.col('Values').map_elements(lambda x: ....) # Use your UDF that takes in list of values here
})

# Then an explode to have it in the original format.

MarcoGorelli · 2023-12-07T12:52:12Z

that won't work I'm afraid

In [14]: df2 = df1.with_columns(**{
    ...:     'Moving Average': pl.col('Values').map_elements(lambda x: ta.sma(x, lenght=2)) # Us
    ...: e your UDF that takes in list of values here
    ...: })

In [15]: df2
Out[15]:
shape: (3, 3)
┌───────┬──────────────────────────────────┬────────────────┐
│ Group ┆ Values                           ┆ Moving Average │
│ ---   ┆ ---                              ┆ ---            │
│ str   ┆ list[f64]                        ┆ list[null]     │
╞═══════╪══════════════════════════════════╪════════════════╡
│ B     ┆ [0.015636, 0.165267, … 0.520834] ┆ null           │
│ C     ┆ [0.230894, 0.241025, … 0.385417] ┆ null           │
│ A     ┆ [0.844534, 0.74732, … 0.651077]  ┆ null           │
└───────┴──────────────────────────────────┴────────────────┘

The issue, as far as I can tell, is that pandas-ta only accepts pandas series, and nothing else (not even numpy array)

I'd suggest opening a feature request to them to accept numpy arrays (or, even better, Polars)

mkleinbort-ic · 2023-12-07T13:34:59Z

The issue, as far as I can tell, is that pandas-ta only accepts pandas series, and nothing else (not even numpy array)

Can't that be fixed with code like:

def pandas_ta_wrapper(x:list)->pd.Series:
     return pd.Series(x)

df2 = df1.with_columns(**{
     'Moving Average': pl.col('Values').map_elements(lambda x: ta.sma(pandas_ta_wrapper(x), lenght=2))
})

MarcoGorelli · 2023-12-07T13:51:23Z

You'd need to then convert back to Polars

This works:

(
    dfpl.with_columns(
        MA2=pl.col("Values")
        .map_batches(lambda x: pl.Series(ta.sma(x.to_pandas(), length=2)))
        .fill_null(strategy="backward")
        .over("Group")
    )
)

EDIT

This comment is not correct, it requires map_elements, not map_batches

Chuck321123 · 2023-12-07T14:54:53Z

Thanks for the input. Ye, I didn't think the package explicitly needed a pandas series @MarcoGorelli to work. However, i tried you solution and it worked, and surprisingly with a ~30% better performance on larger datasets.

cmdlineluser · 2023-12-07T15:18:46Z

The last example uses .map_batches( so I don't think it's equivalent.

(It's not generating per-group values.)

MarcoGorelli · 2023-12-07T15:25:54Z

it's using map_batches, followed by over, so it should be equivalent. ~~at least, if I try it, I get the same result:~~ I may have missed boundary details, just checking

hmm this doesn't look like I would have expected it to - have opened #12941 about this

In [11]: df = pl.DataFrame({'a': [1,1, 2], 'b': [4,5,6]})

In [12]: df.with_columns(pl.col('b').map_batches(lambda x: x.shift()).over('a'))
Out[12]:
shape: (3, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ i64  │
╞═════╪══════╡
│ 1   ┆ null │
│ 1   ┆ 4    │
│ 2   ┆ 5    │
└─────┴──────┘

cmdlineluser · 2023-12-07T15:47:08Z

Yeah, I mentioned it earlier: #12912 (comment)

map_batches always passes the full column, it doesn't do groups.

If you are looking to map a function over a window function or group_by context, refer to func:map_elements instead.

(The distinction was perhaps clearer with the previous names of map vs. apply because one could think of a "group" as a "batch")

import pandas_ta as ta
import polars as pl

df = pl.DataFrame({
   "Group":  ["A", "A", "B", "C", "C", "C"],
   "Values": [1, 2, 3, 4, 5, 6]
})

df.with_columns(
   MA2=pl.col("Values")
   .map_batches(lambda x: [print(x), pl.Series(ta.sma(x.to_pandas(), length=2))][1])
   .over("Group")
)
# shape: (6,)
# Series: 'Values' [i64]
# [
# 	1
# 	2
# 	3
# 	4
# 	5
# 	6
# ]
# shape: (6, 3)
# ┌───────┬────────┬──────┐
# │ Group ┆ Values ┆ MA2  │
# │ ---   ┆ ---    ┆ ---  │
# │ str   ┆ i64    ┆ f64  │
# ╞═══════╪════════╪══════╡
# │ A     ┆ 1      ┆ null │
# │ A     ┆ 2      ┆ 1.5  │
# │ B     ┆ 3      ┆ 2.5  │
# │ C     ┆ 4      ┆ 3.5  │
# │ C     ┆ 5      ┆ 4.5  │
# │ C     ┆ 6      ┆ 5.5  │
# └───────┴────────┴──────┘

df.with_columns(
   MA2=pl.col("Values")
   .map_elements(lambda x: [print(x), pl.Series(ta.sma(x.to_pandas(), length=2))][1])
   .over("Group")
)
# shape: (2,)
# Series: '' [i64]
# [
# 	1
# 	2
# ]
# shape: (1,)
# Series: '' [i64]
# [
# 	3
# ]
# shape: (3,)
# Series: '' [i64]
# [
# 	4
# 	5
# 	6
# ]
# shape: (6, 3)
# ┌───────┬────────┬──────┐
# │ Group ┆ Values ┆ MA2  │
# │ ---   ┆ ---    ┆ ---  │
# │ str   ┆ i64    ┆ f64  │
# ╞═══════╪════════╪══════╡
# │ A     ┆ 1      ┆ null │
# │ A     ┆ 2      ┆ 1.5  │
# │ B     ┆ 3      ┆ null │
# │ C     ┆ 4      ┆ null │
# │ C     ┆ 5      ┆ 4.5  │
# │ C     ┆ 6      ┆ 5.5  │
# └───────┴────────┴──────┘

deanm0000 · 2023-12-14T21:35:51Z

I petitioned for this one to be reopened. It really seems like map_batches ought to "turn in to" map_elements when it's an over/agg.

MarcoGorelli · 2023-12-16T09:08:34Z

Thanks all for your comments!

Right, so going back to the original example, the solution is indeed to use map_elements (not map_batches!) followed by over:

import pandas as pd
import numpy as np
import pandas_ta as ta
import polars as pl
from polars.testing import assert_series_equal
np.random.seed(42)
df = pd.DataFrame({
    'Group': np.random.choice(['A', 'B', 'C'], size=50),
    'Values': np.random.rand(50)
})
df = df.sort_values(by='Group')
df["Moving_Average"] = (df.groupby('Group')["Values"].transform(lambda x: ta.sma(close=x, length=2))).bfill()
# Alternative:
def My_UDF_Function(x):
    return ta.sma(close=x, length=2)

dfpl = pl.from_pandas(df)

dfpl = (
    dfpl.with_columns(
        MA2=pl.col("Values")
        .map_elements(lambda x: pl.Series(ta.sma(x.to_pandas(), length=2)))
        .fill_null(strategy="backward")
        .over("Group")
    )
)

df["MA2"] = df.groupby('Group')["Values"].transform(My_UDF_Function).bfill()

dfpl = dfpl.with_columns(
    pandas_MA2=pl.from_pandas(df['MA2'])
)
assert_series_equal(dfpl['MA2'], dfpl['pandas_MA2'], check_names=False)

I'll try to update the docs

Mickychen00 · 2023-12-18T22:15:36Z

The user guide's UDF page is still to be updated. Current version talks about map and apply these two outdated functions.

Chuck321123 added the documentation Improvements or additions to documentation label Dec 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion to update the UDF documentation #12912

Suggestion to update the UDF documentation #12912

Chuck321123 commented Dec 6, 2023 •

edited

Loading

MarcoGorelli commented Dec 6, 2023

Chuck321123 commented Dec 6, 2023 •

edited

Loading

MarcoGorelli commented Dec 6, 2023

deanm0000 commented Dec 6, 2023

Chuck321123 commented Dec 6, 2023

MarcoGorelli commented Dec 6, 2023

deanm0000 commented Dec 6, 2023

zaa730Sight commented Dec 6, 2023 •

edited

Loading

Chuck321123 commented Dec 6, 2023 •

edited

Loading

cmdlineluser commented Dec 6, 2023

MarcoGorelli commented Dec 6, 2023

deanm0000 commented Dec 6, 2023

mkleinbort-ic commented Dec 7, 2023

MarcoGorelli commented Dec 7, 2023

mkleinbort-ic commented Dec 7, 2023 •

edited

Loading

MarcoGorelli commented Dec 7, 2023 •

edited

Loading

Chuck321123 commented Dec 7, 2023

cmdlineluser commented Dec 7, 2023

MarcoGorelli commented Dec 7, 2023 •

edited

Loading

cmdlineluser commented Dec 7, 2023

deanm0000 commented Dec 14, 2023

MarcoGorelli commented Dec 16, 2023

Mickychen00 commented Dec 18, 2023 •

edited

Loading

Suggestion to update the UDF documentation #12912

Suggestion to update the UDF documentation #12912

Comments

Chuck321123 commented Dec 6, 2023 • edited Loading

Description

Link

MarcoGorelli commented Dec 6, 2023

Chuck321123 commented Dec 6, 2023 • edited Loading

MarcoGorelli commented Dec 6, 2023

deanm0000 commented Dec 6, 2023

Chuck321123 commented Dec 6, 2023

MarcoGorelli commented Dec 6, 2023

deanm0000 commented Dec 6, 2023

zaa730Sight commented Dec 6, 2023 • edited Loading

Chuck321123 commented Dec 6, 2023 • edited Loading

cmdlineluser commented Dec 6, 2023

MarcoGorelli commented Dec 6, 2023

deanm0000 commented Dec 6, 2023

mkleinbort-ic commented Dec 7, 2023

Polars code

MarcoGorelli commented Dec 7, 2023

mkleinbort-ic commented Dec 7, 2023 • edited Loading

MarcoGorelli commented Dec 7, 2023 • edited Loading

EDIT

Chuck321123 commented Dec 7, 2023

cmdlineluser commented Dec 7, 2023

MarcoGorelli commented Dec 7, 2023 • edited Loading

cmdlineluser commented Dec 7, 2023

deanm0000 commented Dec 14, 2023

MarcoGorelli commented Dec 16, 2023

Mickychen00 commented Dec 18, 2023 • edited Loading

Chuck321123 commented Dec 6, 2023 •

edited

Loading

Chuck321123 commented Dec 6, 2023 •

edited

Loading

zaa730Sight commented Dec 6, 2023 •

edited

Loading

Chuck321123 commented Dec 6, 2023 •

edited

Loading

mkleinbort-ic commented Dec 7, 2023 •

edited

Loading

MarcoGorelli commented Dec 7, 2023 •

edited

Loading

MarcoGorelli commented Dec 7, 2023 •

edited

Loading

Mickychen00 commented Dec 18, 2023 •

edited

Loading