Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion to update the UDF documentation #12912

Open
Chuck321123 opened this issue Dec 6, 2023 · 23 comments
Open

Suggestion to update the UDF documentation #12912

Chuck321123 opened this issue Dec 6, 2023 · 23 comments
Labels
documentation Improvements or additions to documentation

Comments

@Chuck321123
Copy link

Chuck321123 commented Dec 6, 2023

Description

So by reading the documentation I'm still struggling on how I would implement longer UDF functions from python to polars. Would be nice if one or several examples could be dropped. Also in pandas, I used transform a lot compared to apply and map as you could do vectorized UDF operations similar to this: df.groupby("Group")["Relevant_Column"].transform(My_UDF_function). How would this work in polars if the UDF functions become slightly advanced? Would be nice to update the documentation a bit and see more examples of this

Link

https://pola-rs.github.io/polars/user-guide/expressions/user-defined-functions/

No response

@Chuck321123 Chuck321123 added the documentation Improvements or additions to documentation label Dec 6, 2023
@MarcoGorelli
Copy link
Collaborator

could you show an example of a udf you'd like to use?

@Chuck321123
Copy link
Author

Chuck321123 commented Dec 6, 2023

@MarcoGorelli Unfortunately, I don't have any now. I do, however, have a simpler UDF that revolves third-party packages. Lets take this fictive example: import Thirdpartypackage as tp

df["New_Col"] = df.groupby("Group")["Relevant_Column"].transform(lambda x: tp.tp_attribute(necessary_variable = x, optional_variable = 2)) Would for example be nice to extend the documentations around how one would implement stuff like this

@MarcoGorelli
Copy link
Collaborator

thanks - have you looked at https://pola-rs.github.io/polars/user-guide/migration/pandas/#pandas-transform ? does this answer your question?

(btw, note that if you're running a third party package with a lambda, then pandas transform won't be vectorised either)

@deanm0000
Copy link
Collaborator

One thing to note is how powerful using numba's guvectorize is for making UDFs since they are vectorized ufuncs.

@Chuck321123
Copy link
Author

@MarcoGorelli You are right. Yes, i looked at it and understand that polars want us to use native functions instead. Although not always possible. Tried this df = df.with_columns(pl.col("Relevant_Column").map_batches(lambda x: tp.tp_attribute(necessary_variable = x, optional_variable = 2)).over("Group").alias("New_Col")) but with no luck (map and apply has been renamed to map_batches and map_elements).

@MarcoGorelli
Copy link
Collaborator

thanks - what do you mean by 'no luck'? could you make a reproducible example please

@deanm0000
Copy link
Collaborator

You often need to wrap the custom function in pl.Series like

df = df.with_columns(pl.col("Relevant_Column").map_batches(lambda x: pl.Series(tp.tp_attribute(necessary_variable = x, optional_variable = 2))).over("Group").alias("New_Col"))

@zaa730Sight
Copy link

zaa730Sight commented Dec 6, 2023

One thing to note is how powerful using numba's guvectorize is for making UDFs since they are vectorized ufuncs.

Strongly agree with this. That being said, for heavy usage the syntax gets kind of clunky. For users that know what they're doing however, I suspect the Polar's team would rather steer them into writing native Rust kernels with the py03-polars extension.

But all in all, I find it hard to beat the rapid prototyping & performance, all from a single python environment that Numba guvectorize + Polars(including the Lazyframe eval) can get you. There are some rough edges with this though.

Starting with the next Numba release, we should get this: numba/numba#9058
It would be ideal to get a multi-dim output (MxN) from a single call. Perhaps Polars would be able to wrap that in a Struct/List/Array of values as a single column (currently it can natively get you that pl.Series of compatible type back).

Also, looking forward to the future, it would be interesting to see if Polars will play nice with Mojo kernels/code, as (assuming Mojo takes off) I wager that would be a decent overlap among the user bases.

@Chuck321123
Copy link
Author

Chuck321123 commented Dec 6, 2023

@deanm0000 I tried, and unfortunately, it didn't work. @MarcoGorelli, here is a simple reproducible example:

import pandas as pd
import numpy as np
import pandas_ta as ta
import polars as pl
np.random.seed(42)
df = pd.DataFrame({
    'Group': np.random.choice(['A', 'B', 'C'], size=50),
    'Values': np.random.rand(50) 
})
df = df.sort_values(by='Group')
df["Moving_Average"] = (df.groupby('Group')["Values"].transform(lambda x: ta.sma(close=x, length=2))).bfill()
# Alternative:
def My_UDF_Function(x):
    return ta.sma(close=x, length=2)
df["MA2"] = df.groupby('Group')["Values"].transform(My_UDF_Function).bfill()

Keep in mind that you need to do pip install pandas-ta first. I already know this can be done in polars by using the rolling function, but thought of leaving an example here

@cmdlineluser
Copy link
Contributor

.map_batches() is not "group aware", it needs to be .map_elements()

It is mentioned in the Notes section of the docs, but perhaps it could be made clearer.

If you are looking to map a function over a window function or group_by context, refer to func:map_elements instead.

(Looks like there is also some formatting issue with the rst file?)

@MarcoGorelli
Copy link
Collaborator

looks to me like pandas-ta specifically wants a pandas series - not a polars series, and not a numpy array, it won't work with anything else:

In [42]: ta.sma(dfpl['Values'], length=2)

In [43]: ta.sma(dfpl['Values'].to_numpy(), length=2)

In [44]: ta.sma(pd.Series(dfpl['Values']), length=2)
Out[44]:
0          NaN
1     0.795927
2     0.643506
3     0.690988
4     0.525113
5     0.267241
6     0.262692
7     0.494509
8     0.680310
9     0.742702
10    0.762478
11    0.646630
12    0.389337
13    0.055688
14    0.333522
15    0.333357
16    0.090452
17    0.220633
18    0.362725
19    0.410134
20    0.665995
21    0.678027
22    0.531861
23    0.698057
24    0.826965
25    0.660905
26    0.422452
27    0.240534
28    0.102339
29    0.347099
30    0.375864
31    0.235960
32    0.545532
33    0.841617
34    0.879748
35    0.766130
36    0.498510
37    0.551201
38    0.502415
39    0.153784
40    0.384721
41    0.589381
42    0.424279
43    0.496428
44    0.431987
45    0.451654
46    0.786145
47    0.776003
48    0.384494
49    0.283826
Name: SMA_2, dtype: float64

so, barring converting to pandas (or requesting that they support polars) I'm not sure there's much that can be done here

@deanm0000
Copy link
Collaborator

@mkleinbort-ic
Copy link

For reference, here is a good example of nb.guvectorize

https://stackoverflow.com/questions/77523657/how-do-you-condense-long-recursive-polars-expressions/77525171?noredirect=1#comment136683582_77525171

But going back to your question - have you tried just using .map_elements on the list of values itself?

Polars code

df = pl.DataFrame({
    'Group': np.random.choice(['A', 'B', 'C'], size=50),
    'Values': np.random.rand(50) 
})

df1 = df.group_by('Group').agg(pl.col('Values'))
# This is what we have so far:

# shape: (3, 2)
# ┌───────┬──────────────────────────────────┐
# │ Group ┆ Values                           │
# │ ---   ┆ ---                              │
# │ str   ┆ list[f64]                        │
# ╞═══════╪══════════════════════════════════╡
# │ C     ┆ [0.035229, 0.925109, … 0.297392] │
# │ A     ┆ [0.789259, 0.004903, … 0.56967]  │
# │ B     ┆ [0.917179, 0.396942, … 0.962115] │
# └───────┴──────────────────────────────────┘

df2 = df1.with_columns(**{
    'Moving Average': pl.col('Values').map_elements(lambda x: ....) # Use your UDF that takes in list of values here
})

# Then an explode to have it in the original format.

@MarcoGorelli
Copy link
Collaborator

that won't work I'm afraid

In [14]: df2 = df1.with_columns(**{
    ...:     'Moving Average': pl.col('Values').map_elements(lambda x: ta.sma(x, lenght=2)) # Us
    ...: e your UDF that takes in list of values here
    ...: })

In [15]: df2
Out[15]:
shape: (3, 3)
┌───────┬──────────────────────────────────┬────────────────┐
│ GroupValuesMoving Average │
│ ---------            │
│ strlist[f64]                        ┆ list[null]     │
╞═══════╪══════════════════════════════════╪════════════════╡
│ B     ┆ [0.015636, 0.165267, … 0.520834] ┆ null           │
│ C     ┆ [0.230894, 0.241025, … 0.385417] ┆ null           │
│ A     ┆ [0.844534, 0.74732, … 0.651077]  ┆ null           │
└───────┴──────────────────────────────────┴────────────────┘

The issue, as far as I can tell, is that pandas-ta only accepts pandas series, and nothing else (not even numpy array)

I'd suggest opening a feature request to them to accept numpy arrays (or, even better, Polars)

@mkleinbort-ic
Copy link

mkleinbort-ic commented Dec 7, 2023

The issue, as far as I can tell, is that pandas-ta only accepts pandas series, and nothing else (not even numpy array)

Can't that be fixed with code like:

def pandas_ta_wrapper(x:list)->pd.Series:
     return pd.Series(x)

df2 = df1.with_columns(**{
     'Moving Average': pl.col('Values').map_elements(lambda x: ta.sma(pandas_ta_wrapper(x), lenght=2))
})

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Dec 7, 2023

You'd need to then convert back to Polars

This works:

(
    dfpl.with_columns(
        MA2=pl.col("Values")
        .map_batches(lambda x: pl.Series(ta.sma(x.to_pandas(), length=2)))
        .fill_null(strategy="backward")
        .over("Group")
    )
)

EDIT

This comment is not correct, it requires map_elements, not map_batches

@Chuck321123
Copy link
Author

Thanks for the input. Ye, I didn't think the package explicitly needed a pandas series @MarcoGorelli to work. However, i tried you solution and it worked, and surprisingly with a ~30% better performance on larger datasets.

@cmdlineluser
Copy link
Contributor

The last example uses .map_batches( so I don't think it's equivalent.

(It's not generating per-group values.)

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Dec 7, 2023

it's using map_batches, followed by over, so it should be equivalent. at least, if I try it, I get the same result: I may have missed boundary details, just checking

hmm this doesn't look like I would have expected it to - have opened #12941 about this

In [11]: df = pl.DataFrame({'a': [1,1, 2], 'b': [4,5,6]})

In [12]: df.with_columns(pl.col('b').map_batches(lambda x: x.shift()).over('a'))
Out[12]:
shape: (3, 2)
┌─────┬──────┐
│ ab    │
│ ------  │
│ i64i64  │
╞═════╪══════╡
│ 1null │
│ 14    │
│ 25    │
└─────┴──────┘

@cmdlineluser
Copy link
Contributor

Yeah, I mentioned it earlier: #12912 (comment)

map_batches always passes the full column, it doesn't do groups.

If you are looking to map a function over a window function or group_by context, refer to func:map_elements instead.

(The distinction was perhaps clearer with the previous names of map vs. apply because one could think of a "group" as a "batch")

import pandas_ta as ta
import polars as pl

df = pl.DataFrame({
   "Group":  ["A", "A", "B", "C", "C", "C"],
   "Values": [1, 2, 3, 4, 5, 6]
})
df.with_columns(
   MA2=pl.col("Values")
   .map_batches(lambda x: [print(x), pl.Series(ta.sma(x.to_pandas(), length=2))][1])
   .over("Group")
)
# shape: (6,)
# Series: 'Values' [i64]
# [
# 	1
# 	2
# 	3
# 	4
# 	5
# 	6
# ]
# shape: (6, 3)
# ┌───────┬────────┬──────┐
# │ Group ┆ Values ┆ MA2  │
# │ ---   ┆ ---    ┆ ---  │
# │ str   ┆ i64    ┆ f64  │
# ╞═══════╪════════╪══════╡
# │ A     ┆ 1      ┆ null │
# │ A     ┆ 2      ┆ 1.5  │
# │ B     ┆ 3      ┆ 2.5  │
# │ C     ┆ 4      ┆ 3.5  │
# │ C     ┆ 5      ┆ 4.5  │
# │ C     ┆ 6      ┆ 5.5  │
# └───────┴────────┴──────┘
df.with_columns(
   MA2=pl.col("Values")
   .map_elements(lambda x: [print(x), pl.Series(ta.sma(x.to_pandas(), length=2))][1])
   .over("Group")
)
# shape: (2,)
# Series: '' [i64]
# [
# 	1
# 	2
# ]
# shape: (1,)
# Series: '' [i64]
# [
# 	3
# ]
# shape: (3,)
# Series: '' [i64]
# [
# 	4
# 	5
# 	6
# ]
# shape: (6, 3)
# ┌───────┬────────┬──────┐
# │ Group ┆ Values ┆ MA2  │
# │ ---   ┆ ---    ┆ ---  │
# │ str   ┆ i64    ┆ f64  │
# ╞═══════╪════════╪══════╡
# │ A     ┆ 1      ┆ null │
# │ A     ┆ 2      ┆ 1.5  │
# │ B     ┆ 3      ┆ null │
# │ C     ┆ 4      ┆ null │
# │ C     ┆ 5      ┆ 4.5  │
# │ C     ┆ 6      ┆ 5.5  │
# └───────┴────────┴──────┘

@deanm0000
Copy link
Collaborator

I petitioned for this one to be reopened. It really seems like map_batches ought to "turn in to" map_elements when it's an over/agg.

@MarcoGorelli
Copy link
Collaborator

Thanks all for your comments!

Right, so going back to the original example, the solution is indeed to use map_elements (not map_batches!) followed by over:

import pandas as pd
import numpy as np
import pandas_ta as ta
import polars as pl
from polars.testing import assert_series_equal
np.random.seed(42)
df = pd.DataFrame({
    'Group': np.random.choice(['A', 'B', 'C'], size=50),
    'Values': np.random.rand(50)
})
df = df.sort_values(by='Group')
df["Moving_Average"] = (df.groupby('Group')["Values"].transform(lambda x: ta.sma(close=x, length=2))).bfill()
# Alternative:
def My_UDF_Function(x):
    return ta.sma(close=x, length=2)

dfpl = pl.from_pandas(df)

dfpl = (
    dfpl.with_columns(
        MA2=pl.col("Values")
        .map_elements(lambda x: pl.Series(ta.sma(x.to_pandas(), length=2)))
        .fill_null(strategy="backward")
        .over("Group")
    )
)

df["MA2"] = df.groupby('Group')["Values"].transform(My_UDF_Function).bfill()

dfpl = dfpl.with_columns(
    pandas_MA2=pl.from_pandas(df['MA2'])
)
assert_series_equal(dfpl['MA2'], dfpl['pandas_MA2'], check_names=False)

I'll try to update the docs

@Mickychen00
Copy link

Mickychen00 commented Dec 18, 2023

The user guide's UDF page is still to be updated. Current version talks about map and apply these two outdated functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

7 participants