-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rolling functions in lazy #1185
Comments
@ritchie46 Hey, I'm looking for rolling versions. I created a function called AutoRollingStats and it currently computes mean, sd, min, and max (that's all the datatable currently offers). I'd like to get the function to a similar state as my R version where I can compute rolling stats for everything you have listed above. I use them quite often for machine learning and inside the ML forecasting functions I use (R version but plans for the Python version as well). |
@AdrianAntico #1188 exposes Wat do you need, so I can prioritize work on that? |
@ritchie46 I'm looking to be able to create rolling measures for mean, sd, skew, kurtosis, quantiles, correlation (pearson, spearman), and mode, for cases where there are one to many partition-by variables and when there are no partition-by variables (something like OVER(ORDER BY x), for any number of rows preceding. Ideally I could also have these generated for multiple variables at the same time and for multiple periods too. The idea behind having them all build in one shot is for the speedup by pushing everything to rust vs going back and forth between calls to the other variables and other periods. The current Python function I've put together uses datatable and it can only generate rolling mean, rolling sd, rolling min, and rolling max, but it's kind of clunky. They don't have any of the rolling functions available yet so what I have to do is generate a sequence of lag columns and then use their rowmean, rowsd, rowmin, and rowmax functions with the lag columns to create them. The AutoRollingStats version I have in R uses data.table and it uses the data.table function called frollmean() which I use for the rolling mean and frollapply() that allows for other functions to be used aside from mean, such as sd, quantile, etc., but it's much slower than the frollmean(). What's great about the R version of data.table is that for their shift(), frollmean(), and frollapply() functions, I can run it for any number of variable and have all the lags or all the rolling means built in one shot (and it will push all the operations to C / C++ to build in a single call, which makes it extremely fast). For example, the syntax would be: moving averages using R data.table::frollmean()
moving sd using frollapply()
|
I think for polars you also have to do some lagging kung-fu to achieve this. There are rolling window functions, but no rolling Maybe I can also examine the possibility of a rolling apply. So that you can apply a |
@AdrianAntico #1196 adds a |
For the more exotic statistics you can use You have to decide up front what to do with missing values. Or fill them with from scipy import stats
df = pl.DataFrame({
"a": np.random.rand(10),
"b": np.random.rand(10),
})
df.select([
pl.all().fill_none(0).rolling_apply(window_size=3, function=lambda s: stats.skew(s)).suffix("_skew")
])
|
This type of functionality will do the trick. However, is there an update required? I'm getting a panic exception
|
Still need to release. :) I plan to release next friday. You could already compile from source. |
I've released a beta version: https://pypi.org/project/polars/0.8.27_beta.1/ |
That did the trick! |
skew added in #1280 |
kurtosis added in #1282 |
@AdrianAntico Continuing discussion from #1124
I have rolling aggregations in
eager
, I will expose them in lazy as well. 👍In wich context do you mean? rolling aggregations or normal?
Normal aggregations we have
The text was updated successfully, but these errors were encountered: