rolling functions in lazy #1185

ritchie46 · 2021-08-21T06:42:02Z

@AdrianAntico Continuing discussion from #1124

@ritchie46 Hey there,

Is there a way to specify window size for rolling calculations? And is there a way to compute them when there are no grouping variables? And lastly, what types of aggregation functions are possible, such as mean, quantile, sd, skew, kurt, corr?

# From the windowing example in the docs:
q = dataset.lazy().with_columns(
    [
        pl.sum("A").over().alias("fruit_sum_A"),
        pl.first("B").over().alias("fruit_first_B"),
        pl.max("B").over().alias("cars_max_B"),
    ]
)

# Is there a way  to run an N-period moving average without grouping vars? Something like:
q = dataset.lazy().with_columns(
    [
        pl.sum("A", periods = 5).over().alias("ma5_bla"),
    ]
)

# Or an N-period moving average with grouping variables? Something like:
q = dataset.lazy().with_columns(
    [
        pl.sum("A", periods = 5).over(CharacterVectorOfColNames).alias("ma5_bla"),
    ]
)

I have rolling aggregations in eager, I will expose them in lazy as well. 👍

And lastly, what types of aggregation functions are possible, such as mean, quantile, sd, skew, kurt, corr?

In wich context do you mean? rolling aggregations or normal?

Normal aggregations we have

The text was updated successfully, but these errors were encountered:

AdrianAntico · 2021-08-21T07:21:58Z

@ritchie46 Hey, I'm looking for rolling versions. I created a function called AutoRollingStats and it currently computes mean, sd, min, and max (that's all the datatable currently offers). I'd like to get the function to a similar state as my R version where I can compute rolling stats for everything you have listed above. I use them quite often for machine learning and inside the ML forecasting functions I use (R version but plans for the Python version as well).

ritchie46 · 2021-08-21T10:58:51Z

@AdrianAntico #1188 exposes min, max, sum and mean as expressions. Correlation.

Wat do you need, so I can prioritize work on that?

AdrianAntico · 2021-08-21T19:14:40Z

@ritchie46 I'm looking to be able to create rolling measures for mean, sd, skew, kurtosis, quantiles, correlation (pearson, spearman), and mode, for cases where there are one to many partition-by variables and when there are no partition-by variables (something like OVER(ORDER BY x), for any number of rows preceding. Ideally I could also have these generated for multiple variables at the same time and for multiple periods too. The idea behind having them all build in one shot is for the speedup by pushing everything to rust vs going back and forth between calls to the other variables and other periods.

The current Python function I've put together uses datatable and it can only generate rolling mean, rolling sd, rolling min, and rolling max, but it's kind of clunky. They don't have any of the rolling functions available yet so what I have to do is generate a sequence of lag columns and then use their rowmean, rowsd, rowmin, and rowmax functions with the lag columns to create them.

The AutoRollingStats version I have in R uses data.table and it uses the data.table function called frollmean() which I use for the rolling mean and frollapply() that allows for other functions to be used aside from mean, such as sd, quantile, etc., but it's much slower than the frollmean().

What's great about the R version of data.table is that for their shift(), frollmean(), and frollapply() functions, I can run it for any number of variable and have all the lags or all the rolling means built in one shot (and it will push all the operations to C / C++ to build in a single call, which makes it extremely fast). For example, the syntax would be:

moving averages using R data.table::frollmean()

# Multiple variables with partition variables (panel or transactional data)
# Creates 9 new columns
BaseCols = c("Col1", "Col2", "Col3")
data[, paste(BaseCols, c(5,10,15)) := frollmean(x = .SD, n = c(5, 10, 15)), .SDcols = c(BaseCols), by = c("Group1", "Group2")]

# Without grouping variables (single time series or transactional data / just 'over')
# Creates 9 new columns
BaseCols = c("Col1", "Col2", "Col3")
data[, paste(BaseCols, c(5,10,15)) := frollmean(x = .SD, n = c(5, 10, 15)), .SDcols = c(BaseCols)]

moving sd using frollapply()

# Multiple variables with partition variables (panel or transactional data)
# Creates 9 new columns
BaseCols = c("Col1", "Col2", "Col3")
data[, paste(BaseCols, c(5,10,15)) := frollapply(x = .SD, n = c(5, 10, 15)), FUN = sd), SDcols = c(BaseCols), by = c("Group1", "Group2")]

# Multiple variables without partition variables (single time series or transactional data)
# Creates 9 new columns
BaseCols = c("Col1", "Col2", "Col3")
data[, paste(BaseCols, c(5,10,15)) := frollapply(x = .SD, n = c(5, 10, 15)), FUN = sd)]

ritchie46 · 2021-08-22T07:01:39Z

I think for polars you also have to do some lagging kung-fu to achieve this. There are rolling window functions, but no rolling std, or quantile.

Maybe I can also examine the possibility of a rolling apply. So that you can apply a Series aggregator on a rolling window.

ritchie46 · 2021-08-22T16:08:12Z

@AdrianAntico #1196 adds a rolling_apply. Its best to only use this for the rolling aggregation not directly supported as this will be slower.

ritchie46 · 2021-08-22T17:07:49Z

For the more exotic statistics you can use scipy.stats.

You have to decide up front what to do with missing values. Or fill them with 0, or replace them with np.nan.

from scipy import stats

df = pl.DataFrame({
    "a": np.random.rand(10),
    "b": np.random.rand(10),
})
df.select([
    pl.all().fill_none(0).rolling_apply(window_size=3, function=lambda s: stats.skew(s)).suffix("_skew")
])

shape: (10, 2)
┌───────────────────────┬────────────────────────┐
│ a_skew                ┆ b_skew                 │
│ ---                   ┆ ---                    │
│ f64                   ┆ f64                    │
╞═══════════════════════╪════════════════════════╡
│ null                  ┆ null                   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null                  ┆ null                   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.06691419711567603   ┆ 0.1371982425693427     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ -6.793356910894194e-1 ┆ -2.3831535083524058e-1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                   ┆ ...                    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.16999895544056237   ┆ 0.3831494551186044     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.14317983998770842   ┆ 0.1807861706971071     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.6325442595571087    ┆ 0.17865733248119894    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ -6.822861889457311e-1 ┆ -5.641642413783482e-1  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.5548853378613413    ┆ 0.44394165335402014    │
└───────────────────────┴────────────────────────┘

AdrianAntico · 2021-08-23T02:22:43Z

This type of functionality will do the trick. However, is there an update required? I'm getting a panic exception

PanicException: called `Result::unwrap()` on an `Err` value: Other("No root column name could be found for expr * in output name utillity")

ritchie46 · 2021-08-23T05:30:06Z

This type of functionality will do the trick. However, is there an update required? I'm getting a panic exception
PanicException: called `Result::unwrap()` on an `Err` value: Other("No root column name could be found for expr * in output name utillity")
``

Still need to release. :) I plan to release next friday. You could already compile from source.

ritchie46 · 2021-08-23T13:38:26Z

I've released a beta version: https://pypi.org/project/polars/0.8.27_beta.1/

AdrianAntico · 2021-08-23T17:07:19Z

That did the trick!

ritchie46 · 2021-09-02T12:16:29Z

skew added in #1280

ritchie46 · 2021-09-02T13:28:47Z

kurtosis added in #1282

ritchie46 mentioned this issue Aug 23, 2021

Add note about scipy.stats integration pola-rs/polars-book#77

Closed

ritchie46 closed this as completed Sep 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rolling functions in lazy #1185

rolling functions in lazy #1185

ritchie46 commented Aug 21, 2021 •

edited

Loading

AdrianAntico commented Aug 21, 2021

ritchie46 commented Aug 21, 2021

AdrianAntico commented Aug 21, 2021

ritchie46 commented Aug 22, 2021

ritchie46 commented Aug 22, 2021

ritchie46 commented Aug 22, 2021

AdrianAntico commented Aug 23, 2021

ritchie46 commented Aug 23, 2021

ritchie46 commented Aug 23, 2021

AdrianAntico commented Aug 23, 2021

ritchie46 commented Sep 2, 2021

ritchie46 commented Sep 2, 2021

rolling functions in lazy #1185

rolling functions in lazy #1185

Comments

ritchie46 commented Aug 21, 2021 • edited Loading

AdrianAntico commented Aug 21, 2021

ritchie46 commented Aug 21, 2021

AdrianAntico commented Aug 21, 2021

ritchie46 commented Aug 22, 2021

ritchie46 commented Aug 22, 2021

ritchie46 commented Aug 22, 2021

AdrianAntico commented Aug 23, 2021

ritchie46 commented Aug 23, 2021

ritchie46 commented Aug 23, 2021

AdrianAntico commented Aug 23, 2021

ritchie46 commented Sep 2, 2021

ritchie46 commented Sep 2, 2021

ritchie46 commented Aug 21, 2021 •

edited

Loading