Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling functions in lazy #1185

Closed
6 tasks done
ritchie46 opened this issue Aug 21, 2021 · 12 comments
Closed
6 tasks done

rolling functions in lazy #1185

ritchie46 opened this issue Aug 21, 2021 · 12 comments

Comments

@ritchie46
Copy link
Member

ritchie46 commented Aug 21, 2021

@AdrianAntico Continuing discussion from #1124

@ritchie46 Hey there,

Is there a way to specify window size for rolling calculations? And is there a way to compute them when there are no grouping variables? And lastly, what types of aggregation functions are possible, such as mean, quantile, sd, skew, kurt, corr?

# From the windowing example in the docs:
q = dataset.lazy().with_columns(
    [
        pl.sum("A").over().alias("fruit_sum_A"),
        pl.first("B").over().alias("fruit_first_B"),
        pl.max("B").over().alias("cars_max_B"),
    ]
)

# Is there a way  to run an N-period moving average without grouping vars? Something like:
q = dataset.lazy().with_columns(
    [
        pl.sum("A", periods = 5).over().alias("ma5_bla"),
    ]
)

# Or an N-period moving average with grouping variables? Something like:
q = dataset.lazy().with_columns(
    [
        pl.sum("A", periods = 5).over(CharacterVectorOfColNames).alias("ma5_bla"),
    ]
)

I have rolling aggregations in eager, I will expose them in lazy as well. 👍

And lastly, what types of aggregation functions are possible, such as mean, quantile, sd, skew, kurt, corr?

In wich context do you mean? rolling aggregations or normal?

Normal aggregations we have

  • mean,
  • quantile
  • std
  • corr (I believe, let me check)
  • skew
  • kurt
@AdrianAntico
Copy link

@ritchie46 Hey, I'm looking for rolling versions. I created a function called AutoRollingStats and it currently computes mean, sd, min, and max (that's all the datatable currently offers). I'd like to get the function to a similar state as my R version where I can compute rolling stats for everything you have listed above. I use them quite often for machine learning and inside the ML forecasting functions I use (R version but plans for the Python version as well).

@ritchie46
Copy link
Member Author

@AdrianAntico #1188 exposes min, max, sum and mean as expressions. Correlation.

Wat do you need, so I can prioritize work on that?

@AdrianAntico
Copy link

@ritchie46 I'm looking to be able to create rolling measures for mean, sd, skew, kurtosis, quantiles, correlation (pearson, spearman), and mode, for cases where there are one to many partition-by variables and when there are no partition-by variables (something like OVER(ORDER BY x), for any number of rows preceding. Ideally I could also have these generated for multiple variables at the same time and for multiple periods too. The idea behind having them all build in one shot is for the speedup by pushing everything to rust vs going back and forth between calls to the other variables and other periods.

The current Python function I've put together uses datatable and it can only generate rolling mean, rolling sd, rolling min, and rolling max, but it's kind of clunky. They don't have any of the rolling functions available yet so what I have to do is generate a sequence of lag columns and then use their rowmean, rowsd, rowmin, and rowmax functions with the lag columns to create them.

The AutoRollingStats version I have in R uses data.table and it uses the data.table function called frollmean() which I use for the rolling mean and frollapply() that allows for other functions to be used aside from mean, such as sd, quantile, etc., but it's much slower than the frollmean().

What's great about the R version of data.table is that for their shift(), frollmean(), and frollapply() functions, I can run it for any number of variable and have all the lags or all the rolling means built in one shot (and it will push all the operations to C / C++ to build in a single call, which makes it extremely fast). For example, the syntax would be:

moving averages using R data.table::frollmean()

# Multiple variables with partition variables (panel or transactional data)
# Creates 9 new columns
BaseCols = c("Col1", "Col2", "Col3")
data[, paste(BaseCols, c(5,10,15)) := frollmean(x = .SD, n = c(5, 10, 15)), .SDcols = c(BaseCols), by = c("Group1", "Group2")]

# Without grouping variables (single time series or transactional data / just 'over')
# Creates 9 new columns
BaseCols = c("Col1", "Col2", "Col3")
data[, paste(BaseCols, c(5,10,15)) := frollmean(x = .SD, n = c(5, 10, 15)), .SDcols = c(BaseCols)]

moving sd using frollapply()

# Multiple variables with partition variables (panel or transactional data)
# Creates 9 new columns
BaseCols = c("Col1", "Col2", "Col3")
data[, paste(BaseCols, c(5,10,15)) := frollapply(x = .SD, n = c(5, 10, 15)), FUN = sd), SDcols = c(BaseCols), by = c("Group1", "Group2")]

# Multiple variables without partition variables (single time series or transactional data)
# Creates 9 new columns
BaseCols = c("Col1", "Col2", "Col3")
data[, paste(BaseCols, c(5,10,15)) := frollapply(x = .SD, n = c(5, 10, 15)), FUN = sd)]

@ritchie46
Copy link
Member Author

I think for polars you also have to do some lagging kung-fu to achieve this. There are rolling window functions, but no rolling std, or quantile.

Maybe I can also examine the possibility of a rolling apply. So that you can apply a Series aggregator on a rolling window.

@ritchie46
Copy link
Member Author

@AdrianAntico #1196 adds a rolling_apply. Its best to only use this for the rolling aggregation not directly supported as this will be slower.

@ritchie46
Copy link
Member Author

For the more exotic statistics you can use scipy.stats.

You have to decide up front what to do with missing values. Or fill them with 0, or replace them with np.nan.

from scipy import stats

df = pl.DataFrame({
    "a": np.random.rand(10),
    "b": np.random.rand(10),
})
df.select([
    pl.all().fill_none(0).rolling_apply(window_size=3, function=lambda s: stats.skew(s)).suffix("_skew")
])
shape: (10, 2)
┌───────────────────────┬────────────────────────┐
│ a_skew                ┆ b_skew                 │
│ ---                   ┆ ---                    │
│ f64                   ┆ f64                    │
╞═══════════════════════╪════════════════════════╡
│ null                  ┆ null                   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null                  ┆ null                   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.06691419711567603   ┆ 0.1371982425693427     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ -6.793356910894194e-1 ┆ -2.3831535083524058e-1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                   ┆ ...                    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.16999895544056237   ┆ 0.3831494551186044     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.14317983998770842   ┆ 0.1807861706971071     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.6325442595571087    ┆ 0.17865733248119894    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ -6.822861889457311e-1 ┆ -5.641642413783482e-1  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.5548853378613413    ┆ 0.44394165335402014    │
└───────────────────────┴────────────────────────┘

@AdrianAntico
Copy link

This type of functionality will do the trick. However, is there an update required? I'm getting a panic exception

PanicException: called `Result::unwrap()` on an `Err` value: Other("No root column name could be found for expr * in output name utillity")

@ritchie46
Copy link
Member Author

This type of functionality will do the trick. However, is there an update required? I'm getting a panic exception

PanicException: called `Result::unwrap()` on an `Err` value: Other("No root column name could be found for expr * in output name utillity")
``

Still need to release. :) I plan to release next friday. You could already compile from source.

@ritchie46
Copy link
Member Author

I've released a beta version: https://pypi.org/project/polars/0.8.27_beta.1/

@AdrianAntico
Copy link

That did the trick!

@ritchie46
Copy link
Member Author

skew added in #1280

@ritchie46
Copy link
Member Author

kurtosis added in #1282

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants