# Statistics
By the end of this lecture you will be able to:
- calculate statistics on a `DataFrame` or expression
- calculate cumulative, rolling and exponentially-weighted statistics
- do row-wise calculations

In [3]:
import polars as pl

In [6]:

df = pl.read_csv("../../../Files/Sample_Superstore-1.csv")

## Statistics on a `DataFrame`

We can call statistical methods on all columns of a `DataFrame` such as `mean`,`min`,`max` etc

In [9]:
df.select(pl.col(pl.Float64)).min()

Sales,Discount,Profit
f64,f64,f64
0.444,0.0,-6599.978


In [10]:
df.select(pl.col(pl.Float64)).min().describe(percentiles=(0.1,0.3,0.5,0.7,0.9))

statistic,Sales,Discount,Profit
str,f64,f64,f64
"""count""",1.0,1.0,1.0
"""null_count""",0.0,0.0,0.0
"""mean""",0.444,0.0,-6599.978
"""std""",,,
"""min""",0.444,0.0,-6599.978
…,…,…,…
"""30%""",0.444,0.0,-6599.978
"""50%""",0.444,0.0,-6599.978
"""70%""",0.444,0.0,-6599.978
"""90%""",0.444,0.0,-6599.978


## Statistics in an expression
We can calculate statistics in an expression

In [21]:
(
    df
    .select(
        pl.col('Profit').mean()
    )
)

Profit
f64
28.656896


The statistics available include:
- count
- sum
- product
- min
- median
- mean
- max

## Rolling statistics
We can calculate rolling statistics in an expression.


We first create a simple `DataFrame` with sequential values

In [22]:
df_rolling = (
    pl.DataFrame(
        {
            "value":range(12),
        }
    )
)
df_rolling.head()

value
i64
0
1
2
3
4


We take the rolling mean over 4 values by setting the `window_size` to be 4

In [23]:
(
    df_rolling
    .with_columns(
        rolling_mean_value = pl.col("value").rolling_mean(window_size=4)
    )
    .head(5)
)

value,rolling_mean_value
i64,f64
0,
1,
2,
3,1.5
4,2.5


Note that by default the first non-`null` value is on the 4th row.

We can calculate the statistic with fewer values than the `window_size` by setting the `min_periods` argument

In [24]:
(
    df_rolling
    .with_columns(
        rolling_mean_value = pl.col("value").rolling_mean(window_size=4),
        rolling_mean_value_min_periods = pl.col("value").rolling_mean(window_size=4,min_periods=1)

    )
).head()

value,rolling_mean_value,rolling_mean_value_min_periods
i64,f64,f64
0,,0.0
1,,0.5
2,,1.0
3,1.5,1.5
4,2.5,2.5


## Exponentially-weighted statistics
Polars has exponentially-weighted statistics available as expressions.

The `span` parameter determines the "alpha" value used in the exponential weighting formula, which is given by:

alpha = 2 / (L + 1)

where L is the span value. The alpha value determines the rate of decay of the weights applied to each data point in the calculation. A higher alpha (or lower span) means that more weight is given to recent data points, while a lower alpha (or higher span) value means that more weight is given to older data points.

In [25]:
(
    df_rolling
    .with_columns(
        ewm_mean_value = pl.col("value").ewm_mean(span=4)
    ).head(5)
)

value,ewm_mean_value
i64,f64
0,0.0
1,0.625
2,1.326531
3,2.095588
4,2.921582


### Multiple statistics
We can use `prefix` or `suffix` when calculating multiple statistics on the same column or columns to avoid name collisions

In [26]:
(
    df_rolling
    .select(
        pl.col(pl.Int64).min().name.suffix("_min"),
        pl.col(pl.Int64).max().name.suffix("_max"),
    )
)

value_min,value_max
i64,i64
0,11


We can also do arithmetic with statistics. 

In this example we calculate a min-max scaler

In [31]:
(
    df
    .select("Sales", "Profit", "Discount")
    .with_columns(
        ((pl.col(pl.Float64) - pl.col(pl.Float64).min()) / (pl.col(pl.Float64).max() - pl.col(pl.Float64).min())).name.suffix("_scaled")
    )
)

Sales,Profit,Discount,Sales_scaled,Profit_scaled,Discount_scaled
f64,f64,f64,f64,f64,f64
261.96,41.9136,0.0,0.011552,0.442794,0.0
731.94,219.582,0.0,0.032313,0.454639,0.0
14.62,6.8714,0.0,0.000626,0.440458,0.0
957.5775,-383.031,0.45,0.04228,0.414464,0.5625
22.368,2.5164,0.2,0.000968,0.440168,0.25
…,…,…,…,…,…
25.248,4.1028,0.2,0.001096,0.440273,0.25
91.96,15.6332,0.0,0.004043,0.441042,0.0
258.576,19.3932,0.2,0.011403,0.441293,0.25
29.6,13.32,0.0,0.001288,0.440888,0.0


## Horizontal computations
To illustrate horizontal computations we define a simple `DataFrame` with two columns

In [32]:
df_hor = pl.DataFrame(
    {
        "vals1":[0,1,2],
        "val2":[3,4,5]
    }
)
df_hor

vals1,val2
i64,i64
0,3
1,4
2,5


Polars has a few dedicated horizontal aggregation functions (with hopefully more to come in the future). The output of these expressions is the name of the first column that goes into them so we need an `alias` to avoid overwriting an existing column with the output statistic

In [33]:
(
    df_hor
    .with_columns(
        pl.max_horizontal(pl.all()).alias("max"),
        pl.min_horizontal(pl.all()).alias("min"),
        pl.sum_horizontal(pl.all()).alias("sum"),
        
    )
)

vals1,val2,max,min,sum
i64,i64,i64,i64,i64
0,3,3,0,3
1,4,4,1,5
2,5,5,2,7


There is also a horizontal `cum_sum`. As any `cum_sum` is not an aggregation (i.e. the output is not a scalar but a `Series` the same length as the input) the `cum_sum_horizontal` output is a `pl.Struct` column with the number of fields equal to the number of columns

In [34]:
(
    df_hor
    .with_columns(
        pl.cum_sum_horizontal(pl.all()),
        
    )
)

vals1,val2,cum_sum
i64,i64,struct[2]
0,3,"{0,3}"
1,4,"{1,5}"
2,5,"{2,7}"
