## Group (window) operations
By the end of this lecture you will be able to:
- do group operations by a single column
- do group operations by multiple columns
- fill nulls by group
- filter by group

Expressions typically work on a single row. In some cases we want them to operate on groups of rows. For this we have `over`.

> I prefer to call these group operations. However, in SQL and some of the Polars docs you may find them referred to as *window* operations.

In [None]:
import polars as pl
import polars.selectors as cs
pl.Config.set_tbl_rows(8)

We create a simple `DataFrame` where we have an `id` column - that defines groups of rows - and a values column

In [None]:
df = pl.DataFrame(
    {
        "id":["a","b","a","b"],
        "value":[0,1,2,3]
    }
)

We want to add a column with the maximum `value` in each group where the groups are defined by the `id` column

In [None]:
(
    df
    .with_columns(
        group_max = pl.col("value").max().over("id")
    )
)

Let's break down the syntax here. We've got:
- `pl.col("value")` which gives us the **input column**
- `.max()` which **aggregates** the values in the input column
- `over("id")` which **groups** the rows by `id` **before** we aggregate the input with `max`

> The equivalent operation in Pandas is `.groupby.transform`

Using `over` is shorthand for:
- doing a `group_by` on the `id` column
- doing an `agg` with `pl.col("value").max()` to get a grouped `DataFrame` and
- left joining the grouped `DataFrame` back to the original `DataFrame`

So Polars is (more or less) doing the following internally when we call `over`

In [None]:
(
    df
    .join(
        df
        .group_by("id")
        .agg(
            pl.col("value").max()
        ),
        on="id"
    )
)

We have seen that the order of the output of `group_by` varies - but we see the order is not affected by `over` as we are doing a `join` back to the `DataFrame` in its original order.

Typically we precede `over` with aggregation - such as `sum` to get a single scalar value for each group.

But we can also use aggregations that produce a `Series` with `over`. For example if we do a `cum_sum` on a column the output is a column rather than a scalar. 

But if we use `cum_sum` with `over` we get the output we expect - the cumulative sum by group. Here we contrast taking the `sum` by group compared to taking the `cum_sum` by group

In [None]:
(
    df
    .with_columns(
        sum = pl.col("value").sum().over("id"),
        cum_sum = pl.col("value").cum_sum().over("id"),
    )
)

### Creating a `pl.List` from grouped values
When we do the `group_by` with `over` we may want to gather the grouped values into a `pl.List` column instead of aggregating them. We do this by passing `mapping_strategy="join"` to `over`

In [None]:
(
    df
    .with_columns(
        value_list = pl.col("value").over("id",mapping_strategy="join")
    )
)

Be aware that this can be memory intensive if we are repeating long lists on many rows

## Multiple columns
We can also do `over` with multiple columns - just like doing a `group_by` with multiple columns. 

We define a new `DataFrame` with two groups. Only the first and third rows are in the same group

In [None]:
df_mult = pl.DataFrame(
    {
        "id1":["a","b","a","b"],
        "id2":["x","x","x","y"],
        "value":[0,1,2,3]
    }
)

We now get the maximum value by group

In [None]:
(
    df_mult
    .with_columns(
        group_max = pl.col("value").max().over("id1","id2")
    )
)

### Filling missing values by group

We can use `over` to fill missing values by group.

Here we have a `DataFrame` where the second value in group `a` is missing

In [None]:
df_missing = pl.DataFrame(
    {
        "id":["a","b","a","b"],
        "value":[0,1,None,3]
    }
)

We can fill forward from the previous value in group `a` by calling `fill_null.over`

In [None]:
(
    df_missing
    .with_columns(
        filled_value = pl.col("value").fill_null(strategy="forward").over("id")
    )
)

### Filtering by group
We can use `over` to apply a filter groupwise. 

Here, for example, we return the rows with the largest `value` grouped by `id`

In [None]:
(
    df_missing
    .filter(
        pl.col("value") == pl.col("value").max().over("id")
    )
)

### Group operations in lazy mode
When we do the same `over` multiple times in lazy mode Polars caches the result of the first internal `group_by` for re-use. However, this caching is not apparent in the optimised query plan 

In [None]:
print(
    df_missing
    .lazy()
    .with_columns(
        group_max = pl.col("value").max().over("id"),
        group_min = pl.col("value").min().over("id"),
    )
    .explain()
)

## Exercises
In the exercises you will develop your understanding of:
- doing arithmetic by group
- filling nulls by group
- doing multiple window expressions in a single `with_columns` statement

### Exercise 1
We want to calculate the *z-score* of the `Age` column normalised by passenger class.

Add a new column `Age_mean` with the mean of the `Age` column for passengers by class

In [None]:
csv_file = "../data/titanic.csv"

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .select(
        'Pclass',cs.starts_with("Age")
    )
    # Use head(6) to see the null on the sixth row
    .head(6)
)

Continue by replacing the `null` values in the `Age` column with the `mean` age for passengers in that class

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .select(
        'Pclass',cs.starts_with("Age")
    )
    .head(6)
)

Replace `Age_mean` with a new column called `Age_delta` that is the difference between the age and the average age of all passengers in the same class

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        <blank>
    )
    .select(
        'Pclass',cs.starts_with("Age")
    )
    .head(10)
)

Continue by adding another column called `Age_z` that has the z-score for the `Age` where the z-score is the (age - average age of the passengers in that class) divided by the standard deviation of the age column for passengers in that class

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        <blank>
    )
    .select(
        'Pclass',cs.starts_with("Age")
    )
    .head(10)
)

### Exercise 2

Count the number of passengers in each group of: passenger class and survival. Name the column of counts `counts`

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

Continue by calculating the percentage breakdown of passenger survival within each passenger class group. Call this column `percent`.

Sort the output by passenger class and survival

Filter the `DataFrame` to find the passengers with the longest name in each class and sort by class

In [None]:
pl.Config.set_fmt_str_lengths(100)
(
    pl.read_csv(csv_file)
    <blank>
)

Calculate the median age of passengers by `Pclass` and `Sex` using `group_by.agg`

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

Fill `nulls` in the `Age` column by the median by `Pclass` and `Sex` in a new column called `Age_filled`. Filter by `nulls` in the `Age` column to confirm that the filled values correspond to the `group_by` above

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

## Solutions

### Solution to exercise 1
We want to calculate the *z-score* of the `Age` column for each passenger normalised by their passenger class.

Add a new column `Age_mean` with the mean of the `Age` column for passengers by class

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        Age_mean = pl.col('Age').mean().over('Pclass')
    )
    .select(
        cs.starts_with("Age")
    )
    .head(6)
)

Continue by replacing the `null` values in the `Age` column with the `mean` age for passengers in that class

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        Age_mean = pl.col('Age').mean().over('Pclass')
    )
    .with_columns(
        Age = pl.col('Age').fill_null(pl.col('Age').mean().over('Pclass'))
    )
    .select(
        'Pclass',cs.starts_with("Age")
    )
    .head(6)
)

Replace `Age_mean` with a new column called `Age_delta` that is the difference between the age and the average age of all passengers in the same class. Keep the `fill_null` step from above

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        Age = pl.col('Age').fill_null(pl.col('Age').mean().over('Pclass'))
    )
    .with_columns(
        Age_delta = pl.col('Age') - pl.col('Age').mean().over('Pclass')
    )
    .select(
        'Pclass',cs.starts_with("Age")
    )
    .head(6)
)

Continue by adding another column called `Age_z` that has the z-score for the `Age` where the z-score is the (age - average age of the passengers in that class) divided by the standard deviation of the age column for passengers in that class

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        Age = pl.col('Age').fill_null(pl.col('Age').mean().over('Pclass'))
    )
    .with_columns(
        Age_delta = pl.col('Age') - pl.col('Age').mean().over('Pclass')
    )

    .with_columns(
        Age_z = ((pl.col('Age') - pl.col('Age').mean().over('Pclass'))/pl.col('Age').std().over('Pclass'))
    )
    .select(
        'Pclass',cs.starts_with("Age")
    )
    .head(6)
)

### Solution to exercise 2

Count the number of passengers in each group of passenger class and survival

In [None]:
(
    pl.read_csv(csv_file)
    .group_by(["Pclass","Survived"])
    .agg(
        pl.col("Name").count().alias("counts")
    )
)

Calculate the percentage breakdown of passenger survival within each passenger class group. Calculate the percentage as 0-100.

Sort the output by passenger class and survival

In [None]:
(
    pl.read_csv(csv_file)
    .group_by("Pclass","Survived")
    .agg(
        pl.col("Name").count().alias("counts")
    )
    .with_columns(
        100*(pl.col("counts")/pl.col("counts").sum().over("Pclass")).round(3).alias("percent")
    )
    .sort("Pclass","Survived")
)

Filter the `DataFrame` to find the passengers with the longest name in each class

In [None]:
pl.Config.set_fmt_str_lengths(100)
(
    pl.read_csv(csv_file)
    .filter(
        pl.col("Name").str.len_chars() == pl.col("Name").str.len_chars().max().over("Pclass")
    )
    .sort("Pclass")
)

Calculate the median age of passengers by `Pclass` and `Sex` using `group_by.agg`

In [None]:
(
    pl.read_csv(csv_file)
    .group_by("Pclass","Sex",maintain_order=True)
    .agg(
        pl.col("Age").median()
    )
)

Fill `nulls` in the `Age` column by the median by `Pclass` and `Sex` in a new column called `Age_filled`. Filter by `nulls` in the `Age` column to confirm that the filled values correspond to the `group_by` above

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col("Age").fill_null(
            pl.col("Age").median().over("Pclass","Sex")
        ).alias("Age_filled")
    )
    .filter(
        pl.col("Age").is_null()
    )
)