# Aggregation in Pandas

In this section we are going to learn all about aggregating with `df.groupby()`. We will be covering:

* [Grouped Aggregation in Pandas](#grouped-agg)
* [Level 1: `df.groupby()`](#groupby)
* [Level 2: Multiple aggregations with `.agg()`](#mult-aggs)
* [Level 3: Named aggregation](#named-agg)
* [<mark>Exercise: Using aggregations</mark>](#ex-aggs)

But before we do anything! Let's import pandas and read our data in again, since we're working in a new notebook.

In [None]:
import pandas as pd

chickweight = (
    pd.read_csv('data/chickweight.csv')
    .rename(str.lower, axis='columns')
)

We can get statistics for our data quite easily using methods like:

* `df.describe()`: gives a selection of statistics for each numeric field
* `df.mean()`: calculates the mean of each column or row
* `df.max()`: calculates the max of each column or row
* `df.min()`: calculates the min of each column or row

In [None]:
chickweight.describe()

In [None]:
print(
    chickweight.mean(),
    chickweight.max(),
    chickweight.min(),
    sep = '\n\n'
)

<a id = 'grouped-agg'></a>
## Grouped Aggregation in Pandas

However, what if we want data about specific groups of the dataset, e.g. the mean weight for diet 2. In that case we need to ***aggregate*** our data.

Aggregation is the act of splitting up your original dataset to calculate statistics on sub-dataframes.

<img src="images/split-groupby-combine.png" width="440" height="440" align="center"/>

There are a few ways to do this in pandas:. 

<a id = 'groupby'></a>
## Level 1: Single aggregations

We can easily see what the mean weight is overall in our dataframe:

In [None]:
(
    chickweight
    ['weight']
    .mean()
)

But what if we want to see what the mean weight is depending on the diet the chicken was on?

The we first need to perform a `.groupby()` in order to split the data first into those different diets.

In [None]:
(
    chickweight
    .groupby('diet')
    ['weight']
    .mean()
)

### So what is `.groupby()`?

This is another type of object:

In [None]:
chickweight.groupby('diet')

A groupby object is esentially a collection of dataframes. The idea is that later we can calculate something per dataframe.

In [None]:
list(chickweight
    .groupby('diet')
)

In [None]:
for diet_name, diet_df in chickweight.groupby('diet'):
    display(diet_df.head(3))

From here we can look at a collection of statistics or information on the remaining numerical columns like:

* `.mean()`
* `.sum()`
* `.min()` and `.max()`
* `.first()`
* `.var()`

**Try them out. Do you understand each one?**

In [None]:
# .first() retrieves the first entry in each group
(
    chickweight
    .sort_values(['chick', 'time'])
    .groupby('diet')
    .first()
)

In [None]:
# Look at how this output differs if the dataframe is sorted in a different way prior to the groupby!
(
    chickweight
    .sort_values('weight', ascending=False)
    .groupby('diet')
    .first()
)

We can even perform aggregations when our data is grouped by **multiple** columns:

In [None]:
(
    chickweight
    .groupby(['diet', 'time'])
    .mean()
)

<a id = 'mult-aggs'></a>
## Level 2: Multiple aggregations with `.agg()`

We can get the results for more than one aggregation by passing a `set` of the aggregations we want in the `.agg()` method.

**Example**: Grouping by time, find the total number of chickens and the mean weight of the chickens

In [None]:
(
    chickweight
    .groupby('time')
    .agg({'count', 'mean'})
)

This has it's limitations:

We have to return each aggregation for **every** column!

To select the columns and perform aggregations on that column only we should pass a `Dict` in the `.agg()` method.

**Example**: Find the count value for rownum and mean value for weight when grouped by time

In [None]:
(
    chickweight
    .groupby('time')
    .agg({'rownum': 'count', 
          'weight': 'min'})
).head()

Currently, the column names are a little misleading, i.e. how would someone looking at this table no we have the minimum weight.

Also, look what happens if we try to get the min & max for the weight column.

In [None]:
(
    chickweight
    .groupby('time')
    .agg({'rownum': 'count',
          'weight': 'min',
          'weight': 'max'})
).head()

To address this issue we should pass the aggregations we want to use as `lists`, rather than `strings`.

In [None]:
(
    chickweight
    .groupby('time')
    .agg({'rownum': ['count'], 
          'weight': ['min', 'max', 'min']})
).head()

This is just what I wanted, but now has 2 indices in the column names - this could cause issues when working with this new table later...


<a id = 'named-agg'></a>
## Level 3: Named aggregation

To support column-specific aggregation with *control over the output column names*, we can use `pd.NamedAgg`.

* The keywords are the *output column names*

* The `pd.NamedAgg` can be used with the fields `column=` and `aggfunc=` to make it clearer what the arguments are.

In [None]:
(
    chickweight
    .groupby('time')
    .agg(num_chickens = pd.NamedAgg(column='rownum',aggfunc='count'),
         weight_mean = pd.NamedAgg(column='weight',aggfunc='mean')
    )
).head()

We can even use custom functions! Hooray for lambda functions!

Let's copy the previous code and add `weight_range = pd.NamedAgg(column = 'weight', aggfunc = lambda x: x.max() - x.min()` as a new aggregate!

In [None]:
(
    chickweight
    .groupby(['time', 'diet'])
    .agg(number_rows = pd.NamedAgg(column = 'rownum', aggfunc = 'count'),
         weight_mean = pd.NamedAgg(column = 'weight', aggfunc = 'mean'),
         weight_spread = pd.NamedAgg(column = 'weight', aggfunc = lambda x: x.max() - x.min())
    )
).head()

A dataframe in pandas has an index. When you aggegate and get multiple columns as a result pandas will **automatically put the grouped columns in the index**. 

Sometimes this is fine, but sometime you might want to undo this operation by resetting the index. You can use `.reset_index()` at the end but it's more efficient to use the in-built parameter `as_index=` in the `.groupby()` method!

Note that the index of a dataframe typically has different behavior than a column.

In [None]:
# Get the range using a defined function
def get_range(col):
    return col.max() - col.min()

(
    chickweight
    .groupby(['time', 'diet'], as_index=False)
    .agg(number_rows=pd.NamedAgg(column='rownum', aggfunc=len),
         weight_mean=pd.NamedAgg(column='weight', aggfunc='mean'),
         weight_spread=pd.NamedAgg(column='weight', aggfunc=get_range)
    )
).head()

Note that an index can also be used to do fancy things. For example, it makes it really easy to control axes if we decide to plot the data.
<a id = 'ex-aggs'></a>
## <mark>Exercise: Using aggregations</mark>

Determine the following aggregate information per diet (*optional*: per diet and time):

- maximum chick id
- median weight
- std (standard deviation) of weight

- any extras of your choice

**bonus points:** use a custom function

In [None]:
# %load answers/ex-aggs.py