# Lecture 3: Pandas [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) 5

* [Data Aggregation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#aggregation)
* Data Transformation

## Imports

In [1]:
import pandas as pd

In addition, I now set Pandas' default display format for floating point numbers to two decimal places:

In [2]:
pd.options.display.float_format='{:,.2f}'.format

## [Data Aggregation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#aggregation)

We're often in the situation that we want to aggregate data. In Pandas, we can do that with [`DataFrame.groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html).

Let's re-create the `DataFrame` from the slides:

In [3]:
df = pd.DataFrame(data=[[1, 2010, 50], [1, 2011, 100],
                        [2, 2010, 60], [2, 2011, 30], [2, 2012, 10],
                        [3, 2012, 500]],
                  columns=['ID', 'Year', 'Value'])
df

Unnamed: 0,ID,Year,Value
0,1,2010,50
1,1,2011,100
2,2,2010,60
3,2,2011,30
4,2,2012,10
5,3,2012,500


We can use different aggregations:
* Sum of `Value` over `ID`s

In [4]:
df.groupby('ID')['Value'].sum()

ID
1    150
2    100
3    500
Name: Value, dtype: int64

* Mean of `Value` over `ID`s

In [5]:
df.groupby('ID')['Value'].mean()

ID
1    75.00
2    33.33
3   500.00
Name: Value, dtype: float64

* Count non-missing of `Value` over `ID`s

In [6]:
df.groupby('ID')['Value'].count()

ID
1    2
2    3
3    1
Name: Value, dtype: int64

Likewise, we can also aggregate over the years:

In [7]:
df.groupby('Year')['Value'].sum()

Year
2010    110
2011    130
2012    510
Name: Value, dtype: int64

Each of the returned aggregations is a `Series`, because we aggregated only over one column. If we aggregate over more than one column, `groupby()` returns a `DataFrame`. Note that `groupby()` automatically sorts the results by the by-column.

By using [`agg()`](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#applying-multiple-functions-at-once) we can also run multiple aggregations at once:

In [8]:
df.groupby('Year')['Value'].agg(['sum',
                                 'mean',
                                 'count'])

Unnamed: 0_level_0,sum,mean,count
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010,110,55,2
2011,130,65,2
2012,510,255,2


Note how `groupby()` now returned a `DataFrame`. Aggregated `DataFrame`s like the one above are by default sorted over the grouping variable (here: `Year`). If we want to have the output sorted, for example, by the sum, we need to use [`sort_values()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html):

In [9]:
df.groupby('Year')['Value'].agg(['sum',
                                 'mean',
                                 'count']).sort_values('sum')

Unnamed: 0_level_0,sum,mean,count
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010,110,55,2
2011,130,65,2
2012,510,255,2


The default is to sort in an *ascending order*. If we want the results in descending order, we specify `ascending=False`:

In [10]:
df.groupby('Year')['Value'].agg(['sum',
                                 'mean',
                                 'count']).sort_values('sum', ascending=False)

Unnamed: 0_level_0,sum,mean,count
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012,510,255,2
2011,130,65,2
2010,110,55,2


Now, 2012 with 510 is the first.

## Data Transformation

Sometimes we want to aggregate over some part of the data, and then add the aggregate values back to the original `DataFrame`. We can do this with the [`transform()`](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#transformation) function.

In [11]:
df

Unnamed: 0,ID,Year,Value
0,1,2010,50
1,1,2011,100
2,2,2010,60
3,2,2011,30
4,2,2012,10
5,3,2012,500


In [12]:
df.groupby('ID')['Value'].transform('mean')

0    75.00
1    75.00
2    33.33
3    33.33
4    33.33
5   500.00
Name: Value, dtype: float64

You see that we get in return a **like-indexed** `Series` (or `DataFrame` if we have multiple aggregation functions). *Like-indexed* means that the index of the returned `Series` is the same as that of the original `DataFrame`.

Why is that good? Because now we can easily add the returned `Series` as a new column to the additional `DataFrame`:

In [13]:
df['ID_avg'] = df.groupby('ID')['Value'].transform('mean')
df

Unnamed: 0,ID,Year,Value,ID_avg
0,1,2010,50,75.0
1,1,2011,100,75.0
2,2,2010,60,33.33
3,2,2011,30,33.33
4,2,2012,10,33.33
5,3,2012,500,500.0


© 2023 Philipp Cornelius