## 03 Pandas Aggregations

Aggregations summarize data with typical operations like `mean`, `sum`, but also `min` and `max`. A single number can give insight into a larger data set.

Pandas supports aggregations similar to numpy.

We use the "data/population.csv" data set again.

In [1]:
import numpy as np
import pandas as pd

import random

# Optionally adjuest the display format.
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [2]:
np.random.seed(0)

## Loading data

In [3]:
df = pd.read_csv("data/population.csv")

In [4]:
df.head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1960,92490932.0
1,Arab World,ARB,1961,95044497.0
2,Arab World,ARB,1962,97682294.0
3,Arab World,ARB,1963,100411076.0
4,Arab World,ARB,1964,103239902.0


Basic sum over fixed elements.

## Basic aggreagtions

In [5]:
df[(df["Country Name"].isin(["Arab World", "European Union"])) & (df.Year == 1960)]

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1960,92490932.0
627,European Union,EUU,1960,409498463.0


In [6]:
df[(df["Country Name"].isin(["Arab World", "European Union"])) & (df.Year == 1960)].Value.sum()

501989395.0

Min, max, mean for a given country.

In [7]:
df[df["Country Name"] == "Germany"].Value.min()

72814900.0

In [8]:
df[df["Country Name"] == "Germany"].Value.max()

82667685.0

In [9]:
df[df["Country Name"] == "Germany"].Value.mean()

79381016.42105263

In [10]:
df[df["Country Name"] == "Germany"].Value.describe()

count         57.000
mean    79381016.421
std      2543088.141
min     72814900.000
25%     78091820.000
50%     78936666.000
75%     81902307.000
max     82667685.000
Name: Value, dtype: float64

Finding the index, where the min or max occured, with `idxmin`, `idxmax`.

In [11]:
df[df["Country Name"] == "Germany"].Value.idxmin()

6721

In [12]:
df.iloc[6721]

Country Name        Germany
Country Code            DEU
Year                   1960
Value          72814900.000
Name: 6721, dtype: object

In [13]:
df[df["Country Name"] == "Germany"].Value.idxmax()

6777

In [14]:
df.iloc[6777]

Country Name        Germany
Country Code            DEU
Year                   2016
Value          82667685.000
Name: 6777, dtype: object

## Aggregations on groups

The dataset contains data on both regions and countries.

We slice off the first part of the file, which contains regional data - so we end up with countries only.

```
$ cat -N data/population.csv
...

   2621 World,WLD,2014,7268986175.73916
   2622 World,WLD,2015,7355220411.68203
   2623 World,WLD,2016,7442135578
   2624 Afghanistan,AFG,1960,8996351
   2625 Afghanistan,AFG,1961,9166764
   2626 Afghanistan,AFG,1962,9345868
...
```

Create a "countries" data frame.

We create a copy, so we have this data separated (since we want to modify the new data frame later).

In [15]:
cdf = df.iloc[2622:].copy()

In [16]:
cdf.iloc[[0, -1]]

Unnamed: 0,Country Name,Country Code,Year,Value
2622,Afghanistan,AFG,1960,8996351.0
14884,Zimbabwe,ZWE,2016,16150362.0


Groupy by year. Lazy operation.

In [17]:
gb = cdf.groupby("Year")

In [18]:
type(gb)

pandas.core.groupby.generic.DataFrameGroupBy

In [19]:
dir(gb)[-10:]

['shift',
 'size',
 'skew',
 'std',
 'sum',
 'tail',
 'take',
 'transform',
 'value_counts',
 'var']

In [20]:
len(dir(gb)) # 163 attributes, many aggregations like sum, mean, min, max

166

Size returns the size of each group. Here, data seems to be complete. From 1989 to 1990 there is a change +2 in the number of countries. The result is a regular Series object.

In [21]:
gb.size()

Year
1960    214
1961    214
1962    214
1963    214
1964    214
1965    214
1966    214
1967    214
1968    214
1969    214
1970    214
1971    214
1972    214
1973    214
1974    214
1975    214
1976    214
1977    214
1978    214
1979    214
1980    214
1981    214
1982    214
1983    214
1984    214
1985    214
1986    214
1987    214
1988    214
1989    214
1990    216
1991    216
1992    215
1993    215
1994    215
1995    216
1996    216
1997    216
1998    217
1999    217
2000    217
2001    217
2002    217
2003    217
2004    217
2005    217
2006    217
2007    217
2008    217
2009    217
2010    217
2011    217
2012    216
2013    216
2014    216
2015    216
2016    216
dtype: int64

What was the maximum population in each year?

We go back to the data frame, because the invocations are quite short.

In [22]:
cdf.groupby("Year").size().head()

Year
1960    214
1961    214
1962    214
1963    214
1964    214
dtype: int64

In [23]:
cdf.groupby("Year").Value.max()

Year
1960    667070000.000
1961    660330000.000
1962    665770000.000
1963    682335000.000
1964    698355000.000
1965    715185000.000
1966    735400000.000
1967    754550000.000
1968    774510000.000
1969    796025000.000
1970    818315000.000
1971    841105000.000
1972    862030000.000
1973    881940000.000
1974    900350000.000
1975    916395000.000
1976    930685000.000
1977    943455000.000
1978    956165000.000
1979    969005000.000
1980    981235000.000
1981    993885000.000
1982   1008630000.000
1983   1023310000.000
1984   1036825000.000
1985   1051040000.000
1986   1066790000.000
1987   1084035000.000
1988   1101630000.000
1989   1118650000.000
1990   1135185000.000
1991   1150780000.000
1992   1164970000.000
1993   1178440000.000
1994   1191835000.000
1995   1204855000.000
1996   1217550000.000
1997   1230075000.000
1998   1241935000.000
1999   1252735000.000
2000   1262645000.000
2001   1271850000.000
2002   1280400000.000
2003   1288400000.000
2004   1296075000.000
2005 

To find the largest countries by per year, we can:
    
* group by Year
* access the Value column and
* ask for the (first) index of the maximum value
* this information can be used with the `df.loc` indexer to display there columns

> Return index of first occurrence of maximum over requested axis.

In [24]:
cdf.loc[cdf.groupby("Year").Value.idxmax()]

Unnamed: 0,Country Name,Country Code,Year,Value
4959,China,CHN,1960,667070000.0
4960,China,CHN,1961,660330000.0
4961,China,CHN,1962,665770000.0
4962,China,CHN,1963,682335000.0
4963,China,CHN,1964,698355000.0
4964,China,CHN,1965,715185000.0
4965,China,CHN,1966,735400000.0
4966,China,CHN,1967,754550000.0
4967,China,CHN,1968,774510000.0
4968,China,CHN,1969,796025000.0


We could exclude China (or other countries) temporarily, to see, which country occupies a second place.

* Define an criterion: `cdf["Country Code"].isin(["CHN", "IND", "USA", "RUS", "IDN"])`
* Negate it with `~`
* Group by the remaining countries
* Take the value
* Find the indices
* Select the elements with `df.loc` label based index

In [25]:
cdf.loc[cdf[~cdf["Country Code"].isin(["CHN", "IND", "USA", "RUS", "IDN"])].groupby("Year").Value.idxmax()]

Unnamed: 0,Country Name,Country Code,Year,Value
8146,Japan,JPN,1960,92500572.0
8147,Japan,JPN,1961,94943000.0
8148,Japan,JPN,1962,95832000.0
8149,Japan,JPN,1963,96812000.0
8150,Japan,JPN,1964,97826000.0
8151,Japan,JPN,1965,98883000.0
8152,Japan,JPN,1966,99790000.0
8153,Japan,JPN,1967,100725000.0
8154,Japan,JPN,1968,101061000.0
8155,Japan,JPN,1969,103172000.0


### Custom aggregations with apply

It is possible to define custom function to execute on groups.

* `df.apply` takes a function of one argument (the sub-dataframe)

In [26]:
cdf.groupby("Year", group_keys=True).apply(lambda sf: sf).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Country Name,Country Code,Year,Value
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1960,2622,Afghanistan,AFG,1960,8996351.0
1960,2679,Albania,ALB,1960,1608800.0
1960,2736,Algeria,DZA,1960,11124888.0
1960,2793,American Samoa,ASM,1960,20013.0
1960,2850,Andorra,AND,1960,13411.0


For example to select three random rows from each group, we could sample from the subframe.

In [27]:
cdf.groupby("Year").apply(lambda sf: sf.sample(n=1))

Unnamed: 0_level_0,Unnamed: 1_level_0,Country Name,Country Code,Year,Value
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1960,13889,Turkmenistan,TKM,1960,1603258.0
1961,13149,Swaziland,SWZ,1961,357453.0
1962,5702,Djibouti,DJI,1962,94204.0
1963,7522,Hungary,HUN,1963,10087947.0
1964,10940,Norway,NOR,1964,3694339.0
1965,3653,Belgium,BEL,1965,9463667.0
1966,3654,Belgium,BEL,1966,9527807.0
1967,8663,Kyrgyz Republic,KGZ,1967,2736500.0
1968,6444,France,FRA,1968,51276054.0
1969,3372,"Bahamas, The",BHS,1969,164248.0


## Accessing groups

It is possible to access groups via the `.groups` attribute as well.

In [28]:
for name, subframe in cdf.groupby("Year"):
    # print(">>> Name of group: %s\n\n>>> Corresponding DataFrame:\n%s\n\n" % (name, subframe))
    print(name, type(subframe), subframe.shape)

1960 <class 'pandas.core.frame.DataFrame'> (214, 4)
1961 <class 'pandas.core.frame.DataFrame'> (214, 4)
1962 <class 'pandas.core.frame.DataFrame'> (214, 4)
1963 <class 'pandas.core.frame.DataFrame'> (214, 4)
1964 <class 'pandas.core.frame.DataFrame'> (214, 4)
1965 <class 'pandas.core.frame.DataFrame'> (214, 4)
1966 <class 'pandas.core.frame.DataFrame'> (214, 4)
1967 <class 'pandas.core.frame.DataFrame'> (214, 4)
1968 <class 'pandas.core.frame.DataFrame'> (214, 4)
1969 <class 'pandas.core.frame.DataFrame'> (214, 4)
1970 <class 'pandas.core.frame.DataFrame'> (214, 4)
1971 <class 'pandas.core.frame.DataFrame'> (214, 4)
1972 <class 'pandas.core.frame.DataFrame'> (214, 4)
1973 <class 'pandas.core.frame.DataFrame'> (214, 4)
1974 <class 'pandas.core.frame.DataFrame'> (214, 4)
1975 <class 'pandas.core.frame.DataFrame'> (214, 4)
1976 <class 'pandas.core.frame.DataFrame'> (214, 4)
1977 <class 'pandas.core.frame.DataFrame'> (214, 4)
1978 <class 'pandas.core.frame.DataFrame'> (214, 4)
1979 <class 