# Manipulating DataFrames with Pandas - 2 (Grouping Data)

We can use `groupby()` function to
* Split data into groups
* Apply a function to each group
  - Aggregate (e.g. compute counts, means)
  - Transform (e.g. fill NAs within groups with group specific values)
  - Filter (filter out data based on a group statistic)
* Combine the results

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

We'll use _Gapminder_ dataset in this Notebook.

In [2]:
df = pd.read_csv('data/gapminder.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10111 entries, 0 to 10110
Data columns (total 8 columns):
Country            10111 non-null object
Year               10111 non-null int64
fertility          10100 non-null float64
life               10111 non-null float64
population         10108 non-null float64
child_mortality    9210 non-null float64
gdp                9000 non-null float64
region             10111 non-null object
dtypes: float64(5), int64(1), object(2)
memory usage: 632.0+ KB


>Aggregating functions are the ones that reduce the dimension of the returned objects. Some common aggregating functions are tabulated below:
- mean()	Compute mean of groups
- sum()	Compute sum of group values
- size()	Compute group sizes
- count()	Compute count of group
- std()	Standard deviation of groups
- var()	Compute variance of groups
- sem()	Standard error of the mean of groups
- describe()	Generates descriptive statistics
- first()	Compute first of group values
- last()	Compute last of group values
- nth()	Take nth value, or a subset if n is a list
- min()	Compute min of group values
- max()	Compute max of group values.
[source](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)

In [3]:
df.groupby('region').mean()

Unnamed: 0_level_0,Year,fertility,life,population,child_mortality,gdp
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
America,1988.5,3.486061,68.722251,17745720.0,50.513292,11599.921875
East Asia & Pacific,1988.510931,3.725836,66.108632,54686190.0,59.337826,13336.156923
Europe & Central Asia,1988.550781,2.214177,71.931303,16003580.0,30.180168,18442.045417
Middle East & North Africa,1988.5,4.970019,65.194301,11713030.0,69.884533,27510.731579
South Asia,1988.5,5.004162,57.13771,140678200.0,137.76715,2552.65
Sub-Saharan Africa,1988.5,5.956105,51.664426,10509980.0,158.917473,3152.428511


In [4]:
df.groupby(['Year', 'region']).mean().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,fertility,life,population,child_mortality,gdp
Year,region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1964,America,5.57465,60.462775,11554890.0,113.950667,6813.875
1964,East Asia & Pacific,5.708032,56.798429,34708390.0,129.10913,6431.5
1964,Europe & Central Asia,3.270488,67.84011,13703030.0,61.585319,9760.0625
1964,Middle East & North Africa,6.965571,52.11981,5623595.0,179.605263,10962.157895
1964,South Asia,6.4805,43.877125,78134240.0,256.9225,1233.875


Multiple aggregations can be applied with `.agg()` method.

In [5]:
df.groupby('region')['population'].agg(['max', 'mean'])

Unnamed: 0_level_0,max,mean
region,Unnamed: 1_level_1,Unnamed: 2_level_1
America,318497600.0,17745720.0
East Asia & Pacific,1359368000.0,54686190.0
Europe & Central Asia,148945600.0,16003580.0
Middle East & North Africa,85378440.0,11713030.0
South Asia,1275138000.0,140678200.0
Sub-Saharan Africa,170901100.0,10509980.0


In [7]:
df = pd.read_csv('data/gapminder.csv', index_col = ['region', 'Year'])
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Country,fertility,life,population,child_mortality,gdp
region,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
South Asia,1964,Afghanistan,7.671,33.639,10474903.0,339.7,1182.0
South Asia,1965,Afghanistan,7.671,34.152,10697983.0,334.1,1182.0
South Asia,1966,Afghanistan,7.671,34.662,10927724.0,328.7,1168.0
South Asia,1967,Afghanistan,7.671,35.17,11163656.0,323.3,1173.0
South Asia,1968,Afghanistan,7.671,35.674,11411022.0,318.1,1187.0
