# Aggregation and Grouping
An essential piece of analysis of large data is efficient summarization: computing
aggregations like sum(), mean(), median(), min(), and max(), in which a single number
gives insight into the nature of a potentially large dataset.

### Planets Data
Here we will use the Planets dataset, available via the <mark><b>Seaborn</b></mark> package. It gives information on planets that astronomers
have discovered around other stars (known as extrasolar planets or exoplanets for
short). It can be downloaded with a simple Seaborn command:

In [1]:
import numpy as np
import pandas as pd 
import seaborn as sns

In [2]:
# list of datasets in the Seaborn Library
sns.get_dataset_names()

['anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'exercise',
 'flights',
 'fmri',
 'gammas',
 'iris',
 'mpg',
 'planets',
 'tips',
 'titanic']

In [3]:
# This has some details on the 1,000+ exoplanets discovered up to 2014.
planets = sns.load_dataset('planets')
print(planets.shape); planets.head()

(1035, 6)


Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


## Simple Aggregation in Pandas
As with a onedimensional
NumPy array, for a Pandas Series the aggregates return a single value:

In [4]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser

0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64

In [5]:
# Numpy aggregations
print("Sum:", ser.sum()); print('Mean:',ser.mean()); print('Max:',ser.max())

Sum: 2.811925491708157
Mean: 0.5623850983416314
Max: 0.9507143064099162


For a DataFrame, by default the aggregates return results within each column:

In [6]:
df = pd.DataFrame({'A': rng.rand(5), 'B': rng.rand(5)})
df

Unnamed: 0,A,B
0,0.155995,0.020584
1,0.058084,0.96991
2,0.866176,0.832443
3,0.601115,0.212339
4,0.708073,0.181825


In [7]:
# by default, the aggregate return results within each column
print(df.mean())

A    0.477888
B    0.443420
dtype: float64


By specifying the axis argument, you can instead aggregate within each row:

In [8]:
# aggregation within each row
df.mean(axis='columns')

0    0.088290
1    0.513997
2    0.849309
3    0.406727
4    0.444949
dtype: float64

Pandas Series and DataFrames include all of the common 
“Aggregations: Min, Max, Mean, Median, Mode; in addition, there
is a convenience method describe() that computes several common aggregates for
each column and returns the result. Let’s use this on the Planets data, for now dropping
rows with missing values:


In [9]:
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


This can be a useful way to begin understanding the overall properties of a dataset.
For example, we see in the year column that although <i>exoplanets</i> were discovered as
far back as 1989, half of all known exoplanets were not discovered until 2010 or after.
This is largely thanks to the <i>Kepler</i> mission, which is a space-based telescope specifically
designed for finding eclipsing planets around other stars.

In [10]:
# Count - Total number of items
planets.count()

method            1035
number            1035
orbital_period     992
mass               513
distance           808
year              1035
dtype: int64

In [11]:
# first and last item
print(planets.first); print(planets.last)

<bound method NDFrame.first of                method  number  orbital_period   mass  distance  year
0     Radial Velocity       1      269.300000   7.10     77.40  2006
1     Radial Velocity       1      874.774000   2.21     56.95  2008
2     Radial Velocity       1      763.000000   2.60     19.84  2011
3     Radial Velocity       1      326.030000  19.40    110.62  2007
4     Radial Velocity       1      516.220000  10.50    119.47  2009
...               ...     ...             ...    ...       ...   ...
1030          Transit       1        3.941507    NaN    172.00  2006
1031          Transit       1        2.615864    NaN    148.00  2007
1032          Transit       1        3.191524    NaN    174.00  2007
1033          Transit       1        4.125083    NaN    293.00  2008
1034          Transit       1        4.187757    NaN    260.00  2008

[1035 rows x 6 columns]>
<bound method NDFrame.last of                method  number  orbital_period   mass  distance  year
0     Radial Vel

In [12]:
# sum, mean and median
print(planets.sum());print(planets.mean()); print(planets.median())

method            Radial VelocityRadial VelocityRadial VelocityR...
number                                                         1848
orbital_period                                          1.98689e+06
mass                                                        1353.38
distance                                                     213368
year                                                        2079388
dtype: object
number               1.785507
orbital_period    2002.917596
mass                 2.638161
distance           264.069282
year              2009.070531
dtype: float64
number               1.0000
orbital_period      39.9795
mass                 1.2600
distance            55.2500
year              2010.0000
dtype: float64


In [13]:
# Max & Min
print(planets.max()); print(planets.min())

method            Transit Timing Variations
number                                    7
orbital_period                       730000
mass                                     25
distance                               8500
year                                   2014
dtype: object
method            Astrometry
number                     1
orbital_period     0.0907063
mass                  0.0036
distance                1.35
year                    1989
dtype: object


In [14]:
# standard deviation & variance & mad - Mean Absolute Deviation
print(planets.std()); print(planets.var()); print(planets.mad())

number                1.240976
orbital_period    26014.728304
mass                  3.818617
distance            733.116493
year                  3.972567
dtype: float64
number            1.540022e+00
orbital_period    6.767661e+08
mass              1.458183e+01
distance          5.374598e+05
year              1.578129e+01
dtype: float64
number               0.903144
orbital_period    3181.747643
mass                 2.580056
distance           318.100885
year                 3.056798
dtype: float64


In [15]:
# prod() - Product of all items
planets.prod()

number            6.022481e+184
orbital_period              inf
mass               9.626219e-60
distance                    inf
year                        inf
dtype: float64

These are all methods of DataFrame and Series objects.

To go deeper into the data, however, simple aggregates are often not enough. The
next level of data summarization is the groupby operation, which allows you to
quickly and efficiently compute aggregates on subsets of data.

## GroupBy: Split, Apply, Combine
Simple aggregations can give you a flavor of your dataset, but often we would prefer
to aggregate conditionally on some label or index: this is implemented in the socalled
groupby operation. The name “group by” comes from a command in the SQL
database language.

### The GroupBy object
The GroupBy object is a very flexible abstraction. In many ways, you can simply treat
it as if it’s a collection of DataFrames, and it does the difficult things under the hood.
Let’s see some examples using the Planets data.

<mark><b>Column indexing</b></mark>. The GroupBy object supports column indexing in the same way as
the DataFrame, and returns a modified GroupBy object. For example:

In [16]:
planets.groupby('method')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001E4955C05C8>

In [17]:
planets.groupby('method')['orbital_period']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001E495643948>

Here we’ve selected a particular Series group from the original DataFrame group by
reference to its column name. As with the GroupBy object, no computation is done
until we call some aggregate on the object:

In [18]:
# This gives an idea of the general scale of orbital periods (in days) that each method is sensitive to.
planets.groupby('method')['orbital_period'].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

<mark><b>Iteration over groups</b></mark>. The GroupBy object supports direct iteration over the groups,
returning each group as a Series or DataFrame:

In [19]:
for (method, group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))

Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)


This can be useful for doing certain things manually, though it is often much faster to
use the built-in apply functionality, which we will discuss momentarily.

<mark><b>Dispatch methods</b></mark>. Through some Python class magic, any method not explicitly
implemented by the GroupBy object will be passed through and called on the groups,
whether they are DataFrame or Series objects. For example, you can use the
describe() method of DataFrames to perform a set of aggregations that describe each
group in the data:

In [20]:
planets.groupby('method')['year'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Astrometry,2.0,2011.5,2.12132,2010.0,2010.75,2011.5,2012.25,2013.0
Eclipse Timing Variations,9.0,2010.0,1.414214,2008.0,2009.0,2010.0,2011.0,2012.0
Imaging,38.0,2009.131579,2.781901,2004.0,2008.0,2009.0,2011.0,2013.0
Microlensing,23.0,2009.782609,2.859697,2004.0,2008.0,2010.0,2012.0,2013.0
Orbital Brightness Modulation,3.0,2011.666667,1.154701,2011.0,2011.0,2011.0,2012.0,2013.0
Pulsar Timing,5.0,1998.4,8.38451,1992.0,1992.0,1994.0,2003.0,2011.0
Pulsation Timing Variations,1.0,2007.0,,2007.0,2007.0,2007.0,2007.0,2007.0
Radial Velocity,553.0,2007.518987,4.249052,1989.0,2005.0,2009.0,2011.0,2014.0
Transit,397.0,2011.236776,2.077867,2002.0,2010.0,2012.0,2013.0,2014.0
Transit Timing Variations,4.0,2012.5,1.290994,2011.0,2011.75,2012.5,2013.25,2014.0


Looking at this table helps us to better understand the data: for example, the vast
majority of planets have been discovered by the Radial Velocity and Transit methods,
though the latter only became common (due to new, more accurate telescopes) in the
last decade. The newest methods seem to be Transit Timing Variation and Orbital
Brightness Modulation, which were not used to discover a new planet until 2011.

This is just one example of the utility of dispatch methods. Notice that they are
applied to each individual group, and the results are then combined within GroupBy
and returned. Again, any valid DataFrame/Series method can be used on the corresponding
GroupBy object, which allows for some very flexible and powerful
operations!

### Aggregate, filter, transform, apply
The preceding discussion focused on aggregation for the combine operation, but
there are more options available. In particular, GroupBy objects have aggregate(),
filter(), transform(), and apply() methods that efficiently implement a variety of
useful operations before combining the grouped data.

In [21]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'], 'data1': range(6),'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


<mark><b>Aggregation</b></mark>. We’re now familiar with GroupBy aggregations with sum(), median(),
and the like, but the aggregate() method allows for even more flexibility. It can take
a string, a function, or a list thereof, and compute all the aggregates at once. Here is a
quick example combining all these:

In [22]:
df.groupby('key').aggregate(['min', np.median, max])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


Another useful pattern is to pass a dictionary mapping column names to operations
to be applied on that column:

In [23]:
df.groupby('key').aggregate({'data1': 'min','data2': 'max'})

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,7
C,2,9


<mark><b>Filtering</b></mark>. A filtering operation allows you to drop data based on the group properties.
For example, we might want to keep all groups in which the standard deviation is
larger than some critical value:

In [24]:
def filter_func(x):
    return x['data2'].std() > 4
print(df); print(df.groupby('key').std());print(df.groupby('key').filter(filter_func))

key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
       data1     data2
key                   
A    2.12132  1.414214
B    2.12132  4.949747
C    2.12132  4.242641
  key  data1  data2
1   B      1      0
2   C      2      3
4   B      4      7
5   C      5      9


The filter() function should return a Boolean value specifying whether the group
passes the filtering. Here because group A does not have a standard deviation greater
than 4, it is dropped from the result.

<mark><b>Transformation</b></mark>. While aggregation must return a reduced version of the data, transformation
can return some transformed version of the full data to recombine. For
such a transformation, the output is the same shape as the input. A common example
is to center the data by subtracting the group-wise mean:

In [25]:
df.groupby('key').transform(lambda x: x - x.mean())

Unnamed: 0,data1,data2
0,-1.5,1.0
1,-1.5,-3.5
2,-1.5,-3.0
3,1.5,-1.0
4,1.5,3.5
5,1.5,3.0


<mark><b>The apply()</b></mark> method. The apply() method lets you apply an arbitrary function to the
group results. The function should take a DataFrame, and return either a Pandas
object (e.g., DataFrame, Series) or a scalar; the combine operation will be tailored to
the type of output returned.

For example, here is an apply() that normalizes the first column by the sum of the
second:

In [26]:
def norm_by_data2(x):
    # x is a DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x
print(df); print(df.groupby('key').apply(norm_by_data2))

key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
  key     data1  data2
0   A  0.000000      5
1   B  0.142857      0
2   C  0.166667      3
3   A  0.375000      3
4   B  0.571429      7
5   C  0.416667      9


<i>apply()</i> within a GroupBy is quite flexible: the only criterion is that the function takes
a DataFrame and returns a Pandas object or scalar; what you do in the middle is up to
you!

### Specifying the split key
In the simple examples presented before, we split the DataFrame on a single column
name. This is just one of many options by which the groups can be defined, and we’ll
go through some other options for group specification here.

<mark><b>A list, array, series, or index providing the grouping keys</b></mark>. The key can be any series or list
with a length matching that of the DataFrame. For example:

In [27]:
L = [0, 1, 0, 1, 2, 0]
print(df); print(df.groupby(L).sum())

key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
   data1  data2
0      7     17
1      4      3
2      4      7


Of course, this means there’s another, more verbose way of accomplishing the
df.groupby('key') from before:

In [28]:
print(df); print(df.groupby(df['key']).sum())

key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
     data1  data2
key              
A        3      8
B        5      7
C        7     12


<mark><b>A dictionary or series mapping index to group</b></mark>. Another method is to provide a dictionary
that maps index values to the group keys:

In [29]:
df2 = df.set_index('key')
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
print(df2); print(df2.groupby(mapping).sum())

data1  data2
key              
A        0      5
B        1      0
C        2      3
A        3      3
B        4      7
C        5      9
           data1  data2
consonant     12     19
vowel          3      8


<mark><b>Any Python function</b></mark>. Similar to mapping, you can pass any Python function that will
input the index value and output the group:

In [30]:
print(df2); print(df2.groupby(str.lower).mean())

data1  data2
key              
A        0      5
B        1      0
C        2      3
A        3      3
B        4      7
C        5      9
   data1  data2
a    1.5    4.0
b    2.5    3.5
c    3.5    6.0


<mark><b>A list of valid keys</b></mark>. Further, any of the preceding key choices can be combined to
group on a multi-index:

In [31]:
df2.groupby([str.lower, mapping]).mean()

Unnamed: 0,Unnamed: 1,data1,data2
a,vowel,1.5,4.0
b,consonant,2.5,3.5
c,consonant,3.5,6.0


### Grouping example
As an example of this, in a couple lines of Python code we can put all these together
and count discovered planets by method and by decade:

In [32]:
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,0.0,0.0,0.0,2.0
Eclipse Timing Variations,0.0,0.0,5.0,10.0
Imaging,0.0,0.0,29.0,21.0
Microlensing,0.0,0.0,12.0,15.0
Orbital Brightness Modulation,0.0,0.0,0.0,5.0
Pulsar Timing,0.0,9.0,1.0,1.0
Pulsation Timing Variations,0.0,0.0,1.0,0.0
Radial Velocity,1.0,52.0,475.0,424.0
Transit,0.0,0.0,64.0,712.0
Transit Timing Variations,0.0,0.0,0.0,9.0


This shows the power of combining many of the operations we’ve discussed up to this
point when looking at realistic datasets. We immediately gain a coarse understanding
of when and how planets have been discovered over the past several decades!