In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_theme()

Aggregation is a part of EDA in a narrow sense when we compute  descriptive statistics (mean, average, median, min, max, correlations etc.):

> A fundamental piece of many data analysis tasks is efficient summarization: computing aggregations like sum, mean, median, min, and max, in which a single number summarizes aspects of a potentially large dataset. 

### Planets Data

In [3]:
planets = sns.load_dataset('planets')

In [4]:
planets.shape

(1035, 6)

In [6]:
planets.round(2).head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.77,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


### Simple Aggregation in Pandas

In [7]:
planets.columns

Index(['method', 'number', 'orbital_period', 'mass', 'distance', 'year'], dtype='object')

In [8]:
planets_short = planets[['orbital_period', 'mass', 'distance']]

> For a DataFrame, by default the aggregates return results within each column. By specifying the axis argument, you can instead aggregate within each row. Again,counterintuitively we aggregate over rows, but specify 

In [13]:
planets_short.sum()

orbital_period    1.986894e+06
mass              1.353376e+03
distance          2.133680e+05
dtype: float64

In [15]:
planets_short.mean()

Unnamed: 0,0
orbital_period,2002.917596
mass,2.638161
distance,264.069282


In [17]:
planets_short.mean(axis=1).shape

(1035,)

`describe`  is actually an aggregate method that computes some most common statistics:
> Pandas Series and DataFrame objects include all of the common aggregates men‐
tioned in Chapter 7; in addition, there is a convenience method, describe, that com‐
putes several common aggregates for each column and returns the result. 

In [20]:
planets.dropna().describe().round(2)

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73,835.78,2.51,52.07,2007.38
std,1.18,1469.13,3.64,46.6,4.17
min,1.0,1.33,0.0,1.35,1989.0
25%,1.0,38.27,0.21,24.5,2005.0
50%,1.0,357.0,1.25,39.94,2009.0
75%,2.0,999.6,2.87,59.33,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


In [21]:
planets.dropna().describe().round(2).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
number,498.0,1.73,1.18,1.0,1.0,1.0,2.0,6.0
orbital_period,498.0,835.78,1469.13,1.33,38.27,357.0,999.6,17337.5
mass,498.0,2.51,3.64,0.0,0.21,1.25,2.87,25.0
distance,498.0,52.07,46.6,1.35,24.5,39.94,59.33,354.0
year,498.0,2007.38,4.17,1989.0,2005.0,2009.0,2011.0,2014.0


### `groupby`

In [22]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)}, columns=['key', 'data'])

In [23]:
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [24]:
df.groupby('key').sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


In [25]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [26]:
planets.method.unique()

array(['Radial Velocity', 'Imaging', 'Eclipse Timing Variations',
       'Transit', 'Astrometry', 'Transit Timing Variations',
       'Orbital Brightness Modulation', 'Microlensing', 'Pulsar Timing',
       'Pulsation Timing Variations'], dtype=object)

In [29]:
# Compute mean orbital period for each method
planets.groupby('method')['orbital_period'].mean().round(2)

method
Astrometry                          631.18
Eclipse Timing Variations          4751.64
Imaging                          118247.74
Microlensing                       3153.57
Orbital Brightness Modulation         0.71
Pulsar Timing                      7343.02
Pulsation Timing Variations        1170.00
Radial Velocity                     823.35
Transit                              21.10
Transit Timing Variations            79.78
Name: orbital_period, dtype: float64

In [33]:
# Compute mean distance for each method
planets.groupby('method')['distance'].mean().round(2)

method
Astrometry                         17.88
Eclipse Timing Variations         315.36
Imaging                            67.72
Microlensing                     4144.00
Orbital Brightness Modulation    1180.00
Pulsar Timing                    1200.00
Pulsation Timing Variations          NaN
Radial Velocity                    51.60
Transit                           599.30
Transit Timing Variations        1104.33
Name: distance, dtype: float64

We have not only standard aggregate methods but all functional methods: `filter`, `transform` and `apply`. Let's look at a couple examples with `filter`.

In [31]:
# Filter the methods with more than 10 planets using the filter method
methods_10 = planets.groupby('method').filter(lambda x: len(x) > 10)

In [32]:
methods_10.method.unique()

array(['Radial Velocity', 'Imaging', 'Transit', 'Microlensing'],
      dtype=object)

In [34]:
# Filter the methods with distance more than 1000 using the filter method
distance_1000 = planets.groupby('method').filter(lambda x: x['distance'].mean() > 1000)

In [35]:
distance_1000.method.unique()

array(['Transit Timing Variations', 'Orbital Brightness Modulation',
       'Microlensing', 'Pulsar Timing'], dtype=object)

We can group on much more complex objects than just a single column name. Here's just one example.

In [43]:
# Count discovered planets by method and by decade: 1980s, 1990s, 2000s, 2010s
decade = planets['year'] // 10 * 10
decade = decade.astype(str) + 's'
decade.name = 'decade'

In [45]:
decade.head()

0    2000s
1    2000s
2    2010s
3    2000s
4    2000s
Name: decade, dtype: object

Here we basically group on 2 columns.

In [42]:
planets.groupby(['method', decade])['number'].sum()

method                         decade
Astrometry                     2010s       2
Eclipse Timing Variations      2000s       5
                               2010s      10
Imaging                        2000s      29
                               2010s      21
Microlensing                   2000s      12
                               2010s      15
Orbital Brightness Modulation  2010s       5
Pulsar Timing                  1990s       9
                               2000s       1
                               2010s       1
Pulsation Timing Variations    2000s       1
Radial Velocity                1980s       1
                               1990s      52
                               2000s     475
                               2010s     424
Transit                        2000s      64
                               2010s     712
Transit Timing Variations      2010s       9
Name: number, dtype: int64

In [46]:
planets.groupby(['method', decade])['number'].sum().unstack()

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,,,,2.0
Eclipse Timing Variations,,,5.0,10.0
Imaging,,,29.0,21.0
Microlensing,,,12.0,15.0
Orbital Brightness Modulation,,,,5.0
Pulsar Timing,,9.0,1.0,1.0
Pulsation Timing Variations,,,1.0,
Radial Velocity,1.0,52.0,475.0,424.0
Transit,,,64.0,712.0
Transit Timing Variations,,,,9.0


In [47]:
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,0.0,0.0,0.0,2.0
Eclipse Timing Variations,0.0,0.0,5.0,10.0
Imaging,0.0,0.0,29.0,21.0
Microlensing,0.0,0.0,12.0,15.0
Orbital Brightness Modulation,0.0,0.0,0.0,5.0
Pulsar Timing,0.0,9.0,1.0,1.0
Pulsation Timing Variations,0.0,0.0,1.0,0.0
Radial Velocity,1.0,52.0,475.0,424.0
Transit,0.0,0.0,64.0,712.0
Transit Timing Variations,0.0,0.0,0.0,9.0
