# Aggregation and Grouping

An essential piece of analysis of large data is efficient summarization: Computing aggregations like: **sum(), mean(), median(), min(), max()** in which a single number gives insight into the nature of potentially large dataset. In this section, we'll explore aggregations in Pandas, from simple operationsakin to what we've seen on NumPy arrays, to more sophisticated operations based on the concept of groupby.

For convenience, we'll use the same display magic function

In [52]:
import numpy as np
import pandas as pd

from IPython.display import Image
from IPython.core.display import HTML 

class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style='float: left; padding: 10px;">
    <p style= 'font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

## Planets Data

Here we will use the Planets Dataset, available via [Seaborn package](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.14-Visualization-With-Seaborn.ipynb). It gives infor on planets that astronomers have discovered around other stars (known as extrasolar planets or exoplanets for short). It can be downloaded with a simple Seaborn command

In [7]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape

(1035, 6)

In [8]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


This has some details on the 1,000+ extrasolar planets discovered up to 2014

## Simple Aggregation in Pandas

Earlier, we explored some of the data aggregations abailable for NumPy arrays. As with one-dimensional NumPy array, for a Pandas Series the aggregates return a single value:

In [10]:
rang = np.random.RandomState(42)
series = pd.Series(rang.rand(5))
series

0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64

In [11]:
series.sum()

2.811925491708157

In [12]:
series.mean()

0.5623850983416314

For a DataFrame, by default the aggregates return results within each column

In [18]:
df = pd.DataFrame({'A': range.rand(5),
                   'B': range.rand(5)})

df

Unnamed: 0,A,B
0,0.5979,0.32533
1,0.921874,0.388677
2,0.088493,0.271349
3,0.195983,0.828738
4,0.045227,0.356753


In [19]:
df.mean()

A    0.369895
B    0.434169
dtype: float64

By specifying the *axis* argument, you can instead aggregate within each row:

In [20]:
df.mean(axis='columns')

0    0.461615
1    0.655276
2    0.179921
3    0.512360
4    0.200990
dtype: float64

Pandas *Series* and *DatFrame* include all of the common aggregates like min(), max(), and mean; in addition, there is a convenience method describe() that computes several common aggregates for each column and returns the result. Let's use this on the Planets data, for now dropping rows with missing values:

In [50]:
planets.dropna().describe()

# This is for the next portion

aggs = ['count()', 'first(),last()', 'mean(),median()', 'min(),max()', 'std(),var()',
        'mad()', 'prod()', 'sum()']

desc = ['Total number of items', 'First and last item', 'Mean and median', 'Minimum & maximum',
       'Standard deviation & Variance', 'Mean absolute deviation', 'Product of all items', 'Sum of all items']

data = pd.DataFrame({'Aggregation':aggs, 'Description':desc})


This can be a useful way to begin understadning the overall properties of a dataset. For example, we see in the year column that although exoplanets were discovered as far back as 1989, half of all known exoplanents were not discovered until 2010 or after. This is largely thanks to the *Kepler* mission, which is a space-based telescope specifcally designed for finding eclisping planets around other stars.

The following table summarizes some other built-in Pandas aggregations

In [51]:

data

Unnamed: 0,Aggregation,Description
0,count(),Total number of items
1,"first(),last()",First and last item
2,"mean(),median()",Mean and median
3,"min(),max()",Minimum & maximum
4,"std(),var()",Standard deviation & Variance
5,mad(),Mean absolute deviation
6,prod(),Product of all items
7,sum(),Sum of all items


These are all methods of DataFrame and Series objects

To go deeper into the data, however, simple aggregates are often not enough. The next level of data summarization is the groupby operation, which allows you to quickly and efficiently compute aggregates on subsets of data.

## Groupby: Split, Apply, Combine

Simple aggregations can give you a flavor of your dataset, but often we would preger to aggregate conditionally on some label or index: this is implemented in the so-called groupby operation. The name "group by" comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wicham of Rstats game: *spliy, apply, combine.


In [53]:
Image(url= "http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/figures/03.08-split-apply-combine.png")

This makies clear what the groupby accomplished:

- The *split* step involves breaking up and grouping a DataFrame depending on the value of the specified key.
- The *Apply* step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups
- The *combine* step merges the results of these operations into an output array.


While this could certainly be done manually using some combination of the masking, aggregation, and merging commands covered earlier, an important realization is that the *intermediate splits* do not need to be explicitly instantiated. Rather, the GrouBy can (often) do this in a sinlg e pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way. The power of the *gropby* is that it abstracts away these steps: the user need not think about how the computation is done under the hood, but rather thinks about the operation as a whole.



## The GroupBy object

The GroupBy object is a very flexible abstraction. In many ways, you can simply treat is as if it's a collection of DataFrames, and it does the difficult thinkgs under the hood. Let's see some examples using the Planets data.

Perhaps the most important operations made available by a GroupBy are *aggregate, filter, transform * & *apply*. 

### Column indexing

The groupby object supports column indexing in the same way as the DataGram, and returns a modified GroupBy object for example:


In [57]:
planets.groupby('method')

<pandas.core.groupby.DataFrameGroupBy object at 0x078E7D10>

In [58]:
planets.groupby('method')['orbital_period']

<pandas.core.groupby.SeriesGroupBy object at 0x078E7C90>

Here we've selected a particular *series*  group from the original DataFrame group by reference to its column name. As with the GroupBy object, no computation is done until we call some aggregate on the object:

In [59]:
planets.groupby('method')['orbital_period'].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

In [61]:
planets.describe()



Unnamed: 0,number,orbital_period,mass,distance,year
count,1035.0,992.0,513.0,808.0,1035.0
mean,1.785507,2002.917596,2.638161,264.069282,2009.070531
std,1.240976,26014.728304,3.818617,733.116493,3.972567
min,1.0,0.090706,0.0036,1.35,1989.0
25%,1.0,,,,2007.0
50%,1.0,,,,2010.0
75%,2.0,,,,2012.0
max,7.0,730000.0,25.0,8500.0,2014.0


This gives an idea of the general scale of orbiltal periods (in days) that each method is sensitive to.

### Iteration over groups

The GroupBy object supports direct iteration over the groups, returning each group as a *Series* or *DataFrame*

In [63]:
for (method, group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))

Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)


This can be useful for doing certain things manually, though it is often much faster to use the built-ion apply functionality, which we will discuss momentarily

### Dispath methods

Through some Python class magic, any method not explicity implemented by GroupBy object will be passed thorugh and called on the groups, whether they are DataFrame or Series objects. For example, you can use the describe() method of DataFrames to perform a set of aggregations that describe each group in the data:

In [None]:
planets.group