# 3.3 Grouping and Aggregating Data

The ability to group and aggregate data is one of the most powerful features of Pandas. Using the aggregation functionality allows analysts to quickly compute summary statistics over their data set at varying levels of specificity that they can choose. This functionality also exists in SQL and includes calculations such as `count`, `nunique` (distinct count), `sum`, `mean`, `median`, `mode`, `max`, `min` and `std` (standard deviation), among others.

In [None]:
import pandas as pd
df = pd.read_csv("./data/titanic.csv")

In [None]:
df.head()

### Creating a group by object
Before applying an aggregate function to a dataframe, Pandas first requires the creation of a `groupby` object. This can be done by using the `.groupby()` method and specifying a list of columns to group by. If there is only one column to be grouped, it can either be passed inside a list or by itself.

In [None]:
df.groupby(['Pclass'])

Notice that running the line above (the `.groupby()` method) didn't return back a dataframe; it returned a `DataFrameGroupBy` object. This object has partitioned out each of the rows into distinct groups based on their values, but doesn't know exactly how they need to be aggregated just yet.

### Aggregating the groups
Aggregate functions can be run directly on the `DataFrameGroupBy` object.

#### Method 1

The first way to use an aggregate function is by using a a built-in dataframe method to compute a single calculation across the dataframe.

Note that the `.mode()` method can only be applied to a Series or dataframe, not to a `DataFrameGroupBy` object. Additionally, the `.max()` and `.min()` methods can only be applied to dataframes, Series, and `DataFrameGroupBy` objects that contain exclusively numerical data.

##### Count

In [None]:
df.groupby(['Pclass']).count()

##### Count distinct

In [None]:
df.groupby(['Pclass']).nunique()

##### Mean (Average)

In [None]:
df.groupby(['Pclass']).mean()

##### Median

In [None]:
df.groupby(['Pclass']).median()

##### Standard Deviation

In [None]:
df.groupby(['Pclass']).std()

##### Max
Note that the `.max()` method can only be applied to dataframes, Series, and `DataFrameGroupBy` objects that contain exclusively numerical data.

In [None]:
df[['Pclass', 'Fare']].groupby(['Pclass']).max()

##### Min
Note that the `.min()` method can only be applied to dataframes, Series, and `DataFrameGroupBy` objects that contain exclusively numerical data.

In [None]:
df[['Pclass', 'Fare']].groupby(['Pclass']).min()

#### Method 2

`DataFrameGroupBy` objects can also have many aggregate functions applied to them at once. This can be done by applying the `.agg()` method, which is unique to `DataFrameGroupBy` objects.

The `.agg()` method accepts a dictionary where each key is the field to be aggregated and the value attached to that key is the aggregation to be applied.

You can apply the following aggregate functions to the groupby:

| Key-word      | Description |
| ----------- | ----------- |
| `count`      | Count       |
| `sum`   | Sum        |
| `mean`      | Average (mean)       |
| `median`   | Median        |
| `nunique`      | Count Distinct       |
| `min`   | Minimum (only works on numerical data)       |
| `max`      | Maximum (only works on numerical data)    |
| `std`      | Standard Deviation       |
| `var`   | Variance        |

In [None]:
df.groupby(['Pclass']).agg({'Fare': 'median', 'Survived': 'mean'})

You can also switch out each value for a list of aggregations to compute for each key.

In [None]:
df.groupby(['Pclass']).agg({'Fare': ['median', 'mean'], 'Survived': ['mean', 'nunique']})

### The `MultiIndex`
Observe the levels of column names generated by the above code. In order to organize the aggregated table (because there are two columns called "mean"), Pandas automatically created a `MultiIndex` to "better" organize things. You can see how the columns are organized by looking at the `.columns` property of the dataframe.

In [None]:
grouped_df = df.groupby(['Pclass']).agg({'Fare': ['median', 'mean'], 'Survived': ['mean', 'nunique']})
grouped_df.columns # The columns are not a list of strings anymore-- they are a list of tuples

The `MultiIndex` *can* be useful, but sometimes its just a hassle to deal with. As long as only one aggregation is performed per column in the `.agg()` method, we can tell the `.groupby()` to not create a MultiIndex by passing in the parameter `as_index=False`. This means that the original column name will be assigned to the column instead of the aggregation term.

In [None]:
df.groupby(['Pclass'], as_index=False).agg({'Fare': 'median', 'Survived': 'mean'}) # Each column is only aggregated once

When more than one aggregation is performed on a single column, however, the `MultiIndex` is necessary in case the same aggregation is performed on several different columns. You can still pass in `as_index=False` but **shouldn't** because it will disrupt the formatting of the dataframe.

In [None]:
df.groupby(['Pclass']).agg({'Fare': ['median', 'mean'], 'Survived': ['mean', 'nunique']}) # More than one aggregation applied to one or more columns

#### Accessing values in an aggregated table with a `MultiIndex`

It may be intimidating to know how to work with data in a dataframe with a `MultiIndex`. However, it's fairly simple. You can access the highest level column by passing in the column name as normal.

In [None]:
grouped_df

In [None]:
grouped_df['Fare']

You can then simply add another column inside the brackets to dive deeper into the levels. This is a single column and thus, a Series object.

In [None]:
grouped_df['Fare', 'median']

### The `.value_counts()` method

We previously looked at the `.value_counts()` method for Series and dataframes. This method counts up unique values in the Series or dataframes and can also be used with `DataFrameGroupBy` objects. This can be especially useful for filtering data and normalizing the results. We can use the `.value_counts()` method on a Series extracted from the `DataFrameGroupBy` object.

In [None]:
df.groupby("Pclass")["Embarked"].value_counts()

#### Looking at MultiIndexed dataframes

Note that a Series was returned. Notice also the reappearance of the `MultiIndex`, this time on the rows. This allows us to look more closely at the number of people who embarked at each location between passenger classes.

We can access the outer row index by using the `.loc` property. This could be useful for drilling down into the number of people embarked in a specific passenger class.

In [None]:
df.groupby("Pclass")["Embarked"].value_counts().loc[1] # The number 1 is the row index name. Note that .iloc[1] would get the index location and would directly access the second value in the Series and would not have the same outcome

#### Normalization

Pass in `normalize=True` to the `.value_counts()` method to normalize the values (ie. express them as a percentage of the total). Each total is calculated at the deepest level of detail.

In [None]:
df.groupby("Pclass")["Embarked"].value_counts(normalize=True)

If you wanted to normalize the value counts across the Pclass dimension, you could calculate it manually. *I haven't yet found a way to specify the level of detail in the `.value_counts()` method, but you could use something like the following to achieve normalization across two dimensions.*

In [None]:
number_of_observations = df['Pclass'].count().sum() # get the total number of non-null values in the 'Pclass' column
df.groupby("Pclass")["Embarked"].value_counts() / number_of_observations # Get the value_counts Series and divide each row by number_of_observations