# 3.3 Grouping and Aggregating Data

The ability to group and aggregate data is one of the most powerful features of Pandas. Using the aggregation functionality allows analysts to quickly compute summary statistics over their data set at varying levels of specificity that they can choose. Aggregation in Pandas includes calculations such as `count`, `nunique` (distinct count), `sum`, `mean`, `median`, `mode`, `max`, `min` and `std` (standard deviation), among others.

Grouping and aggregating in Pandas is based on the same principles as grouping and aggregation in SQL. In SQL, you would SELECT an aggregate function and then GROUP BY all of the non-aggregate columns. In Pandas, we just have to worry about which columns to group by and which columns to aggregate.

Groups and aggregations are almost always used together. They allow the data analyst to "drill down" into sub-categories of the data and view trends between and within groups.

### About the data

The data used in this notebook shows information about passengers on the *Titanic* cruiseliner, a ship which set out from Southampton, U.K. to sail across the Atlantic ocean and which tragically sank upon collision with an iceberg. The dataset contains information about each passenger's passenger class, name, sex, age, siblings, parents/children, ticket number, ticket fare, cabin number, and the embarked location. It also contains information about each passenger's survival status. This data set is extremely popular among data scientists and will facilitate demonstrations of Pandas concepts.

In [None]:
import pandas as pd
df = pd.read_csv("./data/titanic.csv")
df.head()

### Groups
In order to perform aggregation, Pandas needs to know how to group the data. In SQL, we used the GROUP BY clause to explicitly tell the query how to group the aggregations. In this *Titanic* data set, for example, we could group by passenger class (Pclass), where the person embarked (Embarked), or whether or not the person survived (Survived). In Pandas, we can actually perform an aggregation on the entire dataframe at once without specifying a group, and Pandas will just assume that the whole dataset is one big group.

However, it is also important to specify how the rows should be grouped when they are aggregated. Below you will see how to aggregate on a dataframe without grouping AND will also see how to create groups that can then be aggregated.

### Using aggregate functions on an entire dataframe
Previously, we applied aggregate functions to an entire dataframe or Series object by using methods such as `.max()`, `.mean()`, and `.sum()`. As stated before, all aggregate functions need to be grouped before they can be processed. So, what are the groups when we use an aggregate function on an entire dataframe?

Simply enough, the aggregate function treats the entire dataframe as one big group. Thus, there is no need to create groups.

In [None]:
df.sum()

The code above `df.sum()` returned a Series object with all of the rows added/concatenated together. Some of the information is useful (like knowing that 342 people survived the Titanic), but other information is not as useful (like the total age being 21,205... what does that even mean?). Other information is completely useless (such as the total "Sex" being "malefemalefemalefemalemale...").

In this way, we see that aggregating across an entire dataframe *can* be useful when you don't know much about your data and need quick information, but it's probably better to choose which columns to aggregate and which groups to create. This will allow you to obtain detailed information about specific groups. Grouping also allows you to control how aggregations are applied multiple groups at once.

### Creating a group by object
Before applying an aggregate function to groups in a dataframe, Pandas first requires the creation of a `groupby` object. This can be done by using the `.groupby()` method and specifying a list of columns to group by. If there is only one column to be grouped, it can either be passed inside a list or by itself.

In [None]:
df.groupby(['Pclass'])

Notice that running the line above (the `.groupby()` method) didn't return back a dataframe; it returned a `DataFrameGroupBy` object. This object has partitioned out each of the rows of the dataframe into distinct groups (which we defined), but doesn't know exactly how they need to be aggregated just yet.

Next, we will learn how to use aggregate functions on the `DataFrameGroupBy` object.

### Aggregating the groups
Aggregate functions can be run directly on the `DataFrameGroupBy` object. There are two ways to do so. The first way to run aggregations on groups is by using a dataframe method on a `DataFrameGroupBy` object (ie. `.sum()`). The second way is to use the `.agg()` method, passing in a dictionary of aggregations to perform.

Personally, I prefer to use the `.agg()` method because it allows me to use mutliple aggregations at a time, but will also allow me to perform just a single aggregation too.

#### Method 1: Dataframe methods

The first way to use an aggregate function with `.groupby()` is by using a a built-in dataframe method to compute a single calculation across the dataframe. These built in aggregation methods include `.count()`, `.nunique()`, `.mean()`, `.median()`, and `.std()`.

Note that the `.mode()` method can only be applied to a Series or dataframe, not to a `DataFrameGroupBy` object. Additionally, the `.max()` and `.min()` methods can only be applied to dataframes, Series, and `DataFrameGroupBy` objects that contain exclusively numerical data.

##### Count

In [None]:
df.groupby(['Pclass']).count()

##### Count distinct

In [None]:
df.groupby(['Pclass']).nunique()

##### Mean (Average)

In [None]:
df.groupby(['Pclass']).mean()

##### Median

In [None]:
df.groupby(['Pclass']).median()

##### Standard Deviation

In [None]:
df.groupby(['Pclass']).std()

##### Max
Note that the `.max()` method can only be applied to dataframes, Series, and `DataFrameGroupBy` objects that contain exclusively numerical data.

In [None]:
df[['Pclass', 'Fare']].groupby(['Pclass']).max()

##### Min
Note that the `.min()` method can only be applied to dataframes, Series, and `DataFrameGroupBy` objects that contain exclusively numerical data.

In [None]:
df[['Pclass', 'Fare']].groupby(['Pclass']).min()

#### Method 2: The `.agg()` method

`DataFrameGroupBy` objects can also have many aggregate functions applied to them at once. This can be done by applying the `.agg()` method, which is unique to `DataFrameGroupBy` objects.

The `.agg()` method accepts a dictionary where each key is the field to be aggregated and the value attached to that key is the aggregation to be applied.

You can apply the following aggregate functions to the groupby:

| Key-word      | Description |
| ----------- | ----------- |
| `count`      | Count       |
| `sum`   | Sum        |
| `mean`      | Average (mean)       |
| `median`   | Median        |
| `nunique`      | Count Distinct       |
| `min`   | Minimum (only works on numerical data)       |
| `max`      | Maximum (only works on numerical data)    |
| `std`      | Standard Deviation       |
| `var`   | Variance        |

In [None]:
df.groupby(['Pclass']).agg({'Fare': 'median', 'Survived': 'mean'})

You can also switch out each value for a list of aggregations to compute for each key.

In [None]:
df.groupby(['Pclass']).agg({'Fare': ['median', 'mean'], 'Survived': ['mean', 'nunique']})