# 3.3 Grouping and Aggregating Data

The ability to group and aggregate data is one of the most powerful features of Pandas. Using the aggregation functionality allows analysts to quickly compute summary statistics over their data set at varying levels of specificity that they can choose. Aggregation in Pandas includes calculations such as `count`, `nunique` (distinct count), `sum`, `mean`, `median`, `mode`, `max`, `min` and `std` (standard deviation), among others.

Grouping and aggregating in Pandas is based on the same principles as grouping and aggregation in SQL. In SQL, you would SELECT an aggregate function and then GROUP BY all of the non-aggregate columns. In Pandas, we just have to worry about which columns to group by and which columns to aggregate.

Groups and aggregations are almost always used together. They allow the data analyst to "drill down" into sub-categories of the data and view trends between and within groups.

### About the data

The data used in this notebook shows information about passengers on the *Titanic* cruiseliner, a ship which set out from Southampton, U.K. to sail across the Atlantic ocean and which tragically sank upon collision with an iceberg. The dataset contains information about each passenger's passenger class, name, sex, age, siblings, parents/children, ticket number, ticket fare, cabin number, and the embarked location. It also contains information about each passenger's survival status. This data set is extremely popular among data scientists and will facilitate demonstrations of Pandas concepts.

In [None]:
import pandas as pd
df = pd.read_csv("./data/titanic.csv")
df.head()

### Groups
In order to perform aggregation, Pandas needs to know how to group the data. In SQL, we used the GROUP BY clause to explicitly tell the query how to group the aggregations. In this *Titanic* data set, for example, we could group by passenger class (Pclass), where the person embarked (Embarked), or whether or not the person survived (Survived). In Pandas, we can actually perform an aggregation on the entire dataframe at once without specifying a group, and Pandas will just assume that the whole dataset is one big group.

However, it is also important to specify how the rows should be grouped when they are aggregated. Below you will see how to aggregate on a dataframe without grouping AND will also see how to create groups that can then be aggregated.

### Using aggregate functions on an entire dataframe
Previously, we applied aggregate functions to an entire dataframe or Series object by using methods such as `.max()`, `.mean()`, and `.sum()`. As stated before, all aggregate functions need to be grouped before they can be processed. So, what are the groups when we use an aggregate function on an entire dataframe?

Simply enough, the aggregate function treats the entire dataframe as one big group. Thus, there is no need to create groups.

In [None]:
df.sum()

The code above `df.sum()` returned a Series object with all of the rows added/concatenated together. Some of the information is useful (like knowing that 342 people survived the Titanic), but other information is not as useful (like the total age being 21,205... what does that even mean?). Other information is completely useless (such as the total "Sex" being "malefemalefemalefemalemale...").

In this way, we see that aggregating across an entire dataframe *can* be useful when you don't know much about your data and need quick information, but it's probably better to choose which columns to aggregate and which groups to create. This will allow you to obtain detailed information about specific groups. Grouping also allows you to control how aggregations are applied multiple groups at once.

### Creating a group by object
Before applying an aggregate function to groups in a dataframe, Pandas first requires the creation of a `groupby` object. This can be done by using the `.groupby()` method and specifying a list of columns to group by. If there is only one column to be grouped, it can either be passed inside a list or by itself.

In [None]:
df.groupby(['Pclass'])

Notice that running the line above (the `.groupby()` method) didn't return back a dataframe; it returned a `DataFrameGroupBy` object. This object has partitioned out each of the rows of the dataframe into distinct groups (which we defined), but doesn't know exactly how they need to be aggregated just yet.

Next, we will learn how to use aggregate functions on the `DataFrameGroupBy` object.

### Aggregating the groups
Aggregate functions can be run directly on the `DataFrameGroupBy` object. There are two ways to do so. The first way to run aggregations on groups is by using a dataframe method on a `DataFrameGroupBy` object (ie. `.sum()`). The second way is to use the `.agg()` method, passing in a dictionary of aggregations to perform.

Personally, I prefer to use the `.agg()` method because it allows me to use mutliple aggregations at a time, but will also allow me to perform just a single aggregation too.

#### Method 1: Dataframe methods

The first way to use an aggregate function with `.groupby()` is by using a a built-in dataframe method to compute a single calculation across the dataframe. These built in aggregation methods include `.count()`, `.nunique()`, `.mean()`, `.median()`, and `.std()`.

Note that the `.mode()` method can only be applied to a Series or dataframe, not to a `DataFrameGroupBy` object. Additionally, the `.max()` and `.min()` methods can only be applied to dataframes, Series, and `DataFrameGroupBy` objects that contain exclusively numerical data.

##### Count

In [None]:
df.groupby(['Pclass']).count()

##### Count distinct

In [None]:
df.groupby(['Pclass']).nunique()

##### Mean (Average)

In [None]:
df.groupby(['Pclass']).mean()

##### Median

In [None]:
df.groupby(['Pclass']).median()

##### Standard Deviation

In [None]:
df.groupby(['Pclass']).std()

##### Max
Note that the `.max()` method can only be applied to dataframes, Series, and `DataFrameGroupBy` objects that contain exclusively numerical data.

In [None]:
df[['Pclass', 'Fare']].groupby(['Pclass']).max()

##### Min
Note that the `.min()` method can only be applied to dataframes, Series, and `DataFrameGroupBy` objects that contain exclusively numerical data.

In [None]:
df[['Pclass', 'Fare']].groupby(['Pclass']).min()

#### Method 2: The `.agg()` method

`DataFrameGroupBy` objects can also have many aggregate functions applied to them at once. This can be done by applying the `.agg()` method, which is unique to `DataFrameGroupBy` objects.

The `.agg()` method accepts a dictionary where each key is the field to be aggregated and the value attached to that key is the aggregation to be applied.

You can apply the following aggregate functions to the groupby:

| Key-word      | Description |
| ----------- | ----------- |
| `count`      | Count       |
| `sum`   | Sum        |
| `mean`      | Average (mean)       |
| `median`   | Median        |
| `nunique`      | Count Distinct       |
| `min`   | Minimum (only works on numerical data)       |
| `max`      | Maximum (only works on numerical data)    |
| `std`      | Standard Deviation       |
| `var`   | Variance        |

In [None]:
df.groupby(['Pclass']).agg({'Fare': 'median', 'Survived': 'mean'})

You can also switch out each value for a list of aggregations to compute for each key.

In [None]:
df.groupby(['Pclass']).agg({'Fare': ['median', 'mean'], 'Survived': ['mean', 'nunique']})

### The `MultiIndex`
Observe the levels of column names generated by the above code. In order to organize the aggregated table (because there are two columns called "mean"), Pandas automatically created a `MultiIndex` to "better" organize things. You can see how the columns are organized by looking at the `.columns` property of the dataframe.

Pandas creates a MultiIndex to clarify which aggregations were performed on each column.

Below, we create a new dataframe `grouped_df` that contains data aggregated by Pclass.

In [None]:
grouped_df = df.groupby(['Pclass']).agg({'Fare': ['median', 'mean'], 'Survived': ['mean', 'nunique']})

Now, let's look at the columns of this aggregated dataframe. Notice that the `MultiIndex` is composed of a list of tuples.

In [None]:
grouped_df.columns # The columns are not a list of strings anymore-- they are a list of tuples

The `MultiIndex` *can* be useful, but sometimes its just a hassle to deal with. As long as only one aggregation is performed per column in the `.agg()` method, we can tell the `.groupby()` to not create a MultiIndex by passing in the parameter `as_index=False`. This means that the original column name will be assigned to the column instead of the aggregation term.

I personally prefer to use a `as_index=False` whenever possible. It makes things easier for me to understand and makes it easier to work with the results of the aggregation. The downside is that I can't tell what aggregations were performed without looking at the code.

In [None]:
df.groupby(['Pclass'], as_index=False).agg({'Fare': 'median', 'Survived': 'mean'}) # Each column is only aggregated once

When more than one aggregation is performed on a single column, however, the `MultiIndex` is necessary in case the same aggregation is performed on several different columns. You can still pass in `as_index=False` but **shouldn't** because it will disrupt the formatting of the dataframe.

In [None]:
# as_index was not passed in. Pclass is the named row index, meaning data is easily extracted by Pclass
df.groupby(['Pclass']).agg({'Fare': ['median', 'mean'], 'Survived': ['mean', 'nunique']}) # More than one aggregation applied to one or more columns

In [None]:
# as_index=False was passed in and the row index was reset. Now, a MultiIndex must be used to extract data
df.groupby(['Pclass'], as_index=False).agg({'Fare': ['median', 'mean'], 'Survived': ['mean', 'nunique']}) # More than one aggregation applied to one or more columns

#### Accessing values in an aggregated table with a `MultiIndex`

It may be intimidating to know how to work with data in a dataframe with a `MultiIndex`. However, it's fairly simple. You can access the highest level column by passing in the column name as normal.

In [None]:
# See the whole dataframe
grouped_df

In [None]:
# See the column 'Fare', which is itself a dataframe composed of two columns
grouped_df['Fare']

You can then simply add another column inside the brackets to dive deeper into the levels. This is a single column and thus, a Series object.

In [None]:
# See the column 'median' from inside the column 'Fare'
grouped_df['Fare', 'median']

If you are still confused by the `MultiIndex`, don't worry! This is a difficult concept to grasp and is something that you will only see when aggregating data. In this page, I just want you to get your first exposure to this concept so that you will be able to recognize it later on. You don't need to be an expert on using the `MultiIndex` just yet, but should at least be able to say *why* it is used in aggregations.

### The `.value_counts()` method

We previously looked at the `.value_counts()` method for Series and dataframes. This method counts up the number of occurrences of values in the Series or dataframes and can also be used with `DataFrameGroupBy` objects. This can be especially useful for filtering data and normalizing the results. We can use the `.value_counts()` method on a Series extracted from the `DataFrameGroupBy` object.

In the code below, we group by "Pclass" and then look at just the "Embarked" column, counting up how many times each value of "Embarked" occurs among each "Pclass". Thus, the results below show, for each Pclass, how many people embarked in Southampton, Cherbourg, and Queenstown. For example, we can see below that the majority of people who embarked in Queenstown ("Q") were third class passengers (Pclass was 3). In other words, first class passengers (Pclass=1) had 2 embarkments in Queenstown, second class passengers (Pclass=2) had 3 embarkments in Queenstown, and third class passengers (Pclass=3) had 72 embarkments in Queenstown.

In [None]:
df.groupby("Pclass")["Embarked"].value_counts()

#### Looking at MultiIndexed rows

Note that above, a Series was returned. Notice also the reappearance of the `MultiIndex`, this time on the rows. This allows us to look more closely at the number of people who embarked at each location between passenger classes. The `MultiIndex` is necessary for separating out each level of aggregation, meaning that it lets us count up each embarked location for each passenger class.

We can access the outer row index (in this case, "Pclass") by using the `.loc` property. This is exactly how we would access any row in a normal dataframe. Using `.loc` could be useful for drilling down into the number of people embarked in a specific passenger class.

In the code below, we group by "Pclass" and then count up the number of times each "Embarked" value took place across each passenger class. We then use `.loc` to get just the row where Pclass (the outermost named row index) is 1.

In [None]:
df.groupby("Pclass")["Embarked"].value_counts().loc[1] # The number 1 is the row index name. Note that .iloc[1] would get the index location and would directly access the second value in the Series and would not have the same outcome

Again, you do not need to be an expert at using the `MultiIndex` to succeed in this class. For now, it is important that you recognize it and know why it would be useful.

#### Normalization

Pass in `normalize=True` to the `.value_counts()` method to normalize the values (ie. express them as a percentage of the total). Each total is calculated at the deepest level of detail.

In [None]:
df.groupby("Pclass")["Embarked"].value_counts(normalize=True)

If you wanted to normalize the value counts across the Pclass dimension, you could calculate it manually. *I haven't yet found a way to specify the level of detail in the `.value_counts()` method, but you could use something like the following to achieve normalization across two dimensions.*

In [None]:
number_of_observations = df['Pclass'].count().sum() # get the total number of non-null values in the 'Pclass' column
df.groupby("Pclass")["Embarked"].value_counts() / number_of_observations # Get the value_counts Series and divide each row by number_of_observations