# Grouping and Aggregating Data

The ability to group and aggregate data is one of the most powerful features of Pandas. Using the aggregation functionality allows analysts to quickly compute summary statistics over their data set at varying levels of specificity that they can choose. This functionality also exists in SQL and includes calculations such as `count`, `nunique` (distinct count), `sum`, `mean`, `median`, `mode`, `max`, `min` and `std` (standard deviation), among others.

In [1]:
import pandas as pd
df = pd.read_csv("./data/titanic.csv")

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Creating a group by object
Before applying an aggregate function to a dataframe, Pandas first requires the creation of a `groupby` object. This can be done by using the `.groupby()` method and specifying a list of columns to group by. If there is only one column to be grouped, it can either be passed inside a list or by itself.

In [7]:
df.groupby(['Pclass'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000213AFB468F0>

Notice that running the line above (the `.groupby()` method) didn't return back a dataframe; it returned a `DataFrameGroupBy` object. This object has partitioned out each of the rows into distinct groups based on their values, but doesn't know exactly how they need to be aggregated just yet.

### Aggregating the groups
Aggregate functions can be run directly on the `DataFrameGroupBy` object.

The first way to use an aggregate function is by using a a built-in dataframe method to compute a single calculation across the dataframe.

Note that the `.mode()` method can only be applied to a Series or dataframe, not to a `DataFrameGroupBy` object. Additionally, the `.max()` and `.min()` methods can only be applied to dataframes, Series, and `DataFrameGroupBy` objects that contain exclusively numerical data.

##### Count

In [8]:
df.groupby(['Pclass']).count()

Unnamed: 0_level_0,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,216,216,216,216,186,216,216,216,216,176,214
2,184,184,184,184,173,184,184,184,184,16,184
3,491,491,491,491,355,491,491,491,491,12,491


##### Count distinct

In [10]:
df.groupby(['Pclass']).nunique()

Unnamed: 0_level_0,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,216,2,216,2,57,4,4,147,94,133,3
2,184,2,184,2,57,4,4,140,42,7,3
3,491,2,491,2,68,7,7,394,119,7,3


##### Mean (Average)

In [9]:
df.groupby(['Pclass']).mean()

Unnamed: 0_level_0,PassengerId,Survived,Age,SibSp,Parch,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,461.597222,0.62963,38.233441,0.416667,0.356481,84.154687
2,445.956522,0.472826,29.87763,0.402174,0.380435,20.662183
3,439.154786,0.242363,25.14062,0.615071,0.393075,13.67555


##### Median

In [11]:
df.groupby(['Pclass']).median()

Unnamed: 0_level_0,PassengerId,Survived,Age,SibSp,Parch,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,472.0,1.0,37.0,0.0,0.0,60.2875
2,435.5,0.0,29.0,0.0,0.0,14.25
3,432.0,0.0,24.0,0.0,0.0,8.05


##### Standard Deviation

In [13]:
df.groupby(['Pclass']).std()

Unnamed: 0_level_0,PassengerId,Survived,Age,SibSp,Parch,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,246.737616,0.484026,14.802856,0.611898,0.693997,78.380373
2,250.852161,0.500623,14.001077,0.601633,0.690963,13.417399
3,264.441453,0.428949,12.495398,1.374883,0.888861,11.778142


##### Max

In [22]:
df[['Pclass', 'Fare']].groupby(['Pclass']).max()

Unnamed: 0_level_0,Fare
Pclass,Unnamed: 1_level_1
1,512.3292
2,73.5
3,69.55


##### Min

In [None]:
df[['Pclass', ']]