# Grouping Aggregation Basics

In previous chapters, calling DataFrame methods such as `sum` performed the action to every single value in each column as a **whole**. In this chapter, we will perform actions to distinct **groups** within our data, and not to the whole.

## Group into independent DataFrames, then aggregate

Take a look at the image below. We will group our original data into independent DataFrames based on the unique values of one or more columns. Below, our groups are based on the value of the Dept column. Once the data is split into these independent DataFrames, an aggregation is performed on each. Below, the Salary column is aggregated by the sum function and the Experience column is aggregated by the mean function.

![1]

### Examples of questions we can answer

Grouping data is an extremely common technique used in data analysis and can help us answer a variety of questions. Some examples follow:

* What is the maximum salary for every department at a company?
* What is the average temperature and precipitation for every month for different cities?
* What are the top five best selling shirts at each store?

[1]: images/raw_group_agg.png

## Grouping with the `groupby` method

The `groupby` method handles most of the tasks in pandas involving grouping data. This one method is responsible for grouping the data into independent DataFrames and performing the aggregation. It usually does so in a single line of code.

### Aggregation

The most common type of action to perform on each group is an aggregation, though it's possible to manipulate the data in each group any way you want. This chapter only covers how to perform aggregations on each group.

### Grouping Column, Aggregating Column, Aggregating Function

Every `groupby` aggregation has three separate components - the grouping column, the aggregating column, and the aggregating function.

* **Grouping column** - Every distinct value in this column forms its own group
* **Aggregating column** - The column we are applying the function to such that it aggregates (returns a single value). This column is usually numeric.
* **Aggregating function** - The function that is applied to the aggregating column.

## Syntax for using the `groupby` method

The `groupby` method is not as straightforward to use as most other methods, and will take more effort to learn. Making it even more difficult, is the different valid types of syntax that do the same the thing. Only one version of the syntax will be covered at first, with the others delegated to a separate chapter.

### Must use method chaining with `groupby`

Nearly all of the calls to `groupby` must have another method chained to it in order to return a result. The `groupby` method is a **two-step process**. First, we inform pandas how we would like to group, and then we chain the `agg` method to inform pandas how to aggregate. The general syntax takes the following form.

```python
df.groupby('grouping column').agg(new_column=('aggregating column', 'aggregating function'))
```

The grouping column, aggregating column, and aggregating function may be provided as strings. The new column name, `new_column`, is provided as a parameter name in the `agg` method and is set to the two-item tuple of the aggregating column and aggregating function.

### First groupby aggregation

Let's begin our usage of the `groupby` method by finding the average salary of every employee by department from the City of Houston dataset.

In [1]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black


Using the syntax from above, we produce the following solution:

In [2]:
emp.groupby('dept').agg(avg_salary=('salary', 'mean')).round(-3)

Unnamed: 0_level_0,avg_salary
dept,Unnamed: 1_level_1
Fire,61000.0
Health & Human Services,55000.0
Houston Airport System,55000.0
Houston Public Works,51000.0
Library,42000.0
Other,61000.0
Parks & Recreation,37000.0
Police,67000.0
Solid Waste Management,44000.0


### Identify each piece

When completing a `groupby` aggregation, it is important to identify each of the pieces. This will help you insert them in the right place using the syntax above. For the above example, we have:

* **Grouping column** - `dept`
* **Aggregating column** - `salary`
* **Aggregating function** - `mean`

### New column name

You can use any valid variable name for the new column, but since Python does not allow spaces in variables names (as well as some other limitations), you won't have the full flexibility to name them however you want.

### Tuple of aggregating column and aggregating function

The parameter used as the new column name must be set equal to a two-item tuple of the aggregating column and the aggregating function.

### Use string names for aggregation functions

Notice that a string was used to identify the aggregation in the syntax above. pandas understands many string aggregation functions. Below are most of the available string names you can use. Later on we will see where these names come from and how to discover them on your own.

* `sum`
* `min`
* `max`
* `mean`
* `median`
* `std`
* `var`
* `count` - count of non-missing values
* `size` - count of all elements
* `first` - first value in group
* `last` - last value in group
* `idxmax` - index of maximum value in group
* `idxmin` - index of minimum value in group
* `nunique` - number of unique values in group

### More examples

Let's complete a couple more examples with different grouping columns and different aggregating functions. The salary column is the only numeric column, so we will continue to use it as the aggregating column. Let's find the maximum salary by each title. The grouping column is `title`, the aggregating column is `salary`, and the aggregating function is `max`. We use `max_salary` as the new column name.

In [3]:
emp.groupby('title').agg(max_salary=('salary', 'max')).head().round(-3)

Unnamed: 0_level_0,max_salary
title,Unnamed: 1_level_1
3-1-1 TELECOMMUNICATOR,44000.0
3-1-1 TELECOMMUNICATOR SUPERVISOR,54000.0
9-1-1 CUSTODIAN OF RECORDS,58000.0
9-1-1 PSAP SUPERVISOR,69000.0
9-1-1 PSAP SUPERVISOR-FIRE/EMS,73000.0


Let's find the sum of all salaries by sex. The grouping column is `sex`, the aggregating column is `salary`, the aggregating function is `sum` and the new column is `sum_salary`.

In [4]:
emp.groupby('sex').agg(sum_salary=('salary', 'sum'))

Unnamed: 0_level_0,sum_salary
sex,Unnamed: 1_level_1
Female,398010900.0
Male,961815500.0


## Aligning the dots when method chaining

When using `groupby`, you'll often chain together multiple methods making for long lines of code. To help with readability, place each method on a separate line directly below the method before it. You'll also need to wrap the entire command in parentheses to alert Python that this is a single logical line of code. When done properly, you'll have each method aligned underneath the first dot. Here is an example that you will encounter often.

```python
(df.groupby('grouping column')
   .agg(new_column=('aggregating column', 'aggregating function'))
   .round()
   .reset_index()
   .head())
```

## The index when grouping

If you were paying close attention, you noticed that the grouping column gets placed in the index after a call to the `groupby` method. In the example below, notice that `sex` is the new index and is not a column. Also note that the returned object is a **one-column DataFrame** and NOT a Series.

In [5]:
sex_avg_salary = (emp.groupby('sex')
                     .agg(avg_salary=('salary', 'mean'))
                     .round(-3))
sex_avg_salary

Unnamed: 0_level_0,avg_salary
sex,Unnamed: 1_level_1
Female,55000.0
Male,60000.0


### The index `name`

You might be confused as to why there is the word 'sex' directly above the index. It looks like it is a column name, but technically it is not. It is the `name` of the index and can be accessed as an index attribute.

In [6]:
sex_avg_salary.index.name

'sex'

### The `reset_index` method

All DataFrames come equipped with a `reset_index` method, which converts the index into the first column of a DataFrame. The new index will become a simple `RangeIndex`, the sequence of integers beginning at 0.

In [7]:
(emp.groupby('sex')
    .agg(avg_salary=('salary', 'mean'))
    .round(-3)
    .reset_index())

Unnamed: 0,sex,avg_salary
0,Female,55000.0
1,Male,60000.0


You can also set the `as_index` method to `False` to achieve the same result.

In [8]:
(emp.groupby('sex', as_index=False,)
    .agg(avg_salary=('salary', 'mean')))

Unnamed: 0,sex,avg_salary
0,Female,54754.560586
1,Male,59766.076115


### Not sorting the groups

You may have also noticed that the returned DataFrame is sorted by the grouping column by default. This is a nice feature, but you can turn it off by setting the `sort` parameter to `False`. This will increase its performance. The order of the groups will be the same as they appear in the column.

In [9]:
(emp.groupby('title', sort=False)
    .agg(max_salary=('salary', 'max'))
    .head()
    .round(-3))

Unnamed: 0_level_0,max_salary
title,Unnamed: 1_level_1
POLICE SERGEANT,88000.0
ASSISTANT CITY ATTORNEY II,90000.0
SENIOR SLUDGE PROCESSOR,49000.0
SENIOR POLICE OFFICER,76000.0
SENIOR ACCOUNT CLERK,48000.0


## More on method chaining with `groupby`

The `groupby` syntax is a bit strange in that it requires method chaining to deliver results. Let's examine the results of making a call just to the `groupby` method.

In [10]:
emp.groupby('sex')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000022F6EB07290>

### What is that?

The result of any call to a method in Python always returns something, even if that object is `None`. Calling the `groupby` method is no different. It has formally returned a `DataFrameGroupBy` object. Just like all pandas objects, you can see a list of all its [attributes and methods in the API][1]. This type of object is not crucial to dive into at this point. Calling `groupby` by itself does not do much. You are simply alerting pandas that you would like to create distinct groups using the unique values in a particular column. 

### Assign the `groupby` object to a  variable

Let's assign the result of the call to `groupby` as a variable and verify its type.

[1]: http://pandas.pydata.org/pandas-docs/stable/reference/groupby.html

In [11]:
g = emp.groupby('sex')
type(g)

pandas.core.groupby.generic.DataFrameGroupBy

## `GroupBy` objects

The documentation refers to the object returned from a call to the `groupby` method as a **GroupBy** object. Technically, there are two specific objects - `DataFrameGroupBy` (as we saw above) and `SeriesGroupBy`. It's not necessary to know much about these objects. Just be aware that a call to `groupby` returns some other object that is not a DataFrame or a Series. It is a GroupBy object with its own attributes and methods.

### Some exploration of the `GroupBy` object

This GroupBy object can be explored just like any other object in Python. It has attributes and methods that can be accessed through dot notation. Below, we take a look at the `groups` and `ngroups` attributes. The `groups` attribute is a dictionary that maps the group value to the index location of each row of that group, while `ngroups` returns the number of distinct groups.

In [12]:
g.groups

{'Female': [5, 6, 10, 12, 14, 19, 20, 24, 26, 28, 36, 37, 41, 52, 55, 56, 59, 62, 69, 71, 76, 78, 79, 80, 83, 86, 87, 97, 98, 101, 103, 105, 118, 119, 123, 127, 130, 139, 144, 164, 169, 176, 177, 178, 185, 188, 195, 199, 203, 206, 208, 209, 213, 215, 216, 218, 221, 223, 224, 226, 228, 231, 232, 233, 236, 238, 239, 240, 249, 253, 256, 261, 262, 264, 266, 268, 269, 275, 287, 289, 292, 293, 295, 300, 308, 316, 317, 326, 327, 338, 339, 343, 344, 345, 347, 351, 355, 365, 366, 374, ...], 'Male': [0, 1, 2, 3, 4, 7, 8, 9, 11, 13, 15, 16, 17, 18, 21, 22, 23, 25, 27, 29, 30, 31, 32, 33, 34, 35, 38, 39, 40, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 53, 54, 57, 58, 60, 61, 63, 64, 65, 66, 67, 68, 70, 72, 73, 74, 75, 77, 81, 82, 84, 85, 88, 89, 90, 91, 92, 93, 94, 95, 96, 99, 100, 102, 104, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 120, 121, 122, 124, 125, 126, 128, 129, 131, 132, 133, 134, 135, 136, ...]}

In [13]:
g.ngroups

2

### Calling the `agg` method from the GroupBy object

We can call the `agg` method from this GroupBy object to complete another aggregation similar to how we did before.

In [14]:
g.agg(std_salary=('salary', 'std'))

Unnamed: 0_level_0,std_salary
sex,Unnamed: 1_level_1
Female,25087.145161
Male,22306.107822


### Atypical usage 

Even though it is syntactically correct to assign the result of a call to the `groupby` method to a variable name and then call the `agg` method, it is rarely written like this and is usually completed in a single line as demonstrated below. The primary message of the last section was to show how an intermediate object was created (either a `DataFrameGroupBy` or a `SeriesGroupBy`) and it is this object that is being used to do the aggregation.

In [15]:
emp.groupby('sex').agg(std_salary=('salary', 'std'))

Unnamed: 0_level_0,std_salary
sex,Unnamed: 1_level_1
Female,25087.145161
Male,22306.107822


## Exercises

### Exercise 1

<span  style="color:green; font-size:16px">Find the maximum salary for each sex.</span>

In [19]:
emp.head()

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black
3,Police,SENIOR POLICE OFFICER,1997-05-27,75942.1,Male,Hispanic
4,Police,SENIOR POLICE OFFICER,2006-01-23,69355.26,Male,White


In [17]:
emp.groupby('sex').agg(max_salary=('salary','max'))

Unnamed: 0_level_0,max_salary
sex,Unnamed: 1_level_1
Female,342784.0
Male,342784.0


### Exercise 2

<span  style="color:green; font-size:16px">Find the median salary for each department.</span>

In [27]:
(emp.groupby('dept')
    .agg(median_salary=('salary','median'))
    .round(-3)
    .sort_values(by='median_salary',ascending=False)
)

Unnamed: 0_level_0,median_salary
dept,Unnamed: 1_level_1
Police,68000.0
Fire,62000.0
Other,53000.0
Health & Human Services,51000.0
Houston Public Works,47000.0
Houston Airport System,44000.0
Solid Waste Management,39000.0
Library,35000.0
Parks & Recreation,32000.0


### Exercise 3

<span style="color:green; font-size:16px">Find the average salary for each race. Return a DataFrame with the race as a column.</span>

In [31]:
emp.groupby('race', as_index=False).agg(avg_salary=('salary','mean'))

Unnamed: 0,race,avg_salary
0,Asian,65316.197885
1,Black,52264.180833
2,Hispanic,54811.349584
3,Native American,58153.109371
4,White,66611.692973


### Exercise 4

<span style="color:green; font-size:16px">Find the number of employees in each department.</span>

In [45]:
emp.groupby('dept').agg(employee_count=('hire_date','size'))

Unnamed: 0_level_0,employee_count
dept,Unnamed: 1_level_1
Fire,4376
Health & Human Services,1353
Houston Airport System,1216
Houston Public Works,4190
Library,563
Other,3373
Parks & Recreation,1152
Police,7573
Solid Waste Management,512


### Exercise 5

<span style="color:green; font-size:16px">Find the number of unique titles there are for each department.</span>

In [34]:
emp.groupby('dept').agg(unique_titles=('title','nunique'))

Unnamed: 0_level_0,unique_titles
dept,Unnamed: 1_level_1
Fire,77
Health & Human Services,161
Houston Airport System,137
Houston Public Works,215
Library,66
Other,358
Parks & Recreation,109
Police,145
Solid Waste Management,44


### Exercise 6

<span style="color:green; font-size:16px">Find the index of the employee with the maximum salary for each department and then use those index values to select their entire rows from the original DataFrame.</span>

In [35]:
emp.groupby('dept').agg(emp_max_idx=('salary','idxmax'))

Unnamed: 0_level_0,emp_max_idx
dept,Unnamed: 1_level_1
Fire,1732
Health & Human Services,8405
Houston Airport System,3897
Houston Public Works,10704
Library,7564
Other,13338
Parks & Recreation,11679
Police,4413
Solid Waste Management,20244


In [36]:
emp.loc[[1732]]

Unnamed: 0,dept,title,hire_date,salary,sex,race
1732,Fire,"PHYSICIAN,MD",2014-09-27,342784.0,Male,White


In [37]:
emp.groupby('dept').agg(max_salary=('salary','max'))

Unnamed: 0_level_0,max_salary
dept,Unnamed: 1_level_1
Fire,342784.0
Health & Human Services,186685.0
Houston Airport System,275000.0
Houston Public Works,275000.0
Library,170000.0
Other,275000.0
Parks & Recreation,150000.0
Police,280000.0
Solid Waste Management,195000.0


### Use the NYC deaths dataset for the remaining exercises

Execute the cell below to read in the NYC deaths dataset and use it to answer the following exercises.

In [38]:
deaths = pd.read_csv('../data/nyc_deaths.csv')
deaths.head(3)

Unnamed: 0,year,cause,sex,race,deaths
0,2007,Accidents,F,Asian,32
1,2007,Accidents,F,Black,87
2,2007,Accidents,F,Hispanic,71


### Exercise 7

<span style="color:green; font-size:16px">What year had the most deaths?</span>

In [48]:
deaths.groupby('year').agg(death_count=('deaths','sum')).nlargest(1,'death_count')

Unnamed: 0_level_0,death_count
year,Unnamed: 1_level_1
2008,54138


### Exercise 8

<span  style="color:green; font-size:16px">Find the total number of deaths by race and sort by most to least.</span>

In [None]:
(deaths.groupby('race')
       .agg(death_count=('deaths','sum'))
       .sort_values(by='death_count',ascending=False)
)

Unnamed: 0_level_0,death_count
race,Unnamed: 1_level_1
White,206487
Black,111116
Hispanic,74802
Asian,26355
Unknown,6238


### Exercise 9

<span  style="color:green; font-size:16px">Find the total number of deaths by cause and then select the five highest causes.</span>

In [49]:
deaths.groupby('cause').agg(cause_count=('deaths','sum')).nlargest(5,'cause_count')

Unnamed: 0_level_0,cause_count
cause,Unnamed: 1_level_1
Heart Disease,147551
Cancer,106367
Other,77999
Flu and Pneumonia,18678
Diabetes,13794
