# Alternative Groupby Syntax

This chapter covers a few other alternative syntaxes available to do aggregations with the `groupby` method. This chapter has potential to confuse beginning pandas users since these methods do not give you any extra power to do data analysis. However, many other people who use pandas will use these syntaxes so it's important to be aware that they exist. Let's begin by reading in the San Francisco employee compensation dataset.

In [1]:
import pandas as pd
import numpy as np
sf_emp = pd.read_csv('../data/sf_employee_compensation.csv')
sf_emp.head(3)

Unnamed: 0,year,organization group,job,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,2013,Public Protection,Personnel Technician,71414.01,0.0,0.0,14038.58,12918.24,5872.04
1,2013,General Administration & Finance,Planner 2,67941.06,0.0,0.0,13030.23,10047.52,5608.37
2,2013,Public Protection,Firefighter,116956.72,59975.43,19037.3,24796.44,15788.97,3222.2


## Aggregating a single column

Originally, we set our new column name equal to a two-item tuple of the aggregating column and the aggregating function within the `agg` method.

In [2]:
sf_emp.groupby('organization group').agg(mean_salary=('salaries', 'mean'))

Unnamed: 0_level_0,mean_salary
organization group,Unnamed: 1_level_1
Community Health,59002.519797
Culture & Recreation,29496.244217
General Administration & Finance,59452.791117
General City Responsibilities,33787.559971
Human Welfare & Neighborhood Development,44410.937355
Public Protection,77298.638409
"Public Works, Transportation & Commerce",57852.310785


### Alternative - use a dictionary

Instead of a tuple, you can use a dictionary to map the aggregating column to the aggregating function. The generic syntax takes on the following form:

```python
df.groupby('grouping column').agg({'aggregating column': 'aggregating function'})
```


Although this syntax uses less code, it does not allow you to rename columns during the aggregation.

In [3]:
sf_emp.groupby('organization group').agg({'salaries': 'mean'})

Unnamed: 0_level_0,salaries
organization group,Unnamed: 1_level_1
Community Health,59002.519797
Culture & Recreation,29496.244217
General Administration & Finance,59452.791117
General City Responsibilities,33787.559971
Human Welfare & Neighborhood Development,44410.937355
Public Protection,77298.638409
"Public Works, Transportation & Commerce",57852.310785


### Alternative - select the column with the brackets

Instead of using a dictionary, place the aggregating columns in brackets following the `groupby` method and then pass the aggregating function as a string to the `agg` method. The generic syntax takes on the following form:

```python
df.groupby('grouping column')['aggregating column'].agg('aggregating function')
```

In [4]:
sf_emp.groupby('organization group')['salaries'].agg('mean')

organization group
Community Health                            59002.519797
Culture & Recreation                        29496.244217
General Administration & Finance            59452.791117
General City Responsibilities               33787.559971
Human Welfare & Neighborhood Development    44410.937355
Public Protection                           77298.638409
Public Works, Transportation & Commerce     57852.310785
Name: salaries, dtype: float64

You can even bypass the `agg` method and use the name of the aggregation as a method directly after the brackets.

In [8]:
sf_emp.groupby('organization group')['salaries'].mean()

organization group
Community Health                            59002.519797
Culture & Recreation                        29496.244217
General Administration & Finance            59452.791117
General City Responsibilities               33787.559971
Human Welfare & Neighborhood Development    44410.937355
Public Protection                           77298.638409
Public Works, Transportation & Commerce     57852.310785
Name: salaries, dtype: float64

In [7]:
sf_emp.groupby('organization group')[['salaries']].mean()

Unnamed: 0_level_0,salaries
organization group,Unnamed: 1_level_1
Community Health,59002.519797
Culture & Recreation,29496.244217
General Administration & Finance,59452.791117
General City Responsibilities,33787.559971
Human Welfare & Neighborhood Development,44410.937355
Public Protection,77298.638409
"Public Works, Transportation & Commerce",57852.310785


### Possible advantage - allows for multiple aggregating columns

Using any of these alternative methods allows you to use multiple aggregating functions with less amount of code. Here, we use a dictionary to make the aggregating column to all of the aggregating functions we desire.

In [9]:
sf_emp.groupby('organization group').agg({'salaries': ['mean', 'min', 'max']})

Unnamed: 0_level_0,salaries,salaries,salaries
Unnamed: 0_level_1,mean,min,max
organization group,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Community Health,59002.519797,-2716.5,284024.31
Culture & Recreation,29496.244217,0.0,237551.67
General Administration & Finance,59452.791117,-1683.6,301104.04
General City Responsibilities,33787.559971,0.0,365749.75
Human Welfare & Neighborhood Development,44410.937355,-1119.86,221076.01
Public Protection,77298.638409,-2984.52,645739.46
"Public Works, Transportation & Commerce",57852.310785,-1461.5,253688.17


We would need to use more code with the original syntax, but I prefer this as we are returned a DataFrame with a single level index for the columns and we can name each column exactly what we desire.

In [None]:
sf_emp.groupby('organization group').agg(mean_salary=('salaries', 'mean'),
                                         min_salary=('salaries', 'min'),
                                         max_salary=('salaries', 'max'))

In [12]:
sf_emp.groupby('organization group').agg({'salaries': ['mean', 'min', 'max'], 'overtime':['mean','min','max']})

Unnamed: 0_level_0,salaries,salaries,salaries,overtime,overtime,overtime
Unnamed: 0_level_1,mean,min,max,mean,min,max
organization group,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Community Health,59002.519797,-2716.5,284024.31,2019.964453,-18458.15,78206.57
Culture & Recreation,29496.244217,0.0,237551.67,833.305678,-45.56,100443.69
General Administration & Finance,59452.791117,-1683.6,301104.04,760.437942,-188.52,42005.2
General City Responsibilities,33787.559971,0.0,365749.75,3071.33064,0.0,214556.62
Human Welfare & Neighborhood Development,44410.937355,-1119.86,221076.01,655.851562,0.0,41581.89
Public Protection,77298.638409,-2984.52,645739.46,11517.841398,0.0,258124.17
"Public Works, Transportation & Commerce",57852.310785,-1461.5,253688.17,5069.196369,-48.0,152088.8


## No Aggregating Columns

You actually do not need to specify the aggregating columns when grouping if using the method version of the aggregation. When doing so, all columns will be aggregated. Here, we set `numeric_only` to `True` to ensure that only the numeric columns are aggregated by the mean.

In [10]:
(sf_emp.groupby(['year', 'organization group'])
       .mean(numeric_only=True)
       .head()
       .round(-3))

Unnamed: 0_level_0,Unnamed: 1_level_0,salaries,overtime,other salaries,retirement,health and dental,other benefits
year,organization group,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2013,Community Health,59000.0,2000.0,4000.0,11000.0,8000.0,5000.0
2013,Culture & Recreation,31000.0,1000.0,1000.0,5000.0,6000.0,3000.0
2013,General Administration & Finance,65000.0,1000.0,1000.0,12000.0,9000.0,5000.0
2013,General City Responsibilities,13000.0,2000.0,2000.0,3000.0,3000.0,1000.0
2013,Human Welfare & Neighborhood Development,50000.0,0.0,1000.0,9000.0,9000.0,4000.0


## Exercises

Execute the cell below to read in the flights dataset and then use it for the following exercises.

In [11]:
import pandas as pd
flights = pd.read_csv('../data/flights.csv', parse_dates=['date'])
flights.head(3)

Unnamed: 0,date,airline,origin,dest,dep_time,arr_time,cancelled,air_time,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2018-01-01,UA,LAS,IAH,100,547,0,134.0,1222.0,0,0,0,0,0
1,2018-01-01,WN,DEN,PHX,515,720,0,91.0,602.0,0,0,0,0,0
2,2018-01-01,B6,JFK,BOS,550,657,0,39.0,187.0,0,83,8,0,0


### Exercise 1

<span style="color:green; font-size:16px">Use a dictionary in the `groupby` `agg` method to calculate the mean, median, min, and max of the air time for every airline.</span>

In [13]:
flights.groupby('airline').agg({'air_time':['mean','median','min','max']})

Unnamed: 0_level_0,air_time,air_time,air_time,air_time
Unnamed: 0_level_1,mean,median,min,max
airline,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
9E,89.901705,84.0,24.0,224.0
AA,147.940078,126.0,22.0,421.0
AS,185.414344,149.0,38.0,395.0
B6,170.061845,132.0,32.0,428.0
DL,146.034775,125.0,22.0,405.0
EV,57.440252,45.0,35.0,178.0
F9,139.187223,124.0,55.0,327.0
MQ,80.21813,83.0,20.0,164.0
NK,147.173162,133.0,40.0,388.0
OH,66.418699,70.5,28.0,142.0


### Exercise 2

<span style="color:green; font-size:16px">Without using the `agg` method calculate the number of unique destinations for each airline.</span>

In [18]:
flights.groupby('airline')['dest'].nunique()

airline
9E    13
AA    20
AS    18
B6    19
DL    20
EV     8
F9    17
MQ    12
NK    16
OH     9
OO    19
UA    19
VX    12
WN    15
YV    10
YX    14
Name: dest, dtype: int64

### Exercise 3

<span style="color:green; font-size:16px">Calculate the mean of every numeric column for each airline and origin without using the `agg` method.</span>

In [19]:
flights.groupby(['airline','origin']).mean(numeric_only=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,dep_time,arr_time,cancelled,air_time,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
airline,origin,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
9E,ATL,729.714286,840.142857,0.000000,105.142857,689.000000,0.000000,0.000000,3.857143,0.0,0.000000
9E,BOS,1320.029412,1453.970588,0.049020,47.103093,191.558824,2.284314,0.029412,8.607843,0.0,4.176471
9E,CLT,1254.613636,1461.238636,0.068182,82.926829,554.397727,5.261364,0.000000,4.022727,0.0,4.295455
9E,DCA,1168.218182,1309.563636,0.018182,44.629630,216.490909,10.945455,0.000000,1.309091,0.0,2.163636
9E,DFW,1346.068182,1682.136364,0.011364,137.080460,1054.022727,2.840909,0.738636,3.772727,0.0,11.875000
...,...,...,...,...,...,...,...,...,...,...,...
YX,JFK,1519.901961,1677.705882,0.078431,57.191489,278.725490,0.000000,0.000000,3.078431,0.0,2.980392
YX,LGA,1272.184569,1456.931921,0.046899,91.452229,546.311649,3.028744,0.422088,3.836611,0.0,4.751891
YX,MSP,1167.055556,1494.888889,0.037037,114.737179,843.049383,1.265432,0.037037,7.962963,0.0,5.067901
YX,ORD,1374.639535,1515.186047,0.027132,86.163347,589.135659,2.093023,0.352713,7.015504,0.0,7.569767
