# More Groupby Methods

There are many more groupby methods other than `agg`, `filter`, and `transform`. In this chapter, you'll learn how to discover and use them.

## Kinds of groupby attributes and methods

All groupby methods act on either a Series or a DataFrame. If there is a single column name within the brackets following the call to the `groupby` method, then it acts on a Series. If there are no brackets or multiple column names in the brackets, then it acts on a DataFrame. Let's see some examples of the two kinds of `groupby` methods available. Let's begin by reading in the San Francisco employee compensation dataset.

In [3]:
import pandas as pd
import numpy as np
sf_emp = (pd.read_csv('../data/sf_employee_compensation.csv')
            .drop(columns='job'))
sf_emp.head(3)

Unnamed: 0,year,organization group,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,2013,Public Protection,71414.01,0.0,0.0,14038.58,12918.24,5872.04
1,2013,General Administration & Finance,67941.06,0.0,0.0,13030.23,10047.52,5608.37
2,2013,Public Protection,116956.72,59975.43,19037.3,24796.44,15788.97,3222.2


All grouping begins with a call to the `groupby` method by providing it the grouping column(s). Let's assign the object returned when grouping by organization group and output its type.

In [4]:
g_df = sf_emp.groupby('organization group')
type(g_df)

pandas.core.groupby.generic.DataFrameGroupBy

The technical name for this object is `DataFrameGroupBy` and its methods can act on the sub-DataFrame of that group. Let's call the same `groupby` method, but this time put the `salaries` column in brackets following it. A `SeriesGroupBy` is produced and its methods can only act on the salaries Series.

In [5]:
g_series = sf_emp.groupby('organization group')['salaries']
type(g_series)

pandas.core.groupby.generic.SeriesGroupBy

### `GroupBy` API

Take a look at the [`GroupBy` API in the official documentation][1] for a list of all the possible methods. Most of them will overlap with the normal DataFrame methods that were previously covered.

[1]: https://pandas.pydata.org/docs/reference/groupby.html

## Finding all available attributes and methods

The vast majority of DataFrameGroupBy and SeriesGroupBy attributes and methods overlap. Here we retrieve nearly of all the public attributes and methods for each and print out the ones from the SeriesGroupBy to the screen.

In [6]:
public_gbs_methods = [m for m in dir(g_series) 
                      if not m.startswith('_') and len(m) < 15]
public_gbdf_methods = [m for m in dir(g_df) if not m.startswith('_')  
                       and m not in sf_emp.columns and len(m) < 15]
for i, method in enumerate(public_gbs_methods):
    end = '\n' if i % 4 == 3 else ''
    print(f'{method:16}', end=end)

agg             aggregate       all             any             
apply           bfill           corr            count           
cov             cumcount        cummax          cummin          
cumprod         cumsum          describe        diff            
dtype           ewm             expanding       ffill           
fillna          filter          first           get_group       
groups          head            hist            idxmax          
idxmin          indices         last            max             
mean            median          min             ndim            
ngroup          ngroups         nlargest        nsmallest       
nth             nunique         ohlc            pct_change      
pipe            plot            prod            quantile        
rank            resample        rolling         sample          
sem             shift           size            skew            
std             sum             tail            take            
transform       unique   

You should be familiar with many of these attributes and methods as they overlap with the ones available directly from a normal Series. We can take the set difference to determine the attributes and methods unique to each one. These are unique to SeriesGroupBy.

In [7]:
set(public_gbs_methods) - set(public_gbdf_methods)

{'dtype', 'nlargest', 'nsmallest', 'unique'}

These are unique to DataFrameGroupBy objects.

In [8]:
set(public_gbdf_methods) - set(public_gbs_methods)

{'boxplot', 'corrwith', 'dtypes'}

## Calling single aggregation methods

You can bypass the `agg` method by calling the aggregation method directly from one of the groupby objects. The disadvantage is that you won't be able to rename the resulting column. Here, we take the maximum of the salaries column for each organization group.

In [9]:
g_series.max()

organization group
Community Health                            284024.31
Culture & Recreation                        237551.67
General Administration & Finance            301104.04
General City Responsibilities               365749.75
Human Welfare & Neighborhood Development    221076.01
Public Protection                           645739.46
Public Works, Transportation & Commerce     253688.17
Name: salaries, dtype: float64

With `g_df`, the maximum of all non-grouping columns is returned.

In [10]:
g_df.max()

Unnamed: 0_level_0,year,salaries,overtime,other salaries,retirement,health and dental,other benefits
organization group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Community Health,2019,284024.31,78206.57,184400.35,64849.97,29536.06,31810.05
Culture & Recreation,2019,237551.67,100443.69,25774.91,42640.9,29620.66,30915.83
General Administration & Finance,2019,301104.04,42005.2,123838.62,60099.27,29521.98,34138.52
General City Responsibilities,2019,365749.75,214556.62,106461.3,68260.85,36369.96,37243.28
Human Welfare & Neighborhood Development,2019,221076.01,41581.89,26899.6,40959.96,29522.02,33314.41
Public Protection,2019,645739.46,258124.17,239294.57,120791.4,36369.92,37563.46
"Public Works, Transportation & Commerce",2019,253688.17,152088.8,48361.88,47617.3,33495.46,33602.3


### The entire syntax

It's rare that the intermediate call to the `groupby` method will be assigned to a variable as we've done in this chapter. We are only doing this to avoid the repetitive nature of calling the same method over again. When completing these operations in practice, you'll likely begin with the original DataFrame, call the `groupby` method, and then chain the grouping method you desire. Below, the full syntax is given for the last two operations.

```python
sf_emp.groupby('organization group')['salaries'].max()
sf_emp.groupby('organization group').max()
```

### More aggregating methods

Most of the aggregating methods available to normal Series and DataFrames are available to their groupby counterparts. Nearly all of them return a single value for each group. However, the `describe` method returns many aggregations. You can provide it a list of percentiles to return as well. Here, we get many summary statistics for the salaries column on all the groups.

In [11]:
(g_series.describe(percentiles=[0.01, 0.2, 0.5, 0.8, 0.99])
         .round(0)
         .style.format('{:,.0f}'))

Unnamed: 0_level_0,count,mean,std,min,1%,20%,50%,80%,99%,max
organization group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Community Health,9044,59003,47983,-2716,0,10551,54653,97752,203195,284024
Culture & Recreation,3697,29496,33063,0,0,1535,14336,61125,129433,237552
General Administration & Finance,3707,59453,51372,-1684,0,5906,55100,100346,201274,301104
General City Responsibilities,9176,33788,47922,0,0,0,411,77756,175333,365750
Human Welfare & Neighborhood Development,3758,44411,39557,-1120,0,3799,40310,81490,154825,221076
Public Protection,7867,77299,49381,-2985,0,22251,82421,119000,193350,645739
"Public Works, Transportation & Commerce",12751,57852,41369,-1462,0,12168,60108,90459,169203,253688


Calling the `describe` method on `g_df` would return a very wide DataFrame with all of these statistics calculated on each numeric column.

### The `size` method

The `size` method returns the number of values in each group, which is the exact same result as the `value_counts` method. Because it offers less options (no sorting or normalization), I prefer `value_counts`.

In [13]:
g_series.size().head(10)

organization group
Community Health                             9044
Culture & Recreation                         3697
General Administration & Finance             3707
General City Responsibilities                9176
Human Welfare & Neighborhood Development     3758
Public Protection                            7867
Public Works, Transportation & Commerce     12751
Name: salaries, dtype: int64

## `head`, `tail`, and `nth` groupby methods

The `head` and `tail` methods return the first and last five rows, respectively, of each group. Set the parameter `n` to an integer to control the number of rows returned per group. Here we return the first two rows of the entire DataFrame for each organization group. Notice that the order of the rows are preserved and they are not sorted by the grouping column.

In [14]:
g_df.head(2)

Unnamed: 0,year,organization group,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,2013,Public Protection,71414.01,0.0,0.0,14038.58,12918.24,5872.04
1,2013,General Administration & Finance,67941.06,0.0,0.0,13030.23,10047.52,5608.37
2,2013,Public Protection,116956.72,59975.43,19037.3,24796.44,15788.97,3222.2
3,2013,Community Health,31856.0,0.0,0.0,6791.73,5262.99,2574.91
4,2013,Community Health,29590.58,0.0,5898.73,960.81,0.0,9230.03
5,2013,"Public Works, Transportation & Commerce",13063.05,0.0,0.0,0.0,0.0,1031.98
6,2013,"Public Works, Transportation & Commerce",51293.81,0.0,5086.76,12227.17,12918.24,4552.84
9,2013,Culture & Recreation,42175.61,0.0,463.19,9629.21,11124.05,3212.26
14,2013,General Administration & Finance,94416.74,0.0,0.0,16721.26,11063.09,7597.74
17,2013,Culture & Recreation,1821.16,0.0,27.0,0.0,664.72,143.09


Using the same operation on a `g_series` isn't as clear as only the values of the salaries are returned without the context of the grouping column. The index is preserved and can be used to verify correctness.

In [15]:
g_series.head(2)

0       71414.01
1       67941.06
2      116956.72
3       31856.00
4       29590.58
5       13063.05
6       51293.81
9       42175.61
14      94416.74
17       1821.16
36      10172.48
98      76596.01
276     37274.01
723     23270.40
Name: salaries, dtype: float64

The `nth` groupby method allows you to select exactly which rows from the group are returned using integer location. Pass it a single integer or a list of integers. For instance, the following returns rows with integer location 5 and 10 from each group.

In [16]:
g_df.nth([5, 10])

Unnamed: 0,year,organization group,salaries,overtime,other salaries,retirement,health and dental,other benefits
16,2013,Public Protection,0.0,0.0,167.03,0.0,3.04,0.0
21,2013,"Public Works, Transportation & Commerce",80538.2,3026.08,6682.25,15215.01,11708.04,7517.02
25,2013,Community Health,145865.95,0.0,0.0,25832.82,12801.79,17356.29
27,2013,Public Protection,66123.02,1234.17,9522.49,14443.75,12918.24,6310.89
43,2013,General Administration & Finance,73899.03,1454.65,1336.47,14637.5,12918.24,5710.47
46,2013,Culture & Recreation,64781.63,0.0,581.87,11902.67,11937.49,5394.02
53,2013,"Public Works, Transportation & Commerce",11252.03,302.71,86.37,3097.66,3476.43,899.79
57,2013,Community Health,77948.03,1056.84,3323.0,15988.76,12918.24,6785.64
70,2013,Culture & Recreation,16950.09,0.0,0.0,0.0,4460.62,1312.27
74,2013,General Administration & Finance,117287.47,0.0,9064.08,22402.14,8430.06,9359.44


### Groupby methods unique to Series

A few methods such as `nlargest`, `nsmallest`, and `unique` are unique to SeriesGroupBy objects. Here, we get the two largest salaries in each group.

In [17]:
s = g_series.nlargest(2)
s

organization group                             
Community Health                          28998    284024.31
                                          28956    281144.72
Culture & Recreation                      15261    237551.67
                                          16394    200650.56
General Administration & Finance          45386    301104.04
                                          46052    298427.00
General City Responsibilities             44789    365749.75
                                          48421    302472.02
Human Welfare & Neighborhood Development  29317    221076.01
                                          30003    221076.01
Public Protection                         42437    645739.46
                                          49041    281745.01
Public Works, Transportation & Commerce   39003    253688.17
                                          29109    247760.86
Name: salaries, dtype: float64

### Drop a level from the index with `droplevel`

Notice that a multilevel index was created. The inner level contains the index labels for the row with that salary. It isn't very meaningful here and can be dropped with the `droplevel` method. Pass it the integer location or name of the level to drop and it will return a Series without that level. Index levels are numbered beginning at 0 from the outside. 

In [18]:
s.droplevel(1)

organization group
Community Health                            284024.31
Community Health                            281144.72
Culture & Recreation                        237551.67
Culture & Recreation                        200650.56
General Administration & Finance            301104.04
General Administration & Finance            298427.00
General City Responsibilities               365749.75
General City Responsibilities               302472.02
Human Welfare & Neighborhood Development    221076.01
Human Welfare & Neighborhood Development    221076.01
Public Protection                           645739.46
Public Protection                           281745.01
Public Works, Transportation & Commerce     253688.17
Public Works, Transportation & Commerce     247760.86
Name: salaries, dtype: float64

## Non-aggregating methods

Many other methods do not aggregate and instead return a Series or DataFrame with the same length as the group. For the most part, they work exactly the same as they do on regular Series or DataFrames. To help teach these methods, a small example DataFrame will be created.

In [19]:
df = pd.DataFrame({'item': ['A', 'B', 'A', 'A', 'B', 'A', 'B', 'B'],
                   'quantity': [5, 3, 8, np.nan, 2, 15, np.nan, 6]})
df

Unnamed: 0,item,quantity
0,A,5.0
1,B,3.0
2,A,8.0
3,A,
4,B,2.0
5,A,15.0
6,B,
7,B,6.0


We'll use a SeriesGroupBy object to showcase these methods. 

In [20]:
g_series = df.groupby('item')['quantity']

All of the methods in this section preserve the order of the original values. They do NOT sort by the group. Take for instance, the `cumsum` method which accumulates the sum beginning from the top by group.

In [21]:
g_series.cumsum()

0     5.0
1     3.0
2    13.0
3     NaN
4     5.0
5    28.0
6     NaN
7    11.0
Name: quantity, dtype: float64

A Series is returned, but is difficult to decipher without it being attached to the original DataFrame. Let's add it as a column and then re-examine the output.

In [22]:
df['quantity_cumsum'] = g_series.cumsum()
df

Unnamed: 0,item,quantity,quantity_cumsum
0,A,5.0,5.0
1,B,3.0,3.0
2,A,8.0,13.0
3,A,,
4,B,2.0,5.0
5,A,15.0,28.0
6,B,,
7,B,6.0,11.0


Each group has the quantity column accumulated independently for each group. The method `cumcount` is unique to groupby objects and provides the integer location of each row by group beginning with 0.

In [24]:
df['group_iloc'] = g_series.cumcount()
df

Unnamed: 0,item,quantity,quantity_cumsum,group_iloc
0,A,5.0,5.0,0
1,B,3.0,3.0,0
2,A,8.0,13.0,1
3,A,,,2
4,B,2.0,5.0,1
5,A,15.0,28.0,3
6,B,,,2
7,B,6.0,11.0,3


Each quantity can be ranked using the `rank` method. Below, the largest quantity of each group gets ranked 1.

In [None]:
df['group_rank'] = g_series.rank(ascending=False)
df

We fill in missing values with the previous known missing value of that group with the `fillna` method.

In [None]:
df['group_ffill'] = g_series.fillna(method='ffill')
df

### Finding the highest scoring movie for each year

Let's read in the movie dataset and then find the highest scoring movie for each year.

In [None]:
movie = pd.read_csv('../data/movie.csv', index_col='title')
movie.head(2)

Because the title is in the index, calling the `idxmax` method on the `imdb_score` column returns the movie with the highest score for each year.

In [None]:
movie.groupby('year')['imdb_score'].idxmax().tail()

Use the `agg` method to return both the score and movie title.

In [None]:
movie.groupby('year')['imdb_score'].agg(['max', 'idxmax']).tail()

## Summary of other groupby methods

The other groupby methods operate similarly as their DataFrame/Series counterparts, but do so on each independent grouping.

## Exercises

Execute the next cell to read in some of the columns from the flights dataset and use it to answer the following exercises.

In [23]:
import pandas as pd
cols = ['date', 'airline', 'origin', 'dest', 'dep_time', 'arr_time',
       'cancelled', 'air_time', 'distance', 'carrier_delay']
flights = pd.read_csv('../data/flights.csv', parse_dates=['date'], usecols=cols)
flights.head(3)

Unnamed: 0,date,airline,origin,dest,dep_time,arr_time,cancelled,air_time,distance,carrier_delay
0,2018-01-01,UA,LAS,IAH,100,547,0,134.0,1222.0,0
1,2018-01-01,WN,DEN,PHX,515,720,0,91.0,602.0,0
2,2018-01-01,B6,JFK,BOS,550,657,0,39.0,187.0,0


In [42]:
import pandas as pd

# ------------------------------------------------------------------------------
# INGESTION & TYPE OPTIMIZATION
# ------------------------------------------------------------------------------
# Architect's Note: 
# - We optimize strings to 'category' for group-heavy operations.
# - 'cancelled' is cast to bool for memory efficiency.
cols = ['date', 'airline', 'origin', 'dest', 'dep_time', 'arr_time',
        'cancelled', 'air_time', 'distance', 'carrier_delay']

flights = (pd.read_csv('../data/flights.csv', usecols=cols)
    .assign(
        date=lambda x: pd.to_datetime(x['date']),
        airline=lambda x: x['airline'].astype('category'),
        origin=lambda x: x['origin'].astype('category'),
        dest=lambda x: x['dest'].astype('category'),
        cancelled=lambda x: x['cancelled'].astype(bool)
    )
)

# ------------------------------------------------------------------------------
# EXERCISE 1: First and Last per Airline
# ------------------------------------------------------------------------------
first_last_airline = (flights
    .groupby('airline', observed=True)
    .nth([0, -1]).sort_values('airline',ascending=True)
)

# ------------------------------------------------------------------------------
# EXERCISE 2: 500th Flight per Route
# ------------------------------------------------------------------------------
# Architect's Note: .nth is 0-indexed; 500th flight is index 499.
five_hundredth_flight = (flights
    .groupby(['origin', 'dest'], observed=True)
    .nth(499)
)

# ------------------------------------------------------------------------------
# EXERCISE 3: Date of 10th Cancelled Flight per Airline
# ------------------------------------------------------------------------------
tenth_cancelled_date = (flights
    .query('cancelled == True')
    .groupby('airline', observed=True)
    .nth(9)
)

# ------------------------------------------------------------------------------
# EXERCISE 4: Avg Delay for Routes with > 300 Flights
# ------------------------------------------------------------------------------
avg_delay_busy_routes = (
    flights
    # 1. Filter out low-volume routes (The Bouncer)
    .groupby(['origin', 'dest'], observed=True)
    .filter(lambda x: len(x) > 300)
    
    # 2. Collapse remaining high-volume routes into a summary report
    .groupby(['origin', 'dest'], observed=True)
    .agg(
        avg_carrier_delay=('carrier_delay', 'mean'),
        flight_count=('carrier_delay', 'size')
    )
)

### Exercise 1

<span style="color:green; font-size:16px">For each airline, return the first and last row of each group. Use the `nth` groupby method.</span>

In [43]:
first_last_airline

Unnamed: 0,date,airline,origin,dest,dep_time,arr_time,cancelled,air_time,distance,carrier_delay
65851,2018-12-31,9E,CLT,JFK,1400,1603,False,80.0,541.0,0
84,2018-01-01,9E,IAH,ATL,1346,1651,False,86.0,689.0,0
10,2018-01-01,AA,DFW,DCA,610,959,False,131.0,1192.0,0
65914,2018-12-31,AA,DFW,SFO,2047,2245,False,194.0,1464.0,0
65922,2018-12-31,AS,SEA,DFW,2315,502,False,210.0,1660.0,3
8,2018-01-01,AS,SEA,SFO,605,816,False,97.0,679.0,0
65920,2018-12-31,B6,PHX,JFK,2234,509,False,233.0,2153.0,0
2,2018-01-01,B6,JFK,BOS,550,657,False,39.0,187.0,0
16,2018-01-01,DL,LGA,MCO,700,953,False,155.0,950.0,0
65910,2018-12-31,DL,ATL,SEA,1959,2237,False,281.0,2182.0,0


### Exercise 2

<span style="color:green; font-size:16px">For every origin and destination combination, select the 500th flight.</span>

In [44]:
five_hundredth_flight.sort_values(['origin','dest'],ascending=[True,True])

Unnamed: 0,date,airline,origin,dest,dep_time,arr_time,cancelled,air_time,distance,carrier_delay
60067,2018-11-27,DL,JFK,LAX,1925,2300,False,325.0,2475.0,0
64087,2018-12-21,WN,LAS,LAX,545,655,False,48.0,236.0,0
60353,2018-11-29,DL,LAX,JFK,1145,2007,False,269.0,2475.0,0
64842,2018-12-25,AA,LAX,LAS,1955,2107,False,54.0,236.0,0
52420,2018-10-15,WN,LAX,SFO,955,1115,False,56.0,337.0,0
52854,2018-10-17,UA,LGA,ORD,1700,1836,False,129.0,733.0,0
52258,2018-10-14,UA,ORD,LGA,1300,1615,False,95.0,733.0,0
53212,2018-10-19,UA,SFO,LAX,1300,1435,False,58.0,337.0,0


### Exercise 3

<span style="color:green; font-size:16px">Find the date of the 10th cancelled flight for each airline.</span>

In [45]:
tenth_cancelled_date

Unnamed: 0,date,airline,origin,dest,dep_time,arr_time,cancelled,air_time,distance,carrier_delay
535,2018-01-04,UA,BOS,EWR,800,924,True,,200.0,0
627,2018-01-04,AA,EWR,PHX,1620,2009,True,,2133.0,0
641,2018-01-04,DL,DTW,PHL,1745,1941,True,,453.0,0
702,2018-01-05,B6,BOS,DFW,731,1105,True,,1562.0,0
794,2018-01-05,YX,JFK,BOS,1700,1822,True,,187.0,0
6988,2018-02-12,NK,IAH,EWR,630,1042,True,,1400.0,0
9812,2018-02-27,WN,PHX,LAS,2115,2120,True,,255.0,0
10128,2018-03-01,OO,SFO,LAX,1650,1825,True,,337.0,0
12161,2018-03-13,9E,JFK,BOS,905,1030,True,,187.0,0
13472,2018-03-20,VX,SFO,LAX,2140,2323,True,,337.0,0


### Exercise 4

<span style="color:green; font-size:16px">Find the average carrier delay for each origin and destination combination with more than 300 flights.</span>

In [46]:
avg_delay_busy_routes

Unnamed: 0_level_0,Unnamed: 1_level_0,avg_carrier_delay,flight_count
origin,dest,Unnamed: 2_level_1,Unnamed: 3_level_1
ATL,BOS,2.480263,304
ATL,LGA,1.547445,411
ATL,MCO,4.573727,373
ATL,ORD,2.45768,319
BOS,DCA,3.415144,383
BOS,LGA,3.341346,416
BOS,ORD,1.805732,314
DCA,BOS,1.724138,348
DCA,ORD,1.212389,339
DEN,LAX,6.689855,345


In [49]:
avg_delay_busy_routes = (
    flights
    # 1. Filter out low-volume routes (The Bouncer)
    .groupby(['origin', 'dest'], observed=True)
    .filter(lambda x: len(x) > 300)
    
    # 2. Collapse remaining high-volume routes into a summary report
    .groupby(['origin', 'dest'], observed=True)
    .agg(
        avg_carrier_delay=('carrier_delay', 'mean'),
        flight_count=('carrier_delay', 'size')
    )
)

avg_delay_busy_routes.sort_values('avg_carrier_delay',ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,avg_carrier_delay,flight_count
origin,dest,Unnamed: 2_level_1,Unnamed: 3_level_1
JFK,SFO,8.891641,323
ORD,ATL,7.480645,310
DEN,LAX,6.689855,345
ORD,BOS,5.652866,314
ATL,MCO,4.573727,373
LAX,JFK,4.229927,548
DFW,ORD,4.081081,333
DEN,PHX,3.980645,310
SEA,SFO,3.883333,360
LAS,LAX,3.840237,507
