# Miscellaneous Grouping Functionality

Believe it or not, there is even more grouping functionality that remains to be covered in pandas. This chapter provides a few other lesser known grouping features possible with pandas.

## Grouping by columns not in the DataFrame

Thus far, we've only passed strings (or a list of strings) to the `groupby` method. Each of these strings refers to a specific column in the DataFrame. Let's review this simple concept by reading in the bikes dataset and finding the median trip duration by gender.

In [1]:
import pandas as pd
bikes = pd.read_csv('../data/bikes.csv', na_values=-9999)
bikes.head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,11.0,Michigan Ave & Oak St,15.0,73.9,12.7,mostlycloudy
1,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,31.0,Wells St & Walton St,19.0,69.1,6.9,partlycloudy
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy


We use the syntax that returns the result as a Series.

In [2]:
bikes.groupby('gender')['tripduration'].median()

gender
Female    660.0
Male      547.0
Name: tripduration, dtype: float64

Instead of passing in the string name of the column, you can select the column as a Series and pass it to the `groupby` method instead.

In [3]:
s = bikes['gender']
bikes.groupby(s)['tripduration'].median()

gender
Female    660.0
Male      547.0
Name: tripduration, dtype: float64

The same result is produced and since the syntax is a bit more involved, it's best to just use the string name for simplicity. However, the example does show that it is possible to use other Series not in the DataFrame. Take a look at the following Series that has nothing to with the bikes DataFrame. It's just a random sample of strings with the same length as the DataFrame.

In [4]:
n = len(bikes)
s_fruits = pd.Series(['Apple', 'Banana', 'Cantaloupe', 'Durian', 'Elderberry'])
s_fruits = s_fruits.sample(n=n, replace=True, random_state=1, ignore_index=True)
s_fruits.head()

0        Durian
1    Elderberry
2         Apple
3        Banana
4        Durian
dtype: object

As long as the Series is the same length as the DataFrame, it may be passed to the `groupby` method where its unique values form distinct groups. As usual, these unique values are placed in the index.

In [5]:
bikes.groupby(s_fruits)['tripduration'].agg(['size', 'mean'])

Unnamed: 0,size,mean
Apple,10025,735.975761
Banana,9976,702.31265
Cantaloupe,10041,716.635793
Durian,10065,714.443318
Elderberry,9982,714.901723


### Mixing other Series and strings

This other Series may be used together with the normal strings that refer to column names to group by multiple columns.

In [6]:
bikes.groupby([s_fruits, 'gender'])['tripduration'].agg(['size', 'mean'])

Unnamed: 0_level_0,Unnamed: 1_level_0,size,mean
Unnamed: 0_level_1,gender,Unnamed: 2_level_1,Unnamed: 3_level_1
Apple,Female,2522,838.521808
Apple,Male,7503,701.506731
Banana,Female,2470,816.1583
Banana,Male,7506,664.849454
Cantaloupe,Female,2471,797.606232
Cantaloupe,Male,7570,690.205416
Durian,Female,2521,802.136454
Durian,Male,7544,685.138653
Elderberry,Female,2451,794.106487
Elderberry,Male,7531,689.124153


One common use case is when binning a numeric column. Here, we bin temperature into six equal sized bins creating a Series and then count the values in each bin.

In [7]:
temp_bins = pd.qcut(bikes['temperature'], 6)
temp_bins.value_counts()

temperature
(-8.001, 45.0]    8706
(45.0, 57.9]      8526
(57.9, 66.9]      8505
(66.9, 73.0]      8361
(73.0, 79.0]      8274
(79.0, 96.1]      7716
Name: count, dtype: int64

This new Series may be used by itself or in combination with other column names to group. Take note that this Series is assigned to the variable name `s_gt`, and will be used in a upcoming section.

In [8]:
s_gt = bikes.groupby(['gender', temp_bins])['tripduration'].median()
s_gt

  s_gt = bikes.groupby(['gender', temp_bins])['tripduration'].median()


gender  temperature   
Female  (-8.001, 45.0]    544.0
        (45.0, 57.9]      617.5
        (57.9, 66.9]      676.0
        (66.9, 73.0]      690.0
        (73.0, 79.0]      711.0
        (79.0, 96.1]      699.0
Male    (-8.001, 45.0]    474.0
        (45.0, 57.9]      514.0
        (57.9, 66.9]      557.0
        (66.9, 73.0]      582.0
        (73.0, 79.0]      594.0
        (79.0, 96.1]      602.0
Name: tripduration, dtype: float64

Similarly, the `pivot_table` method accepts other Series as well. Here, we reproduce the results from above, but pivot the temperature bins so that they become the new column values.

In [9]:
bikes.pivot_table(index='gender', columns=temp_bins, 
                  values='tripduration', aggfunc='median')

  bikes.pivot_table(index='gender', columns=temp_bins,


temperature,"(-8.001, 45.0]","(45.0, 57.9]","(57.9, 66.9]","(66.9, 73.0]","(73.0, 79.0]","(79.0, 96.1]"
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Female,544.0,617.5,676.0,690.0,711.0,699.0
Male,474.0,514.0,557.0,582.0,594.0,602.0


## Grouping Series and aggregating other columns

The object calling the `groupby` method has always been a DataFrame in all of our previous examples. The Series also has a `groupby` method and like we saw above, it's not necessary for the grouping column to be part of the calling object. Here, we select the trip duration column as a Series, and group using the temperature bins created above. The aggregations are automatically applied to the Series values.

In [10]:
td = bikes['tripduration']
td.groupby(temp_bins).agg(['size', 'mean', 'median', 'min', 'max'])

  td.groupby(temp_bins).agg(['size', 'mean', 'median', 'min', 'max'])


Unnamed: 0_level_0,size,mean,median,min,max
temperature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"(-8.001, 45.0]",8706,624.515621,487.0,60,86188
"(45.0, 57.9]",8526,673.67042,534.0,62,84353
"(57.9, 66.9]",8505,719.023986,583.0,62,73591
"(66.9, 73.0]",8361,736.286808,613.0,61,51684
"(73.0, 79.0]",8274,781.498429,627.0,60,63155
"(79.0, 96.1]",7716,776.139321,625.0,60,85442


## Grouping by index levels

You might be wondering how to use the Series `groupby` method without passing it another Series to act as the grouping column. Series, like DataFrames, can have multiple index levels that act like columns. The `s_gt` Series created above has two index levels. Each of their names may be retrieved with the `names` Index attribute.

In [11]:
s_gt.index.names

FrozenList(['gender', 'temperature'])

These index levels may be used just as if they were DataFrame columns with their names passed to the `groupby` method as strings. The values of the Series are aggregated.

In [None]:
s_gt.groupby('gender').max()

It's also possible to use the integer location of the index level (numbering begins from 0 with the left-most level). Here, we group by the second level, the temperature bins.

In [None]:
s_gt.groupby(level=1).max()

Note, that DataFrames may also be grouped by their index levels in the same exact manner.

## Changing the direction of grouping

As we've seen, many DataFrame methods have an `axis` parameter available to change the default direction of the operation. For most methods, we set `axis=1` to change the operation from vertical to horizontal. The `groupby` method is no different in this regard. Let's read in the `sweden_age` dataset containing the population by age of every person in Sweden from 1980 to 2020. The year is placed in the index and the remaining columns represent each age from 0 to 100, where 100 represents all those aged 100 and above.

In [12]:
sweden_age = pd.read_csv('../data/covid/sweden_age.csv', index_col='year')
sweden_age.tail()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,100
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016,119023,118568,120165,119153,120132,119333,123976,120669,119168,117707,...,18259,14989,11864,8960,7369,5544,3170,2065,1426,1981
2017,116614,121975,120381,122058,121016,122021,121070,125729,122383,120977,...,18054,15005,12021,9247,6749,5400,3968,2132,1331,2084
2018,116839,118762,123525,121822,123550,122475,123497,122487,127148,123866,...,18013,14932,12034,9480,7006,4983,3873,2724,1403,2067
2019,115383,118776,120030,124681,122848,124569,123335,124541,123471,128168,...,18531,15006,12119,9538,7317,5241,3655,2709,1854,2207
2020,113589,116591,119425,120470,125001,123143,124767,123691,124917,123837,...,17922,15086,11928,9315,7135,5361,3710,2515,1770,2449


Let's say we are interested in finding the population of particular age bins per year. We use the `cut` function to bin the age columns, which are read in as strings and must be converted to integers first.

In [13]:
age_bins = pd.cut(sweden_age.columns.astype('int64'), 
                  bins=[0, 5, 15, 25, 35, 50, 65, 80, 101], 
                  right=False)
age_bins.categories

IntervalIndex([[0, 5), [5, 15), [15, 25), [25, 35), [35, 50), [50, 65),
               [65, 80), [80, 101)],
              dtype='interval[int64, left]')

We created eight unique bins, each spanning a variety of different years of age. The variable `age_bins` contains a total of 101 values, one for each column.

In [14]:
len(age_bins)

101

We can now use these bins to group the columns together by setting `axis=1`. The first five columns form a group, with the next 10 columns forming their own independent group, and so on. We now have the population by year within specific age groups.

In [15]:
sweden_age.groupby(age_bins, axis=1).sum().tail()

  sweden_age.groupby(age_bins, axis=1).sum().tail()
  sweden_age.groupby(age_bins, axis=1).sum().tail()


Unnamed: 0_level_0,"[0, 5)","[5, 15)","[15, 25)","[25, 35)","[35, 50)","[50, 65)","[65, 80)","[80, 101)"
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2016,597041,1163953,1169791,1352863,1924124,1810524,1469963,506894
2017,602044,1192633,1161757,1396496,1926545,1834621,1493476,512670
2018,604498,1215231,1157106,1431463,1934294,1851882,1513578,522133
2019,601718,1233103,1156608,1455587,1948953,1866253,1529061,536306
2020,595076,1242722,1156040,1457708,1960192,1879471,1544366,543720


## Exercises

Read in the flights dataset and use it for the following exercises.

In [None]:
flights = pd.read_csv('../data/flights.csv')
flights.head(3)

### Exercise 1

<span style="color:green; font-size:16px">Create a Series of booleans determining if there is a carrier delay of 15 minutes or more. The values should be `False` if under 15 minutes and `True` if 15 minutes or over. Find the average distance flown by each group.</span>

### Exercise 2

<span style="color:green; font-size:16px">Create a Series of booleans determining if there is a weather delay of 15 minutes or more. Compute a cross tabulation of this Series with the similar one created above on carrier delay.</span>

### Exercise 3

<span style="color:green; font-size:16px">Find the total carrier delay by airline and origin as a Series with a multi-level index.</span>

### Exercise 4

<span style="color:green; font-size:16px">Using the Series from Exercise 3, calculate the total carrier delay by airline. Verify the result by calculating it directly from the original DataFrame.</span>

### Exercise 5

<span style="color:green; font-size:16px">Read in the Sweden deaths dataset found in the covid folder. Place the year column in the index and then calculate the total number of deaths by 10 year age interval per year. Then take this DataFrame and calculate the average deaths per age group group by 5 year time spans</span>