# Grouping and Aggregating with Multiple Columns

In this chapter, we'll form groups using more than one column, aggregate more than one column, and learn how to apply more than one aggregation function to each group. Let's begin by reading in the San Francisco employee compensation dataset.

In [1]:
import pandas as pd
sf_emp = pd.read_csv('../data/sf_employee_compensation.csv')
sf_emp.head(3)

Unnamed: 0,year,organization group,job,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,2013,Public Protection,Personnel Technician,71414.01,0.0,0.0,14038.58,12918.24,5872.04
1,2013,General Administration & Finance,Planner 2,67941.06,0.0,0.0,13030.23,10047.52,5608.37
2,2013,Public Protection,Firefighter,116956.72,59975.43,19037.3,24796.44,15788.97,3222.2


## Review grouping and aggregating with a single column

In the previous chapter, we had a single grouping column, aggregating column, and aggregating function. The following syntax was used as a guide:

```python
df.groupby('grouping column').agg(new_column=('aggregating column', 'aggregating function'))
```

Let's see this again by calculating the average salary for each organization group.

In [3]:
(sf_emp.groupby('organization group')
       .agg(avg_salary=('salaries', 'mean'))
       .round(-3))

Unnamed: 0_level_0,avg_salary
organization group,Unnamed: 1_level_1
Community Health,59000.0
Culture & Recreation,29000.0
General Administration & Finance,59000.0
General City Responsibilities,34000.0
Human Welfare & Neighborhood Development,44000.0
Public Protection,77000.0
"Public Works, Transportation & Commerce",58000.0


## Grouping with multiple columns

To create groups based on distinct values from multiple columns, we need to pass a list of these columns to the `groupby` method. Let's find the average salary for every unique combination of year and organization group.

In [4]:
(sf_emp.groupby(['year', 'organization group'])
       .agg(avg_salary=('salaries', 'mean'))
       .round(-3)
       .head(10))

Unnamed: 0_level_0,Unnamed: 1_level_0,avg_salary
year,organization group,Unnamed: 2_level_1
2013,Community Health,59000.0
2013,Culture & Recreation,31000.0
2013,General Administration & Finance,65000.0
2013,General City Responsibilities,13000.0
2013,Human Welfare & Neighborhood Development,50000.0
2013,Public Protection,87000.0
2013,"Public Works, Transportation & Commerce",64000.0
2014,Community Health,61000.0
2014,Culture & Recreation,28000.0
2014,General Administration & Finance,57000.0


### What happened to our index?

Both year and organization group are no longer columns and have been pushed into the index. This is called a **multi-level index**. The year and organization group are considered **levels** of the index and are NOT columns. You'll notice that duplicated values in the outer level are not visible in an index when they immediately follow one another such as with the year level above.

### The MultiIndex is confusing and not necessary for beginners

In my opinion, the multi-level index does not add much value to pandas and can interfere with learning. I advise those new to pandas to avoid using it until they have mastered the basics. Personally, I rarely use it myself and prefer the levels of the index to be DataFrame columns.

By default, all grouping columns will be added to the index. From this point on, we will chain the `reset_index` method to return these levels to columns. Equivalently, you can achieve the same result by setting the `as_index` parameter to `False` in the `groupby` method.

In [None]:
(sf_emp.groupby(['year', 'organization group'])
       .agg(avg_salary=('salaries', 'mean'))
       .round(-3)
       .reset_index()
       .head(10))

Unnamed: 0,year,organization group,avg_salary
0,2013,Community Health,59000.0
1,2013,Culture & Recreation,31000.0
2,2013,General Administration & Finance,65000.0
3,2013,General City Responsibilities,13000.0
4,2013,Human Welfare & Neighborhood Development,50000.0
5,2013,Public Protection,87000.0
6,2013,"Public Works, Transportation & Commerce",64000.0
7,2014,Community Health,61000.0
8,2014,Culture & Recreation,28000.0
9,2014,General Administration & Finance,57000.0


### Isn't the result easier to read with a MultiIndex?

The MultiIndex can make the results easier to read, but it makes further data analysis more difficult as you need to become familiar with special syntax just for the MultiIndex. In my opinion, this added complexity for beginners is not worth the benefit.

## Aggregating multiple columns

To aggregate multiple columns, set a new parameter in the `agg` method equal to another two-item tuple containing the aggregating column and aggregating function. Here, we find the average salary and overtime for each organization group.

In [6]:
(sf_emp.groupby('organization group')
       .agg(avg_salary=('salaries', 'mean'), avg_overtime=('overtime', 'mean'))
       .round(-3)
       .reset_index())

Unnamed: 0,organization group,avg_salary,avg_overtime
0,Community Health,59000.0,2000.0
1,Culture & Recreation,29000.0,1000.0
2,General Administration & Finance,59000.0,1000.0
3,General City Responsibilities,34000.0,3000.0
4,Human Welfare & Neighborhood Development,44000.0,1000.0
5,Public Protection,77000.0,12000.0
6,"Public Works, Transportation & Commerce",58000.0,5000.0


## Multiple grouping columns, aggregating columns, and aggregating functions

We can combine the last two approaches to simultaneously have multiple grouping and aggregating columns along with multiple aggregating functions. The following finds the mean, min, and max salaries along with the average overtime for every unique combination of year and organization group. 

In [7]:
(sf_emp.groupby(['year', 'organization group'])
       .agg(avg_salary=('salaries', 'mean'),
            min_salary=('salaries', 'min'),
            max_salary=('salaries', 'max'),
            avg_overtime=('overtime', 'mean'))
       .round(-3)
       .reset_index()
       .head(10))

Unnamed: 0,year,organization group,avg_salary,min_salary,max_salary,avg_overtime
0,2013,Community Health,59000.0,0.0,231000.0,2000.0
1,2013,Culture & Recreation,31000.0,0.0,144000.0,1000.0
2,2013,General Administration & Finance,65000.0,0.0,285000.0,1000.0
3,2013,General City Responsibilities,13000.0,0.0,37000.0,2000.0
4,2013,Human Welfare & Neighborhood Development,50000.0,0.0,180000.0,0.0
5,2013,Public Protection,87000.0,-3000.0,238000.0,11000.0
6,2013,"Public Works, Transportation & Commerce",64000.0,-1000.0,217000.0,5000.0
7,2014,Community Health,61000.0,0.0,228000.0,1000.0
8,2014,Culture & Recreation,28000.0,0.0,169000.0,1000.0
9,2014,General Administration & Finance,57000.0,0.0,233000.0,1000.0


## Getting the size of each group

Let's say we are interested in the number of rows in each group. We can use the `size` aggregating function like this.

In [8]:
sf_emp.groupby('organization group').agg(size_salaries=('salaries', 'size'))

Unnamed: 0_level_0,size_salaries
organization group,Unnamed: 1_level_1
Community Health,9044
Culture & Recreation,3697
General Administration & Finance,3707
General City Responsibilities,9176
Human Welfare & Neighborhood Development,3758
Public Protection,7867
"Public Works, Transportation & Commerce",12751


The `size` aggregating function is independent of the aggregating column, so regardless of which one you use, the same value is returned. Here we use three different aggregating columns to prove that the size of the group is the same.

In [9]:
(sf_emp.groupby('organization group')
       .agg(size_salary=('salaries', 'size'),
            size_overtime=('overtime', 'size'),
            size_retirement=('retirement', 'size'))
       .reset_index()
       .head(10))

Unnamed: 0,organization group,size_salary,size_overtime,size_retirement
0,Community Health,9044,9044,9044
1,Culture & Recreation,3697,3697,3697
2,General Administration & Finance,3707,3707,3707
3,General City Responsibilities,9176,9176,9176
4,Human Welfare & Neighborhood Development,3758,3758,3758
5,Public Protection,7867,7867,7867
6,"Public Works, Transportation & Commerce",12751,12751,12751


### Just use `value_counts`

There isn't a need to call the `groupby` method with the `size` aggregating function when grouping by a single column. This is exactly what the Series method `value_counts` was designed for. It has the added benefit of sorting the values as well.

In [10]:
sf_emp['organization group'].value_counts()

organization group
Public Works, Transportation & Commerce     12751
General City Responsibilities                9176
Community Health                             9044
Public Protection                            7867
Human Welfare & Neighborhood Development     3758
General Administration & Finance             3707
Culture & Recreation                         3697
Name: count, dtype: int64

### Multiple group size

It's possible to find the size of groups consisting of more than one column with the `groupby` method by passing it a list. The choice for aggregating columns again does not matter as the size is the same regardless.

In [11]:
(sf_emp.groupby(['year', 'organization group'])
       .agg(size_salary=('salaries', 'size'))
       .head(10))

Unnamed: 0_level_0,Unnamed: 1_level_0,size_salary
year,organization group,Unnamed: 2_level_1
2013,Community Health,1092
2013,Culture & Recreation,408
2013,General Administration & Finance,437
2013,General City Responsibilities,22
2013,Human Welfare & Neighborhood Development,385
2013,Public Protection,940
2013,"Public Works, Transportation & Commerce",1402
2014,Community Health,1110
2014,Culture & Recreation,451
2014,General Administration & Finance,433


### DataFrame `value_counts` method

Again, the `value_counts` method produces the same result, but as a Series.

In [12]:
sf_emp.value_counts(['year', 'organization group']).head()

year  organization group                     
2018  General City Responsibilities              3517
2017  General City Responsibilities              3100
      Public Works, Transportation & Commerce    2881
2019  General City Responsibilities              2526
2017  Community Health                           1960
Name: count, dtype: int64

### Rename the column when using `reset_index`

When calling `reset_index` on a Series, pandas will use the `name` attribute of the Series as the new column name. If it doesn't exist (like in the example above), it will use the integer 0 as the new column name.

In [13]:
(sf_emp.value_counts(['year', 'organization group'])
       .reset_index()
       .head(10))

Unnamed: 0,year,organization group,count
0,2018,General City Responsibilities,3517
1,2017,General City Responsibilities,3100
2,2017,"Public Works, Transportation & Commerce",2881
3,2019,General City Responsibilities,2526
4,2017,Community Health,1960
5,2018,"Public Works, Transportation & Commerce",1905
6,2017,Public Protection,1780
7,2019,"Public Works, Transportation & Commerce",1771
8,2015,"Public Works, Transportation & Commerce",1665
9,2016,"Public Works, Transportation & Commerce",1616


Set the `name` parameter within `reset_index` to set the new column name in the resulting DataFrame.

In [14]:
(sf_emp.value_counts(['year', 'organization group'])
       .reset_index(name='size')
       .head(10))

Unnamed: 0,year,organization group,size
0,2018,General City Responsibilities,3517
1,2017,General City Responsibilities,3100
2,2017,"Public Works, Transportation & Commerce",2881
3,2019,General City Responsibilities,2526
4,2017,Community Health,1960
5,2018,"Public Works, Transportation & Commerce",1905
6,2017,Public Protection,1780
7,2019,"Public Works, Transportation & Commerce",1771
8,2015,"Public Works, Transportation & Commerce",1665
9,2016,"Public Works, Transportation & Commerce",1616


## Exercises

Execute the following cell to read in the City of Houston employee data and use it for the first few exercises.

In [15]:
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black


### Exercise 1

<span  style="color:green; font-size:16px">For each department and sex, find the number of unique position titles, the total number of employees, and the average salary. Make sure there is no multi-level index.</span>

In [21]:
(emp.groupby(['dept','sex'], as_index=False)
    .agg(unique_positions=('title','nunique'),
         employee_count=('hire_date','size'),
         avg_salary=('salary','mean')
         )
    .round({'avg_salary':-3})
)

Unnamed: 0,dept,sex,unique_positions,employee_count,avg_salary
0,Fire,Female,51,240,62000.0
1,Fire,Male,54,4136,60000.0
2,Health & Human Services,Female,136,987,54000.0
3,Health & Human Services,Male,110,366,59000.0
4,Houston Airport System,Female,85,443,51000.0
5,Houston Airport System,Male,113,773,57000.0
6,Houston Public Works,Female,151,1195,51000.0
7,Houston Public Works,Male,180,2995,51000.0
8,Library,Female,55,404,41000.0
9,Library,Male,44,159,44000.0


### Exercise 2

<span  style="color:green; font-size:16px">For each department, race, and sex find the min and max and salaries.</span>

In [22]:
(emp.groupby(['dept','race','sex'])
    .agg(min_sal = ('salary','min'),
         max_sal = ('salary','max')
         )
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,min_sal,max_sal
dept,race,sex,Unnamed: 3_level_1,Unnamed: 4_level_1
Fire,Asian,Female,39104.0,342784.00
Fire,Asian,Male,28024.0,342784.00
Fire,Black,Female,16411.0,342784.00
Fire,Black,Male,28024.0,342784.00
Fire,Hispanic,Female,28024.0,89590.02
...,...,...,...,...
Solid Waste Management,Hispanic,Female,32053.0,100119.00
Solid Waste Management,Hispanic,Male,27851.0,60840.00
Solid Waste Management,Native American,Female,31325.0,31325.00
Solid Waste Management,White,Female,36962.0,103275.00


Execute the following cell to read in the college dataset and use it for the remaining exercises.

In [23]:
pd.set_option('display.max_columns', 100)
college = pd.read_csv('../data/college.csv')
college.head(3)

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0


### Exercise 3

<span  style="color:green; font-size:16px">Which city name appears the most frequently. Do this in two different ways. Do it once with and once without the `groupby` method?</span>

In [24]:
college.value_counts('city')

city
New York          87
Chicago           78
Houston           72
Los Angeles       56
Miami             51
                  ..
Woodridge          1
Woodland Park      1
Woodland Hills     1
Holbrook           1
Allen Park         1
Name: count, Length: 2514, dtype: int64

In [26]:
college.groupby('city').agg(city_count=('instnm','size')).sort_values('city_count',ascending=False)

Unnamed: 0_level_0,city_count
city,Unnamed: 1_level_1
New York,87
Chicago,78
Houston,72
Los Angeles,56
Miami,51
...,...
Woodridge,1
Woodland Park,1
Woodland Hills,1
Holbrook,1


### Exercise 4

<span style="color:green; font-size:16px">Does the city 'Houston' only appear in the state of Texas (abbreviated 'TX')?</span>

In [33]:
college.query("city == 'Houston'")['stabbr'].value_counts()

stabbr
TX    71
MO     1
Name: count, dtype: int64

### Exercise 5

<span style="color:green; font-size:16px">Find the maximum undergraduate population for each state?</span>

In [34]:
college.groupby('stabbr').agg(max_pop=('ugds','max'))

Unnamed: 0_level_0,max_pop
stabbr,Unnamed: 1_level_1
AK,12865.0
AL,29851.0
AR,21405.0
AS,1276.0
AZ,151558.0
CA,44744.0
CO,25873.0
CT,18016.0
DC,10433.0
DE,18222.0


### Exercise 6

<span style="color:green; font-size:16px">Find the largest college from each state. From those colleges, find the difference between the largest and smallest.</span>

In [37]:
college.head(3)

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0


In [84]:
(college.groupby('stabbr')
        .agg(max_pop =('ugds','max')
             )
        .agg(['max','min'])
        .diff(-1)
)

Unnamed: 0,max_pop
max,150956.0
min,


### Exercise 7

<span style="color:green; font-size:16px">Find the name and population of the largest college per state.</span>

In [85]:
c2 = college.set_index('instnm')

c2.groupby('stabbr').agg(max_college=('ugds','idxmax'),max_pop = ('ugds','max') )

Unnamed: 0_level_0,max_college,max_pop
stabbr,Unnamed: 1_level_1,Unnamed: 2_level_1
AK,University of Alaska Anchorage,12865.0
AL,The University of Alabama,29851.0
AR,University of Arkansas,21405.0
AS,American Samoa Community College,1276.0
AZ,University of Phoenix-Arizona,151558.0
CA,Ashford University,44744.0
CO,University of Colorado Boulder,25873.0
CT,University of Connecticut,18016.0
DC,George Washington University,10433.0
DE,University of Delaware,18222.0


### Exercise 8

<span  style="color:green; font-size:16px">Do distance only schools tend to have more or less student population than non-distance-only schools?</span>

In [49]:
college['distanceonly'].value_counts()

distanceonly
0.0    7124
1.0      40
Name: count, dtype: int64

In [51]:
college.groupby('distanceonly').agg(mean_pop=('ugds','mean'))

Unnamed: 0_level_0,mean_pop
distanceonly,Unnamed: 1_level_1
0.0,2334.648135
1.0,6245.74359


### Exercise 9

<span style="color:green; font-size:16px">Do distance only schools tend to be more or less religiously affiliated than non-distance-only schools?</span>

In [53]:
college['relaffil'].value_counts()

relaffil
0    6096
1    1439
Name: count, dtype: int64

In [62]:
(college.groupby('distanceonly')
        .agg(sum_rel_aff=('relaffil','sum'),
             size_rel_aff=('relaffil','size'),
             mean_rel_aff=('relaffil','mean'),
             mean_rel_aff2=('relaffil',lambda x: (x.mean() * 100).round(2).astype('str') + '%' )
             )
        .round({'mean_rel_aff':2})
 )

Unnamed: 0_level_0,sum_rel_aff,size_rel_aff,mean_rel_aff,mean_rel_aff2
distanceonly,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,1066,7124,0.15,14.96%
1.0,2,40,0.05,5.0%


### Exercise 10

<span  style="color:green; font-size:16px">What state has the lowest percentage of currently operating schools of those that have religious affiliation?</span>

In [76]:
college.head(3)

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0


In [78]:
c_relafill = college.query('relaffil == 1')

c_relafill.groupby('stabbr').agg(mean_rel_afil =('curroper','mean')).nsmallest(5,'mean_rel_afil')

Unnamed: 0_level_0,mean_rel_afil
stabbr,Unnamed: 1_level_1
UT,0.4
AZ,0.444444
NV,0.5
CA,0.585366
CT,0.647059


### Exercise 11

<span  style="color:green; font-size:16px">Find the top 5 historically black colleges that have the highest undergraduate white percentage (ugds_white)?</span>

In [82]:
college.query('hbcu == 1').nlargest(5,'ugds_white')



Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
4021,Bluefield State College,Bluefield,WV,1.0,0.0,0.0,0,445.0,460.0,0.0,1529.0,0.8437,0.102,0.0111,0.0013,0.0026,0.0,0.0111,0.0203,0.0078,0.1844,1,0.5633,0.5987,0.399,28300,19500
17,Gadsden State Community College,Gadsden,AL,1.0,0.0,0.0,0,,,0.0,4917.0,0.6921,0.2076,0.0305,0.0047,0.0128,0.0018,0.0185,0.015,0.0169,0.4523,1,0.5734,0.0,0.3733,25700,PrivacySuppressed
4050,West Virginia State University,Institute,WV,1.0,0.0,0.0,0,430.0,450.0,0.0,2237.0,0.5816,0.1198,0.0058,0.0045,0.0063,0.0,0.0,0.0063,0.2758,0.1404,1,0.455,0.5362,0.3139,29300,23250
48,Shelton State Community College,Tuscaloosa,AL,1.0,0.0,0.0,0,,,0.0,4755.0,0.5613,0.3466,0.0036,0.0084,0.004,0.0002,0.0145,0.0015,0.0599,0.5184,1,0.4852,0.0,0.2297,24700,PrivacySuppressed
55,H Councill Trenholm State Community College,Montgomery,AL,1.0,0.0,0.0,0,,,0.0,1230.0,0.3951,0.5756,0.0106,0.013,0.0,0.0,0.0024,0.0008,0.0024,0.448,1,0.5337,0.0,0.456,23100,PrivacySuppressed
