# DataFrame Sorting, Ranking, and Uniqueness

In this chapter, we cover methods that sort, rank, and find unique values on the entire DataFrame. All the methods presented in this chapter appeared in the same chapter for Series with the exception of `unique` which is, ironically, unique to just the Series.

## Sorting

pandas allows you to sort the rows of a DataFrame either by the values or by the index with the `sort_values` and `sort_index` methods.

### The `sort_values` method

The `sort_values` DataFrame method sorts the DataFrame by the values of one or more columns. Pass the `by` parameter a column name or list of column names to sort. By default, the sorting takes place in ascending manner (from least to greatest). The college dataset is used for the remainder of the examples in this chapter.

In [1]:
import pandas as pd
college = pd.read_csv('../data/college.csv', index_col='instnm')
college.head(3)

Unnamed: 0_level_0,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0


Let's sort the rows by city.

In [2]:
college.sort_values(by='city').head(3)

Unnamed: 0_level_0,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Northern State University,Aberdeen,SD,0.0,0.0,0.0,0,480.0,475.0,0.0,1693.0,...,0.0219,0.0425,0.0024,0.1872,1,0.2272,0.4303,0.1766,33600,24847
Grays Harbor College,Aberdeen,WA,0.0,0.0,0.0,0,,,0.0,1121.0,...,0.0937,0.0009,0.025,0.182,1,0.453,0.1502,0.5087,27000,11490
Presentation College,Aberdeen,SD,0.0,0.0,0.0,1,440.0,480.0,0.0,705.0,...,0.0284,0.0142,0.0823,0.2865,1,0.4829,0.756,0.3097,35900,25000


### Sort from greatest to least

Set the `ascending` parameter to `False` to sort in the opposite direction. Since the `by` parameter is first, it is usually omitted.

In [3]:
college.sort_values('city', ascending=False).head(3)

Unnamed: 0_level_0,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ohio University-Zanesville Campus,Zanesville,OH,0.0,0.0,0.0,0,,,0.0,1944.0,...,0.037,0.0036,0.0206,0.5561,1,0.3947,0.5245,0.356,39600,21250.0
Mid-EastCTC-Adult Education,Zanesville,OH,0.0,0.0,0.0,0,,,0.0,305.0,...,0.0262,0.0,0.0066,0.3902,1,0.3712,0.4991,0.4961,29800,6943.0
Zane State College,Zanesville,OH,0.0,0.0,0.0,0,,,0.0,2063.0,...,0.0218,0.0,0.2399,0.573,1,0.3645,0.3434,0.3185,23800,13960.5


In [4]:
college.sort_values('ugds', ascending=False).dropna(subset='ugds')

Unnamed: 0_level_0,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
University of Phoenix-Arizona,Tempe,AZ,0.0,0.0,0.0,0,,,0.0,151558.0,...,0.1131,0.0131,0.3152,0.0000,1,0.6009,0.5920,,,33000
Ivy Tech Community College,Indianapolis,IN,0.0,0.0,0.0,0,,,0.0,77657.0,...,0.0209,0.0003,0.0354,0.6350,1,0.5153,0.3384,0.4780,29400,13000
Miami Dade College,Miami,FL,0.0,0.0,0.0,0,,,0.0,61470.0,...,0.0035,0.0521,0.0280,0.5824,1,0.5399,0.0921,0.3503,30100,8500
Lone Star College System,The Woodlands,TX,0.0,0.0,0.0,0,,,0.0,59920.0,...,0.0281,0.0190,0.0292,0.6863,1,0.3405,0.1984,0.3201,32900,11000
Houston Community College,Houston,TX,0.0,0.0,0.0,0,,,0.0,58084.0,...,0.0151,0.0911,0.0198,0.7027,1,0.6680,0.3348,0.4751,32500,10750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Prince Institute-Rocky Mountains,Westminster,CO,0.0,0.0,0.0,1,,,0.0,0.0,...,0.0000,0.0000,0.0000,,0,0.6923,0.9487,0.8824,33400,20992
Spanish-American Institute,New York,NY,0.0,0.0,0.0,0,,,0.0,0.0,...,0.0000,0.0000,0.0000,,1,,,0.8667,19700,
Taft University System,Denver,CO,0.0,0.0,0.0,0,,,1.0,0.0,...,0.0000,0.0000,0.0000,,1,0.0000,0.0000,1.0000,,PrivacySuppressed
Education and Technology Institute,Greensburg,PA,0.0,0.0,0.0,0,,,0.0,0.0,...,0.0000,0.0000,0.0000,,0,0.5333,0.0000,0.9333,,


### Simultaneously sort two or more columns

Sort by any number of columns by passing a list of their names to the `by` parameter. The sort begins with the first column. For instance, the following sorts all the colleges by state, and then within each state, sorts by undergraduate population.

In [5]:
cols = ['stabbr', 'ugds']
state_ugds_sort = college.sort_values(cols)
state_ugds_sort[cols].head(3)

Unnamed: 0_level_0,stabbr,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
Alaska Bible College,AK,27.0
Alaska Christian College,AK,68.0
Ilisagvik College,AK,109.0


Let's select just the state of Oklahoma to verify that it too is sorted by undergraduate population.

In [6]:
state_ugds_sort.query('stabbr == "OK"')[cols].head(3)

Unnamed: 0_level_0,stabbr,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
Hollywood Cosmetology Center,OK,6.0
Claremore Beauty College,OK,15.0
Northwest Technology Center-Fairview,OK,15.0


In [7]:
state_ugds_sort.query('stabbr == "AL" and relaffil == 1')[cols].head(3)

Unnamed: 0_level_0,stabbr,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
Nunation School of Cosmetology,AL,13.0
Regency Beauty Institute-Hoover,AL,46.0
Heritage Christian University,AL,62.0


### Sort multiple columns in different directions

The `ascending` parameter may be passed a list of booleans corresponding to the list of column names in the `by` parameter. The following sorts by state from least to greatest, and then by undergraduate population from greatest to least.

In [11]:
college[cols].sort_values(cols, ascending=False).head(3)

Unnamed: 0_level_0,stabbr,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
University of Wyoming,WY,9910.0
Laramie County Community College,WY,3170.0
Western Wyoming Community College,WY,2768.0


In [8]:
college[cols].sort_values(cols, ascending=[True, False]).head(3)

Unnamed: 0_level_0,stabbr,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
University of Alaska Anchorage,AK,12865.0
University of Alaska Fairbanks,AK,5536.0
Charter College-Anchorage,AK,3256.0


### Sort by the index

The DataFrame may be sorted by its index with the `sort_index` method.

In [9]:
college.sort_index(ascending=False).head(3)

Unnamed: 0_level_0,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
eClips School of Cosmetology and Barbering,Cape Girardeau,MO,0.0,0.0,0.0,0,,,0.0,83.0,...,0.0,0.0,0.0,0.0,1,0.9496,0.9832,0.3772,PrivacySuppressed,9500
duCret School of Arts,Plainfield,NJ,0.0,0.0,0.0,0,,,0.0,41.0,...,0.0976,0.0,0.0244,0.4146,1,0.4375,0.5,0.125,PrivacySuppressed,PrivacySuppressed
Zane State College,Zanesville,OH,0.0,0.0,0.0,0,,,0.0,2063.0,...,0.0218,0.0,0.2399,0.573,1,0.3645,0.3434,0.3185,23800,13960.5


### Sort the columns

Interestingly, you can use the same `sort_index` method to sort the columns of the DataFrame. You must remember that pandas uses an Index object to contain the columns, which is the same object that contains the index. To sort the columns, set the `axis` parameter to 'columns' or 1. This is identical to how we changed the direction of the operation of the statistical methods in previous chapters. Perhaps this method would have been more appropriately named `sort_axis` instead since it sorts either axis.

In [10]:
college.sort_index(axis=1).head(2)

Unnamed: 0_level_0,city,curroper,distanceonly,grad_debt_mdn_supp,hbcu,md_earn_wne_p10,menonly,pctfloan,pctpell,pptug_ef,...,ugds_2mor,ugds_aian,ugds_asian,ugds_black,ugds_hisp,ugds_nhpi,ugds_nra,ugds_unkn,ugds_white,womenonly
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,1,0.0,33888.0,1.0,30300,0.0,0.8284,0.7356,0.0656,...,0.0,0.0024,0.0019,0.9353,0.0055,0.0019,0.0059,0.0138,0.0333,0.0
University of Alabama at Birmingham,Birmingham,1,0.0,21941.5,0.0,39700,0.0,0.5214,0.346,0.2607,...,0.0368,0.0022,0.0518,0.26,0.0283,0.0007,0.0179,0.01,0.5922,0.0


## Ranking

The `rank` method ranks the values in each column, independently, the same way it does for a Series. Let's explore ranking on a subset of the columns from the movie dataset. To simplify matters, we will work with just the first 5 rows of data.

In [12]:
movie = pd.read_csv('../data/movie.csv', index_col='title')
cols = ['year', 'director_name', 'duration', 'director_fb', 'actor1_fb', 
        'actor2_fb', 'actor3_fb', 'num_reviews', 'imdb_score']
movie_5 = movie[cols].head()
movie_5

Unnamed: 0_level_0,year,director_name,duration,director_fb,actor1_fb,actor2_fb,actor3_fb,num_reviews,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Avatar,2009.0,James Cameron,178.0,0.0,1000.0,936.0,855.0,723.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Gore Verbinski,169.0,563.0,40000.0,5000.0,1000.0,302.0,7.1
Spectre,2015.0,Sam Mendes,148.0,0.0,11000.0,393.0,161.0,602.0,6.8
The Dark Knight Rises,2012.0,Christopher Nolan,164.0,22000.0,27000.0,23000.0,23000.0,813.0,8.5
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,131.0,131.0,12.0,,,7.1


Calling the `rank` method with the defaults ranks each value in ascending order beginning with 1. For instance, take a look at the `actor1_fb` column. The smallest value is the fifth value and is given rank 1 below. The `rank` method also ranks string columns.

In [13]:
movie_5.rank()

Unnamed: 0_level_0,year,director_name,duration,director_fb,actor1_fb,actor2_fb,actor3_fb,num_reviews,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Avatar,2.0,4.0,4.0,1.5,2.0,3.0,2.0,3.0,4.0
Pirates of the Caribbean: At World's End,1.0,3.0,3.0,4.0,5.0,4.0,3.0,1.0,2.5
Spectre,4.0,5.0,1.0,1.5,3.0,2.0,1.0,2.0,1.0
The Dark Knight Rises,3.0,1.0,2.0,5.0,4.0,5.0,4.0,4.0,5.0
Star Wars: Episode VII - The Force Awakens,,2.0,,3.0,1.0,1.0,,,2.5


Here, we rank from greatest to least using the minimum rank for ties which take place in the `director_fb` and `imdb_score` columns.

In [17]:
movie_5.rank(ascending=False, method='min')

Unnamed: 0_level_0,year,director_name,duration,director_fb,actor1_fb,actor2_fb,actor3_fb,num_reviews,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Avatar,3.0,2.0,1.0,4.0,4.0,3.0,3.0,2.0,2.0
Pirates of the Caribbean: At World's End,4.0,3.0,2.0,2.0,1.0,2.0,2.0,4.0,3.0
Spectre,1.0,1.0,4.0,4.0,3.0,4.0,4.0,3.0,5.0
The Dark Knight Rises,2.0,5.0,3.0,1.0,2.0,1.0,1.0,1.0,1.0
Star Wars: Episode VII - The Force Awakens,,4.0,,3.0,5.0,5.0,,,3.0


## Uniqueness

In this section we cover the `nunique` and `drop_duplicates` DataFrame methods which exist for Series. We'll use the City of Houston employee dataset for the next set of examples.

In [18]:
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black


### The `nunique` method

The `nunique` method returns the number of unique values in each column as a Series. The old column names are now the new index values.

In [19]:
emp.nunique()

dept            9
title         693
hire_date    3955
salary       4102
sex             2
race            5
dtype: int64

To count any missing values as exactly one more unique, set the the `dropna` parameter to `False`.

In [20]:
emp.nunique(dropna=False)

dept            9
title         693
hire_date    3955
salary       4103
sex             2
race            6
dtype: int64

In [27]:
emp.nunique(dropna=False) - emp.nunique()

dept         0
title        0
hire_date    0
salary       1
sex          0
race         1
dtype: int64

### The `drop_duplicates` method

The default call to the `drop_duplicates` method returns only unique rows of the DataFrame. It does not use the index value in its search for duplicates. If two or more rows are duplicated, the first row is kept. Let's see if there are any duplicate rows in the employee dataset.

In [28]:
emp.shape

(24308, 6)

In [29]:
emp

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.00,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.00,Male,Black
3,Police,SENIOR POLICE OFFICER,1997-05-27,75942.10,Male,Hispanic
4,Police,SENIOR POLICE OFFICER,2006-01-23,69355.26,Male,White
...,...,...,...,...,...,...
24303,Police,SENIOR POLICE OFFICER,2001-12-03,75942.10,Male,Black
24304,Other,SENIOR PROCUREMENT SPECIALIST,2016-03-28,76175.00,Female,Black
24305,Houston Public Works,WATER SERVICE INSPECTOR I,2015-09-14,35173.00,Male,Black
24306,Health & Human Services,HUMAN SERVICE PROGRAM MANAGER,2008-05-19,67198.00,Female,Black


In [30]:
emp.drop_duplicates().shape

(17034, 6)

Interestingly, there are some rows with the exact same information for all six columns.

### Drop duplicates based on a subset of columns

Instead of dropping rows where the entire row is duplicated, you can restrict the search for duplication to a subset of the columns. Pass the `subset` parameter a single column name or a list of column names. The following example returns a single row for each unique department.

In [31]:
emp.drop_duplicates(subset='dept') #we can also use keep

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black
8,Fire,FIRE FIGHTER,2015-08-01,48189.7,Male,White
10,Health & Human Services,ADMINISTRATIVE SPECIALIST,2016-04-25,52451.0,Female,Black
12,Library,CUSTOMER SERVICE CLERK,2017-11-13,28787.0,Female,Hispanic
20,Houston Airport System,SEMI-SKILLED LABORER,2009-07-06,35859.0,Female,Hispanic
23,Solid Waste Management,SENIOR REFUSE TRUCK DRIVER,2016-04-25,35422.0,Male,Black
35,Parks & Recreation,RECREATION ASSISTANT,2015-06-17,29058.0,Male,Black


The following returns the first row for each unique combination of race and sex.

In [32]:
emp.drop_duplicates(subset=['race', 'sex']).head(4)

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black
5,Other,SENIOR ACCOUNT CLERK,2017-10-09,44616.0,Female,Black


In [34]:
emp.drop_duplicates(subset=['dept','race', 'sex'], keep='last')

Unnamed: 0,dept,title,hire_date,salary,sex,race
5050,Houston Public Works,CUSTOMER SERVICE REPRESENTATIVE II,2018-02-05,36462.0,Female,Native American
7213,Police,ADMINISTRATIVE COORDINATOR,2009-04-07,65275.0,Female,
9934,Police,SENIOR POLICE OFFICER,1972-11-13,,Male,
11183,Library,LIBRARY ASSISTANT,2006-10-16,29058.0,Male,Native American
11791,Solid Waste Management,SENIOR REFUSE TRUCK DRIVER,2015-04-27,36962.0,Female,White
...,...,...,...,...,...,...
24303,Police,SENIOR POLICE OFFICER,2001-12-03,75942.1,Male,Black
24304,Other,SENIOR PROCUREMENT SPECIALIST,2016-03-28,76175.0,Female,Black
24305,Houston Public Works,WATER SERVICE INSPECTOR I,2015-09-14,35173.0,Male,Black
24306,Health & Human Services,HUMAN SERVICE PROGRAM MANAGER,2008-05-19,67198.0,Female,Black


In [41]:
emp[emp.duplicated(keep=False)]

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
3,Police,SENIOR POLICE OFFICER,1997-05-27,75942.10,Male,Hispanic
4,Police,SENIOR POLICE OFFICER,2006-01-23,69355.26,Male,White
8,Fire,FIRE FIGHTER,2015-08-01,48189.70,Male,White
9,Police,SENIOR POLICE OFFICER,1993-08-30,75942.10,Male,Black
...,...,...,...,...,...,...
24299,Fire,FIRE FIGHTER,2017-03-27,43528.16,Female,White
24300,Other,STUDENT INTERN I,2018-07-02,28766.00,Female,Black
24301,Other,CUSTOMER SERVICE REPRESENTATIVE II,2016-04-11,36338.00,Male,Black
24303,Police,SENIOR POLICE OFFICER,2001-12-03,75942.10,Male,Black


## Finding the maximum/minimum of a group

In this section, we will explore a practical example of the `drop_duplicates` method. Let's say we are interested in finding the employee with the maximum salary per department. This results in a DataFrame with a single row for each department. Let's begin by sorting by department first and then salary, making sure salary is sorted from greatest to least.

In [42]:
emp_sorted = emp.sort_values(['dept', 'salary'], ascending=[True, False])
emp_sorted.head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
1732,Fire,"PHYSICIAN,MD",2014-09-27,342784.0,Male,White
1975,Fire,"PHYSICIAN,MD",2014-09-27,342784.0,Male,Asian
4680,Fire,"PHYSICIAN,MD",2015-11-23,342784.0,Male,White


The data is correctly sorted, but the information we want is not easily accessible. We desire a single row for each department. We can now turn to the `drop_duplicates` method and use the `subset` parameter to keep the first row of every department.

In [43]:
emp_sorted.drop_duplicates(subset='dept')

Unnamed: 0,dept,title,hire_date,salary,sex,race
1732,Fire,"PHYSICIAN,MD",2014-09-27,342784.0,Male,White
8405,Health & Human Services,"CHIEF PHYSICIAN,MD",2017-07-31,186685.0,Female,White
3897,Houston Airport System,AVIATION DIRECTOR,2010-06-01,275000.0,Male,Hispanic
10704,Houston Public Works,PUBLIC WORKS DIRECTOR,2005-08-10,275000.0,Female,White
7564,Library,LIBRARY DIRECTOR,2005-11-07,170000.0,Female,Black
13338,Other,CITY ATTORNEY,2016-05-02,275000.0,Male,Black
11679,Parks & Recreation,PARKS & RECREATION DIRECTOR,2017-07-05,150000.0,Male,White
4413,Police,POLICE CHIEF,2016-11-30,280000.0,Male,Hispanic
20244,Solid Waste Management,SOLID WASTE DIRECTOR,2001-05-14,195000.0,Male,Black


We can rewrite our solution without assigning the result of `sort_values` to a variable by chaining the `drop_duplicates` method directly after it. Here, we find the employee with the lowest salary per department.

In [44]:
(emp.sort_values(['dept', 'salary'])
    .drop_duplicates(subset='dept'))

Unnamed: 0,dept,title,hire_date,salary,sex,race
7000,Fire,ASSISTANT EMS PHYSICIAN DIRECTOR,2017-07-10,16411.0,Female,Black
16485,Health & Human Services,CUSTOMER SERVICE CLERK,2017-11-20,25064.0,Male,Black
668,Houston Airport System,LABORER,2017-09-11,26125.0,Female,Hispanic
10547,Houston Public Works,LABORER,2014-03-17,28205.0,Male,Black
1183,Library,CUSTOMER SERVICE CLERK,2016-01-19,9912.0,Female,Hispanic
4040,Other,COUNCIL INTERN (EXECUTIVE LEVEL),2018-08-14,24960.0,Female,Hispanic
64,Parks & Recreation,LABORER,2018-04-23,24960.0,Male,Hispanic
2935,Police,CUSTOMER SERVICE CLERK,2016-10-02,26915.0,Female,Hispanic
3135,Solid Waste Management,LABORER,2018-08-27,27040.0,Female,Black


It's actually sufficient to sort just by salary as the first value encountered for each department will be the employee with the highest salary. Note how the final DataFrame will be sorted by salary and not by department.

In [49]:
(emp.sort_values('salary', ascending=False)
    .drop_duplicates(subset='dept'))

Unnamed: 0,dept,title,hire_date,salary,sex,race
19987,Fire,"PHYSICIAN,MD",2016-05-09,342784.0,Male,Hispanic
4413,Police,POLICE CHIEF,2016-11-30,280000.0,Male,Hispanic
3897,Houston Airport System,AVIATION DIRECTOR,2010-06-01,275000.0,Male,Hispanic
10704,Houston Public Works,PUBLIC WORKS DIRECTOR,2005-08-10,275000.0,Female,White
13338,Other,CITY ATTORNEY,2016-05-02,275000.0,Male,Black
20244,Solid Waste Management,SOLID WASTE DIRECTOR,2001-05-14,195000.0,Male,Black
8405,Health & Human Services,"CHIEF PHYSICIAN,MD",2017-07-31,186685.0,Female,White
7564,Library,LIBRARY DIRECTOR,2005-11-07,170000.0,Female,Black
11679,Parks & Recreation,PARKS & RECREATION DIRECTOR,2017-07-05,150000.0,Male,White


This short chain of steps combining `sort_values` with `drop_duplicates` is a generic and common pattern for finding the maximum or minimum of some column within groups formed by other columns. Below, we find the minimum salary for every unique combination of department, race, and sex.

In [46]:
(emp.sort_values('salary')
    .drop_duplicates(subset=['dept', 'race', 'sex']).head())

Unnamed: 0,dept,title,hire_date,salary,sex,race
1183,Library,CUSTOMER SERVICE CLERK,2016-01-19,9912.0,Female,Hispanic
9838,Library,CUSTOMER SERVICE CLERK,2016-02-29,9912.0,Female,Black
13705,Library,CUSTOMER SERVICE CLERK,2016-01-19,9912.0,Female,White
23952,Library,SENIOR CUSTOMER SERVICE CLERK,2017-08-28,10661.0,Male,White
4164,Library,INVENTORY MANAGEMENT CLERK,2017-02-27,10952.0,Male,Black


## The `value_counts` method

The DataFrame `value_counts` method allows counting of unique values just like its counterpart Series method. Pass in the name of the column to could as the first argument.

In [53]:
emp.value_counts('race')

race
Black              8661
White              7779
Hispanic           6148
Asian              1547
Native American     146
Name: count, dtype: int64

Multiple column counts are possible by using a list. By default the groups are sorted in descending order. A Series containing a **multilevel** index is returned, which is a more advanced index that contains multiple labels for each value.

In [54]:
emp.value_counts(['race', 'sex'])

race             sex   
White            Male      6488
Black            Male      5074
Hispanic         Male      4208
Black            Female    3587
Hispanic         Female    1940
White            Female    1291
Asian            Male      1059
                 Female     488
Native American  Male       107
                 Female      39
Name: count, dtype: int64

Use parameter `normalize` to return the relative frequency and `ascending` to sort in reverse order.

In [58]:
emp.value_counts(['dept','race'], normalize=True).round(3).head() * 100

dept                  race    
Police                White       13.1
Fire                  White        9.9
Houston Public Works  Black        9.2
Police                Hispanic     8.3
                      Black        7.6
Name: proportion, dtype: float64

In [56]:
emp.value_counts(['dept', 'sex'], normalize=True, ascending=True).round(3).head() * 100

dept                     sex   
Solid Waste Management   Female    0.5
Library                  Male      0.7
Fire                     Female    1.0
Parks & Recreation       Female    1.5
Health & Human Services  Male      1.5
Name: proportion, dtype: float64

### Brief explanation of the multilevel index

pandas allows for the index (and columns) to have multiple levels. Each level is analogous to an extra column (or row)  but is technically still a label for the value. The Series above has two index levels and one sequence of values. Series always have one sequence of values but can have multiple levels. This multilevel index has names for each of the levels (dept and sex) that appear directly above them. A separate part of the book will be dedicated to multilevel indexes.

## Exercises

Use the employee dataset for the first few exercises.

### Exercise 1

<span  style="color:green; font-size:16px">Sort department, race, sex ascending along with salary descending.</span>

In [59]:
emp

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.00,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.00,Male,Black
3,Police,SENIOR POLICE OFFICER,1997-05-27,75942.10,Male,Hispanic
4,Police,SENIOR POLICE OFFICER,2006-01-23,69355.26,Male,White
...,...,...,...,...,...,...
24303,Police,SENIOR POLICE OFFICER,2001-12-03,75942.10,Male,Black
24304,Other,SENIOR PROCUREMENT SPECIALIST,2016-03-28,76175.00,Female,Black
24305,Houston Public Works,WATER SERVICE INSPECTOR I,2015-09-14,35173.00,Male,Black
24306,Health & Human Services,HUMAN SERVICE PROGRAM MANAGER,2008-05-19,67198.00,Female,Black


In [61]:
emp.sort_values(['dept','race','sex','salary'],ascending=[True,True,True,False])

Unnamed: 0,dept,title,hire_date,salary,sex,race
19946,Fire,"PHYSICIAN,MD",2015-06-22,342784.0,Female,Asian
514,Fire,ASSISTANT EMS PHYSICIAN DIRECTOR,2017-08-28,141669.0,Female,Asian
8642,Fire,STAFF PSYCHOLOGIST,2014-12-22,103805.0,Female,Asian
7199,Fire,SENIOR STAFF ANALYST,2003-01-14,97850.0,Female,Asian
430,Fire,ADMINISTRATIVE ASSISTANT,1985-08-19,55432.0,Female,Asian
...,...,...,...,...,...,...
13866,Solid Waste Management,SENIOR REFUSE TRUCK DRIVER,2014-06-23,36109.0,Male,White
7543,Solid Waste Management,SENIOR SIDELOADER OPERATOR,2018-02-26,35506.0,Male,White
21376,Solid Waste Management,SENIOR SIDELOADER OPERATOR,2018-11-05,35506.0,Male,White
6289,Solid Waste Management,SENIOR REFUSE TRUCK DRIVER,2018-09-10,32760.0,Male,White


### Exercise 2

<span  style="color:green; font-size:16px">How many unique combinations of department and title exist?</span>

In [63]:
emp.value_counts(['dept','title'])

dept                    title                               
Police                  POLICE OFFICER                          2435
                        SENIOR POLICE OFFICER                   2077
Fire                    FIRE FIGHTER                            1840
Police                  POLICE SERGEANT                         1243
Fire                    ENGINEER/OPERATOR                       1132
                                                                ... 
Solid Waste Management  ASSISTANT SUPERINTENDENT                   1
                        CUSTOMER SERVICE SECTION CHIEF             1
Fire                    COMMUNICATIONS SPECIALIST SUPERVISOR       1
                        CUSTOMER SERVICE REPRESENTATIVE I          1
                        CUSTOMER SERVICE REPRESENTATIVE III        1
Name: count, Length: 1312, dtype: int64

In [64]:
emp.drop_duplicates(subset=['dept','title'])

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.00,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.00,Male,Black
3,Police,SENIOR POLICE OFFICER,1997-05-27,75942.10,Male,Hispanic
5,Other,SENIOR ACCOUNT CLERK,2017-10-09,44616.00,Female,Black
...,...,...,...,...,...,...
23888,Other,ASSISTANT ELECTRICAL SUPERVISOR,2005-06-13,66248.00,Male,Hispanic
24016,Library,DEPUTY DIRECTOR (EXECUTIVE LEVEL),2007-12-17,139565.00,Female,Black
24132,Library,IT ASSOCIATE - CLIENT SUPPORT,2005-10-17,31075.00,Male,White
24165,Solid Waste Management,PURCHASING MANAGER,1996-12-17,87989.00,Female,Hispanic


### Exercise 3

<span  style="color:green; font-size:16px">Since only Series methods have a `unique` method, can you think of a creative way of getting the same result as exercise 2 with the `unique` method?</span>

In [66]:
(emp['dept'] + emp['title']).unique()

array(['PolicePOLICE SERGEANT', 'OtherASSISTANT CITY ATTORNEY II',
       'Houston Public WorksSENIOR SLUDGE PROCESSOR', ...,
       'LibraryIT ASSOCIATE - CLIENT SUPPORT',
       'Solid Waste ManagementPURCHASING MANAGER',
       'OtherDEPUTY CIO - IT INFRASTRUCTURE (EXE LVL)'],
      shape=(1312,), dtype=object)

### Exercise 4

<span style="color:green; font-size:16px">Find the frequency of occurrence of all race and sex combinations using the trick from exercise 3. For instance, you would return an object that contains the number of 'Hispanic Males', 'Black Females', etc...</span>

In [76]:
emp.value_counts(subset=['race','sex'], normalize=True)

race             sex   
White            Male      0.267205
Black            Male      0.208970
Hispanic         Male      0.173304
Black            Female    0.147729
Hispanic         Female    0.079898
White            Female    0.053169
Asian            Male      0.043614
                 Female    0.020098
Native American  Male      0.004407
                 Female    0.001606
Name: proportion, dtype: float64

In [77]:
s = emp['race'] + ' ' + emp['sex']

s.value_counts(normalize=True)

White Male                0.267205
Black Male                0.208970
Hispanic Male             0.173304
Black Female              0.147729
Hispanic Female           0.079898
White Female              0.053169
Asian Male                0.043614
Asian Female              0.020098
Native American Male      0.004407
Native American Female    0.001606
Name: proportion, dtype: float64

### Use the college dataset for the remaining exercises

Execute the following cell to read in the college dataset which sets the institution name (`instnm`) as the index.

In [78]:
college = pd.read_csv('../data/college.csv', index_col='instnm')
college.head(3)

Unnamed: 0_level_0,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0


In [85]:
pd.read_csv('../data/dictionaries/college_data_dictionary.csv')

Unnamed: 0,column_name,description
0,instnm,Institution Name
1,city,City Location
2,stabbr,State Abbreviation
3,hbcu,Historically Black College or University
4,menonly,0/1 Men Only
5,womenonly,0/1 Women only
6,relaffil,0/1 Religious Affiliation
7,satvrmid,SAT Verbal Median
8,satmtmid,SAT Math Median
9,distanceonly,Distance Education Only


### Exercise 5

<span style="color:green; font-size:16px">Select the columns `stabbr`, `satvrmid`, `satmtmid` and `ugds` columns for the state of Texas ('TX') that have an undergraduate student population of more than 20,000. Drop any rows with missing values and assign the result to the variable name `college_tx`. </span>

In [88]:
college_tx = (college[['stabbr','satvrmid','satmtmid','ugds']]
                .query(' stabbr == "TX" and ugds >20_000 ')
                .dropna()
             )

college_tx

Unnamed: 0_level_0,stabbr,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
University of Houston,TX,555.0,590.0,31643.0
University of North Texas,TX,545.0,555.0,29758.0
Texas State University,TX,510.0,515.0,32177.0
Texas A & M University-College Station,TX,580.0,615.0,46941.0
The University of Texas at Arlington,TX,494.0,550.0,29616.0
The University of Texas at Austin,TX,630.0,660.0,38914.0
The University of Texas at San Antonio,TX,505.0,535.0,23815.0
Texas Tech University,TX,540.0,560.0,28278.0


### Exercise 6

<span style="color:green; font-size:16px">Rank each column from the `college_tx` DataFrame from greatest to least.</span>

In [90]:
college_tx.rank(ascending=False,method='dense')

Unnamed: 0_level_0,stabbr,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
University of Houston,1.0,3.0,3.0,4.0
University of North Texas,1.0,4.0,5.0,5.0
Texas State University,1.0,6.0,8.0,3.0
Texas A & M University-College Station,1.0,2.0,2.0,1.0
The University of Texas at Arlington,1.0,8.0,6.0,6.0
The University of Texas at Austin,1.0,1.0,1.0,2.0
The University of Texas at San Antonio,1.0,7.0,7.0,8.0
Texas Tech University,1.0,5.0,4.0,7.0


### Exercise 7

<span style="color:green; font-size:16px">Using the full college dataset, find the largest school by population for each state. Return only the `stabbr` and `ugds` columns sorting by `ugds`.</span>

In [99]:
college.sort_values('ugds',ascending=False).drop_duplicates(subset='stabbr')[['stabbr','ugds']]

Unnamed: 0_level_0,stabbr,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
University of Phoenix-Arizona,AZ,151558.0
Ivy Tech Community College,IN,77657.0
Miami Dade College,FL,61470.0
Lone Star College System,TX,59920.0
Liberty University,VA,49340.0
American Public University System,WV,44924.0
Ashford University,CA,44744.0
Western Governors University,UT,44499.0
Ohio State University-Main Campus,OH,43733.0
Kaplan University-Davenport Campus,IA,40335.0


### Exercise 8

<span  style="color:green; font-size:16px">Several of the columns from the college dataset contain binary data (are either 0 or 1). Can you identify the names of these columns?</span>

In [102]:
college.nunique() == 2

city                  False
stabbr                False
hbcu                   True
menonly                True
womenonly              True
relaffil               True
satvrmid              False
satmtmid              False
distanceonly           True
ugds                  False
ugds_white            False
ugds_black            False
ugds_hisp             False
ugds_asian            False
ugds_aian             False
ugds_nhpi             False
ugds_2mor             False
ugds_nra              False
ugds_unkn             False
pptug_ef              False
curroper               True
pctpell               False
pctfloan              False
ug25abv               False
md_earn_wne_p10       False
grad_debt_mdn_supp    False
dtype: bool

In [107]:
college.loc[:,college.nunique() == 2]

Unnamed: 0_level_0,hbcu,menonly,womenonly,relaffil,distanceonly,curroper
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alabama A & M University,1.0,0.0,0.0,0,0.0,1
University of Alabama at Birmingham,0.0,0.0,0.0,0,0.0,1
Amridge University,0.0,0.0,0.0,1,1.0,1
University of Alabama in Huntsville,0.0,0.0,0.0,0,0.0,1
Alabama State University,1.0,0.0,0.0,0,0.0,1
...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,,,,1,,1
Rasmussen College - Overland Park,,,,1,,1
National Personal Training Institute of Cleveland,,,,1,,1
Bay Area Medical Academy - San Jose Satellite Location,,,,1,,1


In [106]:
college.loc[:,college.nunique() == 2].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
hbcu,7164.0,0.014238,0.118478,0.0,0.0,0.0,0.0,1.0
menonly,7164.0,0.009213,0.095546,0.0,0.0,0.0,0.0,1.0
womenonly,7164.0,0.005304,0.072642,0.0,0.0,0.0,0.0,1.0
relaffil,7535.0,0.190975,0.393096,0.0,0.0,0.0,0.0,1.0
distanceonly,7164.0,0.005583,0.074519,0.0,0.0,0.0,0.0,1.0
curroper,7535.0,0.923291,0.266146,0.0,1.0,1.0,1.0,1.0
