In [1]:
import pandas as pd

The data in the csv file was extracted from a [spreadheet](https://collegecost.ed.gov/wwwroot/documents/CATClists2014.xlsx) made available by the [U.S. Department of Education College Affordability and Transparency Center](https://collegecost.ed.gov/).
It contains 2014-2015 tuition and fees for universities and colleges in the USA. The data is loaded into the data frame `data`

In [9]:
data = pd.read_csv("data/college_tuition.csv").set_index("Name of institution")
data.head()

Unnamed: 0_level_0,Sector,Sector name,UnitID,OPEID,State,2014-15 Tuition and fees,List A: High tuition and fee indicator,List E: Low tuition and fee indicator
Name of institution,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
University of Pittsburgh-Pittsburgh Campus,1,"4-year, public",215293,337900,PA,17772.0,1,0
College of William and Mary,1,"4-year, public",231624,370500,VA,17656.0,1,0
Pennsylvania State University-Main Campus,1,"4-year, public",214777,332900,PA,17502.0,1,0
Colorado School of Mines,1,"4-year, public",126775,134800,CO,16918.0,1,0
University of New Hampshire-Main Campus,1,"4-year, public",183044,258900,NH,16552.0,1,0


In [8]:
data.dtypes

Sector                                      int64
Sector name                                object
UnitID                                      int64
OPEID                                       int64
Name of institution                        object
State                                      object
2014-15 Tuition and fees                  float64
List A: High tuition and fee indicator      int64
List E: Low tuition and fee indicator       int64
dtype: object

# Indexing and selection with Series

First we are going to focus on one column in the data frame and extract it into the series `tuition`

In [24]:
tuition = data["2014-15 Tuition and fees"]
tuition

Name of institution
University of Pittsburgh-Pittsburgh Campus    17772.0
College of William and Mary                   17656.0
Pennsylvania State University-Main Campus     17502.0
Colorado School of Mines                      16918.0
University of New Hampshire-Main Campus       16552.0
                                               ...   
Industrial Technical College                   6480.0
American Educational College                   6422.0
Future-Tech Institute                          6279.0
Rosslyn Training Academy of Cosmetology        6139.0
InterAmerican Technical Institute              2550.0
Name: 2014-15 Tuition and fees, Length: 4140, dtype: float64

- Select the first five entries in the series
- Select the last five entries in the series
- Select every tenth entry (10% sampling)
- What is the tuition of the `'University of Georgia'`?
- Select the subseries containing the tuition fees of `'Harvard University'` and `'Massachusetts Institute of Technology'`.
- Select the subseries containing tuitions greater than `50000`
- Select the subseries containing tuition <= `50000` but greater than `40000`
- (Challenge) Select the tuition of institutes of technology, that is colleges with names ending with `Institute of Technology`
- (Challenge) What is the cheapest college in `Georgia`? **Hint** you can use the function `sort_values` to sort the series

In [21]:
test_1 = tuition[:5] # tuition.head(5)
test_1

Name of institution
University of Pittsburgh-Pittsburgh Campus    17772.0
College of William and Mary                   17656.0
Pennsylvania State University-Main Campus     17502.0
Colorado School of Mines                      16918.0
University of New Hampshire-Main Campus       16552.0
Name: 2014-15 Tuition and fees, dtype: float64

In [22]:
test_2 = tuition[-5:] # tuition.tail(5)
test_2

Name of institution
Industrial Technical College               6480.0
American Educational College               6422.0
Future-Tech Institute                      6279.0
Rosslyn Training Academy of Cosmetology    6139.0
InterAmerican Technical Institute          2550.0
Name: 2014-15 Tuition and fees, dtype: float64

In [23]:
test_3 = tuition[::10]
test_3

Name of institution
University of Pittsburgh-Pittsburgh Campus             17772.0
University of Illinois at Urbana-Champaign             15020.0
Pennsylvania State University-Penn State Brandywine    13942.0
University of New Hampshire at Manchester              13768.0
Pennsylvania State University-Penn State DuBois        13528.0
                                                        ...   
The English Center                                      9705.0
Dorsey Business Schools-Roseville                      19968.0
Ohio Center for Broadcasting-Columbus                  15557.0
Cortiva Institute-New Jersey                            9884.0
Automeca Technical College-Ponce                        7256.0
Name: 2014-15 Tuition and fees, Length: 414, dtype: float64

In [27]:
test_4 = tuition['University of Georgia']
test_4

10836.0

In [28]:
test_5 = tuition[['Harvard University', 'Massachusetts Institute of Technology']]
test_5

Name of institution
Harvard University                       43938.0
Massachusetts Institute of Technology    45016.0
Name: 2014-15 Tuition and fees, dtype: float64

In [31]:
test_6 = tuition[tuition > 50000]
test_6

Name of institution
Columbia University in the City of New York               51008.0
Sarah Lawrence College                                    50780.0
Landmark College                                          50080.0
Aviator College of Aeronautical Science and Technology    74787.0
Name: 2014-15 Tuition and fees, dtype: float64

In [32]:
test_7 = tuition[(tuition <= 50000) & (tuition > 40000)]
test_7

Name of institution
Vassar College                                   49570.0
University of Chicago                            49380.0
Trinity College                                  49056.0
Carnegie Mellon University                       49022.0
George Washington University                     48760.0
                                                  ...   
Southern California Institute of Architecture    40262.0
San Francisco Art Institute                      40096.0
Hartwick College                                 40070.0
Ringling College of Art and Design               40040.0
Stetson University                               40040.0
Name: 2014-15 Tuition and fees, Length: 172, dtype: float64

test_8 = tuition[[x for x in tuition.index if x.endswith('Institute of Technology')]]
test_8

In [40]:
test_9 = tuition[[x for x in tuition.index if 'Georgia' in x]].sort_values().index[0]
test_9

'Central Georgia Technical College'

# Indexing and selection with DataFrame

In [20]:
data.head()

Unnamed: 0_level_0,Sector,Sector name,UnitID,OPEID,State,2014-15 Tuition and fees,List A: High tuition and fee indicator,List E: Low tuition and fee indicator
Name of institution,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
University of Pittsburgh-Pittsburgh Campus,1,"4-year, public",215293,337900,PA,17772.0,1,0
College of William and Mary,1,"4-year, public",231624,370500,VA,17656.0,1,0
Pennsylvania State University-Main Campus,1,"4-year, public",214777,332900,PA,17502.0,1,0
Colorado School of Mines,1,"4-year, public",126775,134800,CO,16918.0,1,0
University of New Hampshire-Main Campus,1,"4-year, public",183044,258900,NH,16552.0,1,0


- Extract `State` and `Sector name` in a new data frame
- Extract the row pertaining to the `'University of Georgia'`
- What is the data type of the extract row?
- Extract two rows pertaning to 'Harvard University' and 'Massachusetts Institute of Technology'.
- What is the data type of the extracted rows
- Extract the `State` of 'Harvard University' and 'Massachusetts Institute of Technology'.
- What is the data type of what you extracted?
- What is the tuition fee for institutions in `GA`?
- What is the average tuition fee for institutions in `GA`?
- Extract the last two columns
- Extract the last first five rows and last two columns
- Sample 10% of the data using slicing
- (Challenge) In which state is the institution of the highest tuition fee located?
- (Challenge) What state has the highest average tuition fee?

In [48]:
test_10 = data.loc[:, ['State', 'Sector name']] #or, data[['State', 'Sector name']]
test_10

Unnamed: 0_level_0,State,Sector name
Name of institution,Unnamed: 1_level_1,Unnamed: 2_level_1
University of Pittsburgh-Pittsburgh Campus,PA,"4-year, public"
College of William and Mary,VA,"4-year, public"
Pennsylvania State University-Main Campus,PA,"4-year, public"
Colorado School of Mines,CO,"4-year, public"
University of New Hampshire-Main Campus,NH,"4-year, public"
...,...,...
Industrial Technical College,PR,"Less than 2-year, private for-profit"
American Educational College,PR,"Less than 2-year, private for-profit"
Future-Tech Institute,FL,"Less than 2-year, private for-profit"
Rosslyn Training Academy of Cosmetology,PR,"Less than 2-year, private for-profit"


In [50]:
test_11 = data.loc['University of Georgia']
test_11

Sector                                                 1
Sector name                               4-year, public
UnitID                                            139959
OPEID                                             159800
State                                                 GA
2014-15 Tuition and fees                           10836
List A: High tuition and fee indicator                 0
List E: Low tuition and fee indicator                  0
Name: University of Georgia, dtype: object

In [51]:
test_12 = type(test_11)
test_12

pandas.core.series.Series

In [54]:
test_12 = data.loc[['Harvard University', 'Massachusetts Institute of Technology']]
test_12

Unnamed: 0_level_0,Sector,Sector name,UnitID,OPEID,State,2014-15 Tuition and fees,List A: High tuition and fee indicator,List E: Low tuition and fee indicator
Name of institution,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Harvard University,2,"4-year, private not-for-profit",166027,215500,MA,43938.0,0,0
Massachusetts Institute of Technology,2,"4-year, private not-for-profit",166683,217800,MA,45016.0,0,0


In [55]:
test_13 = type(test_12)
test_13

pandas.core.frame.DataFrame

In [56]:
test_14 = data.loc[['Harvard University', 'Massachusetts Institute of Technology'], 'State']
test_14

Name of institution
Harvard University                       MA
Massachusetts Institute of Technology    MA
Name: State, dtype: object

In [57]:
test_15 = type(test_14)
test_15

pandas.core.series.Series

In [63]:
test_16 = data.loc[data['State'] == 'GA', '2014-15 Tuition and fees']
test_16

Name of institution
Georgia Institute of Technology-Main Campus      11394.0
University of Georgia                            10836.0
Georgia College and State University              8960.0
Georgia State University                          8618.0
Georgia Regents University                        7326.0
                                                  ...   
Miller-Motte Technical College-Macon              9225.0
Interactive College of Technology-Chamblee        8930.0
Interactive College of Technology-Gainesville     8450.0
Interactive College of Technology-Morrow          8450.0
Helms College                                    15064.0
Name: 2014-15 Tuition and fees, Length: 125, dtype: float64

In [64]:
import numpy as np
test_17 = np.mean(test_16)
test_17

11543.72

In [66]:
test_18 = data.iloc[:, -2:]
test_18

Unnamed: 0_level_0,List A: High tuition and fee indicator,List E: Low tuition and fee indicator
Name of institution,Unnamed: 1_level_1,Unnamed: 2_level_1
University of Pittsburgh-Pittsburgh Campus,1,0
College of William and Mary,1,0
Pennsylvania State University-Main Campus,1,0
Colorado School of Mines,1,0
University of New Hampshire-Main Campus,1,0
...,...,...
Industrial Technical College,0,1
American Educational College,0,1
Future-Tech Institute,0,1
Rosslyn Training Academy of Cosmetology,0,1


In [67]:
test_19 = data.iloc[:5, -2:]
test_19

Unnamed: 0_level_0,List A: High tuition and fee indicator,List E: Low tuition and fee indicator
Name of institution,Unnamed: 1_level_1,Unnamed: 2_level_1
University of Pittsburgh-Pittsburgh Campus,1,0
College of William and Mary,1,0
Pennsylvania State University-Main Campus,1,0
Colorado School of Mines,1,0
University of New Hampshire-Main Campus,1,0


In [68]:
test_20 = data.iloc[::10]
test_20

Unnamed: 0_level_0,Sector,Sector name,UnitID,OPEID,State,2014-15 Tuition and fees,List A: High tuition and fee indicator,List E: Low tuition and fee indicator
Name of institution,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
University of Pittsburgh-Pittsburgh Campus,1,"4-year, public",215293,337900,PA,17772.0,1,0
University of Illinois at Urbana-Champaign,1,"4-year, public",145637,177500,IL,15020.0,1,0
Pennsylvania State University-Penn State Brandywine,1,"4-year, public",214731,332921,PA,13942.0,1,0
University of New Hampshire at Manchester,1,"4-year, public",183071,258901,NH,13768.0,1,0
Pennsylvania State University-Penn State DuBois,1,"4-year, public",214740,332907,PA,13528.0,0,0
...,...,...,...,...,...,...,...,...
The English Center,8,"Less than 2-year, private not-for-profit",437653,3413400,CA,9705.0,0,0
Dorsey Business Schools-Roseville,9,"Less than 2-year, private for-profit",250744,469204,MI,19968.0,1,0
Ohio Center for Broadcasting-Columbus,9,"Less than 2-year, private for-profit",453756,3068202,OH,15557.0,0,0
Cortiva Institute-New Jersey,9,"Less than 2-year, private for-profit",460640,3766301,NJ,9884.0,0,0


In [72]:
test_21 = data.sort_values('2014-15 Tuition and fees', ascending=False).iloc[0]['State']
test_21

'FL'

In [85]:
states = set(data['State'])
state_tuitions = {state:np.mean(data.loc[data.State == state, '2014-15 Tuition and fees']) for state in states}

In [84]:
test_22 = max(state_tuitions, key=lambda key: state_tuitions[key])
test_22

'RI'