In [3]:
import numpy as np
import pandas as pd

## Comparison UFuncs as Array Masks

Here is a brief review of what we've covered in the last lecture:

1. Seen how the comparison ufuncs (np.equal, np.less, np.greater, etc) generate boolean arrays that indicate whether a given element of an array meets (or doesn't meet) the condition of the function.

1. We then showed how you could pass these boolean areas to `np.sum`, `np.all`, and `np.any` to derive additional information on your data set.

1. Finally, we demonstrated how you could logically compare two boolean arrays with the **bitwise** operators to perform multistep data comparisons.

For the last segment of this tutorial, we are going to demonstrate using comparison functions to return the original items of the array that is being evaluated instead of a boolean array.

#### Array Masking
In the last lecture, we showed how you could select data from an array using index or slice notation. Here we will introduce another data selection technique called **masking**.

Basically, it looks a lot like slice notation. In case you've forgotten what that looks like, here is a reminder.

In [3]:
simple_int_array = np.array([5, 3, 4, 9, 8, 2, 1, 7, 6, 0])
simple_int_array

array([5, 3, 4, 9, 8, 2, 1, 7, 6, 0])

In [4]:
# Slice elements indexed 5, 6, 7 of our simple_int_array
simple_int_array[5:8]

array([2, 1, 7])

The difference with a mask is that instead of putting `[start:stop:step]` inside the brackets, you actually invoke a comparison function.

In [5]:
simple_int_array < 7

array([ True,  True,  True, False, False,  True,  True, False,  True,
        True])

In [6]:
# Let's return all the values of simple_int_array that are less than 7
# We will create a mask using comparison ufuncts and pass that back into the array

print(simple_int_array)
mask = simple_int_array < 7
print(mask)

# pass the array of True/False values into the source array
# Think of the True/False to be the answer to "Do you want to include this element?"
simple_int_array[mask]

[5 3 4 9 8 2 1 7 6 0]
[ True  True  True False False  True  True False  True  True]


array([5, 3, 4, 2, 1, 6, 0])

In [7]:
# more common, is to accomplish this task inline
subset_array = simple_int_array[simple_int_array < 7]
subset_array

array([5, 3, 4, 2, 1, 6, 0])

**This way of masking is very important and used a lot.** The above statement works the following way
1. The comparison UFunc inside the brackets is evaluated first. 
1. It returns a boolean array where the first 7 elements have `True` value, and the rest have `False`.
1. For each index of the boolean array with a `True` value, the corresponding index of the original array is returned.

In [8]:
# At this point you don't have to know the details of following data loading. 
# However, understand that it is loading the weights of all the athletes
nd_player_weights = np.array(pd.read_csv('./data/nd-football-2018-roster.csv')['Weight'])
nd_player_names = np.array(pd.read_csv('./data/nd-football-2018-roster.csv')['Name'])
nd_player_heights = np.array(pd.read_csv('./data/nd-football-2018-roster.csv')['Height'])

## Activity:

* Names of all players above 75 inches? 
* Names of all players above 220 lbs and below 250 lbs? 

**Hint**: You can use use boolean array created from one array as mask to another array

In [9]:
mask = nd_player_heights > 75
nd_player_names[mask]

# or 

nd_player_names[nd_player_heights > 75]



array(['Daelin Hayes', 'Phil Jurkovec', 'Jack Lamb', 'Cole Capen*',
       'Julian Okwara', 'Jack Henige*', 'Khalid Kareem',
       'Jonathan Bonner', 'Jarrett Patterson', 'John Dirksen',
       'Cole Mabry', 'Aaron Banks', 'Luke Jones', 'Alex Bars',
       'Robert Hainsey', 'Liam Eichenberg', 'Josh Lugg', 'Dillan Gibbons',
       'Tommy Kraemer', 'Micah Jones', 'Miles Boykin', 'Nic Weishar',
       'Chase Claypool', 'Cole Kmet', 'George Takacs', 'Alizé Mack',
       'Brock Wright', 'Adetokunbo Ogundeji', 'Micah Dew-Treadway',
       'Jerry Tillery'], dtype=object)

In [10]:
# Names of all players above 220 lbs and below 250 lbs? 

nd_player_names[ (nd_player_weights > 200) & (nd_player_weights < 250) ]

array(['Dexter Williams', 'Jordan Genmark Heath', 'Avery Davis',
       'Houston Griffith', "Te'von Coney", 'Tony Jones Jr.',
       'Brandon Wimbush', 'Derrik Allen', 'Jafar Armstrong',
       'Donte Vaughn', 'Alohi Gilman', 'Ian Book', 'Paul Moala',
       'Phil Jurkovec', 'D.J. Morgan', 'Isaiah Robertson', 'Nolan Henry*',
       'Justin Ademilola', 'Jalen Elliott', 'Asmar Bilal',
       'Drue Tranquill', 'Tommy Tremble', 'John Mahoney*', 'Leo Albano*',
       'Temitope Agoro*', 'Ovie Oghoufo', 'Jeremiah Owusu-Koramoah',
       'Jack Lamb', 'Cole Capen*', 'Mick Assaf*', 'Shayne Simon',
       'Keenan Sweeney', 'Jahmir Smith', 'Robert Regan*',
       'Christopher Schilling*', 'Drew White', 'Julian Okwara',
       'Jamir Jones', 'Jonathan Jones', 'Matt Bushland*',
       'Jimmy Thompson*', 'Kofi Wardlow', 'Jack Henige*',
       'Brandon Hutson*', 'Devyn Spruell*', 'Bo Bauer', 'John Shannon',
       'Cody Benjamin*', 'Michael Vinson*', 'Micah Jones', 'Miles Boykin',
       'Nic Weishar'

## np.unique

Returns the sorted unique elements of an array. [More info](https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html)

In [11]:
sample_array = np.array([1,2,2,1,2,3,2,23,2,1,3,2])
print(np.unique(sample_array))

[ 1  2  3 23]


# NumPy: Broadcasting

We look at how with UFuncs you can do arithmetic on a numpy array. We also looked at how you can do arithmetic on 2 numpy arrays. When it didn't work the error message mention `broadcasting`.

Broadcasting is how arrays talk to each other. We will again visit this idea later, but here is one explaination:

## Case 1

In [12]:
x = np.arange(3)
y = 5
print(x)
print(y)
print(x.shape)
print(x+y)

[0 1 2]
5
(3,)
[5 6 7]


## Case2

In [17]:
x = np.random.randint(10, size =((3,3)))
y = np.random.randint(10, size = 3)
print("x array")
print(x)
print("y array")
print(y)

print("Their shapes are respectively")
print(x.shape)
print(y.shape)

x array
[[0 7 9]
 [5 2 6]
 [4 6 3]]
y array
[3 9 6]
Their shapes are respectively
(3, 3)
(3,)


In [18]:
x - y

array([[-3, -2,  3],
       [ 2, -7,  0],
       [ 1, -3, -3]])

## Case 3

In [19]:
x = np.random.randint(10, size=(3,1))
y = np.random.randint(10, size = 3)
print("x  array")
print(x)
print("y array")
print(y)

print("Their shapes are respectively")
print(x.shape)
print(y.shape)

x  array
[[1]
 [1]
 [0]]
y array
[8 3 7]
Their shapes are respectively
(3, 1)
(3,)


In [20]:
x - y

array([[-7, -2, -6],
       [-7, -2, -6],
       [-8, -3, -7]])

## Case 4

In [21]:
x = np.random.randint(10, size =((3,3)))
y = np.random.randint(10, size = (3,1))
print("x array")
print(x)
print("y array")
print(y)

print("Their shapes are respectively")
print(x.shape)
print(y.shape)

x array
[[0 5 3]
 [8 7 6]
 [2 4 6]]
y array
[[3]
 [9]
 [6]]
Their shapes are respectively
(3, 3)
(3, 1)


In [22]:
x + y

array([[ 3,  8,  6],
       [17, 16, 15],
       [ 8, 10, 12]])

# NumPy: Fancy Indexing

In [23]:
simple_array = np.array([10,20,30,40,50,60])
simple_array

array([10, 20, 30, 40, 50, 60])

In [24]:
simple_array[3]

40

In [25]:
simple_array[[5,0]]
# This is equivalent to np.array([simple_array[5], simple_array[0]])

array([60, 10])

In [26]:
np.array([simple_array[5], simple_array[0]])

array([60, 10])

## Application: np.argsort()

Returns the indices that would sort an array. [More Info](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html)



In [27]:
x = np.array(['d', 'a', 'b', 'x'])
np.argsort(x)

array([1, 2, 0, 3])

In [28]:
x[np.argsort(x)]

array(['a', 'b', 'd', 'x'], dtype='<U1')

# Detour to Dictionary

In [37]:
my_dictionary = dict()
print(my_dictionary)
my_dictionary = {}

{}


In [30]:
my_other_dictionary = {'key': 'value', 'key2': 'value 2'}
print(my_other_dictionary)

{'key': 'value', 'key2': 'value 2'}


In [38]:
my_dictionary['Jan'] = 1
my_dictionary['Feb'] = 2
my_dictionary['Mar'] = 3

my_dictionary

{'Jan': 1, 'Feb': 2, 'Mar': 3}

In [32]:
my_dictionary['Jan']

1

In [39]:
# You can update the dictionary through the key
my_dictionary['Jan'] = my_dictionary['Jan'] + 4

In [42]:
my_dictionary

{'Jan': 5, 'Feb': '1234', 'Mar': 3}

In [41]:
my_dictionary['Feb'] = '1234'

In [43]:
# Accessing elements not in the dict
my_dictionary['Dec']

KeyError: 'Dec'

In [44]:
# To check if a key is in the list
'Dec' in my_dictionary

False

In [45]:
'Mar' in my_dictionary

True

## You can access the keys and values seperately

In [47]:
print(my_dictionary.keys())
print(my_dictionary.values())

list(my_dictionary.keys())

dict_keys(['Jan', 'Feb', 'Mar'])
dict_values([5, '1234', 3])


['Jan', 'Feb', 'Mar']

In [48]:
np.array(list(my_dictionary.keys()))

array(['Jan', 'Feb', 'Mar'], dtype='<U3')

# Introduction to Pandas

<div class="alert alert-block alert-info">
<p> Source: Example datasets and discussion in this Jupyter Notebook was partly sourced from `Mike Dunn`, University of Notre Dame.  </p>
</div>

In [49]:
import pandas as pd
pd.__version__

'0.25.1'

## `DataFrame` & `Series` Basics

### Basics on data loading

<div class="alert alert-block alert-info">
<h5>Know your current working directory</h5>

<p>`import os`</p>
<p>`os.getcwd()`</p>

</div>

In [50]:
import os
print(os.getcwd())

/Volumes/GoogleDrive/My Drive/ITAO/Data Analytics/_current/Notebooks


<div class="alert alert-block alert-danger">
<h5>Make sure the data is in the right place. </h5>
<p> </p>
<li> Open the above folder location, that was printed as an output of `os.getcwd()` command, using Windows (or Finder on Mac) file system</li>
<li> Make sure there is a folder named 'data' in the location you opened. If not create a folder</li>
<li> Copy the dowloaded data into the newly created 'data' folder </li>

</div>

In [54]:
print(os.listdir('./data/'))

['nd-football-2018-roster.csv', 'college-scorecard-data-scrubbed.csv', 'pokedex.json']


When you specify './data/' in the above and below Python command, '.' means the current directory. 

In the following statement, the interpretation is that in the current directory as this Jupyter file, open the 'data' folder and look for 'nd-football-2018-roster.csv' file. 

In [21]:
athletes_data = pd.read_csv('./data/nd-football-2018-roster.csv')
type(athletes_data)

pandas.core.frame.DataFrame

<div class="alert alert-block alert-info">
The `type` function return the type of the object passed to it. Very handy.
</div> 

### `DataFrames` are made up of an `index` and one or more `Series`
Inside of every frame is an **`index`** and one or more **`Series`** objects.
Let's demonstrate this by looking at the first few elements of our `athletes_data` object.

In [22]:
#DataFrame athletes_data.head() provides the first few rows of the dataset

athletes_data.head()

Unnamed: 0,Number,Name,Position,Height,Weight,Class,Hometown
0,2,Dexter Williams,RB,71,215,Sr.,"Orlando, FL/West Orange"
1,2,Jordan Genmark Heath,LB,73,225,Soph.,"San Diego, CA/Cathedral Catholic"
2,3,Avery Davis,QB,71,204,Soph.,"Cedar Hill, TX/HS"
3,3,Houston Griffith,S,72,205,Fr.,"Chicago, IL/IMG Academy (FL)"
4,4,Te'von Coney,LB,73,240,Sr.,"Palm Beach Gardens, FL/HS"


In [60]:
athletes_data.tail()

Unnamed: 0,Number,Name,Position,Height,Weight,Class,Hometown
111,94,Darnell Ewell,DT,75,330,Soph.,"Norfolk, VA/Lake Taylor"
112,95,Myron Tagovailoa-Amosa,DT,74,285,Soph.,"Ewa Beach, HI/Kapolei"
113,97,Micah Dew-Treadway,DT,76,300,Sr.,"Bolingbrook, IL/HS"
114,98,Ja'Mion Franklin,DT,73,306,Fr.,"Ridgely, MD/North Caroline"
115,99,Jerry Tillery,DT,79,305,Sr.,"Shreveport, LA/Evangel Christian"


The bold numbers running down the left hand side are the **`index`** of the **`DataFrame`**.  The bold strings running across the top are the names of the nested **`DataSeries`** objects.

In [61]:
name_series = athletes_data['Name']
print(type(name_series))
name_series

<class 'pandas.core.series.Series'>


0             Dexter Williams
1        Jordan Genmark Heath
2                 Avery Davis
3            Houston Griffith
4                Te'von Coney
                ...          
111             Darnell Ewell
112    Myron Tagovailoa-Amosa
113        Micah Dew-Treadway
114          Ja'Mion Franklin
115             Jerry Tillery
Name: Name, Length: 116, dtype: object

<div class="alert alert-block alert-info">
<h5>Dictionary Like-Retrieval</h5>
<p>Did you see how I passed to the name of the **`DataSeries`** object that I wanted to the `athletes_data` frame? It was the same sort of syntax you'd use to retrieve a data element from a **`dict`**.</p>
<p>
As we continue to move along, we'll discover that **`DataFrame`** and **`dict`** types share many traits.
</p>
</div> 

### Every `Series` is made up of an index and a NumPy array
Now that we know every **`DataFrame`** is filled with **`Series`** objects, let's inspect `name_series` dig deeper into the data structures.

In [62]:
# Let's ask for the string representation of the object.
# You can ignore the slice notation at the end,
# I just don't want to display all the names.
name_series[0:10]

0         Dexter Williams
1    Jordan Genmark Heath
2             Avery Davis
3        Houston Griffith
4            Te'von Coney
5            Kevin Austin
6          Troy Pride Jr.
7          Tony Jones Jr.
8         Brandon Wimbush
9            Derrik Allen
Name: Name, dtype: object

So, as you can see, we've got two columns here.  
* The first column is the **`index`**.
* The second column, which holds the values of the series is nothing more than our good friend, the NumPy array.

You can retrieve the index and NumPy array separately from a series as follows:

In [63]:
# Get the Series index
name_series.index

RangeIndex(start=0, stop=116, step=1)

In [64]:
name_series.values

array(['Dexter Williams', 'Jordan Genmark Heath', 'Avery Davis',
       'Houston Griffith', "Te'von Coney", 'Kevin Austin',
       'Troy Pride Jr.', 'Tony Jones Jr.', 'Brandon Wimbush',
       'Derrik Allen', 'Jafar Armstrong', 'Donte Vaughn', 'Daelin Hayes',
       'Patrick Pelini*', 'Chris Finke', 'Alohi Gilman', 'Ian Book',
       'D.J. Brown', 'Lawrence Keys', 'Paul Moala', 'Devin Studstill',
       'J.D. Carney*', 'Phil Jurkovec', 'D.J. Morgan', 'Noah Boykin',
       'Isaiah Robertson', 'Nolan Henry*', 'Joe Wilkins Jr.',
       'Cameron Ekanayake*', 'Justin Yoon', 'Justin Ademilola',
       "C'Borius Flemister", 'Shaun Crawford', 'Jalen Elliott',
       'Asmar Bilal', 'Drue Tranquill', 'Tommy Tremble', 'Nick Coleman',
       'Braden Lenzy', 'John Mahoney*', 'Leo Albano*', 'Temitope Agoro*',
       'Julian Love', 'Arion Shinaver*', 'Nicco Fertitta', 'Ovie Oghoufo',
       'Matt Salerno', 'Jeremiah Owusu-Koramoah', 'Jake Rittman*',
       'Jack Lamb', 'Cole Capen*', 'Mick Assaf*', '

## Going a Bit Deeper
The essential difference between an NumPy **`ndarray`** and a Pandas **`Series`** object is their indexes.

**NumPy arrays have indexes as well, but they are implicit and always integers**. You can't access an array's **`index`** property directly like you can on a Pandas series object as we did above.

Furthermore, series objects are not limited to having integer based indexes. You could have indexes of strings, floats, booleans, dates, etc. 

In [65]:
# Create a `DataSeries` object from a dictionary
# This results in a string based index.

sample_dict = {
    'R':'Not as cool. ;-(',
    'Python': 'Best Language Ever!',
    'C': 'Fundamental language',
    'Julia': 'A New language for Data Science'
    }

simple_series = pd.Series(sample_dict)
print(simple_series.index)
print(simple_series.values)

Index(['R', 'Python', 'C', 'Julia'], dtype='object')
['Not as cool. ;-(' 'Best Language Ever!' 'Fundamental language'
 'A New language for Data Science']


In [66]:
simple_series

R                        Not as cool. ;-(
Python                Best Language Ever!
C                    Fundamental language
Julia     A New language for Data Science
dtype: object

In [67]:
simple_series['Python']

'Best Language Ever!'

In [71]:
simple_series['Python':'Julia']

Python                Best Language Ever!
C                    Fundamental language
Julia     A New language for Data Science
dtype: object

When we loaded our `athletes_data` frame from the CSV file, it generated an integer based index, which is the default behavior.

But we could change that.  For instance, we could make the institution names the index values:

In [72]:
athletes_data.head()

Unnamed: 0,Number,Name,Position,Height,Weight,Class,Hometown
0,2,Dexter Williams,RB,71,215,Sr.,"Orlando, FL/West Orange"
1,2,Jordan Genmark Heath,LB,73,225,Soph.,"San Diego, CA/Cathedral Catholic"
2,3,Avery Davis,QB,71,204,Soph.,"Cedar Hill, TX/HS"
3,3,Houston Griffith,S,72,205,Fr.,"Chicago, IL/IMG Academy (FL)"
4,4,Te'von Coney,LB,73,240,Sr.,"Palm Beach Gardens, FL/HS"


In [73]:
athletes_data.index = athletes_data['Name']

athletes_data.head()

Unnamed: 0_level_0,Number,Name,Position,Height,Weight,Class,Hometown
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Dexter Williams,2,Dexter Williams,RB,71,215,Sr.,"Orlando, FL/West Orange"
Jordan Genmark Heath,2,Jordan Genmark Heath,LB,73,225,Soph.,"San Diego, CA/Cathedral Catholic"
Avery Davis,3,Avery Davis,QB,71,204,Soph.,"Cedar Hill, TX/HS"
Houston Griffith,3,Houston Griffith,S,72,205,Fr.,"Chicago, IL/IMG Academy (FL)"
Te'von Coney,4,Te'von Coney,LB,73,240,Sr.,"Palm Beach Gardens, FL/HS"


In [74]:
athletes_data2 = pd.read_csv('./data/nd-football-2018-roster.csv')
athletes_data2.set_index('Name', inplace=True)
athletes_data2.head()

Unnamed: 0_level_0,Number,Position,Height,Weight,Class,Hometown
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dexter Williams,2,RB,71,215,Sr.,"Orlando, FL/West Orange"
Jordan Genmark Heath,2,LB,73,225,Soph.,"San Diego, CA/Cathedral Catholic"
Avery Davis,3,QB,71,204,Soph.,"Cedar Hill, TX/HS"
Houston Griffith,3,S,72,205,Fr.,"Chicago, IL/IMG Academy (FL)"
Te'von Coney,4,LB,73,240,Sr.,"Palm Beach Gardens, FL/HS"


# Data Indexing and Selection

You'll find that many of the same techniques that we used with NumPy arrays will also be available for these objects. In addition, they add some additional functionality that will be very familiar to anyone who has experience with Python dictionaries.

In [76]:
# college_scorecard = pd.read_csv(
#     './data/college-scorecard-data-scrubbed.csv')
# college_scorecard.head()

### Encoding

Text files are encoded in different formats when they are written. To read them, you must decode them with the same standard or you'll have a problem.

For example, our `college-scorecard-data-scrubbed.csv` file was encoded using `latin-1`, but the default setting for Pandas in Python 3 is `utf-8` so we will get an error if we try to read the file without specify the correct encoding like so:

In [77]:
college_scorecard = pd.read_csv(
    './data/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1')
college_scorecard.head()

Unnamed: 0,UNITID,OPEID,OPEID6,institution_name,city,state,url,predominant_degree_code,predominant_degree_desc,institutional_owner_code,...,pell_grant_receipents,full_time_retention_rate_4_year,full_time_retention_rate_less_than_4_year,part_time_rentention_rate_4_year,part_time_rentention_rate_less_than_4_year,students_with_federal_loans,median_student_earnings,median_student_debt,less_than_4_year_school_completion_rate,4_year_school_completion_rate
0,102580,884300,8843,Alaska Bible College,Palmer,AK,www.akbible.edu/,3,Bachelors,2,...,0.3571,0.3333,,,,0.2857,,PrivacySuppressed,,
1,103501,2541000,25410,Alaska Career College,Anchorage,AK,www.alaskacareercollege.edu,1,Certificate,3,...,0.7078,,0.7941,,,0.786,28700.0,8994,0.707589494,
2,442523,4138600,41386,Alaska Christian College,Soldotna,AK,www.alaskacc.edu,1,Certificate,2,...,0.8868,,0.4737,,1.0,0.6792,,PrivacySuppressed,0.0,
3,102669,106100,1061,Alaska Pacific University,Anchorage,AK,www.alaskapacific.edu,3,Bachelors,2,...,0.3152,0.7742,,1.0,,0.5297,47000.0,23250,,0.514833663
4,102711,3160300,31603,AVTEC-Alaska's Institute of Technology,Seward,AK,www.avtec.edu/,1,Certificate,1,...,0.0737,,1.0,,1.0,0.0664,33500.0,PrivacySuppressed,0.846055789,


In [4]:
college_scorecard = pd.read_csv(
    './data/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1', 
    index_col='institution_name')
college_scorecard.head()

Unnamed: 0_level_0,UNITID,OPEID,OPEID6,city,state,url,predominant_degree_code,predominant_degree_desc,institutional_owner_code,institutional_owner_desc,...,pell_grant_receipents,full_time_retention_rate_4_year,full_time_retention_rate_less_than_4_year,part_time_rentention_rate_4_year,part_time_rentention_rate_less_than_4_year,students_with_federal_loans,median_student_earnings,median_student_debt,less_than_4_year_school_completion_rate,4_year_school_completion_rate
institution_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alaska Bible College,102580,884300,8843,Palmer,AK,www.akbible.edu/,3,Bachelors,2,PrivateNonProfit,...,0.3571,0.3333,,,,0.2857,,PrivacySuppressed,,
Alaska Career College,103501,2541000,25410,Anchorage,AK,www.alaskacareercollege.edu,1,Certificate,3,PrivateForProfit,...,0.7078,,0.7941,,,0.786,28700.0,8994,0.707589494,
Alaska Christian College,442523,4138600,41386,Soldotna,AK,www.alaskacc.edu,1,Certificate,2,PrivateNonProfit,...,0.8868,,0.4737,,1.0,0.6792,,PrivacySuppressed,0.0,
Alaska Pacific University,102669,106100,1061,Anchorage,AK,www.alaskapacific.edu,3,Bachelors,2,PrivateNonProfit,...,0.3152,0.7742,,1.0,,0.5297,47000.0,23250,,0.514833663
AVTEC-Alaska's Institute of Technology,102711,3160300,31603,Seward,AK,www.avtec.edu/,1,Certificate,1,Public,...,0.0737,,1.0,,1.0,0.0664,33500.0,PrivacySuppressed,0.846055789,


**NOTE**: In the last read_csv, we are using `index_col` keyword argument to provide the column we want to use as an index. This is a common way of loading CSV when we want a specific column to be an index. 

## Selecting Data from `Series` Objects

Let's start by grabbing the `url` series object out of our data frame:

In [79]:
url_series = college_scorecard['url']
url_series.head()

institution_name
Alaska Bible College                                 www.akbible.edu/
Alaska Career College                     www.alaskacareercollege.edu
Alaska Christian College                             www.alaskacc.edu
Alaska Pacific University                       www.alaskapacific.edu
AVTEC-Alaska's Institute of Technology                 www.avtec.edu/
Name: url, dtype: object

As a reminder, a `Series` object is comprised of an explicit index and the values. **Notice here that our `Series` object inherit the 'institution_name' column values as the index from the `DataFrame`.**

### Dictionary Like Features

Several of the methods available on Python **`dict`** objects are also available on `Series` objects. The reason that this is possible is because Pandas maintains a mapping relationship between the explicit index elements and the Series values - just like standard Python does between the keys & values of a dictionary.  

#### Membership Testing with `in`
You can determine if a given **index** exists in a `Series` using the `in` keyword:


In [82]:
'University of Notre Dame' in url_series

True

#### Value Retrieval via Index "Key"
You can retrieve a value from the `Series` by passing it the index "key" you are interest in.b

In [83]:
url_series['University of Notre Dame']

'www.nd.edu'

In [84]:
url_series['Carnegie Mellon University']

'www.cmu.edu/'

### Array Like Features
Now we will explore some of the array like features of `Series` objects. Most of this will be familiar given what you already know about NumPy arrays, so we will move quickly.

#### Slicing with Explicit Indexes & Implicit Indexes
Slicing is pretty straight forward with NumPy arrays because of their implicit integer based indexes. It gets a little bit more complicated with `Series` objects because the explicit index isn't necessarily integer based.

Just like normal slice, you can specify two elements that you want to be the start/end of what is returned. The difference here is that you can specify the actual index element names/keys instead of numbers.

Here will we ask for all the listings from Stanford to Notre Dame.

In [85]:
url_series['Stanford University':'University of Notre Dame']

institution_name
Stanford University                                         www.stanford.edu/
Starr King School for the Ministry                               www.sksm.edu
SUM Bible College and Theological Seminary                        www.sum.edu
Summit College                                          www.summitcollege.edu
Sutter Beauty College                                 sutterbeautycollege.com
                                                               ...           
Trine University                                                www.trine.edu
Trine University-Regional/Non-Traditional Campuses              www.trine.edu
University of Evansville                                  www.evansville.edu/
University of Indianapolis                                          uindy.edu
University of Notre Dame                                           www.nd.edu
Name: url, Length: 1530, dtype: object

<div class="alert alert-block alert-info">
<p>
It is important to note that the reverse request, `url_series['University of Notre Dame': 'Stanford University']` would have yielded no results.
</p>
<p>
This is because 'University of Notre Dame' appears after 'Stanford' in the CSV file. Remember that technically, the first item in a slice notation is the 'start' and the second is the 'end'. It is important that you have them in the right order.
</p>
</div> 

<div class="alert alert-block alert-danger">
<h5>Warning: Important Distinction</h5>
<p>
In a NumPy array slice (or when using an implicit index), the 'end' value of the slice notation is not included in the return slice.
</p>
<p>
Strangely, when using a slice with an explicit index - the end value is included. Be careful about this as you could end up with an extra record in your slices that you don't want.
</p>
</div> 

##### The Implicit Index Lurking in the Shadows

While it is true that every `Series` object has an explicit index - it is also true that there is also an implicit index that is always available. Because of this, you can continue to use "normal" slice notations on `Series` objects with non-integer based explicit indexes.

Here are a couple of examples.

In [86]:
# Using "normal" slice notations on our `url_series`
# First ten elements
url_series[0:10]

institution_name
Alaska Bible College                                 www.akbible.edu/
Alaska Career College                     www.alaskacareercollege.edu
Alaska Christian College                             www.alaskacc.edu
Alaska Pacific University                       www.alaskapacific.edu
AVTEC-Alaska's Institute of Technology                 www.avtec.edu/
Charter College-Anchorage                      www.chartercollege.edu
Ilisagvik College                                   www.ilisagvik.edu
University of Alaska Anchorage                     www.uaa.alaska.edu
University of Alaska Fairbanks                            www.uaf.edu
University of Alaska Southeast                     www.uas.alaska.edu
Name: url, dtype: object

<div class="alert alert-block alert-danger">
<h5>Important Warning! Implicit vs. Explicit indexing</h5>
<p>
A confusing situation arises when you have a `Series` with an explicit integer index that doesn't start with 0 and increment 1 for each element.
</p>

<p>
Slice notations get convoluted in this case and you have to use some ** special attributes (.loc, .iloc, .ix) that are discussed in your textbook on page 109-110** to keep things straight. 
</p>
</div> 

#### Series Masking
You can do masking on `Series` objects in the same way you did so with NumPy Arrays. Review the Jupyter Notebook for Sept 19th for more information on masking using NumPy

Here a couple of examples:

In [87]:
# Let's get a new Series object with numeric data on SAT average scores.
sat_average_series = college_scorecard['sat_average']

In [88]:
sat_average_series.head()

institution_name
Alaska Bible College                         NaN
Alaska Career College                        NaN
Alaska Christian College                     NaN
Alaska Pacific University                 1054.0
AVTEC-Alaska's Institute of Technology       NaN
Name: sat_average, dtype: float64

In [89]:
sat_average_series > 1300

institution_name
Alaska Bible College                      False
Alaska Career College                     False
Alaska Christian College                  False
Alaska Pacific University                 False
AVTEC-Alaska's Institute of Technology    False
                                          ...  
Northwest College                         False
Sheridan College                          False
University of Wyoming                     False
Western Wyoming Community College         False
Wyotech-Laramie                           False
Name: sat_average, Length: 7282, dtype: bool

In [90]:
# Return schools with SAT averages over 1300
sat_average_series[sat_average_series > 1300]

institution_name
California Institute of Technology    1545.0
Claremont McKenna College             1419.0
Harvey Mudd College                   1500.0
Pomona College                        1454.0
Santa Clara University                1309.0
                                       ...  
University of Richmond                1337.0
University of Virginia-Main Campus    1357.0
Washington and Lee University         1395.0
Middlebury College                    1379.0
Whitman College                       1323.0
Name: sat_average, Length: 82, dtype: float64

## Activity:

1. What schools have averages between 1400 & 1500
1. Is University of Notre Dame one of the schools? 
1. How about 'Harvard University'? 
1. If not, what is Harvards SAT average?


In [100]:
sats_between = sat_average_series[ (sat_average_series > 1400) & (sat_average_series < 1500) ]
sats_between[:10]

"University of Notre Dame" in sats_between
# "Harvard University" in sats_between

# sat_average_series['Harvard University']

# "University of Michigan-Ann Arbor" in sats_between

True

## Selecting Data from `DataFrame` Objects

Similiarly to what we found with `Series` objects. You can interact with `DataFrame` objects in ways that sometimes resemble a dictionary and other times a NumPy array.

### Dictionary Like Features


In [5]:
college_scorecard.head()

Unnamed: 0_level_0,UNITID,OPEID,OPEID6,city,state,url,predominant_degree_code,predominant_degree_desc,institutional_owner_code,institutional_owner_desc,...,pell_grant_receipents,full_time_retention_rate_4_year,full_time_retention_rate_less_than_4_year,part_time_rentention_rate_4_year,part_time_rentention_rate_less_than_4_year,students_with_federal_loans,median_student_earnings,median_student_debt,less_than_4_year_school_completion_rate,4_year_school_completion_rate
institution_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alaska Bible College,102580,884300,8843,Palmer,AK,www.akbible.edu/,3,Bachelors,2,PrivateNonProfit,...,0.3571,0.3333,,,,0.2857,,PrivacySuppressed,,
Alaska Career College,103501,2541000,25410,Anchorage,AK,www.alaskacareercollege.edu,1,Certificate,3,PrivateForProfit,...,0.7078,,0.7941,,,0.786,28700.0,8994,0.707589494,
Alaska Christian College,442523,4138600,41386,Soldotna,AK,www.alaskacc.edu,1,Certificate,2,PrivateNonProfit,...,0.8868,,0.4737,,1.0,0.6792,,PrivacySuppressed,0.0,
Alaska Pacific University,102669,106100,1061,Anchorage,AK,www.alaskapacific.edu,3,Bachelors,2,PrivateNonProfit,...,0.3152,0.7742,,1.0,,0.5297,47000.0,23250,,0.514833663
AVTEC-Alaska's Institute of Technology,102711,3160300,31603,Seward,AK,www.avtec.edu/,1,Certificate,1,Public,...,0.0737,,1.0,,1.0,0.0664,33500.0,PrivacySuppressed,0.846055789,


In [6]:
# You can retrieve an individual Series from a DataFrame
# by passing the Series name/key to the DataFrame
college_scorecard['religious_affiliation_desc'][:10]

institution_name
Alaska Bible College                                            Undenominational
Alaska Career College                                             Not applicable
Alaska Christian College                  Evangelical Covenant Church of America
Alaska Pacific University                                       United Methodist
AVTEC-Alaska's Institute of Technology                            Not applicable
Charter College-Anchorage                                         Not applicable
Ilisagvik College                                                 Not applicable
University of Alaska Anchorage                                    Not applicable
University of Alaska Fairbanks                                    Not applicable
University of Alaska Southeast                                    Not applicable
Name: religious_affiliation_desc, dtype: object

In [7]:
# Test for the existence of a given column/Series in a DataFrame
'city' in college_scorecard

True

In [8]:
college_scorecard['city']


institution_name
Alaska Bible College                            Palmer
Alaska Career College                        Anchorage
Alaska Christian College                      Soldotna
Alaska Pacific University                    Anchorage
AVTEC-Alaska's Institute of Technology          Seward
                                              ...     
Northwest College                               Powell
Sheridan College                              Sheridan
University of Wyoming                          Laramie
Western Wyoming Community College         Rock Springs
Wyotech-Laramie                                Laramie
Name: city, Length: 7282, dtype: object

<div class="alert alert-block alert-warning">
<p> Note the distiction with `in` operator on a `Series` and on a `DataFrame`. When you use it on a `Series` it checks if it is present in the index. Whereas for a `DataFrame`, it check if it is present in the columns
<div>

### Array Like Features

#### Slicing (Explicit Index)
Slicing affects affects rows, not columns in a `DataFrame`. In other words, you can slice based on the index values, but not the column values. Let's get a slice of all rows from 'Alaska Bible College' to 'Alabama State University':

In [9]:
college_scorecard['Alaska Bible College':'Alabama State University']

Unnamed: 0_level_0,UNITID,OPEID,OPEID6,city,state,url,predominant_degree_code,predominant_degree_desc,institutional_owner_code,institutional_owner_desc,...,pell_grant_receipents,full_time_retention_rate_4_year,full_time_retention_rate_less_than_4_year,part_time_rentention_rate_4_year,part_time_rentention_rate_less_than_4_year,students_with_federal_loans,median_student_earnings,median_student_debt,less_than_4_year_school_completion_rate,4_year_school_completion_rate
institution_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alaska Bible College,102580,884300,8843,Palmer,AK,www.akbible.edu/,3,Bachelors,2,PrivateNonProfit,...,0.3571,0.3333,,,,0.2857,,PrivacySuppressed,,
Alaska Career College,103501,2541000,25410,Anchorage,AK,www.alaskacareercollege.edu,1,Certificate,3,PrivateForProfit,...,0.7078,,0.7941,,,0.786,28700,8994,0.707589494,
Alaska Christian College,442523,4138600,41386,Soldotna,AK,www.alaskacc.edu,1,Certificate,2,PrivateNonProfit,...,0.8868,,0.4737,,1.0,0.6792,,PrivacySuppressed,0.0,
Alaska Pacific University,102669,106100,1061,Anchorage,AK,www.alaskapacific.edu,3,Bachelors,2,PrivateNonProfit,...,0.3152,0.7742,,1.0,,0.5297,47000,23250,,0.514833663
AVTEC-Alaska's Institute of Technology,102711,3160300,31603,Seward,AK,www.avtec.edu/,1,Certificate,1,Public,...,0.0737,,1.0,,1.0,0.0664,33500,PrivacySuppressed,0.846055789,
Charter College-Anchorage,102845,2576900,25769,Anchorage,AK,www.chartercollege.edu,1,Certificate,3,PrivateForProfit,...,0.8307,,,,,0.7503,39200,13875,,0.400148336
Ilisagvik College,434584,3461300,34613,Barrow,AK,www.ilisagvik.edu,1,Certificate,1,Public,...,0.1323,,0.8095,,0.3333,0.0,24900,PrivacySuppressed,0.340906818,
University of Alaska Anchorage,102553,1146200,11462,Anchorage,AK,www.uaa.alaska.edu,3,Bachelors,1,Public,...,0.2385,0.7164,,0.4549,,0.2647,42500,19449.5,,0.252541205
University of Alaska Fairbanks,102614,106300,1063,Fairbanks,AK,www.uaf.edu,3,Bachelors,1,Public,...,0.2263,0.7756,,0.4857,,0.255,36200,19355,,0.315570823
University of Alaska Southeast,102632,106500,1065,Juneau,AK,www.uas.alaska.edu,1,Certificate,1,Public,...,0.1769,0.7167,,0.6364,,0.1996,37400,16875,,0.156750746


<div class="alert alert-block alert-info">
<p>
You can however use the `iloc`, and `loc` methods to slice based on columns.  **You can look into this on pages 113-114 of your textbook if you are interested.**</p>
</div> 

#### Slicing (implicit index)
You can also rely on the implicit integer index of the `DataFrame` (yes, it has one too) to retrieve rows by the numeric index.

**Just remember, the 'end' value of the slice is not included when using the implicit index.**

In [10]:
college_scorecard[0:5]

Unnamed: 0_level_0,UNITID,OPEID,OPEID6,city,state,url,predominant_degree_code,predominant_degree_desc,institutional_owner_code,institutional_owner_desc,...,pell_grant_receipents,full_time_retention_rate_4_year,full_time_retention_rate_less_than_4_year,part_time_rentention_rate_4_year,part_time_rentention_rate_less_than_4_year,students_with_federal_loans,median_student_earnings,median_student_debt,less_than_4_year_school_completion_rate,4_year_school_completion_rate
institution_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alaska Bible College,102580,884300,8843,Palmer,AK,www.akbible.edu/,3,Bachelors,2,PrivateNonProfit,...,0.3571,0.3333,,,,0.2857,,PrivacySuppressed,,
Alaska Career College,103501,2541000,25410,Anchorage,AK,www.alaskacareercollege.edu,1,Certificate,3,PrivateForProfit,...,0.7078,,0.7941,,,0.786,28700.0,8994,0.707589494,
Alaska Christian College,442523,4138600,41386,Soldotna,AK,www.alaskacc.edu,1,Certificate,2,PrivateNonProfit,...,0.8868,,0.4737,,1.0,0.6792,,PrivacySuppressed,0.0,
Alaska Pacific University,102669,106100,1061,Anchorage,AK,www.alaskapacific.edu,3,Bachelors,2,PrivateNonProfit,...,0.3152,0.7742,,1.0,,0.5297,47000.0,23250,,0.514833663
AVTEC-Alaska's Institute of Technology,102711,3160300,31603,Seward,AK,www.avtec.edu/,1,Certificate,1,Public,...,0.0737,,1.0,,1.0,0.0664,33500.0,PrivacySuppressed,0.846055789,


#### Masking

Masking operations likewise return rows from a `DataFrame`, but the **criteria of the masks will be a comparison on one of the columns/Series**. This is somewhat confusing sounding, so let's just demonstrate:

In [11]:
# Return all rows where the 'state' Series has a value of 'AK'
college_scorecard[ college_scorecard['state'] == 'AK' ]

Unnamed: 0_level_0,UNITID,OPEID,OPEID6,city,state,url,predominant_degree_code,predominant_degree_desc,institutional_owner_code,institutional_owner_desc,...,pell_grant_receipents,full_time_retention_rate_4_year,full_time_retention_rate_less_than_4_year,part_time_rentention_rate_4_year,part_time_rentention_rate_less_than_4_year,students_with_federal_loans,median_student_earnings,median_student_debt,less_than_4_year_school_completion_rate,4_year_school_completion_rate
institution_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alaska Bible College,102580,884300,8843,Palmer,AK,www.akbible.edu/,3,Bachelors,2,PrivateNonProfit,...,0.3571,0.3333,,,,0.2857,,PrivacySuppressed,,
Alaska Career College,103501,2541000,25410,Anchorage,AK,www.alaskacareercollege.edu,1,Certificate,3,PrivateForProfit,...,0.7078,,0.7941,,,0.786,28700.0,8994,0.707589494,
Alaska Christian College,442523,4138600,41386,Soldotna,AK,www.alaskacc.edu,1,Certificate,2,PrivateNonProfit,...,0.8868,,0.4737,,1.0,0.6792,,PrivacySuppressed,0.0,
Alaska Pacific University,102669,106100,1061,Anchorage,AK,www.alaskapacific.edu,3,Bachelors,2,PrivateNonProfit,...,0.3152,0.7742,,1.0,,0.5297,47000.0,23250,,0.514833663
AVTEC-Alaska's Institute of Technology,102711,3160300,31603,Seward,AK,www.avtec.edu/,1,Certificate,1,Public,...,0.0737,,1.0,,1.0,0.0664,33500.0,PrivacySuppressed,0.846055789,
Charter College-Anchorage,102845,2576900,25769,Anchorage,AK,www.chartercollege.edu,1,Certificate,3,PrivateForProfit,...,0.8307,,,,,0.7503,39200.0,13875,,0.400148336
Ilisagvik College,434584,3461300,34613,Barrow,AK,www.ilisagvik.edu,1,Certificate,1,Public,...,0.1323,,0.8095,,0.3333,0.0,24900.0,PrivacySuppressed,0.340906818,
University of Alaska Anchorage,102553,1146200,11462,Anchorage,AK,www.uaa.alaska.edu,3,Bachelors,1,Public,...,0.2385,0.7164,,0.4549,,0.2647,42500.0,19449.5,,0.252541205
University of Alaska Fairbanks,102614,106300,1063,Fairbanks,AK,www.uaf.edu,3,Bachelors,1,Public,...,0.2263,0.7756,,0.4857,,0.255,36200.0,19355,,0.315570823
University of Alaska Southeast,102632,106500,1065,Juneau,AK,www.uas.alaska.edu,1,Certificate,1,Public,...,0.1769,0.7167,,0.6364,,0.1996,37400.0,16875,,0.156750746


In [12]:
# Which colleges in IN offer Bachelors degrees?
# Again, notice the parathesis here
college_scorecard[ (college_scorecard['state'] == 'IN') & (college_scorecard['predominant_degree_desc'] == 'Bachelors')]



Unnamed: 0_level_0,UNITID,OPEID,OPEID6,city,state,url,predominant_degree_code,predominant_degree_desc,institutional_owner_code,institutional_owner_desc,...,pell_grant_receipents,full_time_retention_rate_4_year,full_time_retention_rate_less_than_4_year,part_time_rentention_rate_4_year,part_time_rentention_rate_less_than_4_year,students_with_federal_loans,median_student_earnings,median_student_debt,less_than_4_year_school_completion_rate,4_year_school_completion_rate
institution_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Anderson University,150066,178500,1785,Anderson,IN,www.anderson.edu,3,Bachelors,2,PrivateNonProfit,...,0.2118,0.7465,,0.5,,0.4688,35600.0,27000,,0.599595878
Ball State University,150136,178600,1786,Muncie,IN,www.bsu.edu,3,Bachelors,1,Public,...,0.3399,0.8141,,0.5,,0.5917,38800.0,25000,,0.594160686
Bethel College-Indiana,150145,178700,1787,Mishawaka,IN,www.bethelcollege.edu,3,Bachelors,2,PrivateNonProfit,...,0.5106,0.8122,,0.0,,0.7631,34900.0,PrivacySuppressed,,0.672081466
Butler University,150163,178800,1788,Indianapolis,IN,www.butler.edu,3,Bachelors,2,PrivateNonProfit,...,0.1649,0.9014,,0.0,,0.5742,55000.0,27000,,0.757950963
Calumet College of Saint Joseph,150172,183400,1834,Whiting,IN,www.ccsj.edu,3,Bachelors,2,PrivateNonProfit,...,0.4351,0.5772,,0.3684,,0.568,38900.0,20293.5,,0.274751351
Chamberlain College of Nursing-Indiana,475741,638510,6385,Indianapolis,IN,www.chamberlain.edu,3,Bachelors,3,PrivateForProfit,...,0.4843,0.75,,,,0.8931,52600.0,24581,,
DePauw University,150400,179200,1792,Greencastle,IN,www.depauw.edu,3,Bachelors,2,PrivateNonProfit,...,0.1944,0.9267,,,,0.5551,47700.0,25000,,0.796709494
DeVry University-Indiana,482486,1072747,10727,Merrillville,IN,www.devry.edu,3,Bachelors,3,PrivateForProfit,...,0.5817,0.7778,,0.4167,,0.8213,,40150,,PrivacySuppressed
Earlham College,150455,179300,1793,Richmond,IN,www.earlham.edu,3,Bachelors,2,PrivateNonProfit,...,0.2829,0.8419,,,,0.5414,33400.0,26840,,0.712695987
Franklin College,150604,179800,1798,Franklin,IN,www.franklincollege.edu,3,Bachelors,2,PrivateNonProfit,...,0.3895,0.7941,,,,0.7722,40800.0,27000,,0.590159016


### Selecting Multiple Columns of DataFrame

In [16]:
two_columns = college_scorecard[ ['state', 'predominant_degree_desc'] ][:3]

# two_columns.head()

**NOTE**: Among the two sets of square brackets `[[ ]]`, the first set is used to select the columns, the second set is used to list the columns you want to select. 

## Activity On Football Athletes Data

1. Details of the players who are in the freshmen class?
1. Details of the players whose position is offensive linemen (OL) and are in their their last year? 
1. Average height of players whose position is Linebacker (LB) and are Sophmores? 

### Here is tip to remember, use `unique()` to get a list of options from a dataset 

In [24]:
print(athletes_data['Class'].unique())
print(athletes_data['Position'].unique())

['Sr.' 'Soph.' 'Fr.' 'Jr.' '5th']
['RB' 'LB' 'QB' 'S' 'WR' 'CB' 'RB/WR' 'DE' 'WR/DB' 'K' 'Rov' 'TE' 'P' 'FB'
 'P/K' 'CB/WR' 'DT' 'LS' 'OL']


In [27]:
athletes_data[athletes_data['Class'] == 'Fr.'][:5]

Unnamed: 0,Number,Name,Position,Height,Weight,Class,Hometown
3,3,Houston Griffith,S,72,205,Fr.,"Chicago, IL/IMG Academy (FL)"
5,4,Kevin Austin,WR,74,197,Fr.,"Ft. Lauderdale, FL/North Broward Prep"
9,7,Derrik Allen,S,73,213,Fr.,"Marietta, GA/Lassiter"
17,12,D.J. Brown,CB,72,191,Fr.,"Crownsville, MD/St. John's College"
18,13,Lawrence Keys,WR,70,170,Fr.,"New Orleans, LA/McDonogh 35"


In [32]:
athletes_data[(athletes_data['Position'] == 'OL') & ((athletes_data['Class'] == 'Sr.') | (athletes_data['Class'] == '5th') )]

Unnamed: 0,Number,Name,Position,Height,Weight,Class,Hometown
78,53,Sam Mustipher,OL,74,306,5th,"Olney, MD/Good Counsel"
84,57,Trevor Ruhland,OL,75,295,Sr.,"Cary, IL/Cary-Grove"
88,62,Logan Plantz*,OL,74,284,Sr.,"Frankfort, IL/Providence Catholic"
94,71,Alex Bars,OL,78,315,5th,"Nashville, TN/Montgomery Bell"


In [34]:
athletes_data[
    (athletes_data['Position'] == 'OL') & 
    (
    (athletes_data['Class'] == 'Sr.') | (athletes_data['Class'] == '5th') 
    )
]

Unnamed: 0,Number,Name,Position,Height,Weight,Class,Hometown
78,53,Sam Mustipher,OL,74,306,5th,"Olney, MD/Good Counsel"
84,57,Trevor Ruhland,OL,75,295,Sr.,"Cary, IL/Cary-Grove"
88,62,Logan Plantz*,OL,74,284,Sr.,"Frankfort, IL/Providence Catholic"
94,71,Alex Bars,OL,78,315,5th,"Nashville, TN/Montgomery Bell"


In [39]:
athletes_data[ 
    (athletes_data['Position'] == 'LB') & 
    (athletes_data['Class'] == 'Soph.')]['Height'].mean()

72.5

## UFunc Arithmatic with Index Preservation

Let us convert the height of the players into meters. The math to convert from inches to meters is to multiply by 0.0254. 

In [None]:
athletes_data.head(5)

In [None]:
athletes_data['Height'] = athletes_data['Height'] * 0.0254

In [None]:
athletes_data.head(5)


Do you see how my index was still preserved? This is referred to as **index preservation** and we will see it come into play both for `Series` and `DataFrame` objects when we using arithmetic functions on them.

## Binary Functions and `DataFrame` Objects
Now let's try performing binary UFunc operations on DataFrames.

#### Operations between 2 DataFrames

To demonstate how arithmetic operations work between two different `DataFrame` objects I'll need to construct a couple of simple objects.

I'll go ahead and create two imaginary objects that hold sales data over two different years for the burger joint: **In-N-Out**

In [None]:
import pandas as pd

# 2015 Sales DataFrame
sales_2015 = pd.DataFrame([
        {'Burgers': 9574265, 'Fries': 7124736, 'Drinks': 11563762},
        {'Burgers': 6574265, 'Fries': 5124736, 'Drinks': 13563762},
    ], 
    index=['California', 'Texas'])

# 2016 Sales DataFrame
# They open their first Indiana store at Notre Dame!!!
# And they sell Irish Shakes nationwide to celebrate.
sales_2016 = pd.DataFrame([
        {'Burgers': 9742652, 'Fries': 7354736, 'Drinks': 11133762, 'Irish Shakes': 75812},
        {'Burgers': 7774222, 'Fries': 6214736, 'Drinks': 14563762, 'Irish Shakes': 15525},
        {'Burgers': 74265, 'Fries': 54736, 'Drinks': 43762, 'Irish Shakes': 23612},
    ], 
    index=['California', 'Texas', 'Indiana'])


Here's what those `DataFrames` look like separately:

In [None]:
print(sales_2015, sales_2016, sep='\n\n\n')

In [None]:
sales_2015 + sales_2016

To have a value in the results of an operation between two `DataFrame` objects, there must be a value in both of the objects for a given Index/Column combination.

This is why there is no data for Indiana in our results (that index only existed in 2016) and no results for Irish Shakes (that column only existed in 2016).

## Loading JSON Files
In terms of web APIs, JSON is the dominant data transmission format on the internet right now - so you'll need to be familar with how to load it into **`DataFrame`** objects as well.

There are a wide variety of ways that JSON documents can be structured. Unless you want to really start getting down into the,  there are really only a few formats that Pandas will read without problems.

For our purposes, we'll use a pretty basic file that conforms to one of the standard formats just to get our feet wet.

I've uploaded a JSON formatted file `pokedex.json` for us to use.  Hopefully, you are a Pokemon fan.

In [40]:
# We use the `orient` parameter to tell Pandas what the basic 
# structure of the JSON is.  The other options are:
# split, index, columns, and values
# More Info: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
pokedex = pd.read_json('./data/pokedex.json', orient='records')
pokedex.head()

Unnamed: 0,id,num,name,img,type,height,weight,candy,candy_count,egg,spawn_chance,avg_spawns,spawn_time,multipliers,weaknesses,next_evolution,prev_evolution
0,1,1,Bulbasaur,http://www.serebii.net/pokemongo/pokemon/001.png,"[Grass, Poison]",0.71 m,6.9 kg,Bulbasaur Candy,25.0,2 km,0.69,69.0,20:00,[1.58],"[Fire, Ice, Flying, Psychic]","[{'num': '002', 'name': 'Ivysaur'}, {'num': '0...",
1,2,2,Ivysaur,http://www.serebii.net/pokemongo/pokemon/002.png,"[Grass, Poison]",0.99 m,13.0 kg,Bulbasaur Candy,100.0,Not in Eggs,0.042,4.2,07:00,"[1.2, 1.6]","[Fire, Ice, Flying, Psychic]","[{'num': '003', 'name': 'Venusaur'}]","[{'num': '001', 'name': 'Bulbasaur'}]"
2,3,3,Venusaur,http://www.serebii.net/pokemongo/pokemon/003.png,"[Grass, Poison]",2.01 m,100.0 kg,Bulbasaur Candy,,Not in Eggs,0.017,1.7,11:30,,"[Fire, Ice, Flying, Psychic]",,"[{'num': '001', 'name': 'Bulbasaur'}, {'num': ..."
3,4,4,Charmander,http://www.serebii.net/pokemongo/pokemon/004.png,[Fire],0.61 m,8.5 kg,Charmander Candy,25.0,2 km,0.253,25.3,08:45,[1.65],"[Water, Ground, Rock]","[{'num': '005', 'name': 'Charmeleon'}, {'num':...",
4,5,5,Charmeleon,http://www.serebii.net/pokemongo/pokemon/005.png,[Fire],1.09 m,19.0 kg,Charmander Candy,100.0,Not in Eggs,0.012,1.2,19:00,[1.79],"[Water, Ground, Rock]","[{'num': '006', 'name': 'Charizard'}]","[{'num': '004', 'name': 'Charmander'}]"


## Practice Dictionary Activity

1. Accept a string as an input from the user
2. Create a dictionary that contains the frequency of each word in the string.
3. Print the dictionary

Below is the sample interaction

In [113]:
# Accept a setence
sentence = input("Enter a sentence")

# Convert the sentence to a list of words using split() function
words_list = sentence.split()

# Create an empty dictionary that contains words and frequencies
word_count = {}

# Iterate through every word in the list
for word in words_list:
    if word in word_count:

        word_count[word] += 1
    else:
        word_count[word] = 1
        
# Finally print the dictionary
print(word_count)

Enter a sentenceMy name is My name is My name is Jonathan
{'My': 3, 'name': 3, 'is': 3, 'Jonathan': 1}
