# MSDS 631 - Lecture 5 (February 20, 2019)

## Key Scientific Libraries - Numpy and Pandas

--------
##### Sidenote: List Comprehensions
Sometimes you want to create a simple list and writing so many lines of code might seem silly. List comprehensions are a nice, clean, efficient way to create lists without having to write out so much code.

In [14]:
list_of_numbers = list(range(1,10))
remainders = [i % 2 for i in list_of_numbers]
remainders

[1, 0, 1, 0, 1, 0, 1, 0, 1]

While list comprehensions are cool and clean looking, please be careful to not go overboard with these. They are easy to screw up, so the more complex your logic, the harder it is to write them.

---

### Scientific Libraries

While Python is arguably the most widely used language for Data Scientists, the language by itself does not necessarily lend itself to performing data analysis or modeling. The most influential evolution of the language that led the widespread adoption in the field was the creation of two libraries: Numpy and Pandas.

Numpy (prounounced "num-pie") was originally written in 1995, but it took it's current form in 2006. Since then it has been a foundational library for performing data analysis in Python. Two years later, Wes McKinney wrote Pandas, which took the concepts started by Numpy and melded it with powerful data structures from R.

Let's start by talking about Numpy.

Numpy is a library that has many built-in Objects that utilize code written in the C language. This makes code run MUCH much faster than what you might see in Python (and often with a lot less code). Let's take a look at a few examples.

### Intro to Numpy

In [1]:
import random
import numpy as np #We use an alias to shorten our code

##### Arrays

The foundational data structure for Numpy is the array. Like lists, values within arrays are accessible by index, but the array has many more powerful features that make analysis much easier.

In [2]:
#Create some random numbers in Python
list_of_numbers1 = [random.randint(1,100) for _ in range(10)]
list_of_numbers2 = [random.randint(1,100) for _ in range(10)]
list_of_numbers3 = [random.randint(1,100) for _ in range(10)]

list_of_numbers1

[53, 26, 47, 43, 35, 91, 38, 38, 67, 29]

In [3]:
#Create some random numbers in Numpy
np_array_of_numbers1 = np.array(list_of_numbers1) #Casting a list as a new Numpy-specific data structure
np_array_of_numbers2 = np.array(list_of_numbers2) #Casting a list as a new Numpy-specific data structure
np_array_of_numbers3 = np.array(list_of_numbers3) #Casting a list as a new Numpy-specific data structure

np_array_of_numbers1

array([53, 26, 47, 43, 35, 91, 38, 38, 67, 29])

In [4]:
#Values can be accessed just like lists by index
np_array_of_numbers1[0]

53

In [5]:
#... or by range
np_array_of_numbers1[:5]

array([53, 26, 47, 43, 35])

In [6]:
#What you CAN do that's unique is access data by random indices
np_array_of_numbers1[[0,3,4,7]]

array([53, 43, 35, 38])

In [7]:
#This doesn't work for lists
list_of_numbers1[[0,3,4,7]]

TypeError: list indices must be integers or slices, not list

##### Array Math

Let's add 5 to every number in the first list.

In [8]:
#Base Python



In [None]:
#Using Numpy



What just happened here???

Numpy uses something called "broadcasting." This allows us to apply certain mathematical operations to each element in an array. We can do a lot more than this too! The more commonly used terminology for this is called **"vectorization."**

Let's multiply each number in list 1 by 5

In [None]:
#Base Python
python_nums = []



python_nums

In [None]:
#Using Numpy


numpy_nums

Now let's add each element of list 1 and list 2 and then divide the sum by each value of list 3

In [None]:
#Example of first index
(list_of_numbers1[0] + list_of_numbers2[0]) / list_of_numbers3[0]

In [None]:
#Let's do it for all of the values now
#Base Python


python_nums

In [None]:
#Using Numpy


numpy_nums

When doing **element-wise** math using arrays, it is critical that you are doing this with arrays that are the same shape.

In [None]:
longer_array = np.array([random.randint(1,100) for _ in range(10)])
shorter_array = np.array([random.randint(1,100) for _ in range(5)])
longer_array / shorter_array #This doesn't work

Element-wise math can happen with almost standard mathematical operator (or combination of operators. In fact, you can even apply functions to arrays, as long as the operations being performed are mathematical in nature (and not logical).

In [9]:
#Basic arithmetic operations and numpy math functions
def math_output(x):
    a = x + 5
    b = a / 2
    c = b ** 3
    d = np.sqrt(c) #Numpy has several built-in math functions
    return d

In [10]:
math_output(7)

14.696938456699069

In [15]:
#In base Python
answers = []
for i in list_of_numbers1:
    answer = math_output(i)
    answers.append(answer)
answers

[156.16977940690063,
 61.023561023591533,
 132.5745073534124,
 117.57550765359255,
 89.442719099991592,
 332.55375505322445,
 99.691398826578819,
 99.691398826578819,
 216.0,
 70.092795635500224]

In [16]:
math_output(np_array_of_numbers1)

array([ 156.16977941,   61.02356102,  132.57450735,  117.57550765,
         89.4427191 ,  332.55375505,   99.69139883,   99.69139883,
        216.        ,   70.09279564])

That worked great! Even cleaner than list comprehensions!

Unfortunately, there are limitations. As mentioned above, control flow cannot be used for vectorization.

In [21]:
def logical_output(x):
    if x % 2 == 0:
        return 'even'
    elif x % 2 == 1:
        return 'odd'
    else:
        return 'something else'

In [22]:
logical_output(4)

'even'

In [23]:
answers = []
for i in list_of_numbers1:
    answer = logical_output(i)
    answers.append(answer)
answers

['odd', 'even', 'odd', 'odd', 'odd', 'odd', 'even', 'even', 'odd', 'odd']

In [24]:
logical_output(np_array_of_numbers1) #You cannot vectorize control flow

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

#### Numpy Filtering

Filters are arrays consisting of True and False values. You can obtain these by applying in-line logic comparisons.

In [25]:
np_array_of_numbers1

array([53, 26, 47, 43, 35, 91, 38, 38, 67, 29])

In [28]:
#Let's find the even numbers in our array
is_even = np_array_of_numbers1 % 2 == 0
is_even

array([False,  True, False, False, False, False,  True,  True, False, False], dtype=bool)

Now, much like how indexing in an array returns a subset of values, we can use this filter to only return the values that are True.

In [29]:
np_array_of_numbers1[is_even]

array([26, 38, 38])

In [30]:
#Let's now find the numbers divisible by 3 in our array
divis_by_3_filter = np_array_of_numbers1 % 3 == 0
nums_divisible_by_3 = np_array_of_numbers1[divis_by_3_filter]
nums_divisible_by_3

array([], dtype=int64)

In [31]:
#We can also do this without creating a filter
np_array_of_numbers1[np_array_of_numbers1 % 3 == 0]

array([], dtype=int64)

Filtering will be an extremely important concept as we move forward.

We can also use these filters on other arrays of the same shape, assuming the indices of the array are all associated with the same entity (e.g. the same student for the example below).

Imagine we're on the school board and want to identify all of the honors students. We want to start by finding everyone with a GPA 3.7 or higher. When we did this previously, it required a complicated loop. Now, with Numpy, we can use simple filtering to identify those students.

In [None]:
students = json.load(open('students.json', 'r'))
ids = np.array([i['student_id'] for i in students])
gpas = np.array([i['gpa'] for i in students])

In [None]:
#Let's find the students with GPAs over 3.7
high_gpa_filter = <insert code here>

print('There are {} students eligible for honor roll. They have the following IDs'.format(len(honor_ids)))
honor_ids

In [None]:
#If we wanted to know all of the students that were economics majors, we could create another filter
majors = np.array([i['major'] for i in students])
is_economics = majors == 'Economics'
is_economics

In [None]:
economics_ids = ids[is_economics]
print('The following students are economics majors')
print(economics_ids)

##### Limits on usage of arrays
There is one significant difference between arrays and lists where arrays are not as useful as lists, and that is in its ability to hold different variable types.

In [None]:
mixed_list = [1, 2.0, '3']
mixed_list

In [None]:
mixed_array = np.array(mixed_list) #No error, but be wary of what you wish for

In [None]:
mixed_array

In this case, Numpy is "helping" you by auto-casting all of the variable into a common data type. Unfortunately, this is not always something you want and you have to be mindful of when it happens.

##### Numpy built-in array methods

In [None]:
gpas.mean() #Compute the arithmetic mean

In [None]:
gpas.max() #Find the maximum value

In [None]:
gpas.min() #Find the minimum value

In [None]:
gpas.std() #Find the standard deviation

In [None]:
gpas.argmax() #Find the FIRST index where the max value occurs

In [None]:
#Alternatively, you can do this to find all of the IDs with a max value
max_gpa = gpas.max()
max_gpa_ids = ids[gpas == max_gpa]
max_gpa_ids

What percent of students have a 4.0?

There are two ways.

In [None]:
len(max_gpa_ids) / len(students)

In [None]:
(gpas == max_gpa).mean() #True's can be autocast as 1.0 and False's can be autocast as 0.0

##### Numpy built-in scientific helper Objects

Numpy has several built-in Objects that help with your analyses. There are WAY too many to cover, but here are a few that you may find yourself using.

##### Native Math functions

In [None]:
#Square root
np.sqrt(25)

In [None]:
#Base e logarithm
np.log(20)

In [None]:
#Base 10 logarithm
np.log10(20)

In [None]:
#Base n logarithm
n = 10
np.log(20) / np.log(n)

In [None]:
squares = np.array([i**2 for i in range(1,10)])
squares

In [None]:
#As we showed before, we can vectorize mathematical operations
np.sqrt(squares)

In [None]:
np.log(squares)

##### Random Number Generator and Sampler

In [None]:
np.random.randint(1,100,size=10)

In [None]:
np.random.random(size=10)

In [None]:
np.random.random(size=10) * 10

In [None]:
np.random.normal(1,.5, size=10)

In [None]:
lots_of_numbers = np.arange(1,10000) #Equivalent of native range() function

In [None]:
random_sample = np.random.choice(lots_of_numbers, size=10)
random_sample

In [None]:
majors = ['Economics', 'Physics', 'Math']
lots_of_majors = np.random.choice(majors, p=[.2,.5,.3], size=1000)
lots_of_majors

In [None]:
#How many Physics majors did we get?


### Intro to Pandas
Pandas is going to be your best friend for the rest of your time analyzing data in Python. It offers much of the functionality of Numpy, but with a MUCH more user-friendly way of displaying and accessing data.

The two fundamental data structures of Pandas are the Series and the DataFrame. A Series can be viewed as equivalent to a Numpy array, while a DataFrame can be viewed as a combination of Series, with some additional functionality.

Let's start by opening a file using Pandas built-in method.

In [2]:
import pandas as pd
students_df = pd.read_csv('students.csv') #This is way easier than our method from before!

In [3]:
#Let's look at the first few records of the students DataFrame
students_df.head() #The "head" method will give you as many rows as you ask for; 5 is default

Unnamed: 0,student_id,first,last,gender,class,major,gpa
0,5a397209-3782-4764-a285-10fae807ee71,Janis,Brown,Female,Junior,Economics,3.12
1,e26c3d69-3c74-49b6-81d7-47232787fad9,Timothy,Bishop,Male,Sophomore,Economics,3.48
2,975c1581-5ba2-430c-a3d1-01ce03bd83f9,Elizabeth,Owens,Female,Freshman,Finance,3.4
3,6081f91d-365c-46ce-ad1b-38af120781d9,Edward,Pearson,Male,Freshman,Math,3.84
4,84cec8f4-0b64-44ce-a628-c0eb73f6ca6f,Lisa,Gonzalez,Female,Junior,Finance,4.0


#### Intro to the DataFrame

DataFrames consist of rows and columns. These are the fundamental "units" of data for a Data Scientist. This is what allows us to organize our data to analyze quickly and consistently. Rows are considered individual "observations" whereas columns are considered "attributes." The "shape" of a data is defined as the number of rows by the number of columns.

In the DataFrame above, you can see that the attributes identified are:
- student_id
- first
- last
- gender
- class
- major
- gpa

Each student's information can be found in a row. Thus, the first row (index 0) is Janis Brown, who is a Female and a Junior Economics student with a GPA of 3.12.

This is MUCH easier to read and track than lists of dictionaries or dictionaries of dictionaries or whatever other combination of data structures that base Python might be able to display for you. You're probably asking why ALL data isn't shown this way, and we'll get to that later on, but in the meantime, let's focus on the simple case here.

Let's start by looking at the shape of the data.

In [5]:
students_df.shape

(10000, 7)

This means that the data has 10,000 rows and 7 columns. The first things you should always do when using DataFrames is look at the first few rows (the `.head()` method) and the shape of the data.

### Accessing chunks of data in a DataFrame

You will almost always want to focus your attention or compute values oncertain subsections of the data. For this there are several ways to slice and dice the data.

##### Accessing columns

Much like dictionaries, columns are accessed via brackets and some column name (just like keys).

In [8]:
first_names = students_df['first'] #Returns a Series object
first_names

0           Janis
1         Timothy
2       Elizabeth
3          Edward
4            Lisa
5        Alphonse
6         Linwood
7          Arthur
8          Daniel
9           Barry
10      Maryellen
11          Jacob
12           Lisa
13            Ida
14           Lola
15           Joan
16          Larry
17      Bridgette
18          James
19          David
20         Ronald
21        Heather
22          Nancy
23          David
24          Jesus
25        Richard
26          Henry
27          Pedro
28         Israel
29           Mary
          ...    
9970     Georgina
9971      Lynette
9972       Vickie
9973       Willie
9974       Sylvia
9975        Derek
9976      Charles
9977     Jonathan
9978         Dave
9979        Ralph
9980         Leta
9981        Megan
9982       Olivia
9983         Anne
9984         John
9985         John
9986       Shelly
9987       Roland
9988     Margaret
9989        Alice
9990        Mable
9991       Joseph
9992       Gladys
9993      Douglas
9994      

Unlike dictionaries, you can access multiple columns at the same time.

In [9]:
full_names = students_df[['first', 'last']]
full_names

Unnamed: 0,first,last
0,Janis,Brown
1,Timothy,Bishop
2,Elizabeth,Owens
3,Edward,Pearson
4,Lisa,Gonzalez
5,Alphonse,Allen
6,Linwood,Coleman
7,Arthur,Mccolpin
8,Daniel,Carter
9,Barry,Reyes


##### Accessing rows

Much like numpy arrays, rows can be accessed via indices.

In [10]:
first_10_students = students_df.iloc[:5]
first_10_students

Unnamed: 0,student_id,first,last,gender,class,major,gpa
0,5a397209-3782-4764-a285-10fae807ee71,Janis,Brown,Female,Junior,Economics,3.12
1,e26c3d69-3c74-49b6-81d7-47232787fad9,Timothy,Bishop,Male,Sophomore,Economics,3.48
2,975c1581-5ba2-430c-a3d1-01ce03bd83f9,Elizabeth,Owens,Female,Freshman,Finance,3.4
3,6081f91d-365c-46ce-ad1b-38af120781d9,Edward,Pearson,Male,Freshman,Math,3.84
4,84cec8f4-0b64-44ce-a628-c0eb73f6ca6f,Lisa,Gonzalez,Female,Junior,Finance,4.0


In [11]:
second_10_students = students_df.iloc[5:10]
second_10_students

Unnamed: 0,student_id,first,last,gender,class,major,gpa
5,6c849c3e-e640-4bba-a86a-4323fd513b90,Alphonse,Allen,Male,Freshman,Engineering,2.99
6,a5c87c39-447c-4c29-92af-fa702a8d5595,Linwood,Coleman,Male,Freshman,Engineering,3.58
7,f6b177e8-e00a-480e-b62e-906c2ad80f85,Arthur,Mccolpin,Male,Junior,Math,3.04
8,8387594f-c9b2-4daa-ae93-c3e40f58cb26,Daniel,Carter,Male,Junior,Chemistry,2.41
9,156aefe7-73b4-4777-929c-7aa9c0cd35c5,Barry,Reyes,Male,Junior,Finance,3.0


Also like numpy arrays, you can filter! This is going to be life changing!

In [14]:
female_students_df = students_df[students_df['gender']=='Female']
female_students_df.head(20)

Unnamed: 0,student_id,first,last,gender,class,major,gpa
0,5a397209-3782-4764-a285-10fae807ee71,Janis,Brown,Female,Junior,Economics,3.12
2,975c1581-5ba2-430c-a3d1-01ce03bd83f9,Elizabeth,Owens,Female,Freshman,Finance,3.4
4,84cec8f4-0b64-44ce-a628-c0eb73f6ca6f,Lisa,Gonzalez,Female,Junior,Finance,4.0
10,8dd7650e-55e0-416e-9f7e-44ead1dc08a2,Maryellen,Pack,Female,Junior,Finance,3.83
12,1846c044-9a87-49e4-ad4f-d1bfadb4e41b,Lisa,Walden,Female,Senior,Economics,3.4
13,21cdf664-d905-44ae-a480-81b687aefa67,Ida,Garcia,Female,Freshman,Chemistry,3.54
14,15714049-2a88-4dbc-8d1a-dde0095a3acf,Lola,Bailey,Female,Senior,Physics,3.28
15,35de9214-505e-4d55-80e6-0302098a44b6,Joan,Decoteau,Female,Senior,Economics,3.81
17,4cb6a2fb-7f67-45ca-abc6-7252aa99d9f0,Bridgette,Vitucci,Female,Sophomore,Chemistry,3.1
21,8eaad117-5777-4ea5-bb6d-663865eeb196,Heather,Manning,Female,Sophomore,Engineering,2.87


Let's take a step-by-step approach to figuring out that filter.

In [None]:
#Find out which rows are females


In [None]:
#Now apply that filter to the DataFrame


In [None]:
#How many are students are there?


In [None]:
#Now let's find out which rows are females and are studying Economics


In [None]:
#Now apply that double filter to the DataFrame


In [None]:
#How many are students are there?


In [None]:
#Now let's find out which rows are females and are studying Engineering and are on probation


In [None]:
#Now apply that double filter to the DataFrame


In [None]:
#How many are students are there?
