# MSDS 631 - Lecture 5 (February 27, 2019)

## Key Scientific Libraries - Numpy and Pandas

In [1]:
even_nums = []
for i in range(101): 
    if i % 2 == 0: 
        even_nums.append(i)

In [2]:
reworked to:

SyntaxError: invalid syntax (<ipython-input-2-049094370936>, line 1)

In [3]:
remainders = []
for i in range(101): 
    remainder =  i % 2
    remainders.append(remainders)

--------
##### Sidenote: List Comprehensions
Sometimes you want to create a simple list and writing so many lines of code might seem silly. List comprehensions are a nice, clean, efficient way to create lists without having to write out so much code.

In [4]:
list_of_numbers = list(range(1,10))
remainders = [i % 2 for i in list_of_numbers]
remainders

[1, 0, 1, 0, 1, 0, 1, 0, 1]

For loop within a list, 

While list comprehensions are cool and clean looking, please be careful to not go overboard with these. They are easy to screw up, so the more complex your logic, the harder it is to write them.

---

### Scientific Libraries

While Python is arguably the most widely used language for Data Scientists, the language by itself does not necessarily lend itself to performing data analysis or modeling. The most influential evolution of the language that led the widespread adoption in the field was the creation of two libraries: Numpy and Pandas.

Numpy (prounounced "num-pie") was originally written in 1995, but it took it's current form in 2006. Since then it has been a foundational library for performing data analysis in Python. Two years later, Wes McKinney wrote Pandas, which took the concepts started by Numpy and melded it with powerful data structures from R.

Let's start by talking about Numpy.

Numpy is a library that has many built-in Objects that utilize code written in the C language. This makes code run MUCH much faster than what you might see in Python (and often with a lot less code). Let's take a look at a few examples.

### Intro to Numpy

In [5]:
import random
import numpy as np #We use an alias to shorten our code

##### Arrays

The foundational data structure for Numpy is the array. Like lists, values within arrays are accessible by index, but the array has many more powerful features that make analysis much easier.

In [6]:
#Create some random numbers in Python
list_of_numbers1 = [random.randint(1,100) for _ in range(10)]
list_of_numbers2 = [random.randint(1,100) for _ in range(10)]
list_of_numbers3 = [random.randint(1,100) for _ in range(10)]

list_of_numbers1

[61, 42, 36, 19, 60, 63, 70, 74, 76, 79]

In [7]:
#Create some random numbers in Numpy
np_array_of_numbers1 = np.array(list_of_numbers1) #Casting a list as a new Numpy-specific data structure
np_array_of_numbers2 = np.array(list_of_numbers2) #Casting a list as a new Numpy-specific data structure
np_array_of_numbers3 = np.array(list_of_numbers3) #Casting a list as a new Numpy-specific data structure

np_array_of_numbers1

array([61, 42, 36, 19, 60, 63, 70, 74, 76, 79])

In [8]:
#Values can be accessed just like lists by index
np_array_of_numbers1[0]

61

In [9]:
#... or by range
np_array_of_numbers1[:5]

array([61, 42, 36, 19, 60])

In [10]:
#What you CAN do that's unique is access data by random indices
np_array_of_numbers1[[0,3,4,7]]

array([61, 19, 60, 74])

In [11]:
#This doesn't work for lists
list_of_numbers1[[0,3,4,7]]

TypeError: list indices must be integers or slices, not list

##### Array Math

Let's add 5 to every number in the first list.

In [12]:
#Base Python
new_numbers = []
for i in list_of_numbers1:
    new_number = i + 5
    new_numbers.append(new_number)
print(list_of_numbers1)
print(new_number)

[61, 42, 36, 19, 60, 63, 70, 74, 76, 79]
84


In [13]:
#Using Numpy
np_array_of_numbers1 + 5


array([66, 47, 41, 24, 65, 68, 75, 79, 81, 84])

What just happened here???

Numpy uses something called "broadcasting." This allows us to apply certain mathematical operations to each element in an array. We can do a lot more than this too! The more commonly used terminology for this is called **"vectorization."**

Let's multiply each number in list 1 by 5

In [14]:
#Base Python
python_nums = []
for i in list_of_numbers1:
    new_num = i * 5
    python_nums.append(new_num)  
python_nums

[305, 210, 180, 95, 300, 315, 350, 370, 380, 395]

In [15]:
#Using Numpy
numpy_nums = np_array_of_numbers1 * 5

numpy_nums

array([305, 210, 180,  95, 300, 315, 350, 370, 380, 395])

Now let's add each element of list 1 and list 2 and then divide the sum by each value of list 3

In [16]:
#Example of first index
(list_of_numbers1[0] + list_of_numbers2[0]) / list_of_numbers3[0]

1.9661016949152543

In [17]:
#Let's do it for all of the values now
#Base Python
for i in range(len(list_of_numbers1)):
    new_num = list_of_numbers1[i] + list_of_numbers2[i] / list_of_numbers3[i]
    python_nums.append(new_num)
python_nums

[305,
 210,
 180,
 95,
 300,
 315,
 350,
 370,
 380,
 395,
 61.932203389830505,
 43.06818181818182,
 37.2,
 19.176470588235293,
 60.65853658536585,
 67.0,
 77.0,
 74.28787878787878,
 76.79069767441861,
 79.8169014084507]

In [18]:
#Using Numpy
numpy_nums = np_array_of_numbers1 + np_array_of_numbers2 / np_array_of_numbers3

numpy_nums

array([61.93220339, 43.06818182, 37.2       , 19.17647059, 60.65853659,
       67.        , 77.        , 74.28787879, 76.79069767, 79.81690141])

When doing **element-wise** math using arrays, it is critical that you are doing this with arrays that are the same shape.

In [19]:
longer_array = np.array([random.randint(1,100) for _ in range(10)])
shorter_array = np.array([random.randint(1,100) for _ in range(5)])
longer_array / shorter_array #This doesn't work

ValueError: operands could not be broadcast together with shapes (10,) (5,) 

Element-wise math can happen with almost standard mathematical operator (or combination of operators. In fact, you can even apply functions to arrays, as long as the operations being performed are mathematical in nature (and not logical).

In [20]:
#Basic arithmetic operations and numpy math functions
def math_output(x):
    a = x + 5
    b = a / 2
    c = b ** 3
    d = np.sqrt(c) #Numpy has several built-in math functions
    return d

In [21]:
math_output(7)

14.696938456699069

In [22]:
#In base Python
answers = []
for i in list_of_numbers1:
    answer = math_output(i)
    answers.append(answer)
answers

[189.57056733575496,
 113.92047664928373,
 92.81769766590853,
 41.569219381653056,
 185.27850657860992,
 198.25236442474022,
 229.63966338592294,
 248.25365052703657,
 257.74042174249655,
 272.19110933313016]

In [23]:
math_output(np_array_of_numbers1)

array([189.57056734, 113.92047665,  92.81769767,  41.56921938,
       185.27850658, 198.25236442, 229.63966339, 248.25365053,
       257.74042174, 272.19110933])

That worked great! Even cleaner than list comprehensions!

Unfortunately, there are limitations. As mentioned above, control flow cannot be used for vectorization.

In [24]:
def logical_output(x):
    if x % 2 == 0:
        return 'even'
    elif x % 2 == 1:
        return 'odd'
    else:
        return 'something else'

In [25]:
logical_output(4)

'even'

In [26]:
answers = []
for i in list_of_numbers1:
    answer = logical_output(i)
    answers.append(answer)
answers

['odd', 'even', 'even', 'odd', 'even', 'odd', 'even', 'even', 'even', 'odd']

In [27]:
logical_output(np_array_of_numbers1) #You cannot vectorize control flow

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

#### Numpy Filtering

Filters are arrays consisting of True and False values. You can obtain these by applying in-line logic comparisons.

In [28]:
np_array_of_numbers1

array([61, 42, 36, 19, 60, 63, 70, 74, 76, 79])

In [29]:
#Let's find the even numbers in our array
is_even = np_array_of_numbers1 % 2 == 0
is_even

array([False,  True,  True, False,  True, False,  True,  True,  True,
       False])

Now, much like how indexing in an array returns a subset of values, we can use this filter to only return the values that are True.

In [30]:
np_array_of_numbers1[is_even]

array([42, 36, 60, 70, 74, 76])

In [31]:
#Let's now find the numbers divisible by 3 in our array
divis_by_3_filter = np_array_of_numbers1 % 3 == 0
nums_divisible_by_3 = np_array_of_numbers1[divis_by_3_filter]
nums_divisible_by_3

array([42, 36, 60, 63])

In [32]:
#We can also do this without creating a filter
np_array_of_numbers1[np_array_of_numbers1 % 3 == 0]

array([42, 36, 60, 63])

Filtering will be an extremely important concept as we move forward.

We can also use these filters on other arrays of the same shape, assuming the indices of the array are all associated with the same entity (e.g. the same student for the example below).

Imagine we're on the school board and want to identify all of the honors students. We want to start by finding everyone with a GPA 3.7 or higher. When we did this previously, it required a complicated loop. Now, with Numpy, we can use simple filtering to identify those students.

In [33]:
import json
students = json.load(open('students.json', 'r'))
ids = np.array([i['student_id'] for i in students])
gpas = np.array([i['gpa'] for i in students])

In [34]:
#Let's find the students with GPAs over 3.7
high_gpa_filter = gpas >= 3.7
honor_ids = ids[high_gpa_filter]

print('There are {} students eligible for honor roll. They have the following IDs'.format(len(honor_ids)))
honor_ids

There are 2166 students eligible for honor roll. They have the following IDs


array(['6081f91d-365c-46ce-ad1b-38af120781d9',
       '84cec8f4-0b64-44ce-a628-c0eb73f6ca6f',
       '8dd7650e-55e0-416e-9f7e-44ead1dc08a2', ...,
       '32606dc2-862b-45cc-b0ac-f2b24253abdf',
       '3f1f6525-3ec0-4184-b435-c829419bf582',
       'bc551659-ba48-447e-aa6a-0c2f49aaa9c1'], dtype='<U36')

In [35]:
#If we wanted to know all of the students that were economics majors, we could create another filter
majors = np.array([i['major'] for i in students])
is_economics = majors == 'Economics'
is_economics

array([ True,  True, False, ..., False, False, False])

In [36]:
economics_ids = ids[is_economics]
print('The following students are economics majors')
print(economics_ids)

The following students are economics majors
['5a397209-3782-4764-a285-10fae807ee71'
 'e26c3d69-3c74-49b6-81d7-47232787fad9'
 '1846c044-9a87-49e4-ad4f-d1bfadb4e41b' ...
 '32606dc2-862b-45cc-b0ac-f2b24253abdf'
 '8dc612f4-8150-4045-9e2d-cf160fb71da4'
 'bc551659-ba48-447e-aa6a-0c2f49aaa9c1']


##### Limits on usage of arrays
There is one significant difference between arrays and lists where arrays are not as useful as lists, and that is in its ability to hold different variable types.

In [37]:
mixed_list = [1, 2.0, '3']
mixed_list

[1, 2.0, '3']

In [38]:
mixed_array = np.array(mixed_list) #No error, but be wary of what you wish for

In [39]:
mixed_array

array(['1', '2.0', '3'], dtype='<U32')

In this case, Numpy is "helping" you by auto-casting all of the variable into a common data type. Unfortunately, this is not always something you want and you have to be mindful of when it happens.

##### Numpy built-in array methods

In [40]:
gpas.mean() #Compute the arithmetic mean

3.376526

In [41]:
gpas.max() #Find the maximum value

4.0

In [42]:
gpas.min() #Find the minimum value

1.72

In [43]:
gpas.std() #Find the standard deviation

0.37002301458693077

In [44]:
gpas.argmax() #Find the FIRST index where the max value occurs

4

In [45]:
filt = gpas = gpas.max()

In [46]:
#Alternatively, you can do this to find all of the IDs with a max value
max_gpa = gpas.max()
max_gpa_ids = ids[gpas == max_gpa]
max_gpa_ids

array([['5a397209-3782-4764-a285-10fae807ee71',
        'e26c3d69-3c74-49b6-81d7-47232787fad9',
        '975c1581-5ba2-430c-a3d1-01ce03bd83f9', ...,
        '4884e643-4a94-4362-a422-604763401487',
        '034754f5-50dd-42e5-a916-cc6c9d9d0131',
        '75c02f31-566f-439e-875e-5af9fe412977']], dtype='<U36')

What percent of students have a 4.0?

There are two ways.

In [47]:
len(max_gpa_ids) / len(students)

0.0001

In [48]:
(gpas == max_gpa).mean() #True's can be autocast as 1.0 and False's can be autocast as 0.0

1.0

In [49]:
for majors in set(majors):
    print(major, (major == major).mean())

NameError: name 'major' is not defined

##### Numpy built-in scientific helper Objects

Numpy has several built-in Objects that help with your analyses. There are WAY too many to cover, but here are a few that you may find yourself using.

##### Native Math functions

In [50]:
#Square root
np.sqrt(25)

5.0

In [51]:
#Base e logarithm
np.log(20)

2.995732273553991

In [52]:
#Base 10 logarithm
np.log10(20)

1.3010299956639813

In [53]:
#Base n logarithm
n = 10
np.log(20) / np.log(n)

1.301029995663981

In [54]:
squares = np.array([i**2 for i in range(1,10)])
squares

array([ 1,  4,  9, 16, 25, 36, 49, 64, 81])

In [55]:
#As we showed before, we can vectorize mathematical operations
np.sqrt(squares)

array([1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [56]:
np.log(squares)

array([0.        , 1.38629436, 2.19722458, 2.77258872, 3.21887582,
       3.58351894, 3.8918203 , 4.15888308, 4.39444915])

##### Random Number Generator and Sampler

In [57]:
np.random.randint(1,100,size=10)

array([35, 64, 46, 67,  2, 77, 32, 94, 63, 28])

In [58]:
np.random.random(size=10)

array([0.94902169, 0.13231593, 0.05168182, 0.66064409, 0.71586207,
       0.39218724, 0.85536755, 0.56734437, 0.18764922, 0.86941338])

In [59]:
np.random.random(size=10) * 10

array([1.25984898, 6.42800169, 2.73848888, 3.92833669, 5.15709465,
       7.13639831, 3.24261335, 0.83659915, 0.40138315, 3.79607391])

In [60]:
np.random.normal(1,.5, size=10)

array([0.18840837, 0.99065694, 1.14577965, 0.77542386, 0.97467738,
       0.48655405, 0.78830985, 0.89234144, 1.39525894, 1.00747233])

In [61]:
lots_of_numbers = np.arange(1,10000) #Equivalent of native range() function

In [62]:
random_sample = np.random.choice(lots_of_numbers, size=10)
random_sample

array([7131, 9613, 8378, 9904, 5593, 4563, 7291, 6126,  104, 8827])

In [63]:
majors = ['Economics', 'Physics', 'Math']
lots_of_majors = np.random.choice(majors, p=[.2,.5,.3], size=1000)
lots_of_majors

array(['Physics', 'Math', 'Math', 'Physics', 'Math', 'Economics', 'Math',
       'Economics', 'Math', 'Math', 'Math', 'Math', 'Physics',
       'Economics', 'Physics', 'Physics', 'Economics', 'Math', 'Math',
       'Physics', 'Physics', 'Math', 'Economics', 'Math', 'Physics',
       'Physics', 'Math', 'Math', 'Physics', 'Physics', 'Physics',
       'Economics', 'Physics', 'Economics', 'Economics', 'Physics',
       'Math', 'Economics', 'Physics', 'Math', 'Economics', 'Math',
       'Economics', 'Physics', 'Math', 'Physics', 'Economics', 'Math',
       'Math', 'Math', 'Math', 'Math', 'Economics', 'Physics', 'Math',
       'Physics', 'Physics', 'Physics', 'Physics', 'Physics', 'Math',
       'Physics', 'Math', 'Math', 'Physics', 'Physics', 'Physics',
       'Physics', 'Physics', 'Physics', 'Physics', 'Math', 'Physics',
       'Economics', 'Math', 'Math', 'Physics', 'Economics', 'Physics',
       'Physics', 'Economics', 'Physics', 'Physics', 'Physics',
       'Economics', 'Physics', 'Phys

In [64]:
#How many Physics majors did we get?
lots_of_majors == 'Physics'

array([ True, False, False,  True, False, False, False, False, False,
       False, False, False,  True, False,  True,  True, False, False,
       False,  True,  True, False, False, False,  True,  True, False,
       False,  True,  True,  True, False,  True, False, False,  True,
       False, False,  True, False, False, False, False,  True, False,
        True, False, False, False, False, False, False, False,  True,
       False,  True,  True,  True,  True,  True, False,  True, False,
       False,  True,  True,  True,  True,  True,  True,  True, False,
        True, False, False, False,  True, False,  True,  True, False,
        True,  True,  True, False,  True,  True, False,  True, False,
        True,  True,  True, False,  True,  True,  True, False, False,
       False,  True, False, False, False, False, False,  True,  True,
       False, False,  True, False, False, False, False,  True, False,
        True, False, False, False,  True,  True,  True,  True, False,
        True, False,

### Intro to Pandas
Pandas is going to be your best friend for the rest of your time analyzing data in Python. It offers much of the functionality of Numpy, but with a MUCH more user-friendly way of displaying and accessing data.

The two fundamental data structures of Pandas are the Series and the DataFrame. A Series can be viewed as equivalent to a Numpy array, while a DataFrame can be viewed as a combination of Series, with some additional functionality.

Let's start by opening a file using Pandas built-in method.

In [65]:
import pandas as pd
students_df = pd.read_csv('students.csv') #This is way easier than our method from before!

In [66]:
#Let's look at the first few records of the students DataFrame
students_df.head(8) #The "head" method will give you as many rows as you ask for; 5 is default

Unnamed: 0,student_id,first,last,gender,class,major,gpa
0,5a397209-3782-4764-a285-10fae807ee71,Janis,Brown,Female,Junior,Economics,3.12
1,e26c3d69-3c74-49b6-81d7-47232787fad9,Timothy,Bishop,Male,Sophomore,Economics,3.48
2,975c1581-5ba2-430c-a3d1-01ce03bd83f9,Elizabeth,Owens,Female,Freshman,Finance,3.4
3,6081f91d-365c-46ce-ad1b-38af120781d9,Edward,Pearson,Male,Freshman,Math,3.84
4,84cec8f4-0b64-44ce-a628-c0eb73f6ca6f,Lisa,Gonzalez,Female,Junior,Finance,4.0
5,6c849c3e-e640-4bba-a86a-4323fd513b90,Alphonse,Allen,Male,Freshman,Engineering,2.99
6,a5c87c39-447c-4c29-92af-fa702a8d5595,Linwood,Coleman,Male,Freshman,Engineering,3.58
7,f6b177e8-e00a-480e-b62e-906c2ad80f85,Arthur,Mccolpin,Male,Junior,Math,3.04


#### Intro to the DataFrame

DataFrames consist of rows and columns. These are the fundamental "units" of data for a Data Scientist. This is what allows us to organize our data to analyze quickly and consistently. Rows are considered individual "observations" whereas columns are considered "attributes." The "shape" of a data is defined as the number of rows by the number of columns.

In the DataFrame above, you can see that the attributes identified are:
- student_id
- first
- last
- gender
- class
- major
- gpa

Each student's information can be found in a row. Thus, the first row (index 0) is Janis Brown, who is a Female and a Junior Economics student with a GPA of 3.12.

This is MUCH easier to read and track than lists of dictionaries or dictionaries of dictionaries or whatever other combination of data structures that base Python might be able to display for you. You're probably asking why ALL data isn't shown this way, and we'll get to that later on, but in the meantime, let's focus on the simple case here.

Let's start by looking at the shape of the data.

In [67]:
students_df.shape

(10000, 7)

This means that the data has 10,000 rows and 7 columns. The first things you should always do when using DataFrames is look at the first few rows (the `.head()` method) and the shape of the data.

### Accessing chunks of data in a DataFrame

You will almost always want to focus your attention or compute values oncertain subsections of the data. For this there are several ways to slice and dice the data.

##### Accessing columns

Much like dictionaries, columns are accessed via brackets and some column name (just like keys).

In [68]:
first_names = students_df['first'] #Returns a Series object
first_names

0           Janis
1         Timothy
2       Elizabeth
3          Edward
4            Lisa
5        Alphonse
6         Linwood
7          Arthur
8          Daniel
9           Barry
10      Maryellen
11          Jacob
12           Lisa
13            Ida
14           Lola
15           Joan
16          Larry
17      Bridgette
18          James
19          David
20         Ronald
21        Heather
22          Nancy
23          David
24          Jesus
25        Richard
26          Henry
27          Pedro
28         Israel
29           Mary
          ...    
9970     Georgina
9971      Lynette
9972       Vickie
9973       Willie
9974       Sylvia
9975        Derek
9976      Charles
9977     Jonathan
9978         Dave
9979        Ralph
9980         Leta
9981        Megan
9982       Olivia
9983         Anne
9984         John
9985         John
9986       Shelly
9987       Roland
9988     Margaret
9989        Alice
9990        Mable
9991       Joseph
9992       Gladys
9993      Douglas
9994      

Unlike dictionaries, you can access multiple columns at the same time.

In [69]:
full_names = students_df[['first', 'last']]
full_names

Unnamed: 0,first,last
0,Janis,Brown
1,Timothy,Bishop
2,Elizabeth,Owens
3,Edward,Pearson
4,Lisa,Gonzalez
5,Alphonse,Allen
6,Linwood,Coleman
7,Arthur,Mccolpin
8,Daniel,Carter
9,Barry,Reyes


##### Accessing rows

Much like numpy arrays, rows can be accessed via indices.

In [70]:
first_5_students = students_df.iloc[2,5,7]
first_5_students

IndexingError: Too many indexers

In [None]:
second_10_students = students_df.iloc[5:10]
second_10_students

Also like numpy arrays, you can filter! This is going to be life changing!

In [71]:
gender = students_df['gender']
female_fit = gender == 'Female' #// break below for initial code 
female_students_df = students_df[female_filt] #[students_df['gender']=='Female']
female_students_df.head(20)

[students_df[students_df]['major']=='Economics']

NameError: name 'female_filt' is not defined

Let's take a step-by-step approach to figuring out that filter.

In [72]:
#Find out which rows are females
[students_df[students_df]['gender']=='female']

ValueError: Must pass DataFrame with boolean values only

In [73]:
#Now apply that filter to the DataFrame
females_df = students_df(female_filt)

NameError: name 'female_filt' is not defined

In [74]:
#How many are students are there?
females_df.shape(0)

NameError: name 'females_df' is not defined

In [79]:
#Now let's find out which rows are females and are studying Economics
econ_filt = students_df['major'] == 'Economics'
combined_filt = econ_filt & female_filt
both = students_df[combined_filt]
both.shape

NameError: name 'female_filt' is not defined

In [76]:
#Now apply that double filter to the DataFrame
on_probation_filt = students_df['gpa'] < 3
students_df('on_probation') = on_probation_filt

SyntaxError: can't assign to function call (<ipython-input-76-d81eee3a42f9>, line 3)

In [77]:
#How many are students are there?
students_df.groupby('major')[['on_probation']].mean().idxmax()

KeyError: "Columns not found: 'on_probation'"

In [78]:
#Now let's find out which rows are females and are studying Engineering and are on probation


In [None]:
#Now apply that double filter to the DataFrame


In [None]:
#How many are students are there?
