# MSDS 631 - Lecture 5 (February 20, 2019)

## Debugging and Key Scientific Libraries

### Debugging

Coding never goes as one plans. We can come up with the best plan, yet we will inevitably mess up our code somewhere. With all of the specificity that computers require, the smallest detail can cause an error. The following are the most common issues you'll run into:
- Incorrect indentations
- Missing or incorrectly placed parentheses
- Missing colons
- Having a return in the wrong level of indentation in a for-loop or function
- Accidentally naming a variable in a function that it isn't passed (i.e. assuming the existance of global variables)
- Changing the name of a variable but forgetting a few instances
- Thinking a data structure is a list instead of a dictionary (or vice versa)

There are too many possibilities to list, but addressing any issues from this list will be a good start.

If running through this checklist doesn't help, then I always recommend starting with "The Squirrel" technique.

#### Rubber Duck Debugging
The Squirrel is *actually* called the "Rubber Duck Technique"... I just adopted the squirrel version that one of my past professors taught me.

*```In software engineering, rubber duck debugging is a method of debugging code. The name is a reference to a story in the book The Pragmatic Programmer in which a programmer would carry around a rubber duck and debug their code by forcing themselves to explain it, line-by-line, to the duck. Many other terms exist for this technique, often involving different inanimate objects. (https://en.wikipedia.org/wiki/Rubber_duck_debugging)
```*

#### Break apart your code
One of the most common issues is when coders try to pack in too many expressions into a single line of code. This could include multiple method calls, function calls, or data structure accessing

In [2]:
import json
students = json.load(open('students.json', 'r'))

If I wanted to print a sentence regarding the first student, the code could look something like this:

In [3]:
print("The first student's GPA was {}, which was high for her major".format(round(students[0]['gpa']),1)).capitalize()

The first student's GPA was 3, which was high for her major


AttributeError: 'NoneType' object has no attribute 'capitalize'

That's a lot going on in a single line. When you're running into errors when you've got complex code like this, you should start by breaking apart the code.

In [4]:
#First thing I want to do is access the first student's records and get their GPA.
student = students[0]
first_gpa = student['gpa']

In [5]:
#Now we're down to the following code:
print("The first student's GPA was {}, which was high for her major".format(round(first_gpa),1)).capitalize()

The first student's GPA was 3, which was high for her major


AttributeError: 'NoneType' object has no attribute 'capitalize'

In [6]:
#Let's steal the code from within parentheses to see how each part works
#The first thing that will get computed is what is inside the format parentheses
round(first_gpa),1) #This is clearly wrong


SyntaxError: invalid syntax (<ipython-input-6-2ebd98efb32c>, line 3)

In [7]:
rounded_first_gpa = round(first_gpa,1)

In [8]:
#Now the code reads
print("The first student's GPA was {}, which was high for her major".format(rounded_first_gpa)

SyntaxError: unexpected EOF while parsing (<ipython-input-8-b29f0195e41a>, line 2)

In [9]:
#Let's go back to checking things within parentheses
#Now we're down to the contents within the print statement
"The first student's GPA was {}, which was high for her major".format(rounded_first_gpa).capitalize(

SyntaxError: unexpected EOF while parsing (<ipython-input-9-1376795cac3f>, line 3)

In [10]:
#Now we've found the next culprit... a missing parenthesis!
#Let's assign the phrase to another variable
phrase = "The first student's GPA was {}, which was high for her major".format(rounded_first_gpa).capitalize()

In [11]:
#Now all we have to do is print the phrase
print(phrase)

The first student's gpa was 3.1, which was high for her major


More lines of code may not seem as efficient or elegant, but I promise it will result in less buggy code. The key is to improve readability.

##### Example

In [12]:
my_points = 53
possible_points = 70
score = (my_points / possible_points) * 100
grade = letter_grade(score)

rounded_score = round(score, 0)

if score >= 60 and grade <69:
        print ('your grade is {}, so your letter grade is D'.format(rounded_score)
    elif score >=69 and grade <79:
        print ('your grade is {}, so your letter grade is C'.format(rounded_score)
    elif score >=79 and grade <89:
        print ('your grade is {}, so your letter grade is B'.format(rounded_score)
    elif: score >= 89
    print ('your grade is {}, so your letter grade is A'.format(round(grade,0))

SyntaxError: invalid syntax (<ipython-input-12-ff1f6293153e>, line 10)

## Scientific Libraries

While Python is arguably the most widely used language for Data Scientists, the language by itself does not necessarily lend itself to performing data analysis or modeling. The most influential evolution of the language that led the widespread adoption in the field was the creation of two libraries: Numpy and Pandas.

Numpy (prounounced "num-pie") was originally written in 1995, but it took it's current form in 2006. Since then it has been a foundational library for performing data analysis in Python. Two years later, Wes McKinney wrote Pandas, which took the concepts started by Numpy and melded it with powerful data structures from R.

Let's start by talking about Numpy.

Numpy is a library that has many built-in Objects that utilize code written in the C language. This makes code run MUCH much faster than what you might see in Python (and often with a lot less code). Let's take a look at a few examples.

### Intro to Numpy

In [13]:
import random
import numpy as np #We use an alias to shorten our code

##### Arrays

The foundational data structure for Numpy is the array. Like lists, values within arrays are accessible by index, but the array has many more powerful features that make analysis much easier.

In [14]:
#Create some random numbers in Python
list_of_numbers1 = [random.randint(1,100) for _ in range(10)]
list_of_numbers2 = [random.randint(1,100) for _ in range(10)]
list_of_numbers3 = [random.randint(1,100) for _ in range(10)]

list_of_numbers1

[3, 68, 40, 7, 30, 83, 68, 71, 2, 88]

In [15]:
#Create some random numbers in Numpy
np_array_of_numbers1 = np.array(list_of_numbers1) #Casting a list as a new Numpy-specific data structure
np_array_of_numbers2 = np.array(list_of_numbers2) #Casting a list as a new Numpy-specific data structure
np_array_of_numbers3 = np.array(list_of_numbers3) #Casting a list as a new Numpy-specific data structure

np_array_of_numbers1

array([ 3, 68, 40,  7, 30, 83, 68, 71,  2, 88])

In [16]:
#Values can be accessed just like lists by index
np_array_of_numbers1[0]

3

In [17]:
#... or by range
np_array_of_numbers1[:5]

array([ 3, 68, 40,  7, 30])

In [18]:
#What you CAN do that's unique is access data by random indices
np_array_of_numbers1[[0,3,4,7]]

array([ 3,  7, 30, 71])

In [19]:
#This doesn't work for lists
list_of_numbers1[[0,3,4,7]]

TypeError: list indices must be integers or slices, not list

##### Array Math

Let's add 5 to every number in the first list.

In [20]:
#Base Python



In [21]:
#Using Numpy



What just happened here???

Numpy uses something called "broadcasting." This allows us to apply certain mathematical operations to each element in an array. We can do a lot more than this too! The more commonly used terminology for this is called **"vectorization."**

Let's multiply each number in list 1 by 5

In [22]:
#Base Python
python_nums = []



python_nums

[]

In [23]:
#Using Numpy


numpy_nums

NameError: name 'numpy_nums' is not defined

Now let's add each element of list 1 and list 2 and then divide the sum by each value of list 3

In [24]:
#Example of first index
(list_of_numbers1[0] + list_of_numbers2[0]) / list_of_numbers3[0]

2.48

In [25]:
#Let's do it for all of the values now
#Base Python


python_nums

[]

In [26]:
#Using Numpy


numpy_nums

NameError: name 'numpy_nums' is not defined

When doing **element-wise** math using arrays, it is critical that you are doing this with arrays that are the same shape.

In [27]:
longer_array = np.array([random.randint(1,100) for _ in range(10)])
shorter_array = np.array([random.randint(1,100) for _ in range(5)])
longer_array / shorter_array #This doesn't work

ValueError: operands could not be broadcast together with shapes (10,) (5,) 

Element-wise math can happen with almost standard mathematical operator (or combination of operators. In fact, you can even apply functions to arrays, as long as the operations being performed are mathematical in nature (and not logical).

In [28]:
#Basic arithmetic operations and numpy math functions
def math_output(x):
    a = x + 5
    b = a / 2
    c = b ** 3
    d = np.sqrt(c) #Numpy has several built-in math functions
    return c

In [29]:
math_output(7)

216.0

In [30]:
#In base Python
answers = []
for i in list_of_numbers1:
    answer = math_output(i)
    answers.append(answer)
answers

[64.0,
 48627.125,
 11390.625,
 216.0,
 5359.375,
 85184.0,
 48627.125,
 54872.0,
 42.875,
 100544.625]

--------
##### Sidenote: List Comprehensions
Sometimes you want to create a simple list and writing so many lines of code might seem silly. List comprehensions are a nice, clean, efficient way to create lists without having to write out so much code.

In [31]:
#SIDENOTE: You can also use something called list comprehensions to create the above code in one line
answers = [math_output(i) for i in list_of_numbers1] #Create in-line for-loops
answers

[64.0,
 48627.125,
 11390.625,
 216.0,
 5359.375,
 85184.0,
 48627.125,
 54872.0,
 42.875,
 100544.625]

In [32]:
remainders = [i % 2 for i in list_of_numbers1]
remainders

[1, 0, 0, 1, 0, 1, 0, 1, 0, 0]

While list comprehensions are cool and clean looking, please be careful to not go overboard with these. They are easy to screw up, so the more complex your logic, the harder it is to write them.

---

In [33]:
math_output(np_array_of_numbers1)

array([6.40000000e+01, 4.86271250e+04, 1.13906250e+04, 2.16000000e+02,
       5.35937500e+03, 8.51840000e+04, 4.86271250e+04, 5.48720000e+04,
       4.28750000e+01, 1.00544625e+05])

In [34]:
def logical_output(x):
    if x % 2 == 0:
        return 'even'
    elif x % 2 == 1:
        return 'odd'
    else:
        return 'something else'

In [35]:
logical_output(4)

'even'

In [36]:
answers = []
for i in list_of_numbers1:
    answer = logical_output(i)
    answers.append(answer)
answers

['odd', 'even', 'even', 'odd', 'even', 'odd', 'even', 'odd', 'even', 'even']

In [37]:
logical_output(np_array_of_numbers1) #You cannot vectorize control flow

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

What you're seeing here in the Numpy version is called filtering. Let's look at the various pieces of filtering that is super powerful.

#### Numpy Filtering

Filters are arrays consisting of True and False values. You can obtain these by applying in-line logic comparisons.

In [38]:
np_array_of_numbers1

array([ 3, 68, 40,  7, 30, 83, 68, 71,  2, 88])

In [39]:
#Let's find the even numbers in our array
is_even = np_array_of_numbers1 % 2 == 0
is_even

array([False,  True,  True, False,  True, False,  True, False,  True,
        True])

Now, much like indexing in an array, we can use this filter to only return the values that are true.

In [40]:
np_array_of_numbers1[is_even]

array([68, 40, 30, 68,  2, 88])

In [41]:
#Let's now find the numbers divisible by 3 in our array
divis_by_3_filter = np_array_of_numbers1 % 3 == 0
nums_divisible_by_3 = np_array_of_numbers1[divis_by_3_filter]
nums_divisible_by_3

array([ 3, 30])

In [42]:
#We can also do this without creating a filter
np_array_of_numbers1[np_array_of_numbers1 % 3 == 0]

array([ 3, 30])

Filtering will be an extremely important concept as we move forward.

We can also use these filters on other arrays of the same shape, assuming the indices of the array are all associated with the same entity (e.g. the same student for the example below).

Imagine we're on the school board and want to identify all of the honors students. We want to start by finding everyone with a GPA 3.7 or higher. When we did this previously, it required a complicated loop. Now, with Numpy, we can use simple filtering to identify those students.

In [43]:
students = json.load(open('students.json', 'r'))
ids = np.array([i['student_id'] for i in students])
gpas = np.array([i['gpa'] for i in students])

In [44]:
#Let's find the students with GPAs over 3.7
high_gpa_filter = <insert code here>

print('There are {} students eligible for honor roll. They have the following IDs'.format(len(honor_ids)))
honor_ids

SyntaxError: invalid syntax (<ipython-input-44-7e7ebbeba1e4>, line 2)

In [45]:
#If we wanted to know all of the students that were economics majors, we could create another filter
majors = np.array([i['major'] for i in students])
is_economics = majors == 'Economics'
is_economics

array([ True,  True, False, ..., False, False, False])

In [46]:
economics_ids = ids[is_economics]
print('The following students are economics majors')
print(economics_ids)

The following students are economics majors
['5a397209-3782-4764-a285-10fae807ee71'
 'e26c3d69-3c74-49b6-81d7-47232787fad9'
 '1846c044-9a87-49e4-ad4f-d1bfadb4e41b' ...
 '32606dc2-862b-45cc-b0ac-f2b24253abdf'
 '8dc612f4-8150-4045-9e2d-cf160fb71da4'
 'bc551659-ba48-447e-aa6a-0c2f49aaa9c1']


##### Limits on usage of arrays
There is one significant difference between arrays and lists where arrays are not as useful as lists, and that is in its ability to hold different variable types.

In [47]:
mixed_list = [1, 2.0, '3']
mixed_list

[1, 2.0, '3']

In [48]:
mixed_array = np.array(mixed_list) #No error, but be wary of what you wish for

In [49]:
mixed_array

array(['1', '2.0', '3'], dtype='<U32')

In this case, Numpy is "helping" you by auto-casting all of the variable into a common data type. Unfortunately, this is not always something you want and you have to be mindful of when it happens.

##### Numpy built-in array methods

In [50]:
gpas.mean() #Compute the arithmetic mean

3.376526

In [51]:
gpas.max() #Find the maximum value

4.0

In [52]:
gpas.min() #Find the minimum value

1.72

In [53]:
gpas.std() #Find the standard deviation

0.37002301458693077

In [54]:
gpas.argmax() #Find the FIRST index where the max value occurs

4

In [55]:
#Alternatively, you can do this to find all of the IDs with a max value
max_gpa = gpas.max()
max_gpa_ids = ids[gpas == max_gpa]
max_gpa_ids

array(['84cec8f4-0b64-44ce-a628-c0eb73f6ca6f',
       '1418830a-ca98-4b96-96c4-2b06a51ea306',
       'e0d1dcc2-d48d-481e-ac3d-2a3cacf11787',
       'd3170950-f20b-4121-bd44-2d0595dd46c1',
       '2900cea4-e471-4cb1-9c81-0248b3d7c6e7',
       '634c6900-4a15-495e-93f2-a46296706be8',
       '5aa1f9cd-1313-4443-abce-e74df66ae16a',
       '05cebb19-39bc-47d1-9ce8-95b5b3a82f9a',
       '3c468120-f16a-4aa0-87a7-62c7678d1016',
       '13a5fabf-787b-4f40-8609-08366ebae350',
       'cd424190-fc3b-45e0-a439-c0171bc12634',
       'c7bdcbce-737c-4a10-b040-380feddd9bb6',
       '2cea7bd0-5730-4046-b8bd-5b867a9a81ca',
       '177d8fcb-363b-42bb-9823-0aa657f4e201',
       'e0184616-f3be-4028-b2cf-a7a47e8e175e',
       '59ddb994-4270-4a6a-9a59-05157e95ab9c',
       'dc5987a2-ab7a-4030-9180-83058c2a01ce',
       '0908951a-47c9-428c-92f5-5bb78ab8dfe5',
       '75fb1c66-2d88-4b3a-885c-f00925271e03',
       '78008297-6fdf-40a5-93eb-f7783d14e41d',
       'f404b1e1-0e62-47a6-88b2-eed3478badac',
       '1d89f

What percent of students have a 4.0?

There are two ways.

In [56]:
len(max_gpa_ids) / len(students)

0.0654

In [57]:
(gpas == max_gpa).mean() #True's can be autocast as 1.0 and False's can be autocast as 0.0

0.0654

##### Numpy built-in scientific helper Objects

Numpy has several built-in Objects that help with your analyses. There are WAY too many to cover, but here are a few that you may find yourself using.

##### Native Math functions

In [58]:
#Square root
np.sqrt(25)

5.0

In [59]:
#Base e logarithm
np.log(20)

2.995732273553991

In [60]:
#Base 10 logarithm
np.log10(20)

1.3010299956639813

In [61]:
#Base n logarithm
n = 10
np.log(20) / np.log(n)

1.301029995663981

In [62]:
squares = np.array([i**2 for i in range(1,10)])
squares

array([ 1,  4,  9, 16, 25, 36, 49, 64, 81])

In [63]:
#As we showed before, we can vectorize mathematical operations
np.sqrt(squares)

array([1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [64]:
np.log(squares)

array([0.        , 1.38629436, 2.19722458, 2.77258872, 3.21887582,
       3.58351894, 3.8918203 , 4.15888308, 4.39444915])

##### Random Number Generator and Sampler

In [65]:
np.random.randint(1,100,size=10)

array([60, 36, 57, 50, 87, 90, 63, 71, 41, 15])

In [66]:
np.random.random(size=10)

array([0.57618648, 0.61020781, 0.25649117, 0.06070109, 0.5438586 ,
       0.96589935, 0.30048617, 0.95570261, 0.39798286, 0.05638088])

In [67]:
np.random.random(size=10) * 10

array([2.90285679, 1.40545779, 9.4832202 , 1.97975768, 2.34419616,
       5.65221295, 5.61556918, 6.52002614, 3.77583904, 5.16654763])

In [68]:
np.random.normal(1,.5, size=10)

array([ 1.38565065,  0.69318036,  1.54080356, -0.58210496,  1.50552311,
        0.72335224,  1.7283337 ,  1.65015739,  1.29118943,  2.0077785 ])

In [69]:
lots_of_numbers = np.arange(1,10000) #Equivalent of native range() function

In [70]:
random_sample = np.random.choice(lots_of_numbers, size=10)
random_sample

array([2309, 9126, 1836, 3912, 3645,  850,  801, 7876, 4218, 9546])

In [71]:
majors = ['Economics', 'Physics', 'Math']
lots_of_majors = np.random.choice(majors, p=[.2,.5,.3], size=1000)
lots_of_majors

array(['Economics', 'Economics', 'Physics', 'Math', 'Physics',
       'Economics', 'Physics', 'Physics', 'Math', 'Math', 'Physics',
       'Physics', 'Math', 'Economics', 'Physics', 'Math', 'Economics',
       'Economics', 'Economics', 'Math', 'Physics', 'Physics', 'Physics',
       'Economics', 'Physics', 'Physics', 'Physics', 'Math', 'Physics',
       'Physics', 'Physics', 'Physics', 'Math', 'Physics', 'Math',
       'Physics', 'Math', 'Physics', 'Math', 'Physics', 'Math', 'Math',
       'Physics', 'Math', 'Physics', 'Economics', 'Math', 'Math',
       'Physics', 'Economics', 'Math', 'Physics', 'Economics',
       'Economics', 'Physics', 'Math', 'Economics', 'Math', 'Physics',
       'Economics', 'Physics', 'Physics', 'Economics', 'Math',
       'Economics', 'Math', 'Physics', 'Physics', 'Physics', 'Physics',
       'Economics', 'Physics', 'Physics', 'Physics', 'Physics', 'Math',
       'Physics', 'Math', 'Physics', 'Physics', 'Math', 'Physics', 'Math',
       'Math', 'Math', 'Econom

In [72]:
#How many Physics majors did we get?


### Intro to Pandas
Pandas is going to be your best friend for the rest of your time analyzing data in Python. It offers much of the functionality of Numpy, but with a MUCH more user-friendly way of displaying and accessing data.

The two fundamental data structures of Pandas are the Series and the DataFrame. A Series can be viewed as equivalent to a Numpy array, while a DataFrame can be viewed as a combination of Series, with some additional functionality.

Let's start by opening a file using Pandas built-in method.

In [73]:
import pandas as pd
students_df = pd.read_csv('students.csv')

In [74]:
#Let's look at the first few records of the students DataFrame
students_df.head()

Unnamed: 0,student_id,first,last,gender,class,major,gpa
0,5a397209-3782-4764-a285-10fae807ee71,Janis,Brown,Female,Junior,Economics,3.12
1,e26c3d69-3c74-49b6-81d7-47232787fad9,Timothy,Bishop,Male,Sophomore,Economics,3.48
2,975c1581-5ba2-430c-a3d1-01ce03bd83f9,Elizabeth,Owens,Female,Freshman,Finance,3.4
3,6081f91d-365c-46ce-ad1b-38af120781d9,Edward,Pearson,Male,Freshman,Math,3.84
4,84cec8f4-0b64-44ce-a628-c0eb73f6ca6f,Lisa,Gonzalez,Female,Junior,Finance,4.0


##### Accessing data in DataFrame

In [75]:
students_first_names = students_df['first'] #Returns a Series object

In [76]:
students_first_names.head(10)

0        Janis
1      Timothy
2    Elizabeth
3       Edward
4         Lisa
5     Alphonse
6      Linwood
7       Arthur
8       Daniel
9        Barry
Name: first, dtype: object

In [77]:
students_first_names.mode()

0    Robert
dtype: object