# MSDS 631 - Lecture 6 (February 27, 2019)

## Key Scientific Libraries - Numpy and Pandas

--------
##### Sidenote: List Comprehensions
Sometimes you want to create a simple list and writing so many lines of code might seem silly. List comprehensions are a nice, clean, efficient way to create lists without having to write out so much code.

In [1]:
remainders = []
for i in range(101):
    remainder = i % 2
    remainders.append(remainder)

In [2]:
def gimme_even_or_odd(num):
    if num % 2 == 0:
        return 'even'
    else:
        return 'odd'

In [3]:
remainders = [i % 2 for i in range(10)]
remainders

[0, 1, 0, 1, 0, 1, 0, 1, 0, 1]

In [4]:
evens_and_odds = [gimme_even_or_odd(i) for i in range(10)]

In [5]:
evens_and_odds

['even', 'odd', 'even', 'odd', 'even', 'odd', 'even', 'odd', 'even', 'odd']

While list comprehensions are cool and clean looking, please be careful to not go overboard with these. They are easy to screw up, so the more complex your logic, the harder it is to write them.

---

### Scientific Libraries

While Python is arguably the most widely used language for Data Scientists, the language by itself does not necessarily lend itself to performing data analysis or modeling. The most influential evolution of the language that led the widespread adoption in the field was the creation of two libraries: Numpy and Pandas.

Numpy (prounounced "num-pie") was originally written in 1995, but it took it's current form in 2006. Since then it has been a foundational library for performing data analysis in Python. Two years later, Wes McKinney wrote Pandas, which took the concepts started by Numpy and melded it with powerful data structures from R.

Let's start by talking about Numpy.

Numpy is a library that has many built-in Objects that utilize code written in the C language. This makes code run MUCH much faster than what you might see in Python (and often with a lot less code). Let's take a look at a few examples.

### Intro to Numpy

In [6]:
import random
import numpy as np #We use an alias to shorten our code

##### Arrays

The foundational data structure for Numpy is the array. Like lists, values within arrays are accessible by index, but the array has many more powerful features that make analysis much easier.

In [7]:
#Create some random numbers in Python
list_of_numbers1 = [random.randint(1,100) for _ in range(10)]
list_of_numbers2 = [random.randint(1,100) for _ in range(10)]
list_of_numbers3 = [random.randint(1,100) for _ in range(10)]

list_of_numbers1

[97, 47, 20, 66, 41, 10, 11, 25, 31, 70]

In [8]:
#Create some random numbers in Numpy
np_array_of_numbers1 = np.array(list_of_numbers1) #Casting a list as a new Numpy-specific data structure
np_array_of_numbers2 = np.array(list_of_numbers2) #Casting a list as a new Numpy-specific data structure
np_array_of_numbers3 = np.array(list_of_numbers3) #Casting a list as a new Numpy-specific data structure

np_array_of_numbers1

array([97, 47, 20, 66, 41, 10, 11, 25, 31, 70])

In [9]:
#Values can be accessed just like lists by index
np_array_of_numbers1[0]

97

In [10]:
#... or by range
np_array_of_numbers1[:5]

array([97, 47, 20, 66, 41])

In [12]:
#What you CAN do that's unique is access data by random indices
i_want_these_indices = [0,3,4,7]
np_array_of_numbers1[i_want_these_indices]

array([97, 66, 41, 25])

In [13]:
#This doesn't work for lists
list_of_numbers1[i_want_these_indices]

TypeError: list indices must be integers or slices, not list

##### Array Math

Let's add 5 to every number in the first list.

In [14]:
#Base Python
new_numbers = []
for i in list_of_numbers1:
    new_number = i + 5
    new_numbers.append(new_number)
print(list_of_numbers1)
print(new_numbers)

[97, 47, 20, 66, 41, 10, 11, 25, 31, 70]
[102, 52, 25, 71, 46, 15, 16, 30, 36, 75]


In [15]:
#Using Numpy
np_array_of_numbers1 + 5

array([102,  52,  25,  71,  46,  15,  16,  30,  36,  75])

What just happened here???

Numpy uses something called "broadcasting." This allows us to apply certain mathematical operations to each element in an array. We can do a lot more than this too! The more commonly used terminology for this is called **"vectorization."**

Let's multiply each number in list 1 by 5

In [16]:
#Base Python
python_nums = []
for i in list_of_numbers1:
    new_num = i * 5
    python_nums.append(new_num)
python_nums

[485, 235, 100, 330, 205, 50, 55, 125, 155, 350]

In [17]:
#Using Numpy
numpy_nums = np_array_of_numbers1 * 5
numpy_nums

array([485, 235, 100, 330, 205,  50,  55, 125, 155, 350])

Now let's add each element of list 1 and list 2 and then divide the sum by each value of list 3

In [18]:
#Example of first index
(list_of_numbers1[0] + list_of_numbers2[0]) / list_of_numbers3[0]

187.0

In [20]:
#Let's do it for all of the values now
#Base Python
python_nums = []
for i in range(len(list_of_numbers1)):
    new_num = (list_of_numbers1[i] + list_of_numbers2[i]) / list_of_numbers3[i]
    python_nums.append(new_num)
python_nums

[187.0,
 4.354838709677419,
 1.0689655172413792,
 20.666666666666668,
 3.608695652173913,
 0.28205128205128205,
 1.9245283018867925,
 3.3076923076923075,
 1.0317460317460319,
 3.75]

In [21]:
#Using Numpy
numpy_nums = (np_array_of_numbers1 + np_array_of_numbers2) / np_array_of_numbers3
numpy_nums

array([ 187.        ,    4.35483871,    1.06896552,   20.66666667,
          3.60869565,    0.28205128,    1.9245283 ,    3.30769231,
          1.03174603,    3.75      ])

When doing **element-wise** math using arrays, it is critical that you are doing this with arrays that are the same shape.

In [22]:
longer_array = np.array([random.randint(1,100) for _ in range(10)])
shorter_array = np.array([random.randint(1,100) for _ in range(5)])
longer_array / shorter_array #This doesn't work

ValueError: operands could not be broadcast together with shapes (10,) (5,) 

Element-wise math can happen with almost standard mathematical operator (or combination of operators. In fact, you can even apply functions to arrays, as long as the operations being performed are mathematical in nature (and not logical).

In [23]:
#Basic arithmetic operations and numpy math functions
def math_output(x):
    a = x + 5
    b = a / 2
    c = b ** 3
    d = np.sqrt(c) #Numpy has several built-in math functions
    return d

In [24]:
math_output(7)

14.696938456699069

In [25]:
#In base Python
answers = []
for i in list_of_numbers1:
    answer = math_output(i)
    answers.append(answer)
answers

[364.21284985568536,
 132.5745073534124,
 44.194173824159222,
 211.51566135868049,
 110.30412503619254,
 20.539595906443729,
 22.627416997969522,
 58.094750193111253,
 76.367532368147138,
 229.63966338592294]

In [26]:
math_output(np_array_of_numbers1)

array([ 364.21284986,  132.57450735,   44.19417382,  211.51566136,
        110.30412504,   20.53959591,   22.627417  ,   58.09475019,
         76.36753237,  229.63966339])

That worked great! Even cleaner than list comprehensions!

Unfortunately, there are limitations. As mentioned above, control flow cannot be used for vectorization.

In [27]:
def logical_output(x):
    if x % 2 == 0:
        return 'even'
    elif x % 2 == 1:
        return 'odd'
    else:
        return 'something else'

In [28]:
logical_output(4)

'even'

In [29]:
answers = []
for i in list_of_numbers1:
    answer = logical_output(i)
    answers.append(answer)
answers

['odd', 'odd', 'even', 'even', 'odd', 'even', 'odd', 'odd', 'odd', 'even']

In [30]:
logical_output(np_array_of_numbers1) #You cannot vectorize control flow

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

#### Numpy Filtering

Filters are arrays consisting of True and False values. You can obtain these by applying in-line logic comparisons.

In [31]:
np_array_of_numbers1

array([97, 47, 20, 66, 41, 10, 11, 25, 31, 70])

In [32]:
#Let's find the even numbers in our array
is_even = np_array_of_numbers1 % 2 == 0
is_even

array([False, False,  True,  True, False,  True, False, False, False,  True], dtype=bool)

Now, much like how indexing in an array returns a subset of values, we can use this filter to only return the values that are True.

In [33]:
np_array_of_evens1 = np_array_of_numbers1[is_even]

In [34]:
np_array_of_evens1

array([20, 66, 10, 70])

In [35]:
#Let's now find the numbers divisible by 3 in our array
divis_by_3_filter = np_array_of_numbers1 % 3 == 0
nums_divisible_by_3 = np_array_of_numbers1[divis_by_3_filter]
nums_divisible_by_3

array([66])

In [36]:
#We can also do this without creating a filter, but rather writing the logic within the brackets
np_array_of_numbers1[np_array_of_numbers1 % 3 == 0]

array([66])

Filtering will be an extremely important concept as we move forward.

We can also use these filters on other arrays of the same shape, assuming the indices of the array are all associated with the same entity (e.g. the same student for the example below).

Imagine we're on the school board and want to identify all of the honors students. We want to start by finding everyone with a GPA 3.7 or higher. When we did this previously, it required a complicated loop. Now, with Numpy, we can use simple filtering to identify those students.

In [37]:
import json
students = json.load(open('students.json', 'r'))
ids = np.array([i['student_id'] for i in students])
gpas = np.array([i['gpa'] for i in students])

In [38]:
#Let's find the students with GPAs over 3.7
high_gpa_filter = gpas >= 3.7
honor_ids = ids[high_gpa_filter]

print('There are {} students eligible for honor roll. They have the following IDs'.format(len(honor_ids)))
honor_ids

There are 2166 students eligible for honor roll. They have the following IDs


array(['6081f91d-365c-46ce-ad1b-38af120781d9',
       '84cec8f4-0b64-44ce-a628-c0eb73f6ca6f',
       '8dd7650e-55e0-416e-9f7e-44ead1dc08a2', ...,
       '32606dc2-862b-45cc-b0ac-f2b24253abdf',
       '3f1f6525-3ec0-4184-b435-c829419bf582',
       'bc551659-ba48-447e-aa6a-0c2f49aaa9c1'],
      dtype='<U36')

In [39]:
#If we wanted to know all of the students that were economics majors, we could create another filter
majors = np.array([i['major'] for i in students])
is_economics = majors == 'Economics'
is_economics

array([ True,  True, False, ..., False, False, False], dtype=bool)

In [40]:
economics_ids = ids[is_economics]
print('The following students are economics majors')
print(economics_ids)

The following students are economics majors
['5a397209-3782-4764-a285-10fae807ee71'
 'e26c3d69-3c74-49b6-81d7-47232787fad9'
 '1846c044-9a87-49e4-ad4f-d1bfadb4e41b' ...,
 '32606dc2-862b-45cc-b0ac-f2b24253abdf'
 '8dc612f4-8150-4045-9e2d-cf160fb71da4'
 'bc551659-ba48-447e-aa6a-0c2f49aaa9c1']


##### Limits on usage of arrays
There is one significant difference between arrays and lists where arrays are not as useful as lists, and that is in its ability to hold different variable types.

In [41]:
mixed_list = ['3', 1, 2.0]
mixed_list

['3', 1, 2.0]

In [42]:
mixed_array = np.array(mixed_list) #No error, but be wary of what you wish for

In [43]:
mixed_array

array(['3', '1', '2.0'],
      dtype='<U3')

In this case, Numpy is "helping" you by auto-casting all of the variable into a common data type. Unfortunately, this is not always something you want and you have to be mindful of when it happens.

##### Numpy built-in array methods

In [44]:
gpas.mean() #Compute the arithmetic mean

3.3765260000000001

In [45]:
gpas.max() #Find the maximum value

4.0

In [46]:
gpas.min() #Find the minimum value

1.72

In [47]:
gpas.std() #Find the standard deviation

0.37002301458693077

In [48]:
gpas.argmax() #Find the FIRST index where the max value occurs

4

In [49]:
#Alternatively, you can do this to find all of the IDs with a max value
max_gpa = gpas.max()
max_gpa_ids = ids[gpas == max_gpa]
max_gpa_ids

array(['84cec8f4-0b64-44ce-a628-c0eb73f6ca6f',
       '1418830a-ca98-4b96-96c4-2b06a51ea306',
       'e0d1dcc2-d48d-481e-ac3d-2a3cacf11787',
       'd3170950-f20b-4121-bd44-2d0595dd46c1',
       '2900cea4-e471-4cb1-9c81-0248b3d7c6e7',
       '634c6900-4a15-495e-93f2-a46296706be8',
       '5aa1f9cd-1313-4443-abce-e74df66ae16a',
       '05cebb19-39bc-47d1-9ce8-95b5b3a82f9a',
       '3c468120-f16a-4aa0-87a7-62c7678d1016',
       '13a5fabf-787b-4f40-8609-08366ebae350',
       'cd424190-fc3b-45e0-a439-c0171bc12634',
       'c7bdcbce-737c-4a10-b040-380feddd9bb6',
       '2cea7bd0-5730-4046-b8bd-5b867a9a81ca',
       '177d8fcb-363b-42bb-9823-0aa657f4e201',
       'e0184616-f3be-4028-b2cf-a7a47e8e175e',
       '59ddb994-4270-4a6a-9a59-05157e95ab9c',
       'dc5987a2-ab7a-4030-9180-83058c2a01ce',
       '0908951a-47c9-428c-92f5-5bb78ab8dfe5',
       '75fb1c66-2d88-4b3a-885c-f00925271e03',
       '78008297-6fdf-40a5-93eb-f7783d14e41d',
       'f404b1e1-0e62-47a6-88b2-eed3478badac',
       '1d89f

What percent of students have a 4.0?

There are two ways.

In [50]:
len(max_gpa_ids) / len(students)

0.0654

In [51]:
(gpas == max_gpa).mean() #True's can be autocast as 1.0 and False's can be autocast as 0.0

0.0654

In [53]:
#Let's take the majors numpy array from earlier to create a filter, then compute the percent that are True
#Here we are trying to compute the percentage of all students that are in a particular major
for major in set(majors):
    print(major, (majors == major).mean())

Economics 0.1973
Physics 0.1515
Math 0.073
Engineering 0.2034
Chemistry 0.1507
Finance 0.2241


##### Numpy built-in scientific helper Objects

Numpy has several built-in Objects that help with your analyses. There are WAY too many to cover, but here are a few that you may find yourself using.

##### Native Math functions

In [54]:
#Square root
np.sqrt(25)

5.0

In [55]:
#Base e logarithm
np.log(20)

2.9957322735539909

In [56]:
#Base 10 logarithm
np.log10(20)

1.3010299956639813

In [57]:
#Base n logarithm
n = 10
np.log(20) / np.log(n)

1.301029995663981

In [60]:
#Let's a list of the squares for the first 9 positive integers
#Below we are creating a list comprehension of the squares
squares_list = [i**2 for i in range(1,10)]
squares = np.array(squares_list)
squares

array([ 1,  4,  9, 16, 25, 36, 49, 64, 81])

In [63]:
#Numpy has a function similar to Base Python for creating a range of numbers - it's called "arange"
first_nine = np.arange(1,10)
#Now we can apply the square without a for-loop by using vectorization
squares = first_nine ** 2

In [64]:
#As we showed before, we can vectorize mathematical operations
np.sqrt(squares)

array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

In [65]:
np.log(squares)

array([ 0.        ,  1.38629436,  2.19722458,  2.77258872,  3.21887582,
        3.58351894,  3.8918203 ,  4.15888308,  4.39444915])

##### Random Number Generator and Sampler

Numpy also has tremendous random number generator capabilities.

In [66]:
#Create an array of 10 random integers that are between 1 and 100 (inclusive)
np.random.randint(1,100,size=10)

array([39,  8, 23, 15, 70, 64, 65, 51, 79, 51])

In [67]:
#Create an array of 10 random float values between 0 and 1 (inclusive)
np.random.random(size=10)

array([ 0.71241263,  0.15653471,  0.77155511,  0.77545275,  0.28476708,
        0.8201949 ,  0.97477733,  0.8480717 ,  0.51428949,  0.12137832])

In [68]:
#We cannot directly create an array of random float values between 0 and 10, but we can use vectorization easily
np.random.random(size=10) * 10

array([ 6.44957505,  8.22368032,  2.168686  ,  6.44282975,  4.01168933,
        7.98599846,  2.10576607,  0.71877618,  2.10998938,  9.86781957])

In [69]:
#Create an array of 10 random float values pulled from a normal distribution of mean 1 and stdev of 0.5
np.random.normal(1,.5, size=10)

array([ 1.38937493,  1.53491008,  0.47200742,  0.83805383,  1.79577735,
        0.93510326,  1.69620974,  1.61419592,  1.00023177,  0.54385163])

In [71]:
#As stated above, we can create an array of numbers between some min value and some max value
lots_of_numbers = np.arange(1,10000) #Equivalent of native range() function
lots_of_numbers[:20]

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20])

In [72]:
#We can pull a random sample from an array (or list)
#Here we are taking 10 random numbers from our array of numbers between 0 and 9999
random_sample = np.random.choice(lots_of_numbers, size=10)
random_sample

array([8495,  416, 6657,  690, 1702, 4639, 6129, 2780, 3773, 3095])

In [86]:
#Here we are randomly choosing 1000 random majors amongst Economics, Physics, and Math with a defined distribution
#Given the probabilies defined, the expected count for the majors is 200 Economics, 500 Physics, and 300 Math
majors = ['Economics', 'Physics', 'Math']
lots_of_majors = np.random.choice(majors, p=[.2,.5,.3], size=1000)
lots_of_majors

array(['Physics', 'Math', 'Physics', 'Physics', 'Math', 'Physics', 'Math',
       'Physics', 'Physics', 'Physics', 'Physics', 'Physics', 'Math',
       'Physics', 'Physics', 'Physics', 'Physics', 'Physics', 'Physics',
       'Physics', 'Physics', 'Physics', 'Physics', 'Physics', 'Economics',
       'Economics', 'Physics', 'Physics', 'Physics', 'Physics', 'Physics',
       'Physics', 'Physics', 'Math', 'Physics', 'Economics', 'Physics',
       'Math', 'Physics', 'Physics', 'Physics', 'Math', 'Math',
       'Economics', 'Physics', 'Physics', 'Economics', 'Physics',
       'Physics', 'Economics', 'Physics', 'Math', 'Physics', 'Math',
       'Economics', 'Economics', 'Math', 'Physics', 'Math', 'Physics',
       'Physics', 'Physics', 'Economics', 'Math', 'Economics', 'Economics',
       'Physics', 'Physics', 'Physics', 'Economics', 'Physics', 'Physics',
       'Economics', 'Physics', 'Physics', 'Physics', 'Physics', 'Physics',
       'Physics', 'Physics', 'Math', 'Economics', 'Physics', 'Ma

In [87]:
#How many of each major did we actually get?
for major in majors:
    print(major, (lots_of_majors == major).sum())

Economics 194
Physics 504
Math 302


### Intro to Pandas
Pandas is going to be your best friend for the rest of your time analyzing data in Python. It offers much of the functionality of Numpy, but with a MUCH more user-friendly way of displaying and accessing data.

The two fundamental data structures of Pandas are the Series and the DataFrame. A Series can be viewed as equivalent to a Numpy array, while a DataFrame can be viewed as a combination of Series, with some additional functionality.

Let's start by opening a file using Pandas built-in method.

In [88]:
import pandas as pd
students_df = pd.read_csv('students.csv') #This is way easier than our method from before!

In [89]:
#Let's look at the first few records of the students DataFrame
students_df.head(10) #The "head" method will give you as many rows as you ask for; 5 is default

Unnamed: 0,student_id,first,last,gender,class,major,gpa
0,5a397209-3782-4764-a285-10fae807ee71,Janis,Brown,Female,Junior,Economics,3.12
1,e26c3d69-3c74-49b6-81d7-47232787fad9,Timothy,Bishop,Male,Sophomore,Economics,3.48
2,975c1581-5ba2-430c-a3d1-01ce03bd83f9,Elizabeth,Owens,Female,Freshman,Finance,3.4
3,6081f91d-365c-46ce-ad1b-38af120781d9,Edward,Pearson,Male,Freshman,Math,3.84
4,84cec8f4-0b64-44ce-a628-c0eb73f6ca6f,Lisa,Gonzalez,Female,Junior,Finance,4.0
5,6c849c3e-e640-4bba-a86a-4323fd513b90,Alphonse,Allen,Male,Freshman,Engineering,2.99
6,a5c87c39-447c-4c29-92af-fa702a8d5595,Linwood,Coleman,Male,Freshman,Engineering,3.58
7,f6b177e8-e00a-480e-b62e-906c2ad80f85,Arthur,Mccolpin,Male,Junior,Math,3.04
8,8387594f-c9b2-4daa-ae93-c3e40f58cb26,Daniel,Carter,Male,Junior,Chemistry,2.41
9,156aefe7-73b4-4777-929c-7aa9c0cd35c5,Barry,Reyes,Male,Junior,Finance,3.0


#### Intro to the DataFrame

DataFrames consist of rows and columns. These are the fundamental "units" of data for a Data Scientist. This is what allows us to organize our data to analyze quickly and consistently. Rows are considered individual "observations" whereas columns are considered "attributes." The "shape" of a data is defined as the number of rows by the number of columns.

In the DataFrame above, you can see that the attributes identified are:
- student_id
- first
- last
- gender
- class
- major
- gpa

Each student's information can be found in a row. Thus, the first row (index 0) is Janis Brown, who is a Female and a Junior Economics student with a GPA of 3.12.

This is MUCH easier to read and track than lists of dictionaries or dictionaries of dictionaries or whatever other combination of data structures that base Python might be able to display for you. You're probably asking why ALL data isn't shown this way, and we'll get to that later on, but in the meantime, let's focus on the simple case here.

Let's start by looking at the shape of the data.

In [90]:
students_df.shape

(10000, 7)

This means that the data has 10,000 rows and 7 columns. The first things you should always do when using DataFrames is look at the first few rows (the `.head()` method) and the shape of the data.

### Accessing chunks of data in a DataFrame

You will almost always want to focus your attention or compute values oncertain subsections of the data. For this there are several ways to slice and dice the data.

##### Accessing columns

Much like dictionaries, columns are accessed via brackets and some column name (just like keys).

In [91]:
first_names = students_df['first'] #Returns a Pandas "Series" object
first_names

0           Janis
1         Timothy
2       Elizabeth
3          Edward
4            Lisa
5        Alphonse
6         Linwood
7          Arthur
8          Daniel
9           Barry
10      Maryellen
11          Jacob
12           Lisa
13            Ida
14           Lola
15           Joan
16          Larry
17      Bridgette
18          James
19          David
20         Ronald
21        Heather
22          Nancy
23          David
24          Jesus
25        Richard
26          Henry
27          Pedro
28         Israel
29           Mary
          ...    
9970     Georgina
9971      Lynette
9972       Vickie
9973       Willie
9974       Sylvia
9975        Derek
9976      Charles
9977     Jonathan
9978         Dave
9979        Ralph
9980         Leta
9981        Megan
9982       Olivia
9983         Anne
9984         John
9985         John
9986       Shelly
9987       Roland
9988     Margaret
9989        Alice
9990        Mable
9991       Joseph
9992       Gladys
9993      Douglas
9994      

Unlike dictionaries, you can access multiple columns at the same time.

In [92]:
full_names = students_df[['first', 'last']]
full_names

Unnamed: 0,first,last
0,Janis,Brown
1,Timothy,Bishop
2,Elizabeth,Owens
3,Edward,Pearson
4,Lisa,Gonzalez
5,Alphonse,Allen
6,Linwood,Coleman
7,Arthur,Mccolpin
8,Daniel,Carter
9,Barry,Reyes


##### Accessing rows

Much like numpy arrays, rows can be accessed via indices.

In [93]:
first_5_students = students_df.iloc[:5]
first_5_students

Unnamed: 0,student_id,first,last,gender,class,major,gpa
0,5a397209-3782-4764-a285-10fae807ee71,Janis,Brown,Female,Junior,Economics,3.12
1,e26c3d69-3c74-49b6-81d7-47232787fad9,Timothy,Bishop,Male,Sophomore,Economics,3.48
2,975c1581-5ba2-430c-a3d1-01ce03bd83f9,Elizabeth,Owens,Female,Freshman,Finance,3.4
3,6081f91d-365c-46ce-ad1b-38af120781d9,Edward,Pearson,Male,Freshman,Math,3.84
4,84cec8f4-0b64-44ce-a628-c0eb73f6ca6f,Lisa,Gonzalez,Female,Junior,Finance,4.0


In [94]:
second_10_students = students_df.iloc[5:10]
second_10_students

Unnamed: 0,student_id,first,last,gender,class,major,gpa
5,6c849c3e-e640-4bba-a86a-4323fd513b90,Alphonse,Allen,Male,Freshman,Engineering,2.99
6,a5c87c39-447c-4c29-92af-fa702a8d5595,Linwood,Coleman,Male,Freshman,Engineering,3.58
7,f6b177e8-e00a-480e-b62e-906c2ad80f85,Arthur,Mccolpin,Male,Junior,Math,3.04
8,8387594f-c9b2-4daa-ae93-c3e40f58cb26,Daniel,Carter,Male,Junior,Chemistry,2.41
9,156aefe7-73b4-4777-929c-7aa9c0cd35c5,Barry,Reyes,Male,Junior,Finance,3.0


Also like numpy arrays, you can filter! This is going to be life changing!

In [97]:
gender = students_df['gender']
female_filt = gender =='Female'
female_students_df = students_df[female_filt]
female_students_df.head(10)

Unnamed: 0,student_id,first,last,gender,class,major,gpa
0,5a397209-3782-4764-a285-10fae807ee71,Janis,Brown,Female,Junior,Economics,3.12
2,975c1581-5ba2-430c-a3d1-01ce03bd83f9,Elizabeth,Owens,Female,Freshman,Finance,3.4
4,84cec8f4-0b64-44ce-a628-c0eb73f6ca6f,Lisa,Gonzalez,Female,Junior,Finance,4.0
10,8dd7650e-55e0-416e-9f7e-44ead1dc08a2,Maryellen,Pack,Female,Junior,Finance,3.83
12,1846c044-9a87-49e4-ad4f-d1bfadb4e41b,Lisa,Walden,Female,Senior,Economics,3.4
13,21cdf664-d905-44ae-a480-81b687aefa67,Ida,Garcia,Female,Freshman,Chemistry,3.54
14,15714049-2a88-4dbc-8d1a-dde0095a3acf,Lola,Bailey,Female,Senior,Physics,3.28
15,35de9214-505e-4d55-80e6-0302098a44b6,Joan,Decoteau,Female,Senior,Economics,3.81
17,4cb6a2fb-7f67-45ca-abc6-7252aa99d9f0,Bridgette,Vitucci,Female,Sophomore,Chemistry,3.1
21,8eaad117-5777-4ea5-bb6d-663865eeb196,Heather,Manning,Female,Sophomore,Engineering,2.87


Let's take a step-by-step approach to figuring out that filter.

In [100]:
#Find out which rows are females
female_filt = students_df['gender']=='Female' #female_filt will be a Series object of True and False values
female_filt.head()

0     True
1    False
2     True
3    False
4     True
Name: gender, dtype: bool

In [101]:
#Now apply that filter to the DataFrame
females_df = students_df[female_filt]

In [102]:
#How many are students are there?
females_df.shape[0]

4921

In [106]:
#Now let's find out which rows are females AND are studying Economics
econ_filt = students_df['major'] == 'Economics'
combined_filt = econ_filt & female_filt
combined_filt.head()

0     True
1    False
2    False
3    False
4    False
dtype: bool

In [107]:
#Now apply that double filter to the DataFrame
female_econ_students = students_df[combined_filt]
female_econ_students.head(10)

Unnamed: 0,student_id,first,last,gender,class,major,gpa
0,5a397209-3782-4764-a285-10fae807ee71,Janis,Brown,Female,Junior,Economics,3.12
12,1846c044-9a87-49e4-ad4f-d1bfadb4e41b,Lisa,Walden,Female,Senior,Economics,3.4
15,35de9214-505e-4d55-80e6-0302098a44b6,Joan,Decoteau,Female,Senior,Economics,3.81
43,667a8999-3765-4a83-bc64-612f64a2011e,Tosha,Flanagan,Female,Junior,Economics,3.92
49,ee1ce005-5671-431b-b832-61452d15ad76,Yvonne,Delo,Female,Sophomore,Economics,2.87
54,ed1f11d4-cf19-4792-83dd-0f9fea5a4692,Karla,Cota,Female,Junior,Economics,3.12
60,52856cb8-4ee2-438b-b645-1e67eee3d3c0,Elizabeth,Ali,Female,Junior,Economics,3.32
65,0413e874-519a-4a8a-a7c8-4350923519fa,Dolores,Henderson,Female,Sophomore,Economics,3.31
67,b7fd1a05-80b7-4fa4-8eef-db1851ad2a4f,Alma,Hines,Female,Junior,Economics,3.77
79,9af34895-3ba0-407f-97ca-50d9ece1e3ed,Buffy,Wirth,Female,Senior,Economics,3.43


In [109]:
#How many are students are there?
female_econ_students.shape[0] #Shape returns a tuple with row and column counts

1500

In [110]:
#Now let's find out which rows are females and are studying Engineering and are on probation
eng_filt = students_df['major'] == 'Engineering'
probation_filt = students_df['gpa'] < 3
female_eng_probation_filt = female_filt & eng_filt & probation_filt

In [112]:
#Now apply that double filter to the DataFrame
female_eng_probation_df = students_df[female_eng_probation_filt]
female_eng_probation_df.head(10)

Unnamed: 0,student_id,first,last,gender,class,major,gpa
21,8eaad117-5777-4ea5-bb6d-663865eeb196,Heather,Manning,Female,Sophomore,Engineering,2.87
87,2b99d648-6619-405f-91f1-4c0e4dec489d,Meghan,Gilman,Female,Junior,Engineering,2.5
185,4bb0282b-d52b-4e8f-9cd5-4dbd0b4fe65f,Jani,Turman,Female,Sophomore,Engineering,2.93
254,31d08966-445e-4ea1-aa95-4d31e345b77e,Frances,Mcdaniel,Female,Freshman,Engineering,2.84
441,712d90a2-e0ac-466c-ab8f-e300da831920,Lauren,Williams,Female,Freshman,Engineering,2.94
576,e27dc2f9-bc8f-40a7-b072-063062aad2da,Denise,Kehoe,Female,Sophomore,Engineering,2.74
580,acacbd17-edea-4eeb-8e86-f83a6ebc6f5b,Candace,Ritter,Female,Freshman,Engineering,2.75
583,9e99c961-c090-433b-b31f-66f378988ee0,Amanda,Medlock,Female,Sophomore,Engineering,2.73
585,b79fc028-b637-4354-b093-8b89d65e6211,Denise,Peoples,Female,Sophomore,Engineering,2.57
665,9b2af973-90f0-4ab1-97f3-7cda0df18f4f,Natasha,Rojas,Female,Sophomore,Engineering,2.93


In [113]:
#How many are students are there?
female_eng_probation_df.shape[0]

247

Now Let's compute the percentage of students on probation by department. First we need to add a new column of booleans that indicate whether a student is on probation or not

In [117]:
#Create boolean column for probation status
on_probation_filt = students_df['gpa'] < 3
students_df['on_probation'] = on_probation_filt
students_df.head()

Unnamed: 0,student_id,first,last,gender,class,major,gpa,on_probation
0,5a397209-3782-4764-a285-10fae807ee71,Janis,Brown,Female,Junior,Economics,3.12,False
1,e26c3d69-3c74-49b6-81d7-47232787fad9,Timothy,Bishop,Male,Sophomore,Economics,3.48,False
2,975c1581-5ba2-430c-a3d1-01ce03bd83f9,Elizabeth,Owens,Female,Freshman,Finance,3.4,False
3,6081f91d-365c-46ce-ad1b-38af120781d9,Edward,Pearson,Male,Freshman,Math,3.84,False
4,84cec8f4-0b64-44ce-a628-c0eb73f6ca6f,Lisa,Gonzalez,Female,Junior,Finance,4.0,False


In [116]:
#As we did earlier with the percentage of students with 4.0 GPAs, we can use the mean of True values to compute
students_df.groupby('major')[['on_probation']].mean()

Unnamed: 0_level_0,on_probation
major,Unnamed: 1_level_1
Chemistry,0.132714
Economics,0.075013
Engineering,0.35939
Finance,0.048193
Math,0.180822
Physics,0.182178


That's WAY easier than how we used to do things with dictionaries!!! We will go more into grouping data next week :)