Introduction to NumPy: Numerical Python

Sarah records her second-grade class's grades in an online spreadsheet. Her web browser records that she visited that spreadsheet, in addition to every other site she's visited. Those sites record her location, the time she spent on them, and where she visited next. The world is chock-full of all sorts of different datasets, and learning how to create, analyze, and manipulate these datasets can give us some insight and control over our digital surroundings.

In this lesson, we'll be constructing and manipulating single-variable datasets. One way to think of a single-variable dataset is that it contains answers to a question. For instance, we might ask 100 people, “How tall are you?” Their heights in inches would form our dataset.

To work with our datasets, we'll be using a powerful Python module known as NumPy, which stands for Numerical Python.

NumPy has many uses including:

    Efficiently working with many numbers at once
    Generating random numbers
    Performing many different numerical functions (i.e., calculating sin, cos, tan, mean, median, etc.)

In the following exercises, we'll learn how to construct one- and two-dimensional arrays and perform basic array operations.



Working with NumPy¶

NumPy is great at storing and manipulating numerical data in arrays.

Let's take a look at an example. Twice Charred in a fictional (mostly) movie review site where four good friends and movie reviewers, Lorie, Marty, Tori, and Kurtz watch movies and give them ratings on a scale of 0 to 100.


In [1]:
# Before we do anything, we need to import NumPy
import numpy as np

# When the gang rates a movie, we can store their ratings in a NumPy array movie_ratings:
movie_ratings = np.array([63.0, 54.0, 70.0, 50.0])

# But they see more than one movie, so we have to create a 2-dimensional array 
# where each row is their ratings for a specific movie.
movie_ratings = np.array([[63.0, 54.0, 70.0, 50.0],
                          [94.0, 85.0, 89.0, 95.0],
                          [64.0, 90.0, 73.0, 85.0]])


# Some fans prefer to have the movies rated on a five star scale, 
# so we can use NumPy to easily divide each element by 20.
movie_ratings_stars = movie_ratings / 20


# Now let's say the ratings are always in the same order (Lorie, Marty, Tori, Kurtz) 
# if we wanted to create an array that only had Tori's ratings, 
# we could select that from our movie_ratings array.
tori_ratings = movie_ratings[:, 2]
tori_ratings

# Now, say we find that we have very similar taste to Marty, 
# so we only want to see movies that he gives a good rating to, 
# we can use logic to select those movies.
# Let's select all of Marty's ratings that are over 80:
marty_ratings = movie_ratings[:, 1]
marty_ratings[marty_ratings > 80]



array([85., 90.])

In [3]:
# Створити нюмпай масив з ліста
test_1 = np.array([92, 94, 88, 91, 87])

In [4]:
# Створити нюмпай масив з файла csv
test_2 = np.genfromtxt('test_2.csv', delimiter=',') 

OSError: test_2.csv not found.

In [5]:
import numpy as np

test_1 = np.array([92, 94, 88, 91, 87])
test_2 = np.array([79, 100, 86, 93, 91])
test_3 = np.array([87, 85, 72, 90, 92])

test_3_fixed = test_3 + 2

Selecting Elements from a 2-D Array

Selecting elements from a 2-d array is very similar to selecting them from a 1-d array, we just have two indices to select from. The syntax for selecting from a 2-d array is a[row,column] where a is the array.

It's important to note that when we work with arrays that have more than one dimension, the relationship between the interior arrays is defined in terms of axes. A two-dimensional array has two axes: axis 0 represents the values that share the same indexical position (are in the same column), and axis 1 represents the values that share an array (are in the same row). This is illustrated below.

Diagram showing the axes in an array

Consider the array

a = np.array([[32, 15, 6, 9, 14], 
              [12, 10, 5, 23, 1],
              [2, 16, 13, 40, 37]])

We can select specific elements using their indices:

>>> a[2,1]
16

Let's say we wanted to select an entire column, we can insert : as the row index:

# selects the first column
>>> a[:,0]
array([32, 12,  2])

The same works if we want to select an entire row:

# selects the second row
>>> a[1,:]
array([12, 10,  5, 23,  1])

We can further narrow it down and select a range from a specific row:

# selects the first three elements of the first row
>>> a[0,0:3]
array([32, 15,  6])



In [6]:
import numpy as np

student_scores = np.array([[92, 94, 88, 91, 87],
                           [79, 100, 86, 93, 91],
                           [87, 85, 72, 90, 92]])

tanya_test_3 = student_scores[2,0]
cody_test_scores = student_scores[:,4]


Logical Operations with Arrays

Another useful thing that arrays can do is perform element-wise logical operations. For instance, suppose we want to know how many elements in an array are greater than 5. We can easily write some code that checks to see whether this statement evaluates to True for each item in the array, without having to use a for loop :

>>> a = np.array([10, 2, 2, 4, 5, 3, 9, 8, 9, 7])
>>> a > 5
array([True, False, False, False, False, False, True, True, True, True], dtype=bool)

We can then use logical operators to evaluate and select items based on certain criteria. To select all elements from the previous array that are greater than 5, we'd write the following:

>>> a[a > 5]
array([10, 9, 8, 9, 7])

We can also combine logical statements to further specify our criteria. To do so, we place each statement in parentheses and use boolean operators like & (and) and | (or).

In our example, we can use combined statements to find the elements that are greater than five or less than two:

>>> a[(a > 5) | (a < 2)]
array([10, 9, 8, 9, 7])



In [7]:
import numpy as np

porridge = np.array([79, 65, 50, 63, 56, 90, 85, 98, 79, 51])

cold = porridge[porridge<60]
hot = porridge[porridge>80]
just_right = porridge[(porridge < 80) & (porridge > 60)]
print(cold)
print(hot)
print(just_right)

[50 56 51]
[90 85 98]
[79 65 63 79]


Review

Let's take a second and review. In this lesson, you learned the basics of the NumPy package. Here are some key points:

    Arrays are a special type of list that allows us to store values in an organized manner.
    An array can be created by either defining it directly using np.array() or by importing a CSV using np.genfromtxt('file.csv', delimiter=',').
    An operation (such as addition) can be performed on every element in an array by simply performing it on the array itself.
    Elements can be selected from arrays using their index and array locations, both of which start at 0.
    Logical operations can be used to create new, more focused arrays out of larger arrays.

The next lesson will explore how to analyze these arrays and use means, medians, and standard deviations to tell a story. But first, practice what you've learned by working through the following checkpoints.


In [8]:
import numpy as np

temperatures = np.genfromtxt('temperature_data.csv', delimiter=',')

print(temperatures)

temperatures_fixed = temperatures + 3

print(temperatures_fixed)

monday_temperatures = temperatures_fixed[0,:]

thursday_friday_morning = temperatures_fixed[3:5,1]

temperature_extremes = temperatures_fixed[(temperatures_fixed < 50) | (temperatures_fixed > 60)]

OSError: temperature_data.csv not found.