#CSE 101: Computer Science Principles
####Stony Brook University
####Kevin McDonnell (ktm@cs.stonybrook.edu)
##Module 17: Number-crunching with NumPy and pandas



### Arrays

NumPy (Numerical Python) is a popular Python library for doing numerical and scientific calculations. The basic data type in NumPy is the **ndarray**, which is a grid of numbers that can have 1, 2 or more *dimensions*.

Unlike lists, an array contains values of the same type. NumPy can perform calculations with arrays much faster than "generic" Python can process lists of numbers.

There are different ways of creating arrays in NumPy. Let's create a 1D array.

In [None]:
import numpy as np

a = np.array([10, 20, 30, 40, 50, 60])
a

array([10, 20, 30, 40, 50, 60])

We can access elements using `[]` notation, same as for lists.

In [None]:
a[3]

40

A 2D array is created in a similar fashion, using a list of lists of integers.

In [None]:
a = np.array([[10, 20, 30, 40], [50, 60, 70, 80], [90, 100, 110, 120]])
a

array([[ 10,  20,  30,  40],
       [ 50,  60,  70,  80],
       [ 90, 100, 110, 120]])

Two indices are needed to access a given value.

In [None]:
a[1, 3]

80

In NumPy, `ndarray` is the underlying data type (a class) used to store the numbers. A **vector** is a 1D array, whereas a **matrix** is a 2D array. NumPy uses the word **axes** to mean "dimensions". So, a 2D array has 2 "axes" in NumPy parlance.

To get the number of dimensions of an array, we can use the `ndim` attribute of an `ndarray` object.

In [None]:
a.ndim

2

To get the dimensions themselves of an array, we can use the `shape` attribute of an `ndarray` object. A *tuple* gives the dimensions.

In [None]:
a.shape

(3, 4)

`(3, 4)` means that the array has 3 rows and 4 columns.

`size` tells you the number of elements in an array.

In [None]:
a.size

12

### Basic NumPy Functions

NumPy has several functions for creating arrays. `zeros` creates an array filled with zeroes. You have to tell the function the dimension (length).

In [None]:
a = np.zeros(5)
a

array([0., 0., 0., 0., 0.])

`ones` creates an array filled with ones.

In [None]:
a = np.ones(4)
a

array([1., 1., 1., 1.])

To create 2D arrays filled with all zeroes or all ones, we give a tuple with the dimensions.

In [None]:
np.zeros((4, 3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [None]:
np.ones((2, 5))

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

`arange` works like the regular `range` function in Python to create an array.

In [None]:
a = np.arange(4)
a

array([0, 1, 2, 3])

In [None]:
a = np.arange(2, 10, 3)
a

array([2, 5, 8])

`linspace` lets you create an array of equally-spaced values over an interval. Let's create a range of values from `0.0` to `4.0` with 8 equally-spaced intervals.

In [None]:
a = np.linspace(0, 4, 9)
a

array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ])

### Indexing and Slicing

Slicing works for `ndarray` objects in the same way as for lists.

In [None]:
a = np.array([10, 20, 30, 40, 50])

In [None]:
a[0:2]

array([10, 20])

In [None]:
a[:3]

array([10, 20, 30])

In [None]:
a[-2:]

array([40, 50])

NumPy supports a special syntax for access and slicing individual elements or groups of elements.

We saw earlier, to access a single element, use `[]` notation with a comma that separates the row/column indices.

In [None]:
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]])
a[1,3]

8

We can slice out rows.

In [None]:
print(a)
print()
print(a[1:3, :])

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]]

[[ 5  6  7  8]
 [ 9 10 11 12]]


We can slice out columns too.

In [None]:
print(a)
print()
print(a[:, 2:4])

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]]

[[ 3  4]
 [ 7  8]
 [11 12]
 [15 16]]


And we can even slice out rows and columns at the same time.

In [None]:
print(a)
print()
print(a[1:3, 2:4])

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]]

[[ 7  8]
 [11 12]]


NumPy can easily extract values that meet a certain condition.

In [None]:
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
a[a < 7]

array([1, 2, 3, 4, 5, 6])

In [None]:
a[a >= 8]

array([ 8,  9, 10, 11, 12])

In [None]:
a[a % 2 == 0]

array([ 2,  4,  6,  8, 10, 12])

To write expressions involving Boolean operators, use `&` in place of `and`, and `|` in place of `or`.

In [None]:
a[(a % 2 == 0) | (a <= 5)]

array([ 1,  2,  3,  4,  5,  6,  8, 10, 12])

Conditions can also be assigned to variables.

In [None]:
even = (a % 2 == 0)
less_than_5 = (a <= 5)
a[even | less_than_5]

array([ 1,  2,  3,  4,  5,  6,  8, 10, 12])

Conditions can also be used to change the contents of an array. 

In [None]:
# Change all multiples of 3 to -1.
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
a[a % 3 == 0] = -1
a

array([[ 1,  2, -1,  4],
       [ 5, -1,  7,  8],
       [-1, 10, 11, -1]])

### Example: Create a "Border" of Ones

Let's create a 2D array of 5 rows and 6 columns with all zero values, except for the elements around the border (top/bottom row, leftmost/rightmost columns).

In [None]:
a = np.ones((5, 6))
a[1:4, 1:5] = 0
a

array([[1., 1., 1., 1., 1., 1.],
       [1., 0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 1.],
       [1., 1., 1., 1., 1., 1.]])

### Basic Arithmetical Operations

Arrays can be added, subtracted, multiplied and divided using the regular 4 arithmetical operators.

In [None]:
a = np.array([3, 4, 5, 6, 7])
b = np.array([2, 2, 2, 2, 2])
c = a + b
c

array([5, 6, 7, 8, 9])

In [None]:
c = a - b
c

array([1, 2, 3, 4, 5])

In [None]:
c = a * b
c

array([ 6,  8, 10, 12, 14])

In [None]:
c = a / b  # use // for integer division
c

array([1.5, 2. , 2.5, 3. , 3.5])

You can also use arithmetical operators with an array and a single value (a *scalar*).

In [None]:
a = np.array([1, 2, 3, 4])
a + 2

array([3, 4, 5, 6])

In [None]:
a - 1

array([0, 1, 2, 3])

In [None]:
1 - a

array([ 0, -1, -2, -3])

In [None]:
a * 2

array([2, 4, 6, 8])

In [None]:
a / 3

array([0.33333333, 0.66666667, 1.        , 1.33333333])

In [None]:
3 / a

array([3.  , 1.5 , 1.  , 0.75])

The `sum` method returns the sum of the values in an array.

In [None]:
a.sum()

10

You can also sum the rows or columns of a 2D array.

In [None]:
a = np.array([[1, 2, 3, 4], 
              [2, 4, 6, 8],
              [1, 3, 5, 7]])

Sum the columns:

In [None]:
a.sum(axis=0)

array([ 4,  9, 14, 19])

Sum the rows:


In [None]:
a.sum(axis=1)

array([10, 20, 16])

You can use indexing and slicing to sum up a particular row or column.

Let's sum the values in row 2.

In [None]:
a[2, :].sum()

16

Now sum the values in column 1.

In [None]:
a[:, 1].sum()

9

Methods `min` and `max` perform the same functions as their regular Python analogs.

In [None]:
print(a.min())
print(a.max())

1
8


`mean` lets you compute the average value overall:

In [None]:
a.mean()

3.8333333333333335

or down the columns,

In [None]:
a.mean(axis=0)

array([1.33333333, 3.        , 4.66666667, 6.33333333])

or across the rows.

In [None]:
a.mean(axis=1)

array([2.5, 5. , 4. ])

### Example: Implement a Mathematical Formula

Suppose we have two sequences of values, $x$ and $y$, and we want to compute the following value:

$z = \frac{1}{n} \sum_{i=1}^n (x_i - y_i)^2$

In [None]:
x = [8, 6, 7, 5, 3, 0, 9]
y = [2, 1, 4, 7, 6, 2, 3]

In regular Python, we might write something like this:

In [None]:
z = 1/len(x) * sum([(xi-yi)**2 for xi, yi in zip(x, y)])
z

17.57142857142857

In NumPy, we would instead write:

In [None]:
x = np.array(x)
y = np.array(y)
z = 1/len(x) * np.sum(np.square(x - y))
z

17.57142857142857

`square` is one of the many [built-in mathematical functions](https://numpy.org/devdocs/reference/routines.math.html) in NumPy.

### Application: Estimate the Value of $\pi$

The [Gregory–Leibniz series](https://en.wikipedia.org/wiki/Approximations_of_%CF%80#Gregory%E2%80%93Leibniz_series) provides a formula for approximating $\pi$. Here it is with a fixed maximum value, $n$ (instead of infinity, as in the real formula):

$\pi \approx 4 \sum_{i=0}^n \frac{(-1)^i}{2i+1} = 4\left(\frac{1}{1} - \frac{1}{3}  + \frac{1}{5}  - \frac{1}{7}  + \cdots + \frac{(-1)^n}{2n+1} \right)$

The `vectorize` function below calls the `term` function on every value produced by the `arange` function.

In [None]:
def term(j):
    return (-1) ** j / (2*j + 1)

approx = 4*np.sum(np.vectorize(term)(np.arange(0,1000)))
approx

3.140592653839792

### Example: Compute the Standard Deviation

In statistics, the **standard deviation** (denoted $\sigma$) of a set of numbers is a measure of the *spread* of the values. For the data values $x_1, x_2, \ldots, x_n$, the standard deviation can be computed using $\sigma = \sqrt{\dfrac{\sum_{i=1}^n(x_i - \mu)^2}{n}}$ where $\mu$ is the mean (average) of the values.

NumPy has a built-in function for computing the standard deviation, `std`, but we will also see how we can use the other built-in arithmetical NumPy functions for calculating it.

In [None]:
scores = np.array([100, 88, 91, 65, 87, 70, 63, 93])
scores.std()

13.13808871183324

Now let's compute it manually, to practice with NumPy arithmetical functions.

In [None]:
sigma = np.sqrt(np.sum(np.square(scores - np.mean(scores))) / len(scores))
sigma

13.13808871183324

That formula is not too bad, but simply calling `std` was a lot easier.

Now let's compute $\sigma$ using regular Python.

In [None]:
import math
scores = scores.tolist()  # convert the numpy array to a regular list
mu = sum(scores) / len(scores)
sigma = math.sqrt(sum([(x - mu)**2 for x in scores]) / len(scores))
sigma

13.13808871183324

With a little refactoring to eliminate a variable, we can make the code a little shorter, but the NumPy approach is still a lot easier.

In [None]:
import math
sigma = math.sqrt(sum([(x - sum(scores) / len(scores))**2 / len(scores) for x in scores]))
sigma

13.13808871183324

### Example: Computing Exam Averages

Certain data processing steps are easier in the  `pandas` library than they are in NumPy. pandas is a library built on top of NumPy for doing **data science**. pandas's basic data structure for data management is called a **dataframe**.

Reading in a CSV file is a simple matter of calling the `read_csv` function, which returns a data frame. The `head` function used below lets you quickly view the first 5 rows of data.

In [None]:
import pandas as pd 

df = pd.read_csv('exams1.csv') 
df.head()

Unnamed: 0,Exam1,Exam2,Exam3
0,96,92,50
1,50,90,80
2,70,89,87
3,68,81,62
4,52,64,74


A dataframe's `columns` and `dtypes` attributes will give you the column names and their data types, respectively.

In [None]:
df.columns

Index(['Exam1', 'Exam2', 'Exam3'], dtype='object')

In [None]:
df.dtypes

Exam1    int64
Exam2    int64
Exam3    int64
dtype: object

To get a NumPy array out of a data frame, just access the `values` attribute.

In [None]:
df.values

array([[96, 92, 50],
       [50, 90, 80],
       [70, 89, 87],
       [68, 81, 62],
       [52, 64, 74],
       [99, 68, 77],
       [54, 44, 71],
       [85, 68, 70],
       [52, 57, 88],
       [81, 92, 59],
       [93, 80, 76],
       [96, 43, 70],
       [49, 88, 61],
       [63, 69, 71],
       [88, 78, 69],
       [63, 87, 85],
       [53, 87, 87],
       [48, 68, 85],
       [41, 52, 53],
       [71, 79, 84],
       [98, 54, 98],
       [99, 66, 92],
       [43, 41, 83],
       [42, 79, 89]])

Now we can compute some statistics quickly and easily. Let's compute the average score on each exam.

In [None]:
df.mean(axis=0)

Exam1    68.916667
Exam2    71.500000
Exam3    75.875000
dtype: float64

Now compute each student's average.

In [None]:
df.mean(axis=1)

0     79.333333
1     73.333333
2     82.000000
3     70.333333
4     63.333333
5     81.333333
6     56.333333
7     74.333333
8     65.666667
9     77.333333
10    83.000000
11    69.666667
12    66.000000
13    67.666667
14    78.333333
15    78.333333
16    75.666667
17    67.000000
18    48.666667
19    78.000000
20    83.333333
21    85.666667
22    55.666667
23    70.000000
dtype: float64

This is not so useful without the names. Let's try a better data-set.

In [None]:
df = pd.read_csv('exams2.csv') 
df.head()

Unnamed: 0,Name,Exam1,Exam2,Exam3
0,Shea,96,92,50
1,Nabila,50,90,80
2,Hania,70,89,87
3,Gia,68,81,62
4,Dione,52,64,74


Let's compute some averages.

In [None]:
df.mean(axis=0)

Exam1    68.916667
Exam2    71.500000
Exam3    75.875000
dtype: float64

In [None]:
df.mean(axis=1).head()

0    79.333333
1    73.333333
2    82.000000
3    70.333333
4    63.333333
dtype: float64

pandas lets you select a column by its name.

In [None]:
df['Name'].head()

0      Shea
1    Nabila
2     Hania
3       Gia
4     Dione
Name: Name, dtype: object

We can pass a list of column names inside the brackets to extract multiple columns at the same time.

In [None]:
df[['Name', 'Exam1']].head()

Unnamed: 0,Name,Exam1
0,Shea,96
1,Nabila,50
2,Hania,70
3,Gia,68
4,Dione,52


pandas makes it easy to sort the data-set by individual columns.

In [None]:
df.sort_values(by='Exam1').head()

Unnamed: 0,Name,Exam1,Exam2,Exam3
18,Zaki,41,52,53
23,Sullivan,42,79,89
22,Renee,43,41,83
17,Noel,48,68,85
12,Becky,49,88,61


In [None]:
df.sort_values(by='Exam3', ascending=False).head(3)

Unnamed: 0,Name,Exam1,Exam2,Exam3
20,Garrett,98,54,98
21,Heidi,99,66,92
23,Sullivan,42,79,89


Now let's compute the averages. Let's set the weight of Exam 1 at 50%, at the other two exams at 25% each.

In [None]:
weights = pd.Series([0.5, 0.25, 0.25], index=['Exam1', 'Exam2', 'Exam3'])
averages = (df[['Exam1', 'Exam2', 'Exam3']] * weights).sum(axis=1)
df['Average'] = averages  # add a new column to the dataframe
df.sort_values(by='Average', ascending=False).head()

Unnamed: 0,Name,Exam1,Exam2,Exam3,Average
21,Heidi,99,66,92,89.0
20,Garrett,98,54,98,87.0
5,Amara,99,68,77,85.75
10,Ralph,93,80,76,85.5
0,Shea,96,92,50,83.5


pandas has a great function called `describe` that gives you *summary statistics* about each column of data in a data-set.

In [None]:
df.describe()

Unnamed: 0,Exam1,Exam2,Exam3,Average
count,24.0,24.0,24.0,24.0
mean,68.916667,71.5,75.875,71.302083
std,20.813701,16.264124,12.643309,11.586625
min,41.0,41.0,50.0,46.75
25%,51.5,62.25,69.75,62.25
50%,65.5,73.5,76.5,72.25
75%,89.25,87.0,85.5,79.4375
max,99.0,92.0,98.0,89.0


The row marked **50%** gives the median of each column.

Let's assign everyone a letter grade. The NumPy `digitize` function lets you assign values within ranges to categories, so let's set up those ranges.

In [None]:
bins = [0, 60, 70, 80, 90]

Now let's do the mapping with `digitize`. For each numerical average, we get a bin number in the range 1 - 5 (not 0 - 4).

In [None]:
np.digitize(df['Average'], bins)

array([4, 2, 3, 2, 2, 4, 1, 3, 2, 3, 4, 3, 2, 2, 4, 3, 3, 2, 1, 3, 4, 4,
       1, 2])

Now we need a dictionary that maps bin numbers to grades (e.g., map 1 to `'F'`, 2 to `'D'`, etc.) Some trickery with `enumerate` makes this a snap. 

In [None]:
letters = ['F', 'D', 'C', 'B', 'A']
letter_grades = dict(enumerate(letters, 1))
letter_grades

{1: 'F', 2: 'D', 3: 'C', 4: 'B', 5: 'A'}

Finally, we need to take each bin number and look up each one in the `letter_grades` dictionary to get the letter grade. The `vectorize` function below calls the `get` method on every bin number to look-up the letter grade.

In [None]:
df['Grade'] = np.vectorize(letter_grades.get)(np.digitize(df['Average'], bins))
df.head()

Unnamed: 0,Name,Exam1,Exam2,Exam3,Average,Grade
0,Shea,96,92,50,83.5,B
1,Nabila,50,90,80,67.5,D
2,Hania,70,89,87,79.0,C
3,Gia,68,81,62,69.75,D
4,Dione,52,64,74,60.5,D
