# Numpy tutorial

Oliver W. Layton

CS251/2: Data Analysis and Visualization

Spring 2021

In [None]:
import numpy as np
import time

## Numpy ndarray basics

### Creation from Python lists

In [None]:
# Make a numpy array from a 2D python list

In [None]:
# print it


### Data type of ndarray

In [None]:
# determine data type


Type can be changed in a few ways. 

1. when creating array — (a) implicitly or (b) explicitly
2. by casting types.

In [None]:
# 1a implicitly


In [None]:
# 1b explicitly


In [None]:
# 2. NOTE: This is a METHOD of the array, not a FUNCTION


In [None]:
# Can also be string. be careful in your CSV parser that your "numbers"
# aren't actually strings!


### Convert back to Python list

In [None]:
# Convert back from ndarray to Python list

print('Back as a Python list:\n', arrAsList)

### Other ways to create ndarrays quickly

#### 1. zeros

- We can plug in a list to get a multi-dimensional array
- We can plug in one int to get a vector of values

#### 2. ones

In [None]:
# can easily make any constant array


#### 3. Random values

In [None]:
# Uniform random values


#### 4. Equally spaced floats in an interval

#### 5. Equally spaced ints in an interval

#### 6. Identify matrix

### Check dimensions — `shape`

In [None]:
# check shape of 3D array


In [None]:
# check number of dimensions (M)


In [None]:
# Access 1st dim (#rows), 2nd dim (#cols) (Use f-string)


In [None]:
# Check number of elements total
print('Num elements in arr_1:', )

## Brief detour: Rapidly build python lists (list comprehension)

In [None]:
# Brief detour: In python you can replace the workflow of 
# list-building by creating an empty list and looping to append...
myList = []
for i in range(5):
    myList.append(i)
print('myList build the usual way', myList)

In [None]:
# ...with Python list comprehensions

print('myListComp', myListComp)

In [None]:
# you can build lists using any function of i. How about i^2?

print('myListSqr', myListSqr)

## ndarray indexing

Basic Accessing and modifying of ndarrays.

### Access and modify single elements

In [None]:
# To access elements in a multidimensional ndarray use ONE set of square brackets []
# Make a new random array
np.random.seed(0)  # ensures random numbers come up the same each time. Useful for debugging.
arr = 

In [None]:
# Get the 1st element


In [None]:
# Modifying single values is similar

print('arr is now:\n', arr)

### Slicing: real power of numpy

Use **colon** notation for all values in a dimension

Access and modify different ranges of data along different dimensions 

Make a 3x5 random array. Access 2nd column

In [None]:
np.random.seed(0)


Access 1st row

Access last 2 columns

Access columns at indices 1-2 and in 1st row. Careful about off-by-one.

- Low range (before :) CONTAINS that index
- High range (after :) DOES NOT contain that index (i-1)

Use slicing to assign values efficiently in batch without loops

In [None]:
# Assign 1st row to -1s


In [None]:
# Assign 1st row to increasing ints


In [None]:
# Multiply the 3rd row by 5 times itself and update the row


### What if we want to access a set of rows or columns that are not adjacent?

Can't use colon notation. Instead use `np._ix`

In [None]:
arr

Example: Say we want column indices 0, 2, 4 and all rows.

**Syntax for `np._ix`:**
- `np._ix` goes inside the square brackets: `arr[np._ix(blah)]`
- Give it `M` arguments (e.g. 2 for a 2D matrix).
- Each argument is a Python list (or ndarray) of indices to take along that dimension.

## Memory

- Numpy tries to be efficient with arrays so assignment does a shallow copy. To do a deep copy, you need to use `.copy()` method

In [None]:
a = np.linspace(-1, 1, 5)
a

In [None]:
b = a

print(b)

In [None]:
# changed a!
a

In [None]:
# fixed with .copy()

print(a)
print(b)

## Apply functions over dimensions (`axes`)

- Axes are the numpy term for different ndarray dimensions. 
- *Idea*: Do we want to apply an operation (e.g. sum) on the rows OR columns of a ndarray?
- *Example*: axis 0 are the rows, axis 1 are the columns, etc.
- We can apply functions over one or more axis super efficiently in one line of code! This is called **Vectorization** — MUCH MUCH faster than loops (stay tuned).

In [None]:
one = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]])
one

Sum along rows -> "collapse" across rows to get sum within each column — 3 numbers

Sum along columns -> "collapse" across columns to get sum within each row — 4 numbers

**Careful:** Applying a function without specifying the axis may compute across the ENTIRE ndarray.

**Mnemonic trick:** Applying a function along an axis eliminates that dimension from the shape. Left with remaining dimensions.

In [None]:
print(one.shape)

print(f'Mean across axis 0: {the_mean.shape}')

print(one.shape)

print(f'Mean across axis 1: {the_mean.shape}')

## Broadcasting

**This is the most useful numpy feature thus far! This will become your bread-and-butter!**

### Simple example: Scalars

As we saw, we can create an array of any size with any constant value WITHOUT ANY LOOPS. This is the simplest example of numpy **broadcasting** the scalar across the ndarray.

In [None]:
# Example with basic arithmetic


### Example: Subtract the minimum value of each column for the original 2D array

In [None]:
np.random.seed(0)
rand_inds = np.random.randint(low=0, high=5, size=(5, 6))
rand_inds

In [None]:
# Take the min across each column and subtract it from rand_inds. Print shape of result


#### What's going on??

Let's look at the shapes:

Numpy is **broadcasting** the vector to operate on the 2d ndarray (**draw this out on board**):

- Numpy looks for axis shape compatibility among the different arrays.
- Numpy sees the column dimension (5, **6**) matches the min vector (**6,**)
- Numpy adds a **singleton dimension** (a "fake" leading dimension for rows). Now the min vector is treated with shape: **(1, 6)**.
- Numpy dynamically "grows" the singleton dimension to the needed shape (5). So now the min vector is treated like a (5, 6) array.
- Numpy element-wise subtracts the two arrays: (5, 6) array can be subtracted by a (5, 6) array.
- Numpy returns the result, which is a (5, 6) array!

Process is **very memory efficient**: No new memory gets allocated during broadcasting.

#### Broadcasting only adds singleton dimensions LEFTWARD to the ndarray with smaller number of dimensions.

What if we did the same thing as above, but now wanted to **subtract the minimum value in each row**?

In [None]:
print(f'rand_inds shape: {rand_inds.shape}')
print(f'min shape: {np.min(rand_inds, axis=1).shape}')

**Problem:** adding singleton dimensions to the LEFT could never make the shapes compatible!! For example, numpy tries: 

    rand_inds shape: (5, 6)
    min shape: (1, 5)

but that won't work! Crash...

#### Adding singleton dimensions by ourselves

We can help numpy out and add a "new axis" ourselves to make the shapes compatible with `np.newaxis`!

Can also make this more readable by defining a temp variable...

#### Squeeze if you need to get rid of all singleton dimensions

"Undo" a new axis / singleton dimension

#### Not automatically squeezing computations

As we saw above, functions like `min` over an axis eliminate that axis from the result: 

In [None]:
print(f'Min over axis 0 shape: {np.min(rand_inds, axis=0).shape}')
print(f'Min over axis 1 shape: {np.min(rand_inds, axis=1).shape}')

If we want `min` (and some other functions) to keep the singleton dimension, we can use the optional argument `keepdims=True`:

In [None]:
print(f'Min over axis 0 shape: {np.min(rand_inds, axis=0).shape}')
print(f'Min over axis 1 shape: {np.min(rand_inds, axis=1).shape}')

This can **help with broadcasting compatibility when performing an operation on an axis then modifying the original ndarray.** Then we don't need to manually add a new axis.

Example with subtracting the mean:

In [None]:
rand_inds_centered = rand_inds.copy()


## Vectorization speed vs loops

Time computation of summing a ndarray with loop vs vectorized.

In [None]:
def timeit(fun):
    '''Just a function to time the runtime of another function'''
    def timer():
        start = time.time()
        fun()
        end = time.time()
        print(f'Took {end - start:.3} secs to run.')
    return timer


@timeit
def sumLoop():
    '''Use for loop to sum a row vector'''
    longRow = np.array([i for i in range(1, 1000000)])
    theSum = 0
    for i in range(len(longRow)):
        theSum += longRow[i]


@timeit
def sumVectorized():
    '''Vectorized version of summing a row vector'''
    longRow = np.array([i for i in range(1, 1000000)])
    theSum = np.sum(longRow)

In [None]:
# Dynamic typing in python makes for loops with lots of small
# operations slow
print('sumLoop:')
sumLoop()

# Vectorization allows Numpy to stop searching at runtime
# and use efficient pre-compiled functions to batch-process
# the computation over the matrix
print('sumVectorized:')
sumVectorized()

## Reshaping

**Problem:**
- You want to preserve the number of elements in an ndarray but "regroup" the elements

Example: Have a `64 x 64` image and want to make one big `64*64` 1D vector by "gluing the rows together":

e.g: (3, 3) -> (9,)

Turn:

    [[1, 2, 3],
     [4, 5, 6],
     [7, 8, 9]]

Into:

     [1, 2, 3, 4, 5, 6, 7, 8, 9]

How can we do this without hard coding?

**Key:** Total number of elements in ndarray doesn't change.

## Combining multiple ndarrays

**Problem:**
- You have two ndarrays and want to concatenate them
- You have an ndarray and **want to append a column or row vector**

### Add/append a new column — "stack horizontally"

**Mnemonic**: Columns go horizontally.

Have `a`:

    [[1, 2]
     [3, 4]]
and `b`

    [[9]
     [9]]
    
want to make:

    [[1, 2, 9]
     [3, 4, 9]]
    
i.e. stack horizontally. Could be two matrices (not just a matrix and a vector).

**Caveat:** We need to make sure shapes are compatible for broadcasting:

- Result shape = `(2, 3)`
- We are starting with `a` shape: `(2, 2)`

The shape of `b` needs to be `(2, 1)` (why wouldn't `(1, 2)` work?)

In [None]:
# create a with reshaping!

## Switching around the axes of an ndarray and matrix multiplication in numpy

We can't matrix multiply the following ndarrays due to shape issues:

In [None]:
a = 3*np.ones([3, 4])
b = 2*np.ones([3, 4])

Need to pair up as

    (3, 4) x (4, 3)

OR

    (4, 3) x (3, 4)
    
Use the transpose to help out!

Note: Transposing a ndarray vector isn't meaningful if you don't have a singleton dimension

In [None]:
a = np.ones(10)
a.shape

### Matrix vs. element-wise multiplication

- Star (*) operator means element-wise multiplication
- Like other basic math operators, can use broadcasting (e.g. can multiply a (3,) and a (5, 3) array)