## Array Operations with Numpy

**Nick Kern**
<br>
**Astro 9: Python Programming in Astronomy**
<br>
**UC Berkeley**

---

<img src="imgs/numpy.jpg" width=300px/>

Now that we've covered the basic functionalities of "pure Python" we will move on to explore the Python packages that really make Python a powerful tool for scientific analyses. Two such packages are the Numerical Python (NumPy) package and the Scientific Computing Tools for Python (SciPy) package: if these packages didn't exist, it is unlikely that Python would be used within the scientific community to the extent it is today.

In brief, Numpy is a package that provides an "array" data structure, and further provides modules that enable us to perform all kinds of mathematical operations (e.g., linear algebra) on those arrays. You can kind of think of a numpy array (called an `ndarray`) as a "multidimensional container of items of the same type and size." Arrays can be sliced and indexed like lists, but can also be rotated, matrix multiplied, "fancy indexed," transposed, etc. There are lots of things we can do with arrays.

1. [Basic Array Manipulation](#Basic-Array-Manipulation)
2. [Matrix Algebra with Numpy](#Matrix-Algebra-with-Numpy)
3. [Speed Considerations](#Speed-Considerations)

### Basic Array Manipulation
---

In [2]:
# load numpy like this
import numpy as np

**Data Types**

One prominent difference between Numpy arrays and Python lists is the fact that all elements within a Numpy array need to have the same data type. 

In [None]:
# define our first array
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(arr.dtype)

Numpy arrays, like lists, are mutable. However, if you assign an element to have a different data type, all other elements must conform. You can do this implicitly, or explicitly.

In [25]:
# numpy implicitly knows to convert the integers to a float
arr = np.array([1.23, 4, 5, 6])
print(arr)
print(arr.dtype)

[ 1.23  4.    5.    6.  ]
float64


In [26]:
# explicity setting the data type
arr = np.array([1.23, 4, 5, 6], dtype=np.complex)
print(arr)
print(arr.dtype)

[ 1.23+0.j  4.00+0.j  5.00+0.j  6.00+0.j]
complex128


In [None]:
# mutability
arr = np.array([0, 0.5, 1, 1.5, 2])
print(arr, hex(id(arr)))
arr[0] = -0.5
print(arr, hex(id(arr)))

In [148]:
# putting the N in ndarray
arr = np.array([ [1,2,3], [4,5,6], [7,8,9]])
print(arr)
print(arr.ndim)
print(arr.shape)

[[1 2 3]
 [4 5 6]
 [7 8 9]]
2
(3, 3)


Numpy arrays are objects, and as such have attributes and methods. Along with `.ndim` and `.shape`, other useful methods are `.astype` to convert to different data types, `.max` and `.min` for maximum and minimum values, and `.ravel` for "unraveling" a multi-dimensional array into a 1D array and `.resize` and `.reshape` for reshaping an array.

**Easily Creating Arrays**

If we want to create an array, we don't need to fill it with elements by hand. Like the `range()` function for lists, the numpy array has a *range* of functions for building arrays easily.

In [111]:
# np.arange similar to range capability
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [127]:
# np.linspace for specifying elements, rather than interval (it is inclusive on upper bound!)
np.linspace(0, 20, 41)

array([  0. ,   0.5,   1. ,   1.5,   2. ,   2.5,   3. ,   3.5,   4. ,
         4.5,   5. ,   5.5,   6. ,   6.5,   7. ,   7.5,   8. ,   8.5,
         9. ,   9.5,  10. ,  10.5,  11. ,  11.5,  12. ,  12.5,  13. ,
        13.5,  14. ,  14.5,  15. ,  15.5,  16. ,  16.5,  17. ,  17.5,
        18. ,  18.5,  19. ,  19.5,  20. ])

In [135]:
# list of ones
np.ones(10)

array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

In [136]:
# list of zeros
np.zeros(10)

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [150]:
# list of 2D ones
np.ones( (2, 5) )

array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [152]:
# list of random numbers from [0, 1)
np.random.random( 5 )

array([ 0.89943441,  0.16338967,  0.79964314,  0.5034744 ,  0.20744789])

**Array Slicing and Fancy Indexing**

Like lists, we can slice into a numpy array, however, with the numpy `ndarray` we can perform multi-dimensional slicing. Remember when slicing, the *start is inclusive*, while the *end is exclusive*. Note that the **zeroth axis** of a 2D array is along the vertical, and the **first axis** is along the horizontal.

A quick way to generate an array is the `np.arange` function, similar to the built-in `range` function.

In [None]:
# Create a 4x4 array
arr = np.arange(16)
arr.resize(4,4)
print(arr)

In [None]:
# slice along 0th axis
print(arr[2, :])

In [None]:
# alice along 1st axis
print(arr[:, 2])

In [None]:
# take a subsection
print(arr[1:3, 1:3])

How might we slice an array to get these results?

<img src="imgs/2D_slicing.jpg" width=300px/>

In [3]:
# create array
arr = np.arange(11,36)
arr.resize(5,5)

# Green
print(arr[::2, ::2])
print("-"*15)

# Blue
print(arr[1:4, 0])
print("-"*15)

# Red
print(arr[0, 1:4])
print("-"*15)

# Purple
print(arr[:, 1])
print("-"*15)

[[11 13 15]
 [21 23 25]
 [31 33 35]]
---------------
[16 21 26]
---------------
[12 13 14]
---------------
[12 17 22 27 32]
---------------


Fancy indexing is like slicing, but allows us to control exactly which elements we want to pull out of the array: we aren't limited to a row (red), column (blue), or elements that are evenly space apart from each other (green). Fancy indexing essentially creates a new `ndarray` composed of any combination of the elements in the initial array. Let's say we wanted the first, second and fourth columns of our previous 2d array. We index this the same way as before, but now we feed a list instead of an integer.

In [None]:
# Take rows
print(arr[ [0, 1, 3], : ])

#print(arr[ [0, 1, 3, 3, 3, 2, 1, 0], :])

In [None]:
# pick out just 16, 24 and 32
print( arr[ [1, 2, 4], [0, 3, 1] ] )

Recall that when we performed straight assignment of lists we were copying pointers, and when we sliced lists we were creating new copies of the object. This allowed us to get around the pitfalls of mutable objects and accidental element assignment. However, for an `ndarray`, a slice creates a *view* of the original array, which is an object that has a different memory address but still *shares data* with the original array. This means that, similar to our mutable list example, we can get unexpected behavior if we are loose with how we assign our arrays and data.

In [None]:
# data manipulation example
arr1 = np.arange(16).reshape(4,4)
arr2 = arr1[1:3, 1:3]
print(arr1)

In [None]:
# assign to arr2
arr2[:, :] = np.arange(100,104).reshape(2, 2)
print(arr1)

In [None]:
# create a copy by performing arithmetic, or by using .copy()
arr1 = np.arange(16).reshape(4,4)
arr2 = arr1[1:3, 1:3].copy()
arr2[:, :] = np.arange(100, 104).reshape(2,2)
print(arr1)

As well as having the same data type throughout the array, the standard numpy `ndarray` must also be rectangular. A 1D array fits this bill automatically. The 2D arrays we created above also satisfy this requirement. This is unlike lists, which can have nested lists of variable lengths. If we try to create a non-rectangular array, numpy will attempt to correct things by assigning the array an `object` data type, such that the array's functionality mimics a normal built-in `list` structure, and can data of various types. In doing so, however, we force the array to have a single dimension and lose the multi-dimensional slicing and fancy indexing properties of standard `ndarrays`.

In [None]:
# non-rectangular array
arr = np.array( [ [1,2,3,4], ['5',6.0], [(7 + 1j),8,9] ])
print(arr)
print(arr.dtype)
print(arr.shape)
print(arr.ndim)

In [None]:
# try to multi-dim slice
arr[:, :1]

**Fancy Indexing with Boolean Arrays and the `np.where` function**

Comparison operators on an `ndarray` will yield another `ndarray` with a boolean data type. The boolean array is the same shape as the original array, and is `True` where the condition is met and `False` where it isn't. We can then fancy index with these arrays to eliminate elements that don't conform. This is a fast and easy way to pick through an array for desired elements. Take the following example.

In [15]:
# Create an array
arr = np.arange(25).reshape(5, 5)
print(arr)

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]


In [16]:
# Create mask
mask = arr > 10
print(mask)

[[False False False False False]
 [False False False False False]
 [False  True  True  True  True]
 [ True  True  True  True  True]
 [ True  True  True  True  True]]


In [17]:
# select out elements
arr[mask]

array([11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24])

Why do the results come back in an "unraveled" form? Exactly because of what we said above, which is that for an `ndarray` to maintain a standard data type, it must be rectangular, which we have explicitly broken by deleting the upper half of our original 2D array. The only remaining geometry that still satisfies "rectangularity" is a 1D array. What if we wanted to maintain the 2D structure and, instead of deleting the `False` elements, replaced them with some other number, say perhaps zero. This is where the `np.where` statement comes in. 

The `np.where` statement can be used in two different ways. The first way accomplishes the same thing as the boolean mask, in that you feed the where statement a condition, it returns a set of lists which you can then feed into array to fancy index only the elements you wanted. The difference here is that instead of returning a boolean array of the same dimension as your original, it returns a series of lists with indices, which performs fancy indexing similar to how we originally did it. It looks something like this:

In [18]:
# fancy indexing with np.where
fancy = np.where(arr > 10)
print(fancy)
print("")
print(arr[fancy])

(array([2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4]), array([1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]))

[11 12 13 14 15 16 17 18 19 20 21 22 23 24]


What's nice about this way of fancy indexing, is that you also get the indices of the desired elements from the original array, which may be useful or necessary depending on what you are doing. The other way of using `np.where` is to perform a true masking of the original array, while keeping its structure intact and assigning the `False` elements some pre-defined value. It looks something like this:

In [19]:
# true masking
np.where(arr < 10, arr, np.nan)

array([[  0.,   1.,   2.,   3.,   4.],
       [  5.,   6.,   7.,   8.,   9.],
       [ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan]])

Here the syntax is
```
np.where(<condition>, if True value, if False value)
```

You can see that the original array is neatly kept intact and we can clearly see which elements are undesired. We introduced a new type in doing so, the numpy NaN, or Not a Number. 

**The Numpy NaN**

NaNs can be both helpful and extremely annoying to deal with at the same time. What's nice about them is that they are definitely not a number, meaning that if you are uncomfortable about masking with a 0, `np.inf`, or some other number, you can mask with a `np.nan` and your array will retain its data type. You couldn't, for example, mask with a `None` type and retain an arbitrary data type (try it!). However, as we will see when we get to array arithmetic, performing arithmetic over arrays that have even a single `np.nan` can mess up your whole code. For example:

In [99]:
# sum the elements of this array!
arr = np.arange(10.0)
print(np.sum(arr))

45.0


In [100]:
# sum when one element is a nan
arr[0] = np.nan

print(arr)

print("")

print(np.sum(arr))

[ nan   1.   2.   3.   4.   5.   6.   7.   8.   9.]

nan


What's doubly annoying, is that, for whatever reason, they don't obey comparison operators. Meaning that if you have a large array with one or two `np.nan`s in them, you can't search for them with a `np.where(arr == np.nan)` statement. Howevever, the good people at NumPy anticipated this and provide us with the `np.isnan` function, which is like a special `np.where` function that works specially for `np.nan` types. Instead of returning indices, as the `np.where` statement returns, it returns a boolean array.

In [9]:
# make a np.nan mask
mask = np.isnan(arr)
print(mask)

[ True False False False False False False False False False]


In [10]:
# perform sum w/ np.nan masked!
np.sum( arr[~mask] )

45.0

Notice we reversed the booleans in order to turn it into a proper mask before summing.

It turns out there are also `np.nan`-safe functions within Numpy that can also do this without us having to think too hard, like `np.nansum`.

In addition, there is an entire sub-module within numpy, called `ma` for Masked Arrays, that automates this and makes handling masked numpy `ndarrays` much easier. Check it out sometime! We may come back to it later in the course.

**Array Concatening and Splitting**

Let's now look at ways in which we can combined (or concatenate) arrays and split them up. Recall that with lists and tuples, we did this with a simple addition operator `+`. With numpy arrays, as we will see, we will want to reserve that for actual arithmetic. There are multiple ways to concatenate arrays, depending on how you'd like to combine them.

1.) The most straightforward way is to just create a new `np.array` object wrapped around the constituent arrays.

In [63]:
# this will stack along the zeroth axis
arr1 = np.arange(10)
arr2 = np.arange(10, 20)
arr3 = np.arange(20,30)
big_arr = np.array([arr1, arr2, arr3])
print(big_arr)

[[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]
 [20 21 22 23 24 25 26 27 28 29]]


In [64]:
# what is the shape of big_arr?
big_arr.shape

(3, 10)

2.) We can also use the `np.vstack` method, which stacks along the zeroth axis, similar to before.

In [56]:
np.vstack([arr1, arr2, arr3])

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]])

3.) Use `np.hstack`, which stacks along the first axis (similar to `list.extend`)

In [65]:
np.hstack([arr1, arr2, arr3])

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

4.) Use `np.dstack`, which stacks along the second axis

In [97]:
np.dstack([arr1, arr2])

ValueError: all the input array dimensions except for the concatenation axis must match exactly

The `np.append` works similarly to the above functions.

We can also split arrays, and there are the `np.split`, `np.vsplit`, `np.hsplit` and `np.dsplit` functions to do just this. Let's look at the `np.split` function as an example.

In [87]:
# create a 4x4 array
arr = np.arange(16).reshape(4,4)
print(arr)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]


In [97]:
# split it!
np.split(arr, 2, axis=1)

[array([[ 0,  1],
        [ 4,  5],
        [ 8,  9],
        [12, 13]]), array([[ 2,  3],
        [ 6,  7],
        [10, 11],
        [14, 15]])]

**Array Arithmetic and Broadcasting**

One of the great things about arrays is our ability to perform math on them, element-by-element. This occurs exactly as you might suspect: a scalar times an array is just another array with its elements multiplied by that scalar. Let's briefly review. Note that in the case of array-array arithmetic, the arrays need to have the same size! Otherwise, Numpy doesn't know how to match up the excess indices.

In [14]:
# array arithmetic
arr1 = np.arange(10)
arr2 = np.arange(10,20)
print(arr1)
print(arr2)

[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]


In [15]:
# arr * arr multiplication
arr1 * arr2

array([  0,  11,  24,  39,  56,  75,  96, 119, 144, 171])

In [7]:
# arr + arr addition
arr1 + arr2

array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

In [18]:
# arr / scalar division
arr2 / 2.0

array([ 5. ,  5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5])

There are sometimes exceptions to the rule I stated above, that for array arithmetic to work the arrays need to have the same shape. The exceptions are when Numpy can identify what it thinks you are trying to do, performs the array operations efficiently w/o making extra copies of data, and extends the resultant array to match what it thinks you are doing. One example of this is what we just did: scalar - array arithmetic! In this case, numpy knows that we we mean by `arr2 / 2.0` is actually `arr2 / array([2.0, 2.0, ..., 2.0])`. Here, numpy "broadcasts" the 2.0 into an array and performs the operations element-by-element as we would expect. Broadcasting can also work between arrays of different shapes, exemplified by the figure below.

<img src="imgs/broadcast.png" width=600px/>

Do we expect the following broadcasting operations to work?

In [144]:
arr1 = np.ones( (2, 3) )
arr2 = np.ones( (2, 1) )

arr1 + arr2

array([[ 2.,  2.,  2.],
       [ 2.,  2.,  2.]])

In [145]:
arr1 = np.ones( (2, 3) )
arr2 = np.ones( 2 )

arr1 + arr2

ValueError: operands could not be broadcast together with shapes (2,3) (2,) 

In [147]:
arr1 = np.ones( (2, 3) )
arr2 = np.ones( 3 )

arr1 + arr2

array([[ 2.,  2.,  2.],
       [ 2.,  2.,  2.]])

### Breakout
---



### Matrix Algebra with Numpy
---

### Breakout
---

### Speed Considerations
---


In [None]:
# generating a blank million-element array

%timeit np.empty(1000000)

%timeit np.zeros(1000000)

In [None]:
# view / copy difference
N = 1000000
l1 = list(range(N))
a1 = np.arange(N)

%timeit l2 = l1[:]
%timeit a2 = a1[:]

While traditional slicing of and `ndarray` creates a view, fancy indexing creates a copy, meaning it is significantly slower when operating over large arrays.

In [None]:
# slicing / fancy indexing difference
arr = np.arange(1000000)

# slicing
%timeit arr[1000:100000]

# fancy indexing w/ a list
%timeit arr[range(1000, 100000)]

In [24]:
# ufuncs for numpy arrays: array arithmetic is much faster than a simple FOR loop
def recipricol(values):
    # initialize empty array
    recips = np.empty( len(values) )
    # use enumerate for a fancy FOR loop
    for i, val in enumerate(values):
        recips[i] = 1.0 / val
    # return output
    return recips

arr = np.arange(1., 10000.)
%timeit recipricol( arr )
%timeit 1 / arr



100 loops, best of 3: 3.91 ms per loop
10000 loops, best of 3: 48.9 µs per loop


In [61]:
def outer_bc(a):
    a.reshape(-1, 1) * a.reshape(1, -1)
    
%timeit outer_bc(a)

The slowest run took 4.78 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.93 µs per loop


In [70]:
np.tile(a, (N, 1)).T

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
       [5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
       [6, 6, 6, 6, 6, 6, 6, 6, 6, 6],
       [7, 7, 7, 7, 7, 7, 7, 7, 7, 7],
       [8, 8, 8, 8, 8, 8, 8, 8, 8, 8],
       [9, 9, 9, 9, 9, 9, 9, 9, 9, 9]])

In [72]:
# broadcasting speed for outer product
def outer_bc(a):
    return a.reshape(-1, 1) * a.reshape(1, -1)
    
def outer_tile(a):
    return np.tile(a, (N, 1)) * np.tile(a, (N, 1)).T
    
N = 100
arr = np.arange(N)

%timeit outer_bc(arr)
%timeit outer_tile(arr)

The slowest run took 8.41 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 44.2 µs per loop
10000 loops, best of 3: 89.2 µs per loop


### Breakout
---

