## Array Operations with Numpy

**Nick Kern**
<br>
**Astro 9: Python Programming in Astronomy**
<br>
**UC Berkeley**

---

<img src="imgs/numpy.jpg" width=300px/>

Now that we've covered the basic functionalities of "pure Python" we will move on to explore the Python packages that really make Python a powerful tool for scientific analyses. One such packages are the Numerical Python (NumPy) package: if this package didn't exist, it is unlikely that Python would be used within the scientific community to the extent it is today.

In brief, Numpy is a package that provides an "array" data structure, and further provides modules that enable us to perform all kinds of mathematical operations (e.g., linear algebra) on those arrays. You can kind of think of a numpy array (called an `ndarray`) as a "multidimensional container of items of the same data type and size." Arrays can be sliced and indexed like lists, but can also be rotated, matrix multiplied, "fancy indexed," transposed, etc. There are lots of things we can do with arrays. What's more, array operations with numpy are generally considerably faster than using a `for` loop and lists. This is due to a few reasons, one of which is because numpy runs on compiled C code, and another is its efficient storage and sharing of data within the array object.

In [1]:
# load numpy like this
import numpy as np

### The `numpy ndarray`

Numpy allows us to work with N-dimensional arrays, or `ndarrays`. One prominent difference between Numpy arrays and Python lists is the fact that all elements within a Numpy array **need to have the same data type**. 

In [None]:
# define our first array
arr = np.array([1, 2, 3, 4, 5])

print(arr)

print(arr.dtype)

If you assign an element to have a different data type, all other elements must conform. You can do this implicitly, or explicitly.

In [None]:
# numpy implicitly knows to convert the integers to a float
arr = np.array([1.23, 4, 5, 6])

print(arr)

print(arr.dtype)

In [None]:
# explicity setting the data type
arr = np.array([1.23, 4, 5, 6], dtype=np.complex)

print(arr)

print(arr.dtype)

Numpy arrays, like lists, **are mutable**.

In [None]:
# arrays, like lists, are mutable
arr = np.array([0, 0.5, 1, 1.5, 2])

print(arr, hex(id(arr)))

arr[0] = -0.5

print(arr, hex(id(arr)))

In [None]:
# putting the N in ndarray
arr = np.array([ [1,2,3], [4,5,6], [7,8,9]])

print(arr)

In [None]:
# arrays are objects, and thus have attributes and methods!
print(arr.ndim)

print(arr.shape)

Numpy arrays are objects, and as such have attributes and methods. Along with `.ndim` and `.shape`, other useful methods are `.astype` to convert to different data types, `.max` and `.min` for maximum and minimum values, and `.ravel` for "unraveling" a multi-dimensional array into a 1D array and `.resize` and `.reshape` for reshaping an array.

### Easily Creating Arrays

If we want to create an array, we don't need to fill it with elements by hand. Like the `range()` function for lists, the numpy array has a *range* of functions for building arrays easily.

In [None]:
# np.arange similar to range capability, but returns an array, not an iterator
np.arange(0, 22, 3)

In [None]:
# np.linspace for specifying number of elements, rather than step-size with arange (it has inclusive upper bound!)
np.linspace(0, 20, 40)

In [None]:
# list of ones
np.ones(3)

In [None]:
# list of zeros
np.zeros(3)

In [None]:
# list of 2D ones
np.ones( (2, 5) )

In [None]:
# 2d array from 0 - 15
np.arange(0, 16).reshape(2, 8)

In [None]:
# list of random numbers from [0, 1)
np.random.random( 5 )

### Array Slicing and Fancy Indexing

[draw a picture]

Like lists, we can slice into a numpy array, however, with the numpy `ndarray` we can perform multi-dimensional slicing. Remember when slicing, the *start is inclusive*, while the *end is exclusive*. Note that the **zeroth axis** of a 2D array is along the vertical, and the **first axis** is along the horizontal.

A quick way to generate an array is the `np.arange` function, similar to the built-in `range` function.

In [None]:
# Create a 4x4 array
arr = np.arange(16)

arr.resize(4,4)

print(arr)

In [None]:
# slice along 0th axis
print(arr[2, :])

In [None]:
# alice along 1st axis
print(arr[:, 2])

In [None]:
# take a subsection
print(arr[1:3, 1:3])

### Breakout 

How might we slice an array to get these results?

<img src="imgs/2D_slicing.jpg" width=300px/>

In [None]:
# breakout
arr = np.arange(11, 36).reshape(5,5)

In [None]:
arr[ 1:4:1, 0]

In [None]:
arr[0, 1:4]

In [None]:
arr[:, 1]

In [None]:
arr[::2, ::2]

**Fancy indexing** is like slicing, but allows us to control exactly which elements we want to pull out of the array: we aren't limited to a row (red), column (blue), or elements that are evenly space apart from each other (green). Fancy indexing essentially creates a new `ndarray` composed of any combination of the elements in the initial array. Let's say we wanted the first, second and fourth columns of our previous 2d array. We index this the same way as before, but now we **feed a list** instead of an integer.

In [None]:
# Take rows
print(arr[ [0, 1, 3], : ])

In [None]:
arr[range(5), range(5)]

In [None]:
# Repeatedly take rows
print(arr[ [0, 1, 3, 3, 3, 2, 1, 0], :])

In [None]:
# pick out just 16, 24 and 32
print( arr[ [1, 2, 4], [0, 3, 1] ] )

Recall that when we performed straight assignment of **lists** we were **copying references**, and when we sliced lists we were creating **new copies** of the object. This allowed us to get around the pitfalls of mutable objects and accidental element assignment. However, for an `ndarray`, a slice creates a **view** of the original array, which is an object that has a different memory address but still **shares data** with the original array. This means that, similar to our mutable list example, we can get unexpected behavior if we are loose with how we assign our arrays and data.

In [None]:
# data manipulation example
arr1 = np.arange(16).reshape(4,4)

arr2 = arr1[1:3, 1:3]

print(arr2)

In [None]:
# assign to arr2
arr2[:, :] = np.arange(100,104).reshape(2, 2)

print(arr1)

In [None]:
# create a copy by performing arithmetic, or by using .copy()
arr1 = np.arange(16).reshape(4,4)

arr2 = arr1[1:3, 1:3] * 1

arr2[:, :] = np.arange(100, 104).reshape(2,2)

print(arr1)

### Fancy Indexing with Boolean Arrays and the `np.where` function

Comparison operators on an `ndarray` will yield another `ndarray` with a boolean data type. The boolean array is the same shape as the original array, and is `True` where the condition is met and `False` where it isn't. We can then fancy index with these arrays to eliminate elements that don't conform. This is a fast and easy way to pick through an array for desired elements. Take the following example.

In [None]:
# Create an array
arr = np.arange(25).reshape(5, 5)
print(arr)

In [None]:
# Create mask
mask = arr > 10
print(mask)

In [None]:
# select out elements with fancy indexing
arr[mask]

The **`np.where`** statement can be used in two different ways. The first way accomplishes the same thing as the boolean mask, in that you feed the where statement a condition, it returns a set of lists which you can then feed into array to fancy index only the elements you wanted. The difference here is that instead of returning a boolean array of the same dimension as your original, it returns a series of lists with indices, which performs fancy indexing similar to how we originally did it. It looks something like this:

In [None]:
# fancy indexing with np.where
fancy = np.where(arr > 10)

fancy

In [None]:
arr[fancy]

Note that you can put more than one conditional in the `np.where` function.

In [None]:
fancy = np.where( (arr > 10) & (arr < 15) )
print(arr[fancy])

What's nice about this way of fancy indexing, is that you also get the indices of the desired elements from the original array, which may be useful or necessary depending on what you are doing.

The other way of using `np.where` is to perform a true masking of the original array, while keeping its structure intact and assigning the `False` elements some pre-defined value. It looks something like this:

In [None]:
# true masking
np.where(arr > 10, arr, -10000)

Here the syntax is
```
np.where(<condition>, value if True, value if False)
```

You can see that the original array is neatly kept intact and we can clearly see which elements are undesired. We introduced a new type in doing so, the numpy NaN, or Not a Number.

### The Numpy NaN

NaNs can be both helpful and extremely annoying to deal with at the same time. What's nice about them is that they are definitely not a number, meaning that if you are uncomfortable about masking with a 0, `np.inf`, or some other number, you can mask with a `np.nan` and your array will retain its data type. You couldn't, for example, mask with a `None` type and retain an arbitrary data type (try it!). However, as we will see when we get to array arithmetic, performing arithmetic over arrays that have even a single `np.nan` can mess up your whole code. For example:

In [None]:
# sum the elements of this array!
arr = np.arange(10.0)
print(np.sum(arr))

In [None]:
# sum when one element is a nan
arr[0] = np.nan

print(arr)

print("")

print(np.sum(arr))

What's doubly annoying, is that, for whatever reason, they don't obey comparison operators. Meaning that if you have a large array with one or two `np.nan`s in them, you can't search for them with a `np.where(arr == np.nan)` statement. Howevever, the good people at NumPy anticipated this and provide us with the `np.isnan` function, which is like a special `np.where` function that works specially for `np.nan` types. Instead of returning indices, as the `np.where` statement returns, it returns a boolean array.

In [None]:
np.isnan(arr)

In [None]:
# make a np.nan mask
mask = np.isnan(arr)
print(mask)

In [None]:
~mask

In [None]:
# perform sum w/ np.nan masked!
np.sum( arr[~mask] )

Notice we reversed the booleans in order to turn it into a proper mask before summing.

It turns out there are also `np.nan`-safe functions within Numpy that can also do this without us having to think too hard, like `np.nansum`.

In addition, there is an entire sub-module within numpy, called `ma` for Masked Arrays, that automates this and makes handling masked numpy `ndarrays` much easier. Check it out sometime! We may come back to it later in the course.

### Array Concatenating

Let's now look at ways in which we can combine (or concatenate) arrays. Recall that with lists and tuples, we did this with a simple addition operator `+`. With numpy arrays, as we will see, we will want to reserve that operator for actual arithmetic.

There are multiple ways to **concatenate arrays**, depending on how you'd like to combine them.

1.) The most straightforward way is to just create a new `np.array` object wrapped around the constituent arrays.

In [None]:
# this will stack along the zeroth axis
arr1 = np.arange(10)
arr2 = np.arange(10, 20)
arr3 = np.arange(20,30)

big_arr = np.array([arr1, arr2, arr3])

In [None]:
# what is the shape of big_arr?
big_arr.shape

2.) We can also use the `np.vstack` method, which stacks along the zeroth axis, similar to before.

In [None]:
np.vstack([arr1, arr2, arr3])

3.) Use `np.hstack`, which stacks along the first axis (similar to `list.extend`)

In [None]:
np.hstack([arr1, arr2, arr3])

4.) Use `np.dstack`, which stacks along the second axis

In [None]:
np.dstack([arr1, arr2, arr3]).shape

The `np.append` works similarly to the above functions.

The `np.concatenate` function works too.

### `numpy.loadtxt` for easy text file I/O

Use the `numpy.loadtxt()` function to easily load in text files. You can feed it a filename and a delimiter string to tell it how to break up the data in the text file. By default, that delimiter is any whitespace. For a `.csv` or (comma-separated-values) file, we will want to choose a delimiter of `','` or a comma. Let's load in the `.csv` file we worked with before.

In [None]:
%%bash
# first lets inspect the file
less ../02_IntroPython/random.csv

In [None]:
# load in all information and assign it to variable data
data = np.loadtxt('../02_IntroPython/random.csv', delimiter=',')
print(data)

### Breakout

1.
Load in the file `clusters.csv` with the `np.loadtxt` function, keeping only the columns labeled as `haloid`, `r200crit`, and `m200crit`. You will want to inspect the header of the file to see which columns these labels correspond to. Find the HaloID corresponding to the cluster with an `r200crit` > 1.0 **and** an `m200crit` < $2\times10^{4}$. Note that HaloID should have data as integers, meaning you will want to break off that component of the data into a separate array with `int` data type, and keep the other two (which are floats) in a single `ndarray` with `float` data type.

2.
Now load in the the column `vdispmean` and concatenate this into your data `ndarray`.

### Array Arithmetic and Broadcasting

One of the great things about arrays is our ability to perform math on them, element-by-element. This occurs exactly as you might suspect: a scalar times an array is just another array with its elements multiplied by that scalar. Note this is not matrix algebra, even though we sometimes refer to `ndarrays` as matrices. Let's briefly review. Note that in the case of array-array arithmetic, the **arrays need to have the same size**! Otherwise, Numpy doesn't know how to match up the excess indices.

In [15]:
# array arithmetic
arr1 = np.arange(10)
arr2 = np.arange(10,20)
print(arr1)
print(arr2)

[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]


In [16]:
# arr * arr multiplication
arr1 * arr2

array([  0,  11,  24,  39,  56,  75,  96, 119, 144, 171])

In [17]:
# arr + arr addition
arr1 + arr2

array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

There are sometimes exceptions to the rule I stated above, that for array arithmetic to work the arrays need to have the same shape. The exceptions are when Numpy can identify what it thinks you are trying to do, performs the array operations efficiently without making extra copies of data, and extends the resultant array to match what it thinks you are doing.

In [18]:
# arr / scalar division
arr2 / 2.0

array([ 5. ,  5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5])

One example of this is what we just did: scalar -- array arithmetic! In this case, numpy knows that we we mean by `arr2 / 2.0` is actually `arr2 / array([2.0, 2.0, ..., 2.0])`. Here, numpy "broadcasts" the 2.0 into an array and performs the operations element-by-element as we would expect. Broadcasting can also work between arrays of different shapes, exemplified by the figure below.

<img src="imgs/broadcast.png" width=600px/>
<center>IC: astroML </center>

Do we expect the following broadcasting operations to work?

In [19]:
arr1 = np.ones( (2, 3) )
arr2 = np.ones( (2, 1) )

arr1 + arr2

array([[ 2.,  2.,  2.],
       [ 2.,  2.,  2.]])

In [26]:
arr1 = np.ones( (2, 3) )
arr2 = np.ones( (1, 2) )

arr1 + arr2

ValueError: operands could not be broadcast together with shapes (2,3) (1,2) 

In [25]:
arr1 = np.ones( (2, 3) )
arr2 = np.ones( (3) )

arr1 + arr2

array([[ 2.,  2.,  2.],
       [ 2.,  2.,  2.]])

Why did one of the broadcasting rules above not work, while the other did? That is because when we try to broadcast into another array with a different dimension, the original array is first converted to the same dimensionality by putting ones in front, i.e., $(3) => (1, 3)$. We can see that this leads to only one of the last two examples above which has at least one matching dimension to anchor the broadcast.

**Array Arithmetic Example: Gravitational Force Calculation**

We can also use mathematical functions directly on arrays, which will be applied onto the array element-by-element. Take the following example. The gravitational force between two masses, $M$ and $m$, separated by a distance $r$, can be expressed as

\begin{align}
F = \frac{GMm}{r^{2}}
\end{align}
where $G$ is the gravitational constant of $6.67\times10^{-11}$ m$^{3}$ kg$^{-1}$ s$^{-2}$. Say we have $M=10$kg and $m=5$kg, and we want to evaluate the force when the distance $r$ ranges from $1 < r < 10$ m in steps of $1$ m.


We can write this into a function for $F$ given $r$:

In [27]:
G = 6.67e-11
M = 10
m = 5
def Fgrav(r):
    return G * M * m / r**2

One way to solve for $F$ when $1 < r < 10$ is with a **`for` loop and a list**:

In [34]:
%%timeit
r_range = range(1, 10000, 1)
force = []
for r in r_range:
    force.append( Fgrav(r) )

100 loops, best of 3: 8.25 ms per loop


Another way, which is both faster, more elegant and easier to read is to **feed a whole `np.ndarray` into the function**:

In [35]:
%%timeit
r_range = np.arange(1, 10000, 1)
force = Fgrav(r_range)

The slowest run took 6.39 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 79.6 µs per loop


### Matrix Operations

`numpy` is also particularly useful for performing matrix operations and linear algebra. Let's look at the dot product function `np.dot()`. If the shapes of the incoming arrays don't match, try using `.T` for transpose. There are lots of other matrix operations, like `np.cross`, `np.inner`, `np.outer` and `np.trace`, etc.

In [36]:
vec1 = np.arange(4)
mat1 = np.arange(8).reshape(2, 4)
mat2 = np.arange(16).reshape(4, 4)

print(vec1)
print("")
print(mat1)
print("")
print(mat2)

[0 1 2 3]

[[0 1 2 3]
 [4 5 6 7]]

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]


In [37]:
# dot product of two vectors
np.dot(vec1, vec1)

14

In [38]:
# dot product of mat1 and mat1 gives an error...
np.dot(mat1, mat1)

ValueError: shapes (2,4) and (2,4) not aligned: 4 (dim 1) != 2 (dim 0)

In [39]:
# dot product of mat1 and mat1.T
np.dot(mat1, mat1.T)

array([[ 14,  38],
       [ 38, 126]])

In [40]:
# dot product of mat1 and mat2
np.dot(mat1, mat2)

array([[ 56,  62,  68,  74],
       [152, 174, 196, 218]])

Let's confirm for ourselves that the cartesian unit vectors in ($x, y, z$) are orthogonal (i.e. their projection onto each other yields zero). Recall

\begin{align}
\vec{a}\cdot\vec{b} = |a|\cdot|b|\cos\theta
\end{align}

In [41]:
unit_x = np.array([1, 0, 0])
unit_y = np.array([0, 1, 0])
unit_z = np.array([0, 0, 1])

### Speed Considerations

It is useful to consider the speed of the calculations we are performing, particularly when working with extremely large arrays (think million of elements or larger). Here we will run through some examples which highlight some efficiencies to keep in mind while using `numpy`.

1.) Making large place-holder arrays? Use `np.empty`


In [50]:
# generating a blank million-element array

%timeit np.zeros(10000000)

%timeit np.empty(10000000)

10 loops, best of 3: 36 ms per loop
100 loops, best of 3: 16 ms per loop


2.) `Numpy` views are fast for data inspection and manipulation. Take-away here is to use `numpy ndarrays` over built-in data structures when possible and when it makes sense.

In [51]:
# view / copy difference
N = 1000000
l_one = list(range(N))
a_one = np.arange(N)

%timeit l_two = l_one[:]
%timeit a_two = a_one[:]

100 loops, best of 3: 19.2 ms per loop
The slowest run took 29.27 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 295 ns per loop


3.) While traditional slicing of and `ndarray` creates a view, fancy indexing creates a copy, meaning it is significantly slower when operating over large arrays.

In [52]:
# slicing / fancy indexing difference
arr = np.arange(1000000)

# slicing
%timeit arr[1000:100000]

# fancy indexing w/ a list
%timeit arr[range(1000, 100000)]

The slowest run took 32.64 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 381 ns per loop
100 loops, best of 3: 16.2 ms per loop


4.) Array operations are faster at element-by-element arithmetic than a `for` loop.

In [53]:
# array arithmetic is much faster than a simple FOR loop
def recipricol(values):
    # initialize empty array
    recips = np.empty( len(values) )
    # use enumerate for a FOR loop
    for i, val in enumerate(values):
        recips[i] = 1.0 / val
    # return output
    return recips

arr = np.arange(1., 10000.)

%timeit recipricol( arr )
%timeit 1 / arr

100 loops, best of 3: 3.61 ms per loop
10000 loops, best of 3: 39.5 µs per loop
