## Array Operations with Numpy

**Nick Kern**
<br>
**Astro 9: Python Programming in Astronomy**
<br>
**UC Berkeley**

---

<img src="imgs/numpy.jpg" width=300px/>

Now that we've covered the basic functionalities of "pure Python" we will move on to explore the Python packages that really make Python a powerful tool for scientific analyses. One such packages are the Numerical Python (NumPy) package: if this package didn't exist, it is unlikely that Python would be used within the scientific community to the extent it is today.

In brief, Numpy is a package that provides an "array" data structure, and further provides modules that enable us to perform all kinds of mathematical operations (e.g., linear algebra) on those arrays. You can kind of think of a numpy array (called an `ndarray`) as a "multidimensional container of items of the same type and size." Arrays can be sliced and indexed like lists, but can also be rotated, matrix multiplied, "fancy indexed," transposed, etc. There are lots of things we can do with arrays.

In [1]:
# load numpy like this
import numpy as np

### The `numpy ndarray`

Numpy allows us to work with N-dimensional arrays, or `ndarrays`. One prominent difference between Numpy arrays and Python lists is the fact that all elements within a Numpy array **need to have the same data type**. 

In [3]:
# define our first array
arr = np.array([1, 2, 3, 4, 5])

print(arr)

print(arr.dtype)

[1 2 3 4 5]
int64


Numpy arrays, like lists, **are mutable**. However, if you assign an element to have a different data type, all other elements must conform. You can do this implicitly, or explicitly.

In [25]:
# numpy implicitly knows to convert the integers to a float
arr = np.array([1.23, 4, 5, 6])

print(arr)

print(arr.dtype)

[ 1.23  4.    5.    6.  ]
float64


In [26]:
# explicity setting the data type
arr = np.array([1.23, 4, 5, 6], dtype=np.complex)

print(arr)

print(arr.dtype)

[ 1.23+0.j  4.00+0.j  5.00+0.j  6.00+0.j]
complex128


In [4]:
# arrays, like lists, are mutable
arr = np.array([0, 0.5, 1, 1.5, 2])

print(arr, hex(id(arr)))

arr[0] = -0.5

print(arr, hex(id(arr)))

[ 0.   0.5  1.   1.5  2. ] 0x108313760
[-0.5  0.5  1.   1.5  2. ] 0x108313760


In [6]:
# putting the N in ndarray
arr = np.array([ [1,2,3], [4,5,6], [7,8,9]])

print(arr)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [7]:
# arrays are objects, and thus have attributes and methods!
print(arr.ndim)

print(arr.shape)

2
(3, 3)


Numpy arrays are objects, and as such have attributes and methods. Along with `.ndim` and `.shape`, other useful methods are `.astype` to convert to different data types, `.max` and `.min` for maximum and minimum values, and `.ravel` for "unraveling" a multi-dimensional array into a 1D array and `.resize` and `.reshape` for reshaping an array.

### Easily Creating Arrays

If we want to create an array, we don't need to fill it with elements by hand. Like the `range()` function for lists, the numpy array has a *range* of functions for building arrays easily.

In [111]:
# np.arange similar to range capability, but returns an array, not an iterator
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [4]:
# np.linspace for specifying number of elements, rather than step-size with arange (it has inclusive upper bound!)
np.linspace(0, 20, 41)

array([  0. ,   0.5,   1. ,   1.5,   2. ,   2.5,   3. ,   3.5,   4. ,
         4.5,   5. ,   5.5,   6. ,   6.5,   7. ,   7.5,   8. ,   8.5,
         9. ,   9.5,  10. ,  10.5,  11. ,  11.5,  12. ,  12.5,  13. ,
        13.5,  14. ,  14.5,  15. ,  15.5,  16. ,  16.5,  17. ,  17.5,
        18. ,  18.5,  19. ,  19.5,  20. ])

In [135]:
# list of ones
np.ones(10)

array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

In [136]:
# list of zeros
np.zeros(10)

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [150]:
# list of 2D ones
np.ones( (2, 5) )

array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [152]:
# list of random numbers from [0, 1)
np.random.random( 5 )

array([ 0.89943441,  0.16338967,  0.79964314,  0.5034744 ,  0.20744789])

### Array Slicing and Fancy Indexing

[draw a picture]

Like lists, we can slice into a numpy array, however, with the numpy `ndarray` we can perform multi-dimensional slicing. Remember when slicing, the *start is inclusive*, while the *end is exclusive*. Note that the **zeroth axis** of a 2D array is along the vertical, and the **first axis** is along the horizontal.

A quick way to generate an array is the `np.arange` function, similar to the built-in `range` function.

In [16]:
# Create a 4x4 array
arr = np.arange(16)

arr.resize(4,4)

print(arr)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]


In [17]:
# slice along 0th axis
print(arr[2, :])

[ 8  9 10 11]


In [18]:
# alice along 1st axis
print(arr[:, 2])

[ 2  6 10 14]


In [19]:
# take a subsection
print(arr[1:3, 1:3])

[[ 5  6]
 [ 9 10]]


### Breakout 

How might we slice an array to get these results?

<img src="imgs/2D_slicing.jpg" width=300px/>

In [3]:
# breakout
arr = np.arange(11, 36).reshape(5,5)

**Fancy indexing** is like slicing, but allows us to control exactly which elements we want to pull out of the array: we aren't limited to a row (red), column (blue), or elements that are evenly space apart from each other (green). Fancy indexing essentially creates a new `ndarray` composed of any combination of the elements in the initial array. Let's say we wanted the first, second and fourth columns of our previous 2d array. We index this the same way as before, but now we **feed a list** instead of an integer.

In [28]:
# Take rows
print(arr[ [0, 1, 3], : ])

[[11 12 13 14 15]
 [16 17 18 19 20]
 [26 27 28 29 30]]


In [29]:
# Repeatedly take rows
print(arr[ [0, 1, 3, 3, 3, 2, 1, 0], :])

[[11 12 13 14 15]
 [16 17 18 19 20]
 [26 27 28 29 30]
 [26 27 28 29 30]
 [26 27 28 29 30]
 [21 22 23 24 25]
 [16 17 18 19 20]
 [11 12 13 14 15]]


In [30]:
# pick out just 16, 24 and 32
print( arr[ [1, 2, 4], [0, 3, 1] ] )

[16 24 32]


Recall that when we performed straight assignment of **lists** we were **copying references**, and when we sliced lists we were creating **new copies** of the object. This allowed us to get around the pitfalls of mutable objects and accidental element assignment. However, for an `ndarray`, a slice creates a **view** of the original array, which is an object that has a different memory address but still **shares data** with the original array. This means that, similar to our mutable list example, we can get unexpected behavior if we are loose with how we assign our arrays and data.

In [31]:
# data manipulation example
arr1 = np.arange(16).reshape(4,4)

arr2 = arr1[1:3, 1:3]

print(arr1)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]


In [32]:
# assign to arr2
arr2[:, :] = np.arange(100,104).reshape(2, 2)

print(arr1)

[[  0   1   2   3]
 [  4 100 101   7]
 [  8 102 103  11]
 [ 12  13  14  15]]


In [36]:
# create a copy by performing arithmetic, or by using .copy()
arr1 = np.arange(16).reshape(4,4)

arr2 = arr1[1:3, 1:3].copy()

arr2[:, :] = np.arange(100, 104).reshape(2,2)

print(arr1)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]


### Fancy Indexing with Boolean Arrays and the `np.where` function

Comparison operators on an `ndarray` will yield another `ndarray` with a boolean data type. The boolean array is the same shape as the original array, and is `True` where the condition is met and `False` where it isn't. We can then fancy index with these arrays to eliminate elements that don't conform. This is a fast and easy way to pick through an array for desired elements. Take the following example.

In [9]:
# Create an array
arr = np.arange(25).reshape(5, 5)
print(arr)

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]


In [44]:
# Create mask
mask = arr > 10
print(mask)

[[False False False False False]
 [False False False False False]
 [False  True  True  True  True]
 [ True  True  True  True  True]
 [ True  True  True  True  True]]


In [45]:
# select out elements with fancy indexing
arr[mask]

array([11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24])

The **`np.where`** statement can be used in two different ways. The first way accomplishes the same thing as the boolean mask, in that you feed the where statement a condition, it returns a set of lists which you can then feed into array to fancy index only the elements you wanted. The difference here is that instead of returning a boolean array of the same dimension as your original, it returns a series of lists with indices, which performs fancy indexing similar to how we originally did it. It looks something like this:

In [46]:
# fancy indexing with np.where
fancy = np.where(arr > 10)

fancy

(array([2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4]), array([1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]))


In [48]:
arr[fancy]

array([11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24])

Note that you can put more than one conditional in the `np.where` function.

In [25]:
fancy = np.where( (arr > 10) & (arr < 15) )
print(arr[fancy])

[11 12 13 14]


What's nice about this way of fancy indexing, is that you also get the indices of the desired elements from the original array, which may be useful or necessary depending on what you are doing.

The other way of using `np.where` is to perform a true masking of the original array, while keeping its structure intact and assigning the `False` elements some pre-defined value. It looks something like this:

In [10]:
# true masking
np.where(arr > 10, arr, np.nan)

array([[ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan],
       [ nan,  11.,  12.,  13.,  14.],
       [ 15.,  16.,  17.,  18.,  19.],
       [ 20.,  21.,  22.,  23.,  24.]])

Here the syntax is
```
np.where(<condition>, value if True, value if False)
```

You can see that the original array is neatly kept intact and we can clearly see which elements are undesired. We introduced a new type in doing so, the numpy NaN, or Not a Number. We will talk about these if we get time at the end.

### The Numpy NaN

NaNs can be both helpful and extremely annoying to deal with at the same time. What's nice about them is that they are definitely not a number, meaning that if you are uncomfortable about masking with a 0, `np.inf`, or some other number, you can mask with a `np.nan` and your array will retain its data type. You couldn't, for example, mask with a `None` type and retain an arbitrary data type (try it!). However, as we will see when we get to array arithmetic, performing arithmetic over arrays that have even a single `np.nan` can mess up your whole code. For example:

In [99]:
# sum the elements of this array!
arr = np.arange(10.0)
print(np.sum(arr))

45.0


In [100]:
# sum when one element is a nan
arr[0] = np.nan

print(arr)

print("")

print(np.sum(arr))

[ nan   1.   2.   3.   4.   5.   6.   7.   8.   9.]

nan


What's doubly annoying, is that, for whatever reason, they don't obey comparison operators. Meaning that if you have a large array with one or two `np.nan`s in them, you can't search for them with a `np.where(arr == np.nan)` statement. Howevever, the good people at NumPy anticipated this and provide us with the `np.isnan` function, which is like a special `np.where` function that works specially for `np.nan` types. Instead of returning indices, as the `np.where` statement returns, it returns a boolean array.

In [9]:
# make a np.nan mask
mask = np.isnan(arr)
print(mask)

[ True False False False False False False False False False]


In [10]:
# perform sum w/ np.nan masked!
np.sum( arr[~mask] )

45.0

Notice we reversed the booleans in order to turn it into a proper mask before summing.

It turns out there are also `np.nan`-safe functions within Numpy that can also do this without us having to think too hard, like `np.nansum`.

In addition, there is an entire sub-module within numpy, called `ma` for Masked Arrays, that automates this and makes handling masked numpy `ndarrays` much easier. Check it out sometime! We may come back to it later in the course.

### Array Concatenating

Let's now look at ways in which we can combine (or concatenate) arrays. Recall that with lists and tuples, we did this with a simple addition operator `+`. With numpy arrays, as we will see, we will want to reserve that operator for actual arithmetic.

There are multiple ways to **concatenate arrays**, depending on how you'd like to combine them.

1.) The most straightforward way is to just create a new `np.array` object wrapped around the constituent arrays.

In [51]:
# this will stack along the zeroth axis
arr1 = np.arange(10)
arr2 = np.arange(10, 20)
arr3 = np.arange(20,30)

big_arr = np.array([arr1, arr2, arr3])

print(big_arr)

[[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]
 [20 21 22 23 24 25 26 27 28 29]]


In [52]:
# what is the shape of big_arr?
big_arr.shape

(3, 10)

2.) We can also use the `np.vstack` method, which stacks along the zeroth axis, similar to before.

In [53]:
np.vstack([arr1, arr2, arr3])

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]])

3.) Use `np.hstack`, which stacks along the first axis (similar to `list.extend`)

In [54]:
np.hstack([arr1, arr2, arr3])

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

4.) Use `np.dstack`, which stacks along the second axis

In [58]:
np.dstack([arr1, arr2, arr3])

array([[[ 0, 10, 20],
        [ 1, 11, 21],
        [ 2, 12, 22],
        [ 3, 13, 23],
        [ 4, 14, 24],
        [ 5, 15, 25],
        [ 6, 16, 26],
        [ 7, 17, 27],
        [ 8, 18, 28],
        [ 9, 19, 29]]])

The `np.append` works similarly to the above functions.

### Breakout

1.
Load in the file `clusters.csv` with the `np.loadtxt` function, keeping only the columns labeled as `haloid`, `r200crit`, and `m200crit`. You will want to inspect the header of the file to see which columns these labels correspond to. Find the HaloID corresponding to the cluster with an `r200crit` > 1.0 **and** an `m200crit` < $2\times10^{4}$. Note that HaloID should have data as integers, meaning you will want to break off that component of the data into a separate array with `int` data type, and keep the other two (which are floats) in a single `ndarray` with `float` data type.

2.
Now load in the the column `vdispmean` and concatenate this into your data `ndarray`.

### Array Arithmetic and Broadcasting

One of the great things about arrays is our ability to perform math on them, element-by-element. This occurs exactly as you might suspect: a scalar times an array is just another array with its elements multiplied by that scalar. Let's briefly review. Note that in the case of array-array arithmetic, the arrays need to have the same size! Otherwise, Numpy doesn't know how to match up the excess indices.

In [14]:
# array arithmetic
arr1 = np.arange(10)
arr2 = np.arange(10,20)
print(arr1)
print(arr2)

[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]


In [15]:
# arr * arr multiplication
arr1 * arr2

array([  0,  11,  24,  39,  56,  75,  96, 119, 144, 171])

In [7]:
# arr + arr addition
arr1 + arr2

array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

In [18]:
# arr / scalar division
arr2 / 2.0

array([ 5. ,  5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5])

There are sometimes exceptions to the rule I stated above, that for array arithmetic to work the arrays need to have the same shape. The exceptions are when Numpy can identify what it thinks you are trying to do, performs the array operations efficiently without making extra copies of data, and extends the resultant array to match what it thinks you are doing.

One example of this is what we just did: scalar -- array arithmetic! In this case, numpy knows that we we mean by `arr2 / 2.0` is actually `arr2 / array([2.0, 2.0, ..., 2.0])`. Here, numpy "broadcasts" the 2.0 into an array and performs the operations element-by-element as we would expect. Broadcasting can also work between arrays of different shapes, exemplified by the figure below.

<img src="imgs/broadcast.png" width=600px/>
<center>IC: astroML </center>

Do we expect the following broadcasting operations to work?

In [144]:
arr1 = np.ones( (2, 3) )
arr2 = np.ones( (2, 1) )

arr1 + arr2

array([[ 2.,  2.,  2.],
       [ 2.,  2.,  2.]])

In [145]:
arr1 = np.ones( (2, 3) )
arr2 = np.ones( 2 )

arr1 + arr2

ValueError: operands could not be broadcast together with shapes (2,3) (2,) 

In [147]:
arr1 = np.ones( (2, 3) )
arr2 = np.ones( 3 )

arr1 + arr2

array([[ 2.,  2.,  2.],
       [ 2.,  2.,  2.]])

### Speed Considerations

It is useful to consider the speed of the calculations we are performing, particularly when working with extremely large arrays (think million of elements or larger). Here we will run through some examples which highlight some efficiencies to keep in mind while using `numpy`.

1.) Making large place-holder arrays? Use `np.empty`


In [22]:
# generating a blank million-element array

%timeit np.empty(1000000)

%timeit np.zeros(1000000)

The slowest run took 23.25 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.02 µs per loop
100 loops, best of 3: 5.31 ms per loop


2.) `Numpy` views are fast for data inspection and manipulation. Take-away here is to use `numpy ndarrays` over built-in data structures when possible and when it makes sense.

In [23]:
# view / copy difference
N = 1000000
l1 = list(range(N))
a1 = np.arange(N)

%timeit l2 = l1[:]
%timeit a2 = a1[:]

10 loops, best of 3: 39.1 ms per loop
The slowest run took 23.39 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 313 ns per loop


3.) While traditional slicing of and `ndarray` creates a view, fancy indexing creates a copy, meaning it is significantly slower when operating over large arrays.

In [None]:
# slicing / fancy indexing difference
arr = np.arange(1000000)

# slicing
%timeit arr[1000:100000]

# fancy indexing w/ a list
%timeit arr[range(1000, 100000)]

4.) Array operations are faster at element-by-element arithmetic than a `for` loop.

In [25]:
# array arithmetic is much faster than a simple FOR loop
def recipricol(values):
    # initialize empty array
    recips = np.empty( len(values) )
    # use enumerate for a FOR loop
    for i, val in enumerate(values):
        recips[i] = 1.0 / val
    # return output
    return recips

arr = np.arange(1., 10000.)
%timeit recipricol( arr )
%timeit 1 / arr

100 loops, best of 3: 3.99 ms per loop
The slowest run took 4.87 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 47.9 µs per loop


### Breakout
