[![Binder](https://mybinder.org/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/joshmaglione/CS102-Jupyter/main?labpath=.%2F03_ndarray.ipynb) 

<a href="https://colab.research.google.com/github/joshmaglione/CS102-Jupyter/blob/main/03_ndarray.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> 

[View on GitHub](https://github.com/joshmaglione/CS102-Jupyter/blob/main/03_ndarray.ipynb)

# 03 : Indexing, Reshaping, and Computing with `ndarray`

## Learning outcomes

- Use slicing, boolean masks, and advanced indexing with `ndarray`.
- Reshape arrays and reason about `shape`/axes.
- Compute summaries efficiently using vectorized operations.

From last time:
- NumPy arrays have type `ndarray`
- `ndarray`s are *homogeneous multi-dimensional* collections of data. 
- Some attributes of an `ndarray`:
  - `dtype` : the data type of the entries,
  - `ndim` : the number of dimensions, 
  - `shape` : the size of each dimension, 
  - `size` : the total size of the array

Additional attributes include:
- `itemsize` : the size (in bytes) of each array element, 
- `nbytes` : the total bytes used by the array.

## Some advantages of the `ndarray`

In [3]:
import numpy as np

### Memory efficiency

Let's look at the size (in bytes) of an instance of `ndarray`.

In [2]:
x3 = np.random.randint(10, size=(3, 2, 5)) 
print(x3)

[[[4 2 9 2 3]
  [5 1 2 7 5]]

 [[7 3 6 2 5]
  [6 8 7 2 4]]

 [[4 8 0 6 3]
  [7 8 0 0 2]]]


In [3]:
print(f"itemsize: {x3.itemsize} bytes")
print(f"nbytes: {x3.nbytes} bytes")

itemsize: 8 bytes
nbytes: 240 bytes


Significantly fewer bytes. 

In general, `nbytes` is equal to `itemsize` times `size`.

In [4]:
x3.itemsize * x3.size == x3.nbytes

True

Let's compare this with a list in Python, and let's make them larger to more easily see the difference.

In [5]:
from sys import getsizeof

# Size of our lists
N = 10000

# Create a list of N elements 
S = range(N)

# Get the size of every element and the container
S_size = sum(getsizeof(x) for x in S) + getsizeof(S)

# Create a Numpy array of N elements 
D = np.arange(N)

print(f"Size of the Python list + container:       {S_size} bytes")
print(f"Size of one element in the NumPy array:    {D.itemsize} bytes")
print(f"Size of the entire NumPy array:            {D.nbytes} bytes")

Size of the Python list + container:       280048 bytes
Size of one element in the NumPy array:    8 bytes
Size of the entire NumPy array:            80000 bytes


### Iterating through lists

Let's do a simple operation with `list` and `ndarray`

In [6]:
# Create lists of size N
N = 10000
Xpy = range(N)
Ypy = range(N)
Xnp = np.arange(N)
Ynp = np.arange(N)

We will use the magic command `%timeit` to time how long it takes to execute.

In [7]:
%timeit _ = [Xpy[i] + Ypy[i] for i in range(N)]

943 Î¼s Â± 685 ns per loop (mean Â± std. dev. of 7 runs, 1,000 loops each)


In [8]:
%timeit _ = Xnp + Ynp

2.49 Î¼s Â± 61.3 ns per loop (mean Â± std. dev. of 7 runs, 100,000 loops each)


| prefix | symbol | value | 
| ------ | ------ | ----- | 
| deci   | d      | $10^{-1}$ |
| centi  | c      | $10^{-2}$ |
| milli  | m      | $10^{-3}$ |
| micro  | Î¼      | $10^{-6}$ | 
| nano   | n      | $10^{-9}$ | 
| pico   | p      | $10^{-12}$ |
| atto   | a      | $10^{-18}$ | 

[Attosecond physics ðŸ¤¯](https://en.wikipedia.org/wiki/Attosecond_physics)

The magnitude difference is 1000 times.

## Indexing with `ndarray`

In Python counting starts with $0$, so it can be confusing.

Sometimes I refer to the first entry of a list as the 'first' entry, and sometimes I refer to it as the 'zeroth' entry. 

This is confusing, but I try to correct myself and use 'zeroth'.

For the other entries, I generally match what Python would use. 

The simplest example is the $1$-dimensional array, so let's work with that. 

In [9]:
a1 = np.random.randint(100, size=6)
print(a1)

[55 63 77 11 87 38]


We access the entries of `a1` (and any $1$-dimensional array) with a single integer. 

In our example the integers $\{0, 1, 2, 3, 4, 5\}$ are suitable.

In [11]:
print(a1[0])

55


In [12]:
a1[2]

np.int64(77)

In [13]:
a1[5]

np.int64(38)

In [15]:
# a1[6]       # naughty naughty

Nonnegative integers are used to access entries from left to right.

Negative integers are used to access entries from right to left.

For our example, we can also use the integers from $\{-1,-2,-3,-4,-5,-6\}$.

In [16]:
a1[-1]

np.int64(38)

In [17]:
a1[-3]

np.int64(11)

In [18]:
a1[-6]

np.int64(55)

In [20]:
# a[-10]        # naughty naughty

Every $1$-dimensional `ndarray` of length $N$ can be indexed with the integers 
$$
    \{-N,\ -N+1,\ \dots,\ -1,\ 0,\ 1,\ \dots,\ N-2,\ N-1\} . 
$$

**Quick Note.** You can determine the length of an array `a` by `len(a)`.

We can surgically change one entry of the array

In [24]:
a1[3] = 137.8
print(a1)

[ 55  63  77 137  87  38]


We can take this ideas and generalize to higher dimensional arrays.

Let's see the leap from $1$ to $2$ dimensions.

In [25]:
a2 = np.random.randint(100, size=(3, 9))
print(a2)

[[ 7 25 49 36 82 86 27 10 39]
 [80 86 43 65 79 93 79 50 55]
 [44  6 44 16 36 24 44 45 25]]


Entries are indexed the same way we index matrices. 

For example, the $(i,j)$ entry of a matrix lies in the $i$th row and $j$th column. 

We access entries by pairs of integers.

In [29]:
a2[0, 0]
# a2[0][0]

np.int64(7)

In [30]:
a2[0, 4]

np.int64(82)

In [31]:
a2[2, 6]

np.int64(44)

We can think of the first entry as taking an integer from 
$$
    \{-3, -2, -1, 0, 1, 2\}
$$

and the second entry from
$$
    \{-9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8\}.
$$

In [32]:
a2[-2, 8]

np.int64(55)

Moving to higher dimensions, the same ideas apply.

In [33]:
a5 = np.random.randint(100, size=(4, 3, 5, 7, 2))
print(a5)

[[[[[77 33]
    [ 6 54]
    [58 42]
    [91 12]
    [34 79]
    [61 10]
    [28 18]]

   [[45 29]
    [18 18]
    [11 31]
    [82 30]
    [41 43]
    [55 81]
    [28 17]]

   [[91 44]
    [42 68]
    [48 64]
    [ 8 34]
    [27 32]
    [36 55]
    [53 19]]

   [[22 55]
    [32 34]
    [75 25]
    [34 56]
    [41 59]
    [85 83]
    [93 60]]

   [[53  0]
    [35 32]
    [18 18]
    [10 11]
    [52 29]
    [35 12]
    [48  5]]]


  [[[26 99]
    [96  8]
    [52 97]
    [61 36]
    [88 29]
    [83 35]
    [19 28]]

   [[45 93]
    [64 26]
    [72 79]
    [87 35]
    [17 69]
    [89 21]
    [ 0 65]]

   [[85 96]
    [51 90]
    [19 60]
    [79 22]
    [10 77]
    [98 40]
    [79 99]]

   [[42 95]
    [94 36]
    [96  5]
    [53  3]
    [57 36]
    [23 65]
    [33 12]]

   [[ 7 72]
    [16 51]
    [ 1 90]
    [39 42]
    [37 14]
    [47 28]
    [76 60]]]


  [[[15 15]
    [87  4]
    [22 11]
    [46 32]
    [33 10]
    [65 82]
    [43 49]]

   [[ 0  6]
    [75 46]
    [11 85]
    [67 91]
  

In [35]:
# a5[0, 0, 0, 0, 0]
a5.shape

(4, 3, 5, 7, 2)

There are a few more indexing tricks, but this covers most of what one would do.

If you want to learn more, check out the [documentation](https://numpy.org/doc/stable/user/basics.indexing.html).

## Slicing arrays

A slice of an array is a subarray, which can be lower-dimensional than the original.

### One-dimensional slices

In some sense, this is the most boring, but it's also the easiest to understand.

Accessing entries was done by `a1[k]` for some $k$.

We will take a range of entries from `a1`.

In [36]:
print(a1)

[ 55  63  77 137  87  38]


In [37]:
print(a1[1:5])

[ 63  77 137  87]


The syntax `a1[i:j]` takes all entries from $i$ to (and including) $j-1$. 

In [38]:
print(a1[0:6])

[ 55  63  77 137  87  38]


In [39]:
print(a1[-6:-1])

[ 55  63  77 137  87]


There's a *third* argument you can use.

In [40]:
a1 = np.arange(20)
print(a1)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]


In [41]:
print(a1[0:20:2])

[ 0  2  4  6  8 10 12 14 16 18]


In [42]:
print(a1[:5])
print(a1[5:])
print(a1[:])

[0 1 2 3 4]
[ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]


In [43]:
print(a1[::2])
print(a1[::])
print(a1[::-1])

[ 0  2  4  6  8 10 12 14 16 18]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
[19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0]


### Jumping to $3$ dimensions

It might be helpful to visualize a $3$-dimensional array as a rectangular prism of data.

The following is an illustration of a $(5\times 6\times 4)$-array.

![](imgs/multiway_array.png)

In [47]:
a3 = np.random.randint(10, size=(5, 6, 4))
print(a3)

[[[1 0 2 5]
  [1 4 1 4]
  [4 6 4 9]
  [0 5 0 7]
  [6 0 6 6]
  [2 9 1 4]]

 [[6 9 7 3]
  [7 5 5 6]
  [6 1 8 1]
  [8 7 3 3]
  [9 5 1 6]
  [8 0 7 5]]

 [[1 1 9 9]
  [8 5 7 6]
  [3 8 2 8]
  [7 5 3 9]
  [1 6 9 6]
  [5 8 7 3]]

 [[8 8 1 2]
  [6 7 8 4]
  [3 5 4 7]
  [3 4 5 2]
  [1 4 8 6]
  [0 0 4 9]]

 [[6 5 1 4]
  [2 1 9 1]
  [9 2 0 1]
  [4 3 4 2]
  [4 5 4 3]
  [4 4 2 2]]]


'Slicing' is an operation on arrays that yield 'subarrays'. 

For example, here are a few slices of the above array:

![](imgs/sliced.png)

In [45]:
for k in range(4):
    print(a3[:, :, k])
    print()

[[8 7 2 7 0 3]
 [9 3 7 4 9 6]
 [7 2 8 7 3 7]
 [0 2 3 2 5 4]
 [9 8 2 8 2 6]]

[[5 6 1 8 7 5]
 [0 1 4 0 6 9]
 [0 9 0 2 8 1]
 [0 8 5 8 8 4]
 [3 3 7 2 5 3]]

[[9 9 8 2 8 8]
 [4 3 0 0 4 2]
 [4 0 3 0 8 1]
 [0 0 8 1 9 3]
 [3 9 6 0 1 1]]

[[3 6 1 1 1 4]
 [7 9 1 6 1 1]
 [7 8 0 6 6 5]
 [3 9 3 0 6 0]
 [9 7 1 3 9 7]]



In [48]:
print(a3[0])
print()
print(a3[0, :, :])

[[1 0 2 5]
 [1 4 1 4]
 [4 6 4 9]
 [0 5 0 7]
 [6 0 6 6]
 [2 9 1 4]]

[[1 0 2 5]
 [1 4 1 4]
 [4 6 4 9]
 [0 5 0 7]
 [6 0 6 6]
 [2 9 1 4]]


## Creating copies

This might seem silly, but it is important. 

Here's a problem without an error. 

In [49]:
a2 = np.random.randint(10, size=(3, 4))
print(a2)

[[5 9 0 5]
 [5 2 9 1]
 [4 4 7 2]]


In [50]:
b2 = a2[1:, 1:]
print(b2)

[[2 9 1]
 [4 7 2]]


In [51]:
b2[0, 0] = -1
print(b2)

[[-1  9  1]
 [ 4  7  2]]


In [52]:
print(a2)

[[ 5  9  0  5]
 [ 5 -1  9  1]
 [ 4  4  7  2]]


This might not be intended. If you want to edit `b2` independently of `a2`, they need to be independent of each other.

We can do this by the `copy` method.

In [53]:
b2 = a2[1:, 1:].copy()
print(b2)

[[-1  9  1]
 [ 4  7  2]]


In [1]:
b2[0, 0] = 42
print(b2)
print()
print(a2)

NameError: name 'b2' is not defined

Be careful out there.

## Reshaping

We can reshape arrays into other appropriate sizes.

In [5]:
a1 = np.arange(10)
a2 = a1.reshape(2, 5)
print(a1)
print()
print(a2)

[0 1 2 3 4 5 6 7 8 9]

[[0 1 2 3 4]
 [5 6 7 8 9]]


In [6]:
a2[0, 0] = 10
print(a2)

[[10  1  2  3  4]
 [ 5  6  7  8  9]]


In [7]:
print(a1)

[10  1  2  3  4  5  6  7  8  9]


Therefore `reshape` is not making a copy in general. Keep that in mind. 

The shapes are all distinct:
$$
    (n),\; (1, n),\; (n, 1),\; (n, 1, 1),\; (1, 1, n, 1, 1, 1),\; \text{etc}.
$$

In [8]:
a1 = np.arange(5)
a2_r = np.arange(5).reshape(1, 5)
a2_c = np.arange(5).reshape(5, 1)
a4 = np.arange(5).reshape(1, 5, 1, 1)

In [9]:
print(a1)
print(a2_r)
print(a2_c)
print(a4)

[0 1 2 3 4]
[[0 1 2 3 4]]
[[0]
 [1]
 [2]
 [3]
 [4]]
[[[[0]]

  [[1]]

  [[2]]

  [[3]]

  [[4]]]]


### `newaxis`

A common enough reshape occurs when one takes a $1$-dimensional array and converts it to either a row or column vector. 

`reshape` works here, but so does `newaxis`.

In [10]:
print(a1)

[0 1 2 3 4]


In [11]:
print(a1[np.newaxis, :])

[[0 1 2 3 4]]


In [12]:
print(a1[:, np.newaxis])

[[0]
 [1]
 [2]
 [3]
 [4]]


In [13]:
print(a1[np.newaxis, :, np.newaxis, np.newaxis])

[[[[0]]

  [[1]]

  [[2]]

  [[3]]

  [[4]]]]


## Concatenating

As usual, with $1$-dimensional arrays the notion of concatenation is simple.

In [14]:
a1 = np.arange(5)
b1 = np.arange(20, 25)
c1 = np.arange(9, 2, -2)

In [15]:
np.concatenate([a1, b1, c1])

array([ 0,  1,  2,  3,  4, 20, 21, 22, 23, 24,  9,  7,  5,  3])

For higher dimensions, concatenation gets confusing. 

![](imgs/confused_thinking.png)

**We concatenate *along* an axis.**

For $1$-dimensional arrays, there is only one axis, so it is unambiguous. 

For $2$-dimensional arrays, you have horizontal and vertical. 

For an $n$-dimensional array, there are $n$ axes labeled $0$, $1$, up to $n-1$.

When we indexed an entry, we gave specific coordinates to the *axes*.

So `a3[i, j, k]` takes the entry in the $i^{th}$ position on axis 0, the $j^{th}$ position on axis 1, and the $k^{th}$ position on axis 2. 

#### Matrices

Since `a2[i, j]` takes the entry in row $i$ and column $j$, we know 
- axis 0 : rows
- axis 1 : columns

Say it again:

**We concatenate *along* an axis.**

If we concatenate *along* axis 0, we concatenate along the rows. This is a *vertical* concatenation.

In [16]:
a2 = np.arange(12).reshape(3, 4)
b2 = np.arange(42, 50).reshape(2, 4)
print(a2)
print()
print(b2)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

[[42 43 44 45]
 [46 47 48 49]]


In [17]:
print(np.concatenate([a2, b2], axis=0))

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [42 43 44 45]
 [46 47 48 49]]


If we concatenate *along* axis 1, we concatenate along the columns. This is a *horizontal* concatenation.

In [18]:
a2 = np.arange(12).reshape(3, 4)
b2 = np.arange(42, 48).reshape(3, 2)
print(a2)
print()
print(b2)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

[[42 43]
 [44 45]
 [46 47]]


In [21]:
print(np.concatenate([a2, b2], axis=1))

[[ 0  1  2  3 42 43]
 [ 4  5  6  7 44 45]
 [ 8  9 10 11 46 47]]


I don't really want to go higher. 

## Splitting

The function `split` is, in some sense, the inverse to `concatenate`, so we'll go fast.

In [None]:
a1 = np.arange(8)
print(a1)

In [None]:
print(np.split(a1, 4))

In [None]:
print(np.split(a1, [3, 4]))

Running `np.split(a1, k)` for an integer $k$ returns `a1` split into *equal* sized arrays of length $k$.

If $k$ is not a divisor of `len(a1)`, an error is raised.

Running `np.split(a1, [i, j, k])` with $i < j < k$, all three integers, then 
```python
a1[:i],  a1[i:j],  a1[j:k],  a1[k:]
```

is returned.

The idea generalizes to higher dimensions using the keyword argument `axis`. 

As with concatenation, splitting happens *along* a given axis. 

## Exercises
1. Starting with a $1$-dimensional array of length $60$,
   reshape it into a $3$-dimensional array with dimensions
   of sizes $5$, $4$ and $3$, respectively.
2. Then split the array along the second dimenson,
   the one of size $4$, into two halves.
3. What does `np.newaxis` mean, and what is it used for?