# Numpy Review

In this review, we are going to refresh our memories about the Numpy package. Numpy (numerical Python) is the basic engine that turns Python into a tool for data analysis. Anything "data sciency" that you do in Python will rely on Numpy at some point, either explicitly (when calling Numpy fuctions directly), or implicitly (when, e.g., using Pandas).

The point of Numpy is do make working with data in Python easier and faster. Here, we going to remind ourselves of basic Numpy functionality.

## Python lists

First, let's look at a basic Python list:

In [None]:
a_list = [2, 4, 6, 8]

---

Just to warm up, let's get some items from our list via *indexing*:

Get the first number from the list (the "zeroith" number in Pythonese):

Get the last two numbers:

---

Now let's make a *nested* Python list:

In [None]:
a_nested_list = [[2, 4], [3, -1], [-2, 1]]

Remember that a Python list can hold data of different types, lengths, etc., but this list is special; it is a *list of lists* all of the same length.

Let's have a look:

In [None]:
print(a_nested_list)

How would we get the first entry in the second list? 

We could do it in two steps... first, get the second list:

In [None]:
sec_list = a_nested_list[1]

And then get the first entry:

In [None]:
my_num = sec_list[0]
print(my_num)

Conveniently, we can just do this in one go:

In [None]:
my_num = a_nested_list[1][0]
print(my_num)

This does the same thing without having to invoke the intermediate variable `sec_list`. 

While this list-of-lists construct might seem a little abstract, there is actually a nice way to wrap our heads around it, which is to think of it as a *matrix*.

> Unfortunately, no one can be told what the Matrix is. You have to see it for yourself. *- Morpheus*

Here is an example of a matrix:

![A matrix](./images/mailboxes.png)

A ***matrix*** is a 2 dimensional (2D) arrangement of things and, for our purposes, the things are data of one form or another (numbers, strings, timestamps, etc.).

These mailboxes are numbered sequentially, because there are other mailboxes in other matrices and that's just how the USPS rolls. But, notice that, for *this* matrix of mailboxes, there is another way in which we could uniquely refer to each mailbox. Specifically, we could uniquely specify each mailbox by the *row* it is in and the *column* it is in.

For example, the open mailbox with the key in the door is in the 2nd row and the 4th column or, in terms of Python indexes, the mailbox is at location [1, 3]. 

So, we can think of matrix as an arrangement of data that has a built in ***spatial coordinate system*** used to refer to the items of data.

Here's another Python list of lists:

In [None]:
another_nested_list = [[3.3, 2.3, 2.2], [1.2, 7.8, 8.7], [4.8, 2.2, 6.5],
                       [1.5, 7.5, 9.5], [5.9, 1.6, 7.7]]

As far as Python is concerned, this is just a list that happens to contain 5 lists, each of length 3:

In [None]:
print(f'Another nested Python list: {another_nested_list}')

But it makes sense for our human brains to think about it as a 2D arrangement of data, like this: 

| row # | Col # | | | | |
| ---- | ---- | ---- | ---- | ---- | ---- |
|   | 0 | 1 | 2 | 3 |  4 | 5 |
| 0 | 3.3 | 1.2 | 4.8 | 1.5 |  5.9 | 9.0 |
| 1 | 2.3 | 7.8 | 2.2 | 7.5 |  1.6 | 8.1 |
| 2 | 2.2 | 8.7 | 6.5 | 9.5 |  7.7 | 5.2 |

Now we can think of the Python indexes used to access the data as spatial ***row*** and ***column*** coordinates. For example:

In [None]:
another_nested_list[2][1]

fetches the data value in the second row (row index 1) and the third column (column index 2). 

Even though, in Python terms, `another_nested_list` is just a list of lists that all happen to be of the same length, it's very helpful for us to map data like this onto a matrix and think of the indexes as coordinates.

## Numpy

Numpy is a big and powerful package, but you can think of it's most basic function as making this matrix-like way of thinking about data explicit, as opposed to just a cute way of thinking about lists of lists.

To use numpy, we first import it. Traditionally, it is imported under the name "`np`".

In [None]:
import numpy as np

Now we can convert our latest nested list into a numpy matrix using numpy's `array()` function:

In [None]:
our_numpy_matrix = np.array(another_nested_list)

Let's look at our new matrix!

In [None]:
print(our_numpy_matrix)

And compare it to the Python list from which we created it.

In [None]:
print(another_nested_list)

We can see that the numpy version has made the spatial row x column arrangement explicit.

Now, getting data values is easy peasy, we just index into our new matrix with row and column coordinates of a desired value:

In [None]:
our_numpy_matrix[2,0] # get the data at the third row and first column

Let's see what data type our new matrix is, according to Python:

In [None]:
type(our_numpy_matrix)

So our new matrix is a Python object made by numpy of type "ndarray", which is short for "N-dimesional array" – we'll unpack this in a bit.

But our object contains other objects (in that Pythonic way), so let's see what they are:

In [None]:
type(our_numpy_matrix[2,0])

So they are floating point numbers, also defined in numpy, that presumably have a few more bells and whistles that regular Python floats (the 64 at the end means that 64 bits are used to store each number; this is the number's *precision*.

You may have noticed that our new matrix is a little different than the way we laid out our numbers in the table above. There, we made each sub-list into a column, whereas the `array` function seems to have made each list into a row.

Fear not! Numpy objects, like all Python objects "know" how to do things; they have *methods*. The need to turn the rows into columns and vice versa is very common – it is called *transposing* a matrix – so numpy arrays have a *transpose* method `T`.

In [None]:
transposed_matrix = our_numpy_matrix.T
print(transposed_matrix)

Notice that the value 4.8 used to be in the third row of the first column, but now it's coordinates have been flipped:

In [None]:
transposed_matrix[0,2] # new location of 4.8

In [None]:
transposed_matrix[2,0] # value at the old location

---

In the code cell below, make a Python list of lists, create a numpy matrix from it, transpose it, and access 3 of it's values.

### Numpy arrays 

So far, we've been talking about our data above as a "matrix", yet we used the `array()` function to make it, and Python tells us that our matrix is an `ndarray` – what's going on?

#### Types of arrays

"Array" is a general term for a structured collection of data, and can have any number of dimensions (hence "ndarray" for "N-dimensional array"). Here's an (empty) 3 dimensional array:

![A 3D array](images/array.png)

If this array were named "phred", we would index it just like above, but with an extra index – the coordinate specifying the location along the third dimension. So `phred[5, 5, 3]` would specify the bottom right location just peeking out on the fourth – what? – "page" of the array.

Though arrays can have any number of dimensions, lower-dimensional arrays are common and get their own special names.

A "matrix" is an array of 2 dimensions. As you already know, a matrix is a universal format for data and is preferably in "tidy" format, where each row is an observation and each column is a variable.

A "vector" is a list of numbers, so named because a simple list of numbers is used in math (linear algebra) and physics to specify vectors (such as force). Vectors can be 

* "row vectors" - a matrix with a single row
* "column vectors" - a matrix with a single column
* a list of numbers with only a single dimension, like a Python list

Finally, in this lingo, a single number is referred to as a "scaler" (because multiplying a vector by a number scales the length of the vector without changing its direction).

---

In the cell below, make a Python list or tuple containing the x and y coordinates of a point (any point you like – I'm a big fan of x=3, y=1 personally). On a piece of paper or a drawing program or whatever, plot the point in an x,y coordinate system, and draw an arrow – a vector! – from the origin to your point.

In [None]:
v = (3, 1)

Now convert your Python object into a numpy array.

In [None]:
va = np.array(v)

Get the shape of your new vector using the `shape` method (used just like the `T` method above).

In [None]:
va.shape

Multiply your vector by 2 (if your vector is named "Velma", then you would literally do `Velma * 2`).

Plot your new vector and confirm that the multiplication *scaled* the original vector up by a factor of 2 without changing its direction!

---

We've just illustrated an awesome thing about numpy ndarrays: if we want to do simple operations on an entire array, we don't need to do it element-by-element, we just do it on the entire array in one go! Despite operating on arrays in general, this property is referred to as "*vectorization*" of operations.

#### Making numpy arrays 

##### arrays from lists or tuples

We've already seen that we can make numpy arrays from Python lists. Like this:

In [None]:
print(f'A python list: {a_list}')

a_numpy_thing = np.array((a_list))

print(f'A numpy thing: {a_numpy_thing}')

Or this:

In [None]:
a = np.array([[1, 2, 3], [4, 5, 6]])
a

---

In the cell below, confirm that you can make a numpy array from a Python tuple.

##### ones and zeros

If we're creating or reading in data from a source other than a Python list or tuple, we need a place to put it. To make new arrays to hold stuff, we first create an array filled with something. Most of the time, it doesn't matter what we fill it with. We commonly fill new arrays with ones or zeros.

In [None]:
a_vec = np.ones((3,2))
print(f'an array of ones:\n {a_vec}')

---

In the cell below, make an array of zeros.

---

##### Any number 

We can also initialize an array to any value we want.

In [None]:
the_answer = np.full((3,3), 42)
print(f'Every cell has the answer!:\n {the_answer}')

---

In the cell below, get the same result as above (a 3x3 array of 42s) using 1) `np.ones` and 2) `np.zeros`. 
(Hint: take advantage of vectorization using `*` and `+`)

##### random numbers

In data science, we often add random noise to simulations in order to capture the random variability present in the universe and the data we get from it. We can do this using any number of functions in `np.random`. For example

In [None]:
my_noise = np.random.randn(4,4)
print(f'C\'mon feel the noise! \n {my_noise}')

Which made normally distributed (Gaussian) noise. 

We can also make noise that is uniformly distributed:

In [None]:
unif_noise = np.random.rand(4,4)
print(f'Moar noise! \n {unif_noise}')

Or we can make random integers:

In [None]:
int_noise = np.random.randint(1, 11, (4,4))
print(f'Random integers! \n {int_noise}')

##### the identity matrix

Finally, we can make an "identity matrix", a matrix with 1s running down the diagonal. It's useful for linear algebra applications, and is included there only for completeness.

In [None]:
aye = np.eye(4,4)
print(f'Aye aye Cap\'n! \n {aye}')

---

In the cell below, make a 6x6 array containing random integers from -10 to 10.

#### "vectorized" operations 

As mentioned above, most of the common Python operators become *vectorized* in numpy, which means that we can to the same thing to every element of our arrays in one go.

let's add 10 to all our Gaussian random numbers from above.

In [None]:
amp_noise = my_noise + 11
amp_noise

Let's see which ones go to 11!

In [None]:
amp_noise >= 11

If arrays are the same size, we can do element-by-element things easily.

Which elements of the Gaussian noise are greater than the corresponding elements of the uniform noise?

In [None]:
my_noise > unif_noise

Add our Gaussian and integer noise element-by-element:

In [None]:
my_noise + int_noise

Divide our identity matrix by our integer noise – everything off the main diagonal should be zero...

In [None]:
aye / int_noise

---

In the cell below, make a 10x5 matrix containing normally distributed random numbers with a mean of about 100 and a standard deviation of about 15 using vectorized operations.

#### Making numpy sequences 

When computing things like functions (in the math sense), we need to start by laying down an x-axis (a working domain of the function). There are two numpy functions, `arange()` and `linspace()` that make this easy for us.

The function `arange()` allows us to specify the endpoints of our domain, and a step size (which defaults to one). We can make a sequence of the numbers one to 10 like this:

In [None]:
my_domain = np.arange(1,11)
print(f'my x axis is: \n {my_domain}')

We can also specify a step size with a third argument. Like this:

In [None]:
my_domain = np.arange(1,11,2)
print(f'my x axis is: \n {my_domain}')

The function `linspace()` is similar, but allows us to specify the number of numbers we need, and it figures out the step size for us. Like this:

In [None]:
my_domain = np.linspace(1,11,10)
print(f'my x axis is: \n {my_domain}')

If we look at the shape of the object created by either `arange()` or `linspace()`, we see that they are 1D:

In [None]:
my_domain.shape

Note that there is only one dimension, so we only need a single index to access a value:

In [None]:
my_domain[3]

We can turn this into either a row vector or a column vector by adding a second dimension. Adding a new dimension will make it, technically, a matrix – a matrix with only one column or one row, respectively.

In [None]:
my_domain_row = my_domain[np.newaxis,:]
print(f'my x axis row vector is: \n {my_domain_row}')

This looks the same, but let's check its shape:

In [None]:
my_domain_row.shape

So now it has 1 row and 10 columns. We therefore use a row *and* a column index to get a value.

In [None]:
my_domain_row[0, 3]

---

In the cell below, make a column vector out of `my_domain`, check the shape, and get the third entry.

#### Indexing cells 

Cells and subsets of numpy arrays are accessed – "indexed" – much like Python lists and tuples are.

We've already done a fair amount of indexing, but let's get a bit more flexible. 

#### Indexing rows and columns

We can fetch entire rows or columns using the the colon, `:`. Let's try this on our `my_noise` array. First, let's look at it again:

In [None]:
print(my_noise)

Now let's get the first column:

In [None]:
my_noise[:,0]

The colon means "everything on this dimension", so the above command means "get all the rows in the first column of `my_noise`.

---

In the cell below, get the second (index = 1) row of data.

Now check the shape in the cell below.

Now get the first row and check the shape:

Our output in both cases above is a 1D vector. So what do we do if we want to grab a column, say, preserve it as a column? Easy! We just specify a starting and stopping index. Like this:

In [None]:
my_noise[:,0:1]

You can see from the output that this is a column, but check its shape in the cell below to be sure:

#### Indexing subsets ("slicing")

What we have started doing above is called "slicing", which is carving out ("slicing") subsets of data from an array.

The key to slicing is the colon, `:`, operator. 

Let's play with a 1D vector, `my_domain` first. Let's remind ourselves of it:

In [None]:
print(my_domain)

If we put a number on either side, we can read as "from the first index to the second index":

In [None]:
my_domain[1:4]

If we just put an index on the left, we can read it as "from the index to the end". Like this:

In [None]:
my_domain[2:]

If we just put an index on the right, we can read it as "from the beginning to the index". Like this:

In [None]:
my_domain[:2]

---

In the cell below, slice out the 3rd through 5th values of `my_domain`.

---

The extension of slicing to a matrix is straightforward. You just do your slicing on each dimension separately.

Here are some examples of array indexing from Python for Data Analysis by Wes McKinney:

![Array Indexing](images/arrayIndexing.png)

---

In the cell below, try some of these slices on `my_noise`.

#### Summaries of a matrix

A numpy matrix has many methods to compute things about itself, like the sum or mean of its values.

Here's the sum of all the elements of my_noise:

In [None]:
my_noise.sum()

Here's the arithmetic mean:

In [None]:
my_noise.mean()

In the cell below, try `mean(0)` and `mean(1)` – what do these do?

Describe what these do here (edit this cell).

 In the cell below, do a dir(my_noise):

Now, in the cell below, see if you can use a method of `my_noise` to round all the numbers to the nearest integer: