# `NumPy`: Numerical operations in Python

Background about Numpy. Checkout the SciPy paper

Data scientists spend a lot of time – wait for it – working with ***data***! To work with **data** it is critical to organize the data in a way that facilitate the work on the potential analyses we might need to do. So organizing data means guessing what type of work we will want to do with the dataset. And, odd is it may seem, good guessing requires some practice. The data organization process will require: 

* store the data a clear and systematic way
* provide methods to access the data that are simple and straightforward
* be flexible enough so to and allow to modify the format of the data for various needs

We have already met a data container that is commonly used and that we will ultimately be using a lot: the Pandas DataFrame. So far, we have imported data directly into a DataFrame, have passed that DataFrame to plotting functions, and then accessed columns of data using the column names. We have even seen that DataFrames "know" how to plot the data inside them. In more technical terms this means that the DataFrame is an object that have basic methods to plot and explore the data it contains.

Next, to really understand and use DataFrames, it is of great help to understand and use what they are built upon, which are the `numpy arrays`. Numpy arrays are grids or tables for holding, accessing, and manipulating data. DataFrames are literally built on top of numpy arrays, providing many bells and whistles, such as column names, neat formatting, etc. 

Numpy arrays are simpler than Pandas DataFrames. They are created and accessed in ways that are very similar to the ways Python `lists` can be accessed. Lists are among the simplest most  built-in data containers. This is no accident of course; if you can work with one, then it's easy to work with the other.

So what we are going to do is to learn how to work with `lists` (lists are handy!), then we will graduate to `numpy arrays` and see what they can do. Finally, we'll step back up to `DataFrames` – with our newfound understanding of what's under the hood – in this and the next tutorial.

After all this, you will be `Masters of data` or at least know more about Python.

### Python lists

We have covered Python `lists` and other datatypes in previous tutorials. Python `lists` (a list of things) is make using `[square brackets]`.

For example:

In [None]:
mylist = ['this', 3, 'list', 4+2j, pi]

We can address elements in a list by using indices and the `:` (colon) operator.

In [None]:
mylist[0:3]

We can read this as "Give me all the elements in the interval between 0 **inclusive** to 3 **exclusive**."

I know this is weird. But at least for any two indexes `a` and `b`, the number of elements you get back from `mylist[a,b]` is always equal to `b` minus `a`, so I guess that's good!

We can get any consecutive hunk of elements using `:`.

In [None]:
mylist[2:5]

If you omit the indexes, Python will assume you want everything.

In [None]:
mylist[:]

### Numpy Arrays

Numpy arrays were designed to be lists with superpowers, so everything we learned in the previous section will apply to numpy arrays as well!

Not to state the obvious, but to use numpy arrays, we'll need to `import numpy`.

In [None]:
import numpy as np

Then we can make an array in much the same way we make a list.

In [None]:
myarr = np.array([2, 4, 6, 8, 9, 10])

In [None]:
myarr

From then on, all the indexing we've learned so far applies directly!

In [None]:
myarr[4]

In [None]:
myarr[-3:]

We can make an array directly out of a list.

In [None]:
myarr2 = np.array(mylist)

In [None]:
myarr2

And then of course we can index it exactly the same way, so... *Wait, why are we making arrays now? What's the difference?*

One **huge** difference is that if we wanted to do some math with basic Python lists, the fact that they can hold multiple types of data elements does not assure that the mathematical operations will perform.

In [None]:
mylist + 5

`numpy` arrays instead contain numerical elements by definition. This definition assures the ability to perform math ith the arrays. So, whereas the addition above did not work when using the `list`, it does work when using the `numpy` array, even though both `list` and `array` contain the same elements!

In [None]:
myarr2 + 5

Now **that** seems like it might be useful!

We can even add two arrays, or subtract them, or whatever!

In [None]:
myarr + myarr2

We can combine our 2 arrays into a single ***two dimensional (2D) array***.

In [None]:
twoDarr = np.array([myarr, myarr2])

In [None]:
twoDarr

Simple though this may seem, *2D arrays just like this are the bedrock of data analysis!* Arrays of real data are usually larger – sometimes much much larger! – but all the principles are the same and all you as a Data Scientists need to remmeber is the dimensionality of the data arrays. Python will then compute what you ask for.

So one important thing to know about arrays, besides the ability to do maths, is the shape of arrays. Indeed, numpy arrays have built-in functionality to tell us their shape.

In [None]:
twoDarr.shape

Unlike lists, which are always just lists, arrays can come in any shape. So it's *really* convenient that they can tell us what shape they are straight away.

Indexing into 2D arrays is a straightforward extension of indexing into 1D arrays or lists. We just provide a second index after a `,` (comma). Like this.

In [None]:
twoDarr[1,3]

The first index refers to the **row index**, and the second to the **column index**. In this case, we're asking for the value in the second row and the fourth column, which is indeed 7 (remember *the first row and column are index=0!*).

We can play all the same games indexing with 2D arrays as we can with 1D arrays, we just have to remember that everything before the comma `,` refers to the *rows* in that it specifies locations along the *vertical dimension*, and everything after the comma `,` refers to the *columns* in that it specifies locations along the *horizontal dimension*.

So this:

In [None]:
twoDarr[:,0:3]

means "Give me all the rows (the colon `:`) in the first 3 columns (the "`0:3`)."

I told you that the colon all alone by itself would end up being useful!!! In this case for example, by using the `:` you do not need to type many indices (one per row) and you even do not need to remmeber how many rows there are, just use `:` and Python will return all the elements.

A few more examples:

In [None]:
# the last row (regardless of the number of rows, 
# again you do not need to knowhow many rows exist)
twoDarr[-1,:] 

In [None]:
twoDarr[:,-2:] # last two columns

In [None]:
twoDarr[0,::2] # first row, every other column

To get good at this, you don't need natural born talent or anything like that. Like so much in life, the key is *practice, practice, practice*!!! So play around! You can't break your computer or anything!

Another neat trick that arrays can do is *transpose* themselves, flipping the rows for columns.

(Hold your right hand in front of your face so that you're looking at your palm with your fingers pointing towards the left. Now flip your hand so that you're looking at the back of your hand with your fingers pointing up. You just *transposed* your hand such that the first row (your pointer finger) became the first column!)

In [None]:
colarr = twoDarr.T

In [None]:
colarr

Why would we want to do that? By convention, *variables* in datasets should correspond to the columns, and *observations* should correspond to the rows. So we have taken data in which this was not so and turned it into an array in which the columns are the first few non-prime numbers and the prime numbers, respectively, and the rows correspond to the instances in order (1st , 2nd, 3rd, ....).

We have just done a little of what is known as **data wrangling**. While not as fun as data visualization, data wrangling is often a big part of any analysis project!

Now that we have the data into shape, we can unleash all the powers of numpy arrays, powers which pandas DataFrames will inherit and build upon!

For example, who's bigger overall, the primes or the non-primes?

In [None]:
colarr.sum(0)

The primes win! 
In `colarr.sum(0)`, the 0 means "the first (vertical) dimension", i.e., sum the values *across the rows* within each column. To sum along the second dimension, we do:

In [None]:
colarr.sum(1)

So any numpy array knows how to add up the numbers in it by row or by column (see what happens if you leave off the dimension, like this `colarr.sum()`. The list of things that numpy arrays can do themselves is pretty impressive.

Check it out [here](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html).

(or paste this into your browser: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html)

Often, we want to create a array that we know we're going to put values in later. For example, we might be planning on doing a computation that will result in 3 sets of 7 values, and we want be able to store them directly into an array. We can pre-make an array filled with zeros with np.zeros(r, c).

In [None]:
myzeros = np.zeros((7, 3))

In [None]:
myzeros

So those are the basics of numpy arrays. They:

* store values in rows and columns
* each dimension starts at index zero (like lists)
* can be accessed using
    - square brackets `[]` with row and column indexes separated by a comma
    - integer indexes (including negative "start from the end" indexes)
    - a colon `:` (or two if you want a step value other than 1)
* can have maths done to every element in one go
* can be added, subtracted, etc. from one another
* have superpowers! they can compute stuff along their rows and columns!

### `for` loops and numpy arrays

In previous tutorials we have learned about `for` loops using simple Python datatypes. Here after we will use the newly discovered `NumPy` arrays and practice with them by going over some of their functionality and their relation to `for` loops.

One great thing about `for` loops is that we can use them to go through the rows or columns of an array (or both!) in turn, repeating some operation on each one. Let's say we need to put the numbers of the binary sequence (2, 4, 8, 16...) in the columns of a 10x5 array for some future simulation.

We could do that this way:

In [None]:
nRows, nCols = 10, 5   # Python let's us do this!
myArraySize = (nRows, nCols)  # we'll make a 10x5 array. Rows always come first!
anArray = np.zeros(myArraySize)

anArray[:,0] = 2
anArray[:,1] = 4
anArray[:,2] = 8
anArray[:,3] = 16
anArray[:,4] = 32

anArray

That works, no doubt. But 

1. there's a lot of "hand coding", which is prone to mistakes
2. it would be a pain to scale up to huge arrays (as we already know)
3. it's ugly 

Now let's do this a cleaner and much more scalable way using a `for` loop.

In [None]:
nRows, nCols = 10, 5   # make variables for length and width of our array
myArraySize = (nRows, nCols)    # we'll make a 10x5 array. Rows always come first!
ourNumbers = [2, 4, 8, 16, 32]  # numbers that we'll set each column to
anArray = np.zeros(myArraySize) # make an array to hold our numbers

for i in range(nCols) :
    anArray[:,i] = ourNumbers[i]
    
anArray

And we get the same result.

So we've swapped this:

```
anArray[:,0] = 2
anArray[:,1] = 4
anArray[:,2] = 8
anArray[:,3] = 16
anArray[:,4] = 32
```

(Yuk.)

for this:

```
for i in range(nCols) :
    anArray[:,i] = ourNumbers[i]
```
    
(Nice.)

which is already a huge improvement. But imagine if we were working with a 1000 or 10,000 element array! Doing it the first way – well – you can imagine. But doing it the second way, all we would have to do is change `nCols` and be a bit clever and compute `ourNumbers` automatically.

 Wait, what? How would we compute the binary sequence – the powers of 2 – automatically? 
 
 With a `for` loop of course! Let's do that!

In [None]:
ourNumbers = list()     # Make an empty Python list
for i in range(nCols) :
    thisNumber = 2**(i+1)          # compute 2 to the right power
    ourNumbers.append(thisNumber)  # and append it to our list 
ourNumbers

---

Puzzle time! Rewrite the above code without using the `thisNumber` variable (so there should only by one line inside the `for` loop).

---

Okay, now we can write our code to populate the numpy array in way that is completely scalable using a single `for` loop:

In [None]:
nRows, nCols = 10, 5   # Python let's us do this!
myArraySize = (nRows, nCols)  # we'll make a 10x5 array. Rows always come first!
anArray = np.zeros(myArraySize)

for i in range(nCols) :
    anArray[:,i] = 2**(i+1)
    
anArray

Notice that, now, the ***only*** thing we need to change to compute and add more or fewer powers of 2 to our array is a single value – nCols in this case – *everything else is done automatically!*

---

###### Coding challenges!

Write code (using a `for` loop of course) to compute the cube of the odd numbers from 1 to 9. (Remember that `range()` can take a step argument.)

Write scalable code to compute the first "n" numbers of the [Fibonacci sequence](https://en.wikipedia.org/wiki/Fibonacci_number). The Fibonacci sequence (named for the famous 13th century mathematician) starts with the numbers 0 and 1, and each number after that is the sum of the previous two numbers. (Galileo, da Vinci, and Franco aren't the only famous Italian scientists/mathematicians!). 

#### Nested for loops

A great thing about `for` loops is that they can be *nested* inside one another. This is best illustrated by example, so let's look at one and disect it.

In [None]:
nRows, nCols = 4, 3            # (easily changeble) array height and width
myArraySize = (nRows, nCols)    # handy list of the size 
anArray = np.zeros(myArraySize) # make the array

for i in range(nRows) :
    for j in range(nCols) :
        anArray[i,j] = i + j*nRows
        print('Hi! I\'m in row ', j, ' and column', i, '!')
        
anArray

So what's happening? In the first or "outer" loop, `for i in range(nRows) :` we're going to step through the numbers 0 to three, corresponding to the row indexes. 

At each value of `i`, the entire second or "inner" loop, `for j in range(nCols) :` is going to run, stepping through each value of `j`, corresponding to the column indexes. 

At each value of `j`, we stick a number in the `[i, j]` cell (`anArray[i,j] = i + j*nRows`), print a little message, and move on the the next value of `j`.

Once the inner loop is complete, we jump out into the outer loop, increment `i` by 1, and then jump back into the inner loop and do the whole thing again! After i has run its course from 0 to `nRows`, we say farewell to that loop and go on our way!

---

###### Coding challenge!

Change the above loop so that it numbers the cells from from left-to-right, top-to-bottom. As before, resist the temptation to cut and paste and write your code from scratch! 

---

The loop you just wrote numbers the cells of your array in "row-major" order, or "row wise", while the original loop numbered the cells in "column-major" order, or "column wise". 

Nested loops give you tremendous power! You go through any array element-by-element and get or set individual values. You can even do things like loading a series of data files in turn (in an outer loop), and then chewing through each data file in an inner loop.

As a final example, let's say we want to simulate a diurnal rhythm, like the cortisol level the body for several people. Since differnt people have different schedules, we want to add a bit of randomness to when each persons cortisole level waxes and wanes. 

In [None]:
import numpy as np

In [None]:
hours, person = 24, 10            # (easily changeble) array height and width
myArraySize = (hours, person)     # handy list of the size 
cortLevel = np.zeros(myArraySize) # make the array

myFreq = 2*np.pi/hours            # make the frequency once per 24 hrs

for j in range(person) :            # we'll go person by person
    myPhase = np.random.rand(1, 1)  # get a random phase for this person
    for i in range(hours) :         # go down current column (person) row-by-row
        cortLevel[i,j] = np.sin((myFreq*i + myPhase)) # set val. for this [time, person]
        

In [None]:
import matplotlib.pyplot as plt
plt.plot(cortLevel);

Cool!

### `While` loops

Sometimes we wish to repeat a calculation (or something), not for a predetermined number of times like in a `for` loop, but until some critereon is reached. This is accomplished using a `while` loops, which just keeps running and running until a critereon is reached. One dangerous thing about a `while` loop is that if the criteon can't be reached because we made a mistake in our code, then the loop runs forever – an infinite loop!

As a simple example, let's see how many tries it takes to get a number from the standard normal distribution that is above 2 - the upper 2.5% tail of the distribution!

In [None]:
x, cutOff, myCounter = 0, 2, 0

while x < cutOff :
    x = np.random.randn()
    myCounter += 1
    
myCounter

The dissection of the code is as follows.

* the first line sets some useful variables
    - a "test" variable `x` that will contain our candidate random numbers
    - our "cut off" variable that we will test x against
    - a "counter" variable that we'll use to count the number of tries
* the `while x < cutOff :` says "keep trying *while* `x` is less than `cutOff`
* `x = np.random.randn()` gets a random number and assignes it to `x`
* `myCounter +=` increments our counter

Once we get a random number above 2, the `x < 2` returns `False` and the loop ends. Whatever value is then in `myCounter` is our answer!

Run the above code cell several times! Does it always take the same number of times? Based on what you know about the standard normal distribution, how many times should it take?

Now here's an interesting puzzle... How many times does it take to get a big random number on average? What does the distribution look like?

How would we answer those questions?

Let's use... 

a ***for loop!***

In [None]:
import seaborn as sns  # for making a histogram/kde

In [None]:
nExperiments = 100  # how many times we'll do our little experiment
nSamplesNeeded = np.zeros((nExperiments, 1))
x, cutOff, = 0, 2

for i in range(nExperiments) :
    myCounter = 0
    x = 0
    while x < cutOff :
        x = np.random.randn()
        myCounter += 1
    nSamplesNeeded[i, 0] = myCounter    
    

That looks like a lot of code, but go through it carefully. All we have done is nest our `while` loop inside a `for` loop, so that we can do our "How many times?" experiment as often as we wish. On each pass through the `for` loop, we store the answer from a single experiment in the `i`th row of a numpy array!

Let's look at the number of tries it took on each experiment:

In [None]:
plt.plot(nSamplesNeeded, '.')

Okay, cool! So it looks like we usually get a "big" number in under 50 tries, but it occasionally takes a lot longer. Let's look at the distribution of these numbers!

In [None]:
sns.displot(nSamplesNeeded, kind='kde')

Okay, I think that, while pretty, this plot is misleading. Can you see why?

Let's do a plain old histogram.

In [None]:
sns.displot(nSamplesNeeded, kind='hist')

Now this make more sense, because we can't have a negative number of tries!

So it looks like, on average, it took us about – what? – 40 tries to get a number in the upper 2 1/2% tail of the distribution. Let's do a quick calculation.

In [None]:
100 / 2.5

### Logic operators on NumPy arrays

There are other types of operators that do not come standard with Python but that are part of other packages and need to be imported. These operators behave differently.

When dealing with arrays, instead of individual numbers, things look slightly different. For example, if we wanted to perform a logical operation between two sets of numbers, e.g., two arrays, operatiors (`=`, `>`, etc) will work sometimes but not others. 

Let's take a look at how we would perform comparisons and logical operations with NumPy arrays.

In [3]:
import numpy as np # We import NumPy as we are working on arrays

In [None]:
myRnds = np.random.randn(1, 5) # we create an array of random numbers
myRnds

Now, imagine we wanted to know whether each number stored in the Array `myRnds` is positive. 

In [None]:
myRnds > 0

If we wanted to find out whether any of the numbers in an array are positive, we would use the numpy array method `any`:

In [None]:
logical_array = (myRnds > 0)
np.any(logical_array)

If we wanted to test whether all the values in an array are positive, we would use the method `all`. 

In [None]:
np.all(logical_array)

Because both `all` and `any` apply to numpy atrays, they can also be called as methods of a NumPy Arrays. For example:

In [None]:
logical_array.any()

In [None]:
logical_array.all()

Numpy arrays also allow comparing values element-wise. This means that we could compare each element of one array with the corresponding element of another array. If the twovectors have the same size.

`[1, 2, 3] = [1, 4, 3]`

Would compare 1 to 1, 2 to 4 and 3 to 3.

In [None]:
array_one = np.random.randn(1,5) > 0;
array_two = np.random.randn(1,5) > 0;
np.logical_and(array_one, array_two)

What happens if the two arrays have different size, though?

In [None]:
vector_one = np.random.randn(1,6) > 0;
vector_two = np.random.randn(1,5) > 0;
np.logical_and(vector_one, vector_two)

The not and or operators also exist for numpy arrays:

In [None]:
vector_one = np.random.randn(1,5) > 0;
vector_two = np.random.randn(1,5) > 0;
np.logical_or(vector_one, vector_two)

In [None]:
vector_one = np.random.randn(1,5) > 0;
vector_two = np.random.randn(1,5) > 0;
np.logical_not(vector_one, vector_two)

### Summary

In this tutorial, we have worked through the very important element of code called the ***loop***, which allows us to repeat calculations over and over and over again.

The most frequently used loop is the `for` loop, which allows us to do a computation a number of times. It can be used to do things like crawl through the rows and columns of a numpy array. With a pair of nested `for` loops, we can even crawl through each cell of an array in either row-major or column-major order. We could even use a `for` loop to chew through a series of data files, etc.

A `while` loop is used when we don't know ahead of time how many times we'll need to do the calculation. The while loop allows us to compute or look at thing as many times as necessary until some condition as met. We just have to be careful that we don't make a dreaded *infinite* loop (what sort of cut off would make the `while` loop above essentially infinite?).

**Loops are strong!**

So in this tutorial we have shown how to organize and manipulate data using Python `numpy` `arrays`.

The operations that are available for these two data types will be the base for many things that you might need to do as a Data Scientist.  