# Introduction to NumPy

_This notebook was originally written by [Matthew R. Carbone](https://www.bnl.gov/staff/mcarbone) for the [AIMS Tutorial Series](https://github.com/AIMLWG/AIML-tutorials) at Brookhaven National Lab._

In this notebook, we introduce the premier scientific computing package in Python: NumPy, which stands for "numerical python." NumPy's website can be found here: https://numpy.org/. Its code is open source, which means you can actually _read_ all of the code that makes up the library. You can find NumPy's maintained open source distribution on GitHub, here: https://github.com/numpy/numpy.

# A note on loading packages in Python

You may have noticed from previous notebooks that certain things "just work." For example, if I want to "cast" the float `8.2` to an integer, I can just use `int(8.2)`, which ultimately gives me the integer `8`. What is `int()`? It is actually a function that is defined in the Python **standard library**. Essentially, the distribution of Python itself defines many simple functions and classes that you can use, and are pre-loaded "out of the box." There are also parts of Python's standard library that you have to load yourself. For example, if you want to access some simple mathematical functions, you'll need the `math` library, which you can import like so:

In [None]:
import math

Now we have access to all of the functions, classes, etc. that are defined in the `math` library. For example:

In [None]:
math.sin(math.pi)  # numerically this is zero

Now, the NumPy library is not part of the standard library. It must be downloaded and installed separately. Depending on where you're running this notebook, it is possible NumPy is already installed. For instance:

* If you're running on a Google Colab instance, NumPy is already installed and can be accessed.
* If you're running on a local machine, and you have not installed NumPy using something like `pip install numpy`, it will not be available.

You can ultimately always check and see if the current environment you're working in has access to a library by trying to import it. It is assumed that we will be working in Google Colab for now, but just in case, feel free to run the command using the ! "magic," which actually calls out to a new shell in the notebook, allowing you a command line-like interface without having to leave the notebook. Give it a try:

In [None]:
!pip install numpy

Once you are confident NumPy is installed, you can import it. You can do `import numpy` but traditionally, we create an alias for numpy by doing:

In [None]:
import numpy as np

Just to make our lives a bit easier and so we don't have to constantly type "numpy". Now, instead of, for example `numpy.array`, we can just use `np.array`.

# Introduction to NumPy and its core objects

So, what is NumPy, and why do we care? There are two primary reasons (and you can find a more detailed breakdown on NumPy's [homepage](https://numpy.org/)):

* **Performance**: It is worth noting that many operations in Python are slow. At a very conceptual level, a lot of work that otherwise happens during _compiling_ your code in other languages, like C, happen at runtime in Python. NumPy is essentially a wrapper for compiled C code, allowing for both the flexibility and readability of Python _and_ the speed of compiled code.
* **Numerical computing tools**: NumPy offers multiple convenience objects, like the array, that have many useful operations baked into them. These include simple operations (such as element-wise adding and multiplication) and complex ones from linear algebra.

The primary purpose of this module is to explore the basics of these computing tools.

## The array

Note that details on the NumPy array can be found in the NumPy documentation, [here](https://numpy.org/doc/stable/user/absolute_beginners.html#more-information-about-arrays).

The `np.array` is the bread and butter of everything in NumPy. It also serves as the foundation for more complicated objects, such as the PyTorch tensor, which allow for automatic machine learning operations such as automatic differentiation. Overall, if you understand the syntax of NumPy, you can understand the syntax of any other array-like object in Python.

The easiest way to create an array is from a list, let's give that a try:

In [None]:
my_list = [1, 2, 3]
my_array = np.array(my_list)
print(f"The content of my_array: {my_array}, the type of my_array: {type(my_array)}")

You can see immediately that the list and the array largely look the same. There's one interesting subtlety to note even at this point. What types of objects is NumPy storing? You can access that information with the `dtype` property:

In [None]:
my_array.dtype

These are integers, particularly int64's (a 64-bit signed integer). You might recall that in Python, a list can have multiple different types in them. What happens if you try to initialize a NumPy array from a list that has different typed objects in it? Let's start with an int and float:

In [None]:
my_list = [1.0, 2, 3]
my_array = np.array(my_list)
print(f"The content of my_array: {my_array}, the type of my_array: {type(my_array)} ({my_array.dtype})")

NumPy is smart. It noticed that all of the elements in the array were compatible, and that the "most common" type of object it can use to represent the data is a 64-bit floating point number. Hence, it cast every element of the original list to that type.

In [None]:
my_list = [1.0, 2, 3, "4"]
my_array = np.array(my_list)
print(f"The content of my_array: {my_array}, the type of my_array: {type(my_array)} ({my_array.dtype})")

Something weird now. U32 is a unicode string. Once again, NumPy has made an opinionated choice about what type to cast all elements in the array to. Beware, this might not be the behavior you intend! You should also note that you can attempt to force NumPy to pick a specific type for its array. For example:

In [None]:
my_list = [1.0, 2, 3, "4"]
my_array = np.array(my_list, dtype=np.int64)
print(f"The content of my_array: {my_array}, the type of my_array: {type(my_array)} ({my_array.dtype})")

But of course, you might not always succeed if you try to cast, for example, the letter "a" to an int!

In [None]:
np.array([1.0, 2, 3, "a"], dtype=np.int64)

## Accessing elements of arrays

To access the elements of a NumPy array, we can use familiar syntax to lists. Similar to lists, you should remember that Python is a 0-indexed language. In other words, the first element of an array will be accessed using index `0`.

In [None]:
my_array = [-1, 3, 5]
print(f"The first element of my_array is {my_array[0]}")

Trying to access an element out-of-bounds will expectedly lead to an error.

In [None]:
my_array[4]

And note that, similar to lists, you can access arrays in "reverse order" by using negative indexes.

In [None]:
my_array = [-1, 3, 5, 7, 9]
print(f"The second-to-last element of my_array is {my_array[-2]}")

## Initializing the array in different ways

There are far more ways to initialize a NumPy array than from a list. Of particular note are the various methods NumPy itself has defined for us. Let's take a look at a few: `np.arange`, `np.linspace`, and `np.random.random`.

Note that it is highly recommended that when you're using a new function or method, you look at that object's docstring. The docstring is essentially documentation that the developer has put with the code itself, and should provide information on the object's parameters, use cases and what it might return if it's a function or method. You can access the docstring by putting a "?" after the object in a Jupyter Notebook. For example, you might want to give this: `np.arange?` a try.

The `np.arange` function expects integer inputs and produces an integer grid from the first value passed to one less than the last value passed. For example:

In [None]:
my_array = np.arange(3, 10)
print(my_array)

The `np.linspace` function produces an evenly spaced grid from the first to the second values passed, with the third value indicating how many points to take on the grid. For example:

In [None]:
my_array = np.linspace(5, 8, 6)
print(my_array)

Finally, `np.random.random` produces random values on the half-open iterval $[0, 1)$.

In [None]:
np.random.seed(123)  # what's this for?
my_array = np.random.random(5)
print(my_array)

You might imagine that each of these methods for creating an array has its advantages and disadvantages, and each might be applicable in different use cases.

## Arrays are multidimensional

So far, we've only dealt with one-dimensional arrays, but NumPy is designed to handle arrays of arbitrary dimension. For example, an array of dimension 2 is a matrix and can be initialized using a list of lists. Here's two examples.

In [None]:
arr = np.array([
    [1, 2],
    [3, 4]
])
print("I am a 2d array")
arr

In [None]:
arr = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
print("I am a 3d array")
arr

We can also utilize the `shape` property to get valuable information on the... shape of the array.

In [None]:
print(arr.shape)

Arrays need not be symmetrical.

In [None]:
np.random.seed(123)
arr = np.random.random(size=(3, 2, 5))
print(arr.shape)

## Accessing elements of multi-dimensional arrays

Accessing the elements of a multi-dimensional array in NumPy is a simple extension of accessing the element of a list in Python. For example, in the array above, to access the 0th element along the 1st axis, the 1st element along the 2nd axis and the last element of the last axis, we can do:

In [None]:
arr[0, 1, -1]

# Basic operations in NumPy

## Element-wise operations

So now that we have arrays, what can we do with them? Let's take a look at a few examples. To start, we consider the simple array below.

In [None]:
arr = np.array([[1, 2, 3], [4, 5, 6]])
arr

Imagine you were tasked with adding 1 to each element of the array. You might consider running a for loop over each axis of the array, accessing each element, and adding 1 to it. Such a solution might look like this:

In [None]:
for ii in range(arr.shape[0]):
    for jj in range(arr.shape[1]):
        arr[ii, jj] += 1  # New idea, what's the += doing?
arr

This solution works fine, but there are two problems lurking. One is obvious, the other not so much.

1. The obvious problem is that this is not very scalable. What happens if you have a 3-dimensional array? You'd have to write yet another for loop. This can get tedious very quickly.
2. The not-so-obvious problem is that this is actually incredibly _slow_. You don't notice because the for loops are tiny, but if you have a very large array, this will be very slow because it's written in Python. The NumPy version of this, while it looks simpler, is actually accessing compiled C code, which is very fast.

It might surprise you then that the NumPy version, which again is very fast, is a single simple line of code.

In [None]:
arr = np.array([[1, 2, 3], [4, 5, 6]])
arr + 1

This is called an **element-wise operation**. An element-wise operation is one that, as its name suggests, operates on every element in the same way. NumPy has quite a few of these. Here's an inexhaustive list.

In [None]:
# Addition (just did this)
arr + 1

In [None]:
# Subtraction
arr - 1

In [None]:
# Multiplication
arr * 2

In [None]:
# Division
arr / 2

In [None]:
# Integer division
arr // 2  # What's going on here?

In [None]:
# Raise-to-power
arr**2

In [None]:
# Sine operation (applies sine function to all elements)
np.sin(arr)

In [None]:
# Inverse tangent operation
np.arctan(arr)

In [None]:
# Log operation (applies natural log function to all elements)
np.log(arr)

In [None]:
# Boolean comparison
arr == 1

## Slicing

[Array slicing](https://numpy.org/doc/stable/user/basics.indexing.html#slicing-and-striding) is an incredibly important operation for returning a part of an array. It is worth working through the linked documentation, but here we will provide a basic introduction.

Consider the same array as before. How do we access the first two columns only? NumPy slicing logic works in the same way as Python's except it can be applied to every dimension:

In [None]:
arr[:, :2]

Above, the `:` says "take this entire dimension, and the `:2` says take up to but _excluding_ the 2nd (remember that everything in Python is zero-indexed!).

This is, essentially all there is to it at a basic level. We encourage you to check out the linked documentation above, and read the parts on indexing, slicing, advanced indexing, etc. There are a few other neat parts to this, such as the elipsis `...` and `np.newaxis`, which can be useful in some cases.

## Reshaping

It is often necessary to "reshape" NumPy arrays into other shapes. The reasons for doing this will be apparent later on, but for now, remember that `arr.shape` can access the shape of the `arr` NumPy array. The total number of elements of the array is the product of all elements of `arr.shape`, and any array can be reshaped into a new shape as long as the product of that new shape's elements equal the product of the previous shape's elements. This is best shown by example:

In [None]:
arr = np.array([[1, 2, 3], [4, 5, 6]])
arr.shape

Note the initial shape of `arr` is `(2, 3)`, or a "two by three" matrix/array/tensor/etc.

In [None]:
arr2 = arr.reshape(3, 2)
arr2

In [None]:
arr2.shape

Now the new shape is `(3, 2)`. We have, through some procedure, modified the structure of the array itself. Is this procedure "reversible"? Let's see:

In [None]:
arr3 = arr.reshape(3, 2).reshape(2, 3)
arr == arr3

Indeed it is. NumPy takes a standard convention with reshaping, which you can read more about [here](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html). 

There is also another nice way to reshape, that is by using `-1` as an argument to `reshape`. Let's see what happens:

In [None]:
arr4 = arr.reshape(-1, 1)
arr4

The `-1` argument can be read as "everything else". It can only be used once. So, in the above case, `reshape` is transforming the array into an "everything else by one" array, where everything else is actually equal to 6.

# Advanced operations

We now move on to more advanced operations which can be performed using NumPy. These operations will require a bit of background in statistics and linear algebra. As such, we will try to provide a brief, crash course-style review of the important concepts.

## Statistics

Of course, where would we be without statistics? As you might expect, a NumPy array can at its most basic level be interpreted as a collection of numbers, and whenever we have collections of numbers, we are tempted to apply the tools of statistics. Here, we will briefly go over the mean, median, sum, standard deviation, and a few others. An exhaustive list of the NumPy statistics API can be found [here](https://numpy.org/doc/stable/reference/routines.statistics.html).

Lets start by creating a test array. This array will simply be the sequence $\{1^2, 2^2, ..., N^2\},$ i.e., the list of squared integers.

Our first few lines of code will be a combination of multiple concepts from earlier in this notebook. Make sure you understand everything that's going on before proceeding!

In [None]:
N = 10
arr = np.arange(1, N + 1)**2
arr

Now for some statistics. What if we want to take the **mean** of this array? We can do this in Python for sure, but remember, as the array gets larger, this will be very slow.

In [None]:
def compute_mean(arr):
    
    # s is a counter for the current value of the mean
    s = 0
    
    # arr.flatten reshapes the array into a single dimension
    for element in arr.flatten():
        s += element
    
    # Note that arr.size gets the total number of elements in the array
    return s / arr.size

In [None]:
mean = compute_mean(arr)
mean

How to do this in NumPy? It's extremely simple:

In [None]:
arr.mean() == mean

That's it! Similarly, we can compute the **sum** over all elements via

In [None]:
arr.sum()

As a quick reminder, the **standard deviation** is defined as

$$ \sigma = \sqrt{\frac{\sum_{i=1}^N (x_i - \mu)^2 }{N}}$$

where $\mu$ is the mean as calculated above. Let's combine a few pieces of knowledge from this notebook once more and efficiently compute the standard deviation of the array.

In [None]:
std = np.sqrt(((arr - arr.mean())**2).sum() / arr.size)

It's really important to note what every piece of the line of code above is doing. Take it one step at a time.
* `arr.mean()` computes the mean of the `arr` object.
* The mean is then subtracted, element-wise, from `arr`.
* The square of this new array, `arr - arr.mean()` is then squared.
* That is then summed.
* That result is then divided by `N`, or the total number of elements in `arr`.
* Finally, the square root of that result is taken.

Of course, there is a much easier way to do this in NumPy:

In [None]:
arr.std() == std  # easy!

The **median** is also simple, except there is no `median` method defined on the array object. Instead, we must manually call it.

In [None]:
np.median(arr)

## Linear algebra

This is an advanced section which isn't necessarily required on a first read. However, the reader should note that to fully understand some more advanced concepts in computer science, including machine learning, one must understand some linear algebra. We'll go over a few simple examples for now, while pointing the interested reader to NumPy's [linear algebra documentation](https://numpy.org/doc/stable/reference/routines.linalg.html).

TK

## Broadcasting

Broadcasting is one of NumPy's most powerful built in tools. It essentially deals with how NumPy treats arrays of arbitrarily different shapes. There's really no better way to explain broadcasting for the first time than to show a few examples. For more details, check out NumPy's [broadcasting documentation](https://numpy.org/doc/stable/user/basics.broadcasting.html).

A quick note: in the "element wise operations" section, you've already done the most basic form of broadcasting! NumPy has to figure out, for example, how to "multiply" an array and a number. It is not obvious at first what the behavior of this will be. For example, consider what happens when you multiply a standard Python list by an integer:

In [None]:
[1, 2, 3] * 3

Not what you expected, right? That's because Python has defined list-integer multiplication differently than Numpy. As someone who is software savvy, you'll have to keep track of these subtleties!

# Conclusion

In this notebook, we worked through a variety of examples which demonstrate the power of NumPy. With that said, you may not realize how powerful some of these functions and methods are until you see some examples. As such, we encourage you to work through some of the provided notebooks/problems in this repository!