# Chapter 0: Python Libraries: NumPy, Pandas and Matplotlib

Welcome to our first programming assignment! Before we get deep into the statistical libraries in the Python universe, we need some practice with the three general-purpose Python libraries that every aspiring [Pythonist](https://en.wiktionary.org/wiki/Pythonist) should know cold.

## Directions

1. The programming assignments are organized into sequences of short problems. You can see the structure of the programming assignment by opening the "Table of Contents" along the left side of the notebook (if you are using Google Colab or Jupyter Lab).

2. The problems are preceded by "Description" sections that explain relevant background. Before attempting any of the problems, make sure to read these "Description" sections and run all code cells.

3. After reading the "Description" sections, proceed to the problems.

4. Each problem contains a blank cell containing the following comment: `# ENTER YOUR CODE IN THIS CELL`. Enter your code in these cells below the comment, being sure to not erase the comment. There are usually directions on the precise syntax that you will use to enter your solution properly. Please pay very careful attention to these directions.

5. Below most of the solution cells are "autograder" cells. Do not alter the autograder cells in any way.

6. Do not add any cells of your own to the notebook, or delete any existing cells (either code or markdown).

7. **If you alter the notebook in any fashion so that the autograder fails to execute, your grade will be reduced accordingly.**

## Submission instructions

1. Once you have finished entering all your solutions, you will want to rerun all cells from scratch to ensure that everything works OK. To do this in Google Colab, click "Runtime -> Restart and run all" along the top of the notebook.

2. Now scroll back through your notebook and make sure that all code cells ran properly.

3. If everything looks OK, save your assignment and upload the `.ipynb` file at the provided link on the course <a href="https://github.com/jmyers7/stats-book-materials">GitHub repo</a>. Late submissions are not accepted.

4. You may submit multiple times, but I will only grade your last submission.

## NumPy

### Description

[NumPy](https://numpy.org/doc/stable/index.html), which stands for **Num**erical **Py**thon, is an extensive library for numerical computions in Python. Not only does it provide a huge number of convenient mathematics functions, but it also provides one of the most fundamental data structures in the Python language, the NumPy _arrays_.

The first thing we must do is import the NumPy module by calling `import numpy`. Because five letters is apparently more typing than a person could be expected to do, the Python community has universally agreed that the NumPy library should have the two-letter alias `np`. So, any time that we want to access the NumPy library in our code, we can type `np` instead of `numpy`.

In [None]:
import numpy as np

At the most basic level, one-dimensional NumPy arrays are simply lists of numbers. Here's an example:

In [None]:
v = np.array([1, 2, 3])
v

In this code block, I created a $1$-dimensional NumPy array called `v` containing the numbers $(1,2,3)$. To create it, I called `array` from the NumPy module via the function call `np.array`, and I passed the list `[1, 2, 3]` as the parameter. We can check the "shape" or size of the array `v` as follows:

In [None]:
v.shape

While we might have expected that the shape of `v` is $(3,1)$, in fact NumPy prints the shape as $(3,)$. This is one of the funny quirks of how NumPy handles $1$-dimensional arrays. It will come back as a minor annoyance a little later.

In the next code block, I will define a $2$-dimensional array `A` that contains the following numbers:

$$
\begin{bmatrix}
3 & 1 \\
-1 & 0 \\
10 & -5
\end{bmatrix}.
$$

In [None]:
A = np.array([[3, 1], [-1, 0], [10, -5]])
A

Study this code carefully, to make sure you understand how I defined `A`!

This time, the shape of `A` is exactly what we expect:

In [None]:
A.shape

The shapes of $2$-dimensional arrays are always in this format:

$$
(\text{number of rows}, \text{number of columns}).
$$

We can access specific entries of arrays by "indexing into them" as follows:

In [None]:
v[2]

Here, I printed out the third element in the $1$-dimensional array `v` from above. Remember, indexing in Python is "0-based," which means that the indices in `v` run as $0,1,2$ instead of $1,2,3$.

Let's now print out the first entry in `v`, along with the entry in `A` in the third row and second column:

In [None]:
print(f'The first entry in v is: {v[0]}')
print(f'The entry in A in the third row and second column is: {A[2, 1]}')

Notice that I printed _entire_ sentences by calling `print` with sentences enclosed in single quotes, just like all strings in the Python language. However, these are special types of strings called [f-strings](https://peps.python.org/pep-0498/), which are identified by the single `f` that sits in front of the strings. F-strings allow us to embed and print variables _inside_ the strings enclosed in curly braces `{}`.

But we can do more than just access single elements at a time. We can also "slice" arrays to obtain multiple elements. For example, suppose that we want to grab the first column of `A` --- we would write:

In [None]:
A[:, 0]

You can think of the colon `:` as standing for "all." So, I would interpret the code `A[:, 0]` as the first column of `A` over _all_ rows.

Similarly, if we wanted the second row of `A`, we would write:

In [None]:
A[1, :]

But the colon `:` has even more functionality. Consider this:

In [None]:
A[0:2, 0]

The `0` in the second position of `[0:2, 0]` means that we are accessing the $0$-th (i.e., first) column of `A`. The `0:2` should be read as a _range_ from `0` up to `2`, excluding `2`. So, all together, `A[0:2,0]` means that we are "slicing out" the first two entries of `A` from the first column.

How about the second and third elements in `A` from the second column? That would be:

In [None]:
A[1:3, 1]

Not so bad, right?

We can add rows and columns to NumPy arrays by calling the convenient [`np.concatenate`](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html) function. For example, suppose that we wanted to add

$$
\begin{bmatrix}
2 \\ 3 \\ 4
\end{bmatrix}
$$

as a third column in `A`. Then, we would write:

In [None]:
new_column = np.array([2, 3, 4]).reshape(3, 1)
A = np.concatenate((A, new_column), axis=1)
A

Notice that I had to call `.reshape(3, 1)` on the new column to change its shape from the annoying `(3,)` to `(3, 1)`. Then, I passed the existing array `A` along with `new_column` as a tuple to the `np.concatenate` function, along with `axis=1` to specify that I wanted the new column to be added as a column and not a row. I then re-defined `A` to be this new array of shape $(3,3)$.

Now, what if we wanted to add a new row to `A`?

In [None]:
new_row = np.array([9, 10, -3]).reshape(1, 3)
A = np.concatenate([A, new_row], axis=0)
A

Notice that instead of `axis=1`, which means you're adding a new _column_, I wrote `axis=0` which means we're adding a new row. However, as you can see at the [doc](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html), technically I didn't need to set `axis=0` because `0` is actually the default parameter for `axis`.

Let's talk about _boolean masking_ (or filtering). Let's suppose, for example, that I wanted to pull out all entries in `A` that are greater than or equal to $9$. In order to do this, we first define a so-called "mask" based on this criterion:

In [None]:
mask = (A >= 9)
mask

Notice that a mask is nothing but an array that contains `True` and `False` values. Then, to pull out the entries in `A` that are $\geq 9$, we simply index into `A` using the mask like this:

In [None]:
A[mask]

For another example, suppose that we wanted to pull out all entries $x$ in `A` such that $0 \leq x \leq 3$. Then, we create another mask and index into `A`:

In [None]:
another_mask = (0<= A) & (A <= 3)
A[another_mask]

Finally, we will need to know how to create $1$-dimensional arrays of evenly spaced values over a specified range. For this, we use the [`np.linspace`](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html) function. For example, suppose that we wanted to create an array of $100$ evenly spaced $x$-values on the interval $2\leq x \leq 7$. Then, we would write:

In [None]:
x_vals = np.linspace(2, 7, num=100)
x_vals[0:15]

Notice that I sliced out the first 15 entries of the array, rather than print out all 100 entries. Let's check the shape of our new array:

In [None]:
x_vals.shape

That's exactly what we expected!

### Problem 1 --- Creating arrays

In the next code cell, create a NumPy array called `B` of shape $(2,3)$ that contains the following data:

$$
\begin{bmatrix}
1 & 2 & 0 \\ -3 & 2.5 & 4
\end{bmatrix}.
$$

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


### Problem 2 --- Array indexing and slicing

Index into your array `B` and retrieve the entry in the second row and third column. Save your answer into the variable `entry`, by writing `entry = ` and putting your code to the right of `=`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Now slice into your array and retrieve the second and third entries in the second row. Save your answer into the variable `slice`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


### Problem 3 --- Adding rows and columns

Add

$$
\begin{bmatrix}
3 \\ 5
\end{bmatrix}
$$

as a fourth column to your array `B`. Make sure to _update_ your existing array `B` by calling `B = np.concatenate()` with the appropriate parameters.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Now add

$$
\begin{bmatrix}
0 & 6 & 4 & -1
\end{bmatrix}
$$

as a third row to $B$.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


### Problem 4 --- Boolean masking

Extract all elements in the array `B` that are greater than or equal to $2$. Save your answer into the variable `elements`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


### Problem 5 --- Arrays of evenly spaced values

Create an array `x_vals` consisting of $250$ evenly space $x$-values over the range $3 \leq x \leq 10$. Slice into the array and save the first ten values into the variable `slice`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


## Pandas

### Description

NumPy arrays are basic **numerical** arrays. But when we deal with real-world data, we need much more flexible data structures. Such data structures are provided by the _data frames_ in the [Pandas](https://pandas.pydata.org/) library.

First, we import the Pandas module under its conventional alias `pd`:

In [None]:
import pandas as pd

Now, I want to import external data into the notebook from the online textbook's GitHub repo. To do this, I call the following code:

In [None]:
url = 'https://github.com/jmyers7/stats-book-materials/blob/0ef3ab6affa90279aa4fd38b5edad78a2b7e0dcc/data/data-1-1.csv?raw=true'
df = pd.read_csv(url)

We call the `read_csv` function from the Pandas module, passing it the URL location of our data as a Python string in single quotes. The `read_csv` function creates a Pandas data frame, which we will call `df`.

Let's print out the data frame:

In [None]:
df

We see that the data frame contains five columns of data, plus an initial unnamed index column (ranging from $0$ to $4$). Three of the columns are numerical, while two contain strings (i.e., words).

When you import a fresh data frame into the notebook, Pandas automatically assigns consecutive numerical values for row indices. But sometimes you want more descriptive indices; for example, if your data frame contained sales data with order numbers, then you might want the indices to be the order numbers. So, suppose that we wanted the `ID` column in our data frame to serve as the row indices; then we would write:

In [None]:
df.set_index('ID', inplace=True)
df

Notice that the generic indices $0$-$4$ have dissappeared, and that the `ID` column now serves the index column. Our data frame now consists of _four_ columns of data, and a single index column.

We can get the "shape" or size of a data frame just like we do with NumPy arrays:

In [None]:
df.shape

The first number, $5$, is the number of rows, while the second number, $4$, is the number of columns. Notice that the index column is not counted.

The data frames that you work with in the real world might be _huge_, perhaps containing many thousands or even millions of rows. Clearly, such large data frames cannot be printed to the screen so easily. For this reason, we can use the `head` method of a Pandas data frame to print out only the first few rows:

In [None]:
df.head(3) # print the first three rows

We can slice out specific columns of `df` by indexing with the column names passed as strings. For example:

In [None]:
df['C']

In the print out, we see two columns: The first is the index column `ID`, while the second is the column `C` that we sliced out.

If `df` were not a data frame but instead a NumPy array, then in order to slice out a column we would have to use the colon `:` as a first index, as we saw above. We can do the same thing with data frames, but we would have to use the `loc` method of the data frame:

In [None]:
df.loc[:, 'C']

It's up to you if you prefer using `df['C']` or `df.loc[:, 'C']`. They both do the same thing.

For another example, let's grab the two entries in column `C` with `ID` values `12` and `7`:

In [None]:
df.loc[[12, 7], 'C']

The `loc` method allows us to access the data inside a data frame using the column names and the index values (as specified by the index column). But if we wanted to access the data in a data frame using _numerical_ indices as if it were a NumPy array, then we must use `iloc` in place of `loc`. For example, we can grab the first two elements in the third column like this:

In [None]:
df.iloc[:2, 2]

Notice that I wrote `:2` instead of `0:2`, which you might have have been expecting. When you leave off the `0` in front of `:`, Python is smart enough to fill it in automatically. This is a convenient short cut.

How about adding columns and rows to a data frame? Let's add a fifth column to our data frame:

In [None]:
df['E'] = [1, -3, 5, 4, 0]
df

To add a row, we need to use a Python dictionary with the column names as keys to create a new data frame for the row. Then, we pass the existing data frame and the new row to `pd.concat` along with the index value of the new row:

In [None]:
new_data = {
    'A' : [3],
    'B' : [10],
    'C' : ['chair'],
    'D' : ['zebra'],
    'E' : [2]
}

new_row = pd.DataFrame(new_data, index=[8])
df = pd.concat([df, new_row])
df

Finally, let's talk boolean masking in the context of Pandas data frames. Conveniently, it is almost identical to masking for NumPy arrays. For example, suppose we wanted to grab all rows with `cat` in column `D`. Then, we create a mask based on this criterion and use it to index into the data frame.

In [None]:
mask = (df['D'] == 'cat')
df[mask]

What if we wanted to use a mask to pull out the single row with `cat` in column `D` and `0` in column `A`? We would do it like this:

In [None]:
another_mask = (df['D'] == 'cat') & (df['A'] == 0)
df[another_mask]

### Problem 6 --- Importing data

We will import `data-1-2.csv` into the notebook from the following URL location:

In [None]:
# Be sure to run this cell! Do NOT alter or delete it!
url = 'https://github.com/jmyers7/stats-book-materials/blob/0ef3ab6affa90279aa4fd38b5edad78a2b7e0dcc/data/data-1-2.csv?raw=true'

Now, in the next code cell and using the above URL, make a call to `pd.read_csv` to import the data into the data frame `df`. (This will overwrite the version of `df` that we used above, but that's ok.) Be sure to print out `df` to make sure it worked.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Now set the `num` column as the index column of `df`. Afterward, print out `df` yet again to make sure it worked.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


### Problem 7 --- Data frame slicing and boolean masking

Extract column `D` from the data frame and save it into the variable `col`. Use either the column-name-only method I showed you above, or `loc` and a colon `:`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Using boolean masking, pull out the rows in `df` with either `lamp` in column `A` or `eagle` in column `D`. Save your answer into the variable `rows`. (Notice the "**or**" instead of "**and**"!)

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


### Problem 8 --- Adding rows and columns

Add the following row to the data frame:

$$
[\text{door}, 17, 5, \text{dove}].
$$

Set the index value of the new row to $22$. Make sure to print out `df` to verify that it worked.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


## Matplotlib

### Description

The [Matplotlib](https://matplotlib.org/) library provides most of the basic plotting functionality in Python. The fastest way to learn Matplotlib is to reverse engineer an example.

But first, we need to do our imports. Most of the time, you will be interacting with the `pyplot` module in Matplotlib---its conventional alias is `plt`:

In [None]:
import matplotlib.pyplot as plt

Here's our basic example. Suppose we wanted to plot a full period of the function $f(x) = \sin{x}$ over the interval $0\leq x \leq 2\pi$. Here's how we do it:

In [None]:
x_vals = np.linspace(0, 2 * 3.14)
y_vals = np.sin(x_vals)

plt.plot(x_vals, y_vals)
plt.show()

As you can see, the code is quite simple and almost explains itself.

1. First, we make a call to the function `np.linspace` that we studied earlier, generating a NumPy array of equally spaced values in the interval $[0,2\pi]$. This array is saved as `x_vals`. Note that I did not explicitly set the `num` parameter and instead used the default value of `50` (see the [doc](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html)).

2. Then, the _entire_ array `x_vals` is plugged into the `np.sin` function, which is NumPy's implementation of $\sin{x}$. When you plug an array into a NumPy function like this, NumPy knows to apply the function to each entry in the array. (This is called _vectorization_.) Therefore, the array `y_val` contains 50 numbers.

3. We then call the `plot` method of `plt`, passing in the two arrays `x_vals` and `y_vals`.

4. Finally, we show the plot by calling `plt.show()`.

So, what Matplotlib is _actually_ doing is plotting $50$ points in the $xy$-plane, and then connecting the dots to draw the sine graph. But because these points are so numerous and close together, we _see_ essentially a smooth, curved graph.

We can add labels to the $x$- and $y$-axes like this:

In [None]:
plt.plot(x_vals, y_vals)
plt.xlabel('$x$ values')
plt.ylabel('$y$ values')
plt.show()

We could have also simply called `plt.xlabel('x values')` to generate the label along the $x$-axis. The string that I wrote, `'$x$ values'`, puts the "$x$" in LaTeX math mode, which is essentially an italic font. (I am a snobbish mathematician, after all.)

We can add a second curve to our graph like this:

In [None]:
plt.plot(x_vals, np.sin(x_vals), label='$\sin{x}$')
plt.plot(x_vals, np.cos(x_vals), label='$\cos{x}$')
plt.xlabel('$x$ values')
plt.ylabel('$y$ values')
plt.legend()
plt.show()

I'm going to let you dissect the code on your own, since I'm pretty sure you're getting the hang of this.

If we wanted to plot something like

$$
f(x) = x(x^2-1)
$$

over the interval $[-2,2]$, then it would be convenient to define our own function:

In [None]:
def f(x):
  return x * (x ** 2 - 1)

x_vals = np.linspace(-2, 2)
plt.plot(x_vals, f(x_vals))
plt.show()

It's also possible to make a single plot consisting of multiple subplots. Here's how:

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(8, 5))
axes[0].plot(x_vals, np.sin(x_vals))
axes[1].plot(x_vals, np.cos(x_vals))
plt.tight_layout()

Let's step through the code:

1. We make a call to the `subplots` method in `plt`. We pass the parameter `ncols=2`, which means that we want to generate a plot with two columns. The parameter `figsize` is self-explanatory, and sets the size of the figure. It's often difficult to know a good size ahead of time, and you just need to tinker with these values.

2. The call to `plt.subplots` returns _two_ items which are saved as `fig` and `axes`. We won't use `fig`, so don't worry about it. The `axes` object is actually a NumPy array that contains the two separate "axes" on which we will draw our plots.

3. We call the `plot` method on the first axes object, `axes[0]`. We plot our sine curve.

4. We then call the `plot` method on the second set of axes, `axes[1]`. We plot our cosine curve.

5. Instead of calling `plt.show()` to show the plot, we call `plt.tight_layout()` which shows the plot and adjusts spacing.

There is so, _so_ much more that we can do with Matplotlib. We will see fancier plots as we go through the course, and we will also learn a bit about the powerful statistical graphing library Seaborn which is built on top of Matplotlib. But this should be enough for now.

### Problem 9 --- One big plot

Your only assignment in this section is to plot [this](https://github.com/jmyers7/stats-book-materials/blob/0ef3ab6affa90279aa4fd38b5edad78a2b7e0dcc/img/plot-1-1.png).

You'll be using the `subplots` method from `plt`, as I showed you above. The two functions on the left are

$$
f(x) = x(x-1)(x-2)
$$

and

$$
g(x) = x^2(x-1).
$$

You should plot these over the interval $[-1,3]$. The two functions on the right are $\sin{x}$ and $\cos{x}$; you should plot these over the interval $[0,4\pi]$.

Make sure that you label the $x$- and $y$-axes as shown, and also include the legends. I did **not** show you how to change the labels on `axes` objects, nor did I show you how to add legends to subplots. So this problem will be a bit of a challenge, as you'll need to figure that stuff out on your own.

(Hint: I used 18 lines of code to produce the plot.)

In [None]:
# ENTER YOUR CODE IN THIS CELL

