[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/joshmaglione/CS102-Jupyter/main?labpath=.%2FWeek05.ipynb) 

<a href="https://colab.research.google.com/github/joshmaglione/CS102-Jupyter/blob/main/Week05.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> 

[View on GitHub](https://github.com/joshmaglione/CS102-Jupyter/blob/main/Week05.ipynb)

# Week 5: Our last foray into NumPy

This is the last week where we will primarily be focused on NumPy. 

Of course, we will use NumPy throughout.

## Exmaple: Rainfall in Galway

We will use the data `data/Galway_rainfall.csv` which is also publicly available at 

[data.gov.ie](https://data.gov.ie/dataset/galway-univcoll-climate-data)

We will foreshadow a bit of what we will learn about in the coming weeks with this example.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

We'll use `Pandas` to extract the relevant data. 

We want the daily rainfall from the years 2009, 2010, and 2011.

You don't have to worry about *what* is happening. 

In [None]:
pre_df = pd.read_csv("data/Galway_rainfall.csv", comment='#')
df = pd.DataFrame({
	'date': pd.to_datetime(pre_df['date'], format="%d-%b-%Y"),
	'rain': pre_df['rain']
})
df09 = df.query('20090101 <= date < 20100101')
df10 = df.query('20100101 <= date < 20110101')
df11 = df.query('20110101 <= date < 20120101')

In [None]:
ax09 = df09.plot(x="date", y="rain")
ax10 = df10.plot(x="date", y="rain")
ax11 = df11.plot(x="date", y="rain")

They all kind of look the same. 

It's not entirely simple to answer the following questions by looking at the graphs. 
1. How many rainy days were there in the year? 
2. How many days with less than 3 mm of rain? 
3. What was the average amount on days with at least 3 mm of rain? 

Naturally there are countless other questions like this too.

In the above code, we have taken advantage of three key conveniences offered by NumPy. 
- (Basic) comparison
- [Masking](https://numpy.org/doc/stable/reference/maskedarray.generic.html#what-is-a-masked-array)
- [Advanced Indexing](https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing)

## Comparisons

Boolean comparisons are also implemented using Ufuncs.

In [None]:
a1 = np.random.randint(10, size=5)
print(a1)

All of the standard boolean comparisons are available to us.

In [None]:
print(a1 > 4)

In [None]:
print(a1 != 5)

In [None]:
print(a1 == 3)

We know that *really* the RHS is being **broadcasted**. 

If you don't know what I mean, check out [Week04.ipynb](https://github.com/joshmaglione/CS102-Jupyter/blob/main/Week04.ipynb).

Thus, we can actually compare any pair of compatible arrays (under broadcasting rules).

In [None]:
print(2**a1 <= a1**2 - 2*a1 + 6)

This next example trips me up every so often...

In [None]:
a1 = np.random.randint(10, size=5)
b1 = np.random.randint(10, size=5)
print(f"      a1 = {a1}")
print(f"      b1 = {b1}")
print(f"a1 == b1 = {a1 == b1}")

Recall that it `False` is often interpretted as $0$ and `True` is interpretted as $1$.

This isn't just a convenience; there is good mathematical reason for this. (Logical operations have an algebraic analog.)

We can take advantage of this interpretation and we can count the number of `True` values with NumPy's `sum`.

In [None]:
print(a1)
count = np.sum(a1 > 3)
print(f"The number of entries greater than 3 is {count}")

In [None]:
a2 = np.random.randint(10, size=(4,5))
print(a2)
col = np.sum(a2 % 2 == 0, axis=0)
row = np.sum(a2 % 2 == 0, axis=1)
print(f"The number of even entries in each column is {col}")
print(f"The number of even entries in each row is {row}")

You can check if *any* or *all* values are `True`.

In [None]:
a2 = np.random.randint(10, size=(3, 5))
print(a2)
print(f"There exists a 4 in a2: {np.any(a2 == 4)}")
print(f"Every value in a2 is greater than 0: {np.all(a2 > 0)}")

You can pass axis values with `any` and `all`.

In [None]:
print(np.any(a2 == 0, axis=0))

![](imgs/clippy.png)

It looks like you are using functions already defined in Python: `any`, `all`, ans `sum`.


### Bitwise operators

You can also use the operators 
- `&` for `and`
- `|` for `or`
- `^` for `xor`
- `~` for `not`

I avoided using an `&` for the 2009 data:

In [45]:
df09 = df.query('20090101 <= date < 20100101')

x = 5
(2 < x < 7) == ((2 < x) & (x < 7))

True

## Masking

A *masked array* is a pair of arrays of identical shape and size. 

Depending on the masking, it might be more efficient to encode this with less data.

But for thinking about this, it is easier to think of a pair of arrays.

The *mask* is an array of boolean values. 

A `True` in the mask means that the corresponding value in the array is masked, or omitted. 

We can access masked arrays in NumPy via `ma.array`.

In [46]:
np.ma.array([1,2,3], mask=[0,1,0])

masked_array(data=[1, --, 3],
             mask=[False,  True, False],
       fill_value=999999)

But really, we can use the idea of masking as a means of selecting data.

In [49]:
a1 = np.random.randint(10, size=5)
print(a1)
print(a1[a1 > 5])

[2 0 1 6 7]
[6 7]


The `a1 > 5` plays the role of the **negated** mask. That is, the mask is actually `~(a1 > 5)`.

I used masking to extract the desired data from our data set.

In [50]:
df09 = df.query('20090101 <= date < 20100101')

We can answer some of the questions we posed earlier. 

In [65]:
rf09 = df09["rain"]     # Just extract the rain data, ignore the dates.

In [63]:
rainy09 = np.sum(rf09 > 0)
print(f"In 2009, there were {rainy09} days with rain.")

In 2009, there were 241 days with rain.


In [66]:
avg09 = np.average(rf09[rf09 > 0])
print(f"In 2009, rainy days averaged {avg09} mm of rain.")

In 2009, rainy days averaged 6.288796680497925 mm of rain.


## Advanced Indexing