# Introduction to NumPy

There are a few extremely important libraries for data science in Python. **NumPy** is one of them.

NumPy is great for efficiently loading, storing and manipulating in-memory data.

Data comes in a wide variety of formats, but almost everything we do in data science will boil everything down to arrays of numbers.

For example, the words in documents can be represented as the numbers that encode letters in computers or even the frequency of particular words in a collection of documents. Digital images can be thought of as two-dimensional arrays of numbers representing pixel brightness or color. Sound files can be represented as one-dimensional arrays of frequency versus time. NumPy excels at this.

NumPy is short for *Numerical Python*, and it provides an efficient means of storing and operating on dense data buffers in Python.

**Library Alias**

In [None]:
import numpy as np

## Built-In Help

There's a lot to remember when we start pulling libraries in to Python. Jupyter notebooks have tried to make it easier:

### Exercise

In [None]:
# Place your cursor after the period and press <TAB>:
np.

### Exercise

In [None]:
# Replace 'add' below with a few different NumPy function names and look over the documentation:
np.add?

Pick out some functions and try reading the documentation - get comfortable doing it!

## NumPy arrays: a specialized data structure for analysis

> **Learning goal:** By the end of this subsection, you should have a basic understanding of what NumPy arrays are and how they differ from the other Python data structures you have studied thus far.

So, why bother with NumPy? We can just use Python lists, right?

### Lists in Python


Remember that Python lists can hold just one kind of object **or** we can store mulitple object types:

In [None]:
myList = list(range(10))
myList

In [None]:
[type(item) for item in myList]

In [None]:
myList2 = [True, "2", 3.0, 4]
[type(item) for item in myList2]

This is great and flexible, but the flexibility comes at a price. 

Each item in a list is really a separate Python object. That means that each item in a list must contain its own type info, reference count, and other information. All of this information can become expensive in terms of memory and performance if we are dealing with hundreds of thousands or millions of items in a list. 

Usually, in data science, our arrays just store a single type of data anyway (such as integers or floats), so there is room for improved efficiency. 

Enter the fixed-type, NumPy-style array.

### NumPy Fixed-type arrays

At the level of implementation by the computer, the `ndarray` that is part of the NumPy package contains a single pointer to one contiguous block of data. This is efficient memory-wise and computationally. Better still, NumPy provides efficient *operations* on data stored in `ndarray` objects.

(Note that we will pretty much use “array,” “NumPy array,” and “ndarray” interchangeably throughout this section to refer to the ndarray object.)

#### Creating NumPy arrays method 1: using Python lists

In [None]:
# Create an integer array:
np.array([1, 4, 2, 5, 3])

**Think, Pair, Share**

In [None]:
np.array([3.14, 4, 2, 3])

This is called *upcasting*!

### Exercise - try it and learn!

In [None]:
# What happens if you construct an array using a list that contains a combination of integers, floats, and strings?
np.array([3.14, '4', 2, 3])

**Explicit Typing**

**Share**

In [None]:
np.array([1, 2, 3, 4], dtype='float32')

### Exercise

In [None]:
# Try this using a different dtype.
# Remember that you can always refer to the documentation with the command np.array.
np.array([1, 2, 3, 4], dtype='int16')

**Multi-Dimensional Array**

Multidimensional arrays are useful and common in data science. 

**Think, Pair, Share**

In [None]:
# nested lists result in multi-dimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])

#### Creating NumPy arrays method 2: building from scratch

In practice, it is often more efficient to create arrays from scratch using functions built into NumPy, particularly for larger arrays.

In [None]:
np.zeros(10, dtype=int)

In [None]:
np.ones((3, 5), dtype=float)

In [None]:
np.full((3, 5), 3.14)

#### Exercise: try referencing the documentation

In [None]:
np.arange(0, 20, 3)

In [None]:
np.arange?

#### Exercise - reference the docs

In [None]:
# What does np.arange(0, 20, 2) do?
np.arange?

In [None]:
np.arange(0, 20, 2) 

In [None]:
np.linspace(0, 1, 5)

In [None]:
np.random.random((3, 3))

In [None]:
np.random.normal(0, 1, (3, 3))

In [None]:
np.random.randint(0, 10, (3, 3))

In [None]:
# What does np.eye(3) do?
np.eye?

In [None]:
np.eye(3)

In [None]:
np.empty(3)

#### Exercise

Take a couple of minutes to go back and play with these code snippets, changing the parameters. These functions are the bread-and-butter of creating NumPy arrays and you will want to become comfortable with them.

Below is a table listing out several of the array-creation functions in NumPy.

| Function      | Description |
|:--------------|:------------|
| `array`       | Converts input data (list, tuple, array, or other sequence type) to an ndarray either |
|               | by inferring a dtype or explicitly specifying a dtype. Copies the input data by default. |
| `asarray`     | Converts input to ndarray, but does not copy if the input is already an ndarray. |
| `arange`      | Similar to the built-in `range()` function but returns an ndarray instead of a list. |
| `ones`, `ones_like` | Produces an array of all 1s with the given shape and dtype. |
|               | `ones_like` takes another array and produces a ones-array of the same shape and dtype. |
| `zeros`, `zeros_like` | Similar to `ones` and `ones_like` but producing arrays of 0s instead. |
| `empty`, `empty_like` | Creates new arrays by allocating new memory, but does not populate with any values 
|               | like `ones` and `zeros`. |
| `full`, `full_like` | Produces an array of the given shape and dtype with all values set to the indicated “fill value.” |
|               | `full_like` takes another array and produces a a filled array of the same shape and dtype. |
| `eye`, `identity` | Create a square $N \times N$ identity matrix (1s on the diagonal and 0s elsewhere) |

### NumPy data types

The standard NumPy data types are listed in the following table. Note that when constructing an array, they can be specified using a string:

```python
np.zeros(8, dtype='int16')
```

Or they can be specified directly using the NumPy object:

```python
np.zeros(8, dtype=np.int16)
```

| Data type	    | Description |
|:--------------|:------------|
| ``bool_``     | Boolean (True or False) stored as a byte |
| ``int_``      | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)| 
| ``intc``      | Identical to C ``int`` (normally ``int32`` or ``int64``)| 
| ``intp``      | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)| 
| ``int8``      | Byte (-128 to 127)| 
| ``int16``     | Integer (-32768 to 32767)|
| ``int32``     | Integer (-2147483648 to 2147483647)|
| ``int64``     | Integer (-9223372036854775808 to 9223372036854775807)| 
| ``uint8``     | Unsigned integer (0 to 255)| 
| ``uint16``    | Unsigned integer (0 to 65535)| 
| ``uint32``    | Unsigned integer (0 to 4294967295)| 
| ``uint64``    | Unsigned integer (0 to 18446744073709551615)| 
| ``float_``    | Shorthand for ``float64``.| 
| ``float16``   | Half-precision float: sign bit, 5 bits exponent, 10 bits mantissa| 
| ``float32``   | Single-precision float: sign bit, 8 bits exponent, 23 bits mantissa| 
| ``float64``   | Double-precision float: sign bit, 11 bits exponent, 52 bits mantissa| 
| ``complex_``  | Shorthand for ``complex128``.| 
| ``complex64`` | Complex number, represented by two 32-bit floats| 
| ``complex128``| Complex number, represented by two 64-bit floats| 

If these data types seem a lot like those in C, that's because NumPy is built in C.

> **Takeaway:** NumPy arrays are a data structure similar to Python lists that provide high performance when storing and working on large amounts of homogeneous data—precisely the kind of data that you will encounter frequently in doing data science. NumPy arrays support many data types beyond those discussed in this course. With all of that said, however, don’t worry about memorizing all of the NumPy dtypes. **It’s often just necessary to care about the general kind of data you’re dealing with: floating point, integer, Boolean, string, or general Python object.**

## Working with NumPy arrays: the basics

> **Learning goal:** By the end of this subsection, you should be comfortable working with NumPy arrays in basic ways.

Now that you know how to create arrays in NumPy, you need to get comfortable manipulating them for two reasons. 
1. You will work with NumPy arrays as part of your exploration of data science. 
1. Our other important Python data-science tool, pandas, is actually built around NumPy. 

Getting good at working with NumPy arrays will pay dividends in the next section (Section 4) and beyond: NumPy arrays are the building blocks for the `Series` and `DataFrame` data structures in the Python pandas library and you will use them *a lot* in data science. 

To get comfortable with array manipulation, we will cover five specifics:

- **Arrays attributes**: Assessing the size, shape, and data types of arrays
- **Indexing arrays**: Getting and setting the value of individual array elements
- **Slicing arrays**: Getting and setting smaller subarrays within a larger array
- **Reshaping arrays**: Changing the shape of a given array
- **Joining and splitting arrays**: Combining multiple arrays into one and splitting one array into multiple arrays

### Array attributes

First, let's look at some array attributes. We'll start by defining three arrays filled with random numbers: one one-dimensional, another two-dimensional, and the last three-dimensional. Because we will be using NumPy's random number generator, we will set a *seed* value in order to ensure that you get the same random arrays each time you run this code:

In [None]:
import numpy as np
np.random.seed(0)  # seed for reproducibility

a1 = np.random.randint(10, size=6)  # One-dimensional array
a2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
a3 = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array

**Array Types**

In [None]:
print("dtype:", a3.dtype)



### Exercise:

In [None]:
a3

In [None]:
# Change the values in this code snippet to look at the 
# attributes for a1, a2, and a3:
print("a3 ndim: ", a3.ndim)
print("a3 shape:", a3.shape)
print("a3 size: ", a3.size)

### Indexing arrays

**Quick Review**

In [None]:
a1

**Share**

In [None]:
a1[0]

In [None]:
a1[4]

In [None]:
a1[-1]

In [None]:
a1[-2]

**Multi-Dimensional Arrays**

In [None]:
a2

**Explore on your own**

Do multidimensional NumPy arrays work like Python lists of lists? Try a few combinations like a2[1][1] or a3[0][2][1] and see what is returned:

In [None]:
a2[0, 0]

In [None]:
a2[2, 0]

In [None]:
a2[2, -1]

In [None]:
a2[0, 0] = 12
a2

**Think, pair, share**

In [None]:
a1

In [None]:
a1[0] = 3.14159
a1

Watch out! This doesn't behave like you might expect.

### Exercise:

In [None]:
# What happens if you try to insert a string into a1?
# Hint: try both a string like '3' and one like 'three'

int('3')
a1[2] = '3'
a1
a1[3] = "three"

### Slicing arrays

Similar to how you can use square brackets to access individual array elements, you can also use them to access subarrays. You do this with the *slice* notation, marked by the colon (`:`) character. 

NumPy slicing syntax follows that of the standard Python list; so, to access a slice of an array `a`, use this notation:
``` python
a[start:stop:step]
```
If any of these are unspecified, they default to the values ``start=0``, ``stop=``*``size of dimension``*, ``step=1``.

Let's take a look at accessing sub-arrays in one dimension and in multiple dimensions.

#### One-dimensional slices

In [None]:
a = np.arange(10)
a

In [None]:
a[:5]

In [None]:
a[5:]

In [None]:
a[4:7]

In [None]:
a[4:7:3]

**Slicing With Index**

In [None]:
a[::2]

In [None]:
a[1::2]

Be careful when using negative values for ``step``. When ``step`` has a negative value, the defaults for ``start`` and ``stop`` are swapped and you can use this functionality to reverse an array:

In [None]:
a[::-1]

In [None]:
a[5::-2]

### Exercise:

In [None]:
a

In [None]:
# How can you create a slice that contains every third element of a
# descending from the second-to-last element to the second element of a?


#### Multidimensional slices

Multidimensional slices use the same slice notation of one-dimensional subarrays mixed with the comma-separated notation of multidimensional arrays. Some examples will help illustrate this.

In [None]:
a2

In [None]:
a2[:2, :3] # two rows, three columns

In [None]:
a2[:3, ::2] # all rows, every other column

Finally, subarray dimensions can even be reversed together:

In [None]:
a2[::-1, ::-1]

#### Accessing array rows and columns

You will often need to access a single row or column in an array. You can do this through a combination of indexing and slicing; specifically by using an empty slice marked by a single colon (``:``).

In [None]:
a2

In [None]:
print(a2[:, 0]) # first column of x2

In [None]:
print(a2[0, :]) # first row of x2

In the case of row access, the empty slice can be omitted for a more compact syntax:

In [None]:
print(a2[0]) # equivalent to a2[0, :]

### Exercise:

In [None]:
# How would you access the third column of a3?
# How about the third row of a3?
print(a3[:, :, 2])

#### Slices are no-copy views

It's important to know that slicing produces *views* of array data, not *copies*. This is a **huge** difference between NumPy array slicing and Python list slicing. 

With Python lists, slices are only shallow copies of lists; if you modify a copy, it doesn't affect the parent list. 

When you modify a NumPy subarray, **you modify the original list.** 

In [None]:
print(a2)

In [None]:
a2_sub = a2[:2, :2] # Get a a 2x2 subarray
print(a2_sub)

In [None]:
a2_sub[0, 0] = 99 # Modify
print(a2_sub) 

In [None]:
print(a2)

### Exercise:

In [None]:
# Now try reversing the column and row order of a2_sub
# Does a2 look the way you expected it would
# after that manipulation?


This behavior is actually really useful as we access and modify small parts of a large dataset in-memory.

#### Copying arrays


In [None]:
a2_sub_copy = a2[:2, :2].copy()
print(a2_sub_copy)

In [None]:
a2_sub_copy[0, 0] = 42
print(a2_sub_copy)

In [None]:
print(a2)

### Joining and splitting arrays

Another common data-manipulation need in data science is combining multiple datasets; learning first how to do this with NumPy arrays will help you in the next section when we do this with more complex data structures. You will many times also need to split a single array into multiple arrays.

#### Joining arrays

To join arrays in NumPy, you will most often use `np.concatenate`, which is the method we will cover here. If you find yourself in the future needing to specifically join arrays in mixed dimensions (a rarer case), read the documentation on `np.vstack`, `np.hstack`, and `np.dstack`.

In [None]:
a = np.array([1, 2, 3])
b = np.array([3, 2, 1])
np.concatenate([a, b])

In [None]:
c = [99, 99, 99]
print(np.concatenate([a, b, c]))

In [None]:
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])
grid

**Think, Pair, Share**

In [None]:
np.concatenate([grid, grid])


#### Splitting arrays

In order to split arrays into multiple smaller arrays, you can use the functions ``np.split``, ``np.hsplit``, ``np.vsplit``, and ``np.dsplit``.  As above, we will only cover the most commonly used function (`np.split`) in this course.

**Think, Pair, Share**

In [None]:
a = [1, 2, 3, 99, 99, 3, 2, 1]
a1, a2, a3 = np.split(a, [3, 5])
print(a1, a2, a3)

Notice that *N* split-points produces to *N + 1* subarrays. In this case it has formed the subarray `a2` with `a[3]` and `a[4]` (the element just before position 5 [remember how Python indexing goes], the second input in the tuple) as elements. `a1` and `a3` pick up the leftover portions from the original array `a`.

> **Takeaway:** Manipulating datasets is a fundamental part of preparing data for analysis. The skills you learned and practiced here will form building blocks for the most sophisticated data-manipulation you will learn in later sections in this course.

## Sorting arrays

So far we have just worried about accessing and modifying NumPy arrays. Another huge thing you will need to do as a data scientist is sort array data. Sorting is often an important means of teasing out the structure in data (such as outlying data points).

Although you could use Python's built-in `sort` and `sorted` functions, they will not work nearly as efficiently as NumPy's `np.sort` function.

`np.sort` returns a sorted version of an array without modifying the input:

In [None]:
a = np.array([2, 1, 4, 3, 5])
np.sort(a)

In [None]:
print(a)

In [None]:
a.sort()
print(a)

### Sorting along rows or columns

A useful feature of NumPy's sorting algorithms is the ability to sort along specific rows or columns of a multidimensional array using the `axis` argument. For example:

In [None]:
rand = np.random.RandomState(42)
table = rand.randint(0, 10, (4, 6))
print(table)

In [None]:
np.sort(table, axis=0)

In [None]:
np.sort(table, axis=1)

Bear in mind that this treats each row or column as an independent array; any relationships between the row or column values will be lost doing this kind of sorting.

## NumPy vs Python built-in functions

> **Learning goal:** By the end of this subsection, you should have a basic understanding of what NumPy universal functions are and how (and why) to use them.


Some of the properties that make Python great to work with for data science (its dynamic, interpreted nature, for example) can also make it slow. This is particularly true with looping. These small performance hits can add up to minutes (or longer) when dealing with truly huge datasets.

The performance bottleneck is not the operations themselves, but the type-checking and function dispatches that Python performs on each cycle of a loop. 

Each time Python does a calculation, it first examines the object's type and does a dynamic lookup of the correct function to use for that type. Such is life with interpreted code. 

However, you may remember that NumPy is actually implemented in C - a compiled language! We can get the best of both worlds by using NumPy universal functions.

**Array arithmetic**

Many NumPy ufuncs use Python's native arithmetic operators, so you can use the standard addition, subtraction, multiplication, and division operators that we covered in Section 1:

In [None]:
a = np.arange(4)

print("a     =", a)
print("a + 5 =", a + 5)
print("a - 5 =", a - 5)
print("a * 2 =", a * 2)
print("a / 2 =", a / 2)
print("a // 2 =", a // 2)  # floor division

There are also ufuncs for negation, exponentiation, and the modulo operation:

In [None]:
print("-a     = ", -a)
print("a ** 2 = ", a ** 2)
print("a % 2  = ", a % 2)

Even though it looks like we're just using normal python operations - we are using the ufuncs. They are written as *wrappers* around the normal NumPy functions. 

For example, the `+` operator is actually a wrapper for the `add` function:

In [None]:
np.add(a, 2)

Here's a cheat sheet for the equivalencies between Python operators and NumPy ufuncs:

| Operator	    | Equivalent ufunc    | Description                           |
|:--------------|:--------------------|:--------------------------------------|
|``+``          |``np.add``           |Addition (e.g., ``1 + 1 = 2``)         |
|``-``          |``np.subtract``      |Subtraction (e.g., ``3 - 2 = 1``)      |
|``-``          |``np.negative``      |Unary negation (e.g., ``-2``)          |
|``*``          |``np.multiply``      |Multiplication (e.g., ``2 * 3 = 6``)   |
|``/``          |``np.divide``        |Division (e.g., ``3 / 2 = 1.5``)       |
|``//``         |``np.floor_divide``  |Floor division (e.g., ``3 // 2 = 1``)  |
|``**``         |``np.power``         |Exponentiation (e.g., ``2 ** 3 = 8``)  |
|``%``          |``np.mod``           |Modulus/remainder (e.g., ``9 % 4 = 1``)|

#### Exponents and logarithms

In [None]:
a = [1, 2, 3]
print("a     =", a)
print("e^a   =", np.exp(a))
print("2^a   =", np.exp2(a))
print("3^a   =", np.power(3, a))

In [None]:
a = [1, 2, 4, 10]
print("a        =", a)
print("ln(a)    =", np.log(a))
print("log2(a)  =", np.log2(a))
print("log10(a) =", np.log10(a))

There are very special versions of these ufuncs written to help maintain precision. 
These functions give more precise values than if you were to use the raw `np.log` or `np.exp` on very small values of `a`.

In [None]:
a = [0, 0.001, 0.01, 0.1]
print("exp(a) - 1 =", np.expm1(a))
print("log(1 + a) =", np.log1p(a))

#### Specialized ufuncs

NumPy has many other ufuncs. Another source for specialized and obscure ufuncs is the submodule `scipy.special`. If you need to compute some specialized mathematical or statistical function on your data, chances are it is implemented in `scipy.special`.

In [None]:
from scipy import special

In [None]:
# Gamma functions (generalized factorials) and related functions
a = [1, 5, 10]
print("gamma(a)     =", special.gamma(a))
print("ln|gamma(a)| =", special.gammaln(a))
print("beta(a, 2)   =", special.beta(a, 2))

> **Takeaway:** Universal functions in NumPy provide you with computational functions that are faster than regular Python functions, particularly when working on large datasets that are common in data science. This speed is important because it can make you more efficient as a data scientist and it makes a broader range of inquiries into your data tractable in terms of time and computational resources.

## Aggregations

> **Learning goal:** By the end of this subsection, you should be comfortable aggregating data in NumPy.

One of the first things you will find yourself doing with most datasets is computing the summary statistics for the data in order to get a general overview of your data before exploring it further. These summary statistics include the mean and standard deviation, in addition to other aggregates, such as the sum, product, median, minimum and maximum, or quantiles of the data.

NumPy has fast built-in aggregation functions for working on arrays that are the subject of this sub-section.

### Summing the values of an array

In [None]:
myList = np.random.random(100)
np.sum(myList)

**NumPy vs Python Functions**

In [None]:
large_array = np.random.rand(1000000)
%timeit sum(large_array)
%timeit np.sum(large_array)

Be aware that even though built-in Python functions and NumPy ufuncs may seem similar, they frequenty have different optional arguments, with different orders, and meanings.

In most situations, the NumPy functions will perform better.

### Minimum and maximum

In [None]:
np.min(large_array), np.max(large_array)

In [None]:
print(large_array.min(), large_array.max(), large_array.sum())

### Other aggregation functions

The table below lists other aggregation functions in NumPy. Most NumPy aggregates have a '`NaN`-safe' version, which computes the result while ignoring missing values marked by the `NaN` value.

|Function Name      |   NaN-safe Version  | Description                                   |
|:------------------|:--------------------|:----------------------------------------------|
| ``np.sum``        | ``np.nansum``       | Compute sum of elements                       |
| ``np.prod``       | ``np.nanprod``      | Compute product of elements                   |
| ``np.mean``       | ``np.nanmean``      | Compute mean of elements                      |
| ``np.std``        | ``np.nanstd``       | Compute standard deviation                    |
| ``np.var``        | ``np.nanvar``       | Compute variance                              |
| ``np.min``        | ``np.nanmin``       | Find minimum value                            |
| ``np.max``        | ``np.nanmax``       | Find maximum value                            |
| ``np.argmin``     | ``np.nanargmin``    | Find index of minimum value                   |
| ``np.argmax``     | ``np.nanargmax``    | Find index of maximum value                   |
| ``np.median``     | ``np.nanmedian``    | Compute median of elements                    |
| ``np.percentile`` | ``np.nanpercentile``| Compute rank-based statistics of elements     |
| ``np.any``        | N/A                 | Evaluate whether any elements are true        |
| ``np.all``        | N/A                 | Evaluate whether all elements are true        |

We will see these aggregates often throughout the rest of the course.

> **Takeaway:** Aggregation is the primary means you will use to explore you data, not just when using NumPy, but particularly in conjunction with pandas, the Python library you will learn about in the next section, which builds off of NumPy and thus off of everything you have learned thus far.

## Computation on arrays with broadcasting

> **Learning goal:** By the end of this subsection, you should have a basic understanding of how broadcasting works in NumPy (and why NumPy uses it).

Another means of speeding upoperations is to use NumPy's *broadcasting* functionality: creating rules for applying binary ufuncs like addition, subtraction, or multiplication on arrays of different sizes.

Before, when we performed binary operations on arrays of the same size, those operations were performed on an element-by-element basis.

In [None]:
first_array = np.array([3, 6, 8, 1])
second_array = np.array([4, 5, 7, 2])
first_array + second_array

Broadcasting enables you to perform these types of binary operations on arrays of different sizes. Thus, you could just as easily add a scalar (which is really just a zero-dimensional array) to an array:

In [None]:
first_array + 5

Similarly, you can add a one-dimensional array to a two-dimensional array:

In [None]:
one_dim_array = np.ones((1))
one_dim_array

In [None]:
two_dim_array = np.ones((2, 2))
two_dim_array

In [None]:
one_dim_array + two_dim_array

> **Takeaway:** The data you will work with in data science invariably comes in different shapes and sizes (at least in terms of the arrays in which you work with that data). The broadcasting functionality in NumPy enables you to use binary functions on irregularly fitting data in a predictable way.

## Comparisons, masks, and Boolean logic in NumPy

> **Learning goal:** By the end of this subsection, you should be comfortable with and understand how to use Boolean masking in NumPy in order to answer basic questions about your data.

*Masking* is when you want to manipulate or count or extract values in an array based on a criterion. For example, counting all the values in an array greater than a certain value is an example of masking. 

Boolean masking is often the most efficient way to accomplish these types of tasks in NumPy and it plays a large part in cleaning and otherwise preparing data for analysis (see Section 5).

### Example: Counting Rainy Days

Let's see masking in practice by examining the monthly rainfall statistics for Seattle. The data is in a CSV file from data.gov. To load the data, we will use pandas, which we will formally introduce in Section 4.

In [None]:
import numpy as np
import pandas as pd

# Use pandas to extract rainfall as a NumPy array
rainfall_2003 = pd.read_csv(
    'Data/Observed_Monthly_Rain_Gauge_Accumulations_-_Oct_2002_to_May_2017.csv')\
    ['RG01'][2:14].values
rainfall_2003

That was a little much for what we've seen so far. Let's break it down.

The rainfall data contains monthly rainfall totals from several rain gauges around the city of Seattle; we selected the first one.

From that gauge, we then selected the relevant months for the first full calendar year in the dataset, 2003. That range of months started at the third row of the CSV file (remember, Python zero-indexes!) and ran through the thirteenth row, hence `2:14]`.

You now have an array containing 12 values, each of which records the monthly rainfall in inches from January to December 2003.

Bar charts are an easy way to get an early feel for data.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
plt.bar(np.arange(1, len(rainfall_2003) + 1), rainfall_2003)

Again, breaking down that dense line, we passed two parameters to the bar function in pyplot: 
1. defining the index for the x-axis 
1. second defining the data to use for the bars (the y-axis). 

To create the index, we use the NumPy function `arange` to create a sequence of numbers. We know that the length of our array is 12, but it can be a good habit to get into to programmatically pass the length of an array in case it changes or you don’t know it with specificity. 

We also added 1 to both the start and the end of the `arange` to accommodate for Python zero-indexing (because there is no “month-zero” in the calendar).

What do you think about the weather in Seattle? How does this compare to Sydney?

There are still several questions we would like to answer, such as in how many months did it rain, or what was the average precipitation in those months?

### Boolean operators

How do we find out all months with rain less than 1 inch and greater than .5 inch? 

This is accomplished through Python's *bitwise logic operators*, `&`, `|`, `^`, and `~`. Like with the standard arithmetic operators, NumPy overloads these as ufuncs which work element-wise on (usually Boolean) arrays.

For example, we can address this sort of compound question as follows:

In [None]:
np.sum((rainfall_2003 > 0.5) & (rainfall_2003 < 1))

Note that the parentheses here are very important!

#### Exercise

In [None]:
# Fill in the expressions that correctly calculate the below:
print(
    "Number of months with less than 4 inches of rain and more than 1 inch: ",
    np.sum((rainfall_2003 > 0.5) & (rainfall_2003 < 1)))
#print("Number of months without rain: ", np.sum(...)
#print("Number of months with rain:    ", np.sum(...)
#print("Months with more than 1 inch:  ", np.sum(...)
#print("Rainy months with < 1 inch:    ", np.sum(...)

## Boolean arrays as masks

In the prior section, we looked at aggregates computed directly on Boolean arrays.
A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves.

Suppose we want an array of all values in a 2D array that are less than 5:

In [None]:
rand = np.random.RandomState(0)
two_dim_array = rand.randint(10, size=(3, 4))
two_dim_array

In [None]:
two_dim_array < 5

This creates a boolean "mask" that we can then apply to our array:

**Masking**

In [None]:
two_dim_array[two_dim_array < 5]

In [None]:
# Construct a mask of all rainy months
rainy = (rainfall_2003 > 0)

# Construct a mask of all summer months (June through September)
months = np.arange(1, 13)
summer = (months > 5) & (months < 10)

print("Median precip in rainy months in 2003 (inches):   ", 
      np.median(rainfall_2003[rainy]))
print("Median precip in summer months in 2003 (inches):  ", 
      np.median(rainfall_2003[summer]))
print("Maximum precip in summer months in 2003 (inches): ", 
      np.max(rainfall_2003[summer]))
print("Median precip in non-summer rainy months (inches):", 
      np.median(rainfall_2003[rainy & ~summer]))

> **Takeaway:** By combining Boolean operations, masking operations, and aggregates, you can quickly answer questions similar to those we posed about the Seattle rainfall data about any dataset. Operations like these will form the basis for the data exploration and preparation for analysis that will by our primary concerns in Sections 4 and 5.