# Introduction to NumPy for Working with Numerical Data Arrays

<div class="alert alert-success">
    
## This notebook covers
- NumPy multidimensional array data structure
- NumPy data types
- Selecting, slicing, and querying dataframes
- Simple calculations with summary functions
- Sorting and grouping data
- Copying and renaming dataframe columns
- Handling missing values
- Merging dataframes and writing to file
</div>

<div class="alert alert-warning">

## Reminders

Remember, you can use Jupyter's built-in table of contents (hamburger on the far left) to jump from heading to heading.

---

This notebook will run in the MSUpy conda environment, which you create in the previous lesson. To select the Jupyter kernel the MSUpy environment go to the Kernel tab, select Change Kernel, then select the MSUpy kernel in the pop up window.

---

To turn on line number for code cells go to View menu and click Show Line Numbers.

</div>

# I. Importing Necessary Packages

In [1]:
import numpy as np

# II. Introduction to the NumPy Data Structure

The NumPy (**Num**erical **Py**thon) package is widely used in science and engineering for working with data in homogenous (same data type) multidimensional arrays. Using NumPy multidimensional arrays, as opposed to nested Python lists, improves speed, reduces memory consumption, and offers easier syntax for performing a variety of common data processing tasks. 

## Data Structure - ndarray

The NumPy data structure is called an *ndarray*, for n-dimensional array, which we may refer to throughout this notebook simply as an "array". Below is a schematic of a 1-d, 2-d, and 3-d array. A 1-d array looks similar to a list or a single row of data values. A 2-d array looks similar to a grid or table with rows and columns. A 3-d array looks similar to a cube or cuboid. Each dimension in an ndarray is called an *axis* and the first axis is always axis zero, keeping consistent with Python's zero-based indexing. Also, NumPy arrays must be rectangular and not jagged (i.e., each row of a two-dimensional array must have the same number of columns) and all of the data values in each array must be of the same data type. You'll notice below that similar to Pandas DataFrames, NumPy arrays have a shape attribute where each number in the shape tuple is the length of an array axis. 

<img src="images/numpy_arrays.png" alt="schematic of ndarrays" width="700"/>

It could be helpful to think of array dimensionality in terms of the data your arrays could hold (see image below). A geoscientist, for example, may use a 1-d array to hold solar radiation observations over time at a single location (a timeseries). In this case the only axis (axis 0) would represent data values at different times. A 2-d array could hold spatially gridded data like a gridded map of annual mean temperature or alternatively it could hold land surface imagery with 1-meter pixel resolution, for example. In this case axis 0 (rows) could represent data values at different latitudes and axis 1 (columns) could represent data values at different longitudes. A 3-d array could hold a timeseries of spatially gridded data like daily gridded maximum temperature. In this case axis 0 (the stack in the image below) could represent different times and axes 1 and 2 could represent different latitudes and longitudes.  

<img src="images/numpy_geoarrays.png" alt="ndarray data examples" width="700"/>

The above are examples of array data you may encounter frequently, but actually NumPy arrays can have as many axes as you can dream up and each axis can represent whatever you want it to. In this notebook though, we'll think of 1-d arrays mostly as timeseries, 2-d arrays as latitude-longitude grids, and 3-d arrays as timeseries of latitude-longitude grids. 

We won't get into higher dimensional data arrays and we'll only work with numerical data arrays (integer, float, boolean). Technically, NumPy arrays can contain numerical or non-numerical data (such as strings and bytes) but non-numerical data must be defined as fixed-width data types, which requires knowing or calculating the sizes of the longest text or byte sequence in advance. We will not cover non-numerical NumPy arrays in this course. 

## NumPy Data Types

We'll cover a few, but not all, of the NumPy numerical data types here. NumPy supports a much greater variety of numerical data types than core Python does. NumPy numerical data types are more specific than the core Python numerical data types, allowing for more efficient memory usage and faster computation. More detail can be found in the [NumPy Documentation for Data Types](https://numpy.org/doc/stable/user/basics.types.html#data-types).

The NumPy data types below are *concrete* types, meaning that a consistent number of bits are reserved for each individual data value in memory. A bit is the smallest unit of data in a computer (represented by a 0 or a 1) and you need 8 bits (1 byte) to represent one character like a letter or number in memory. The concrete data types below are named by combining the basic numerical type name (e.g., integer, float) with the number of bits that are needed to represent a single value in memory (bitsize).

NumPy Data Type | Description | Value Range
---|---|---
np.bool_ | 8-bit Boolean | True/False, not equivalent to 1,0
np.int8 | 8-bit integer value | -128 to 127
np.int16 | 16-bit integer value | -32768 to 32767
np.int32 | 32-bit integer value | -2147483648 to 2147483647
np.int64 | 64-bit integer value | -9223372036854775808 to 9223372036854775807
np.uint8 | unsigned 8-bit integer value | 0 to 255
np.uint16 | unsigned 16-bit integer value | 0 to 65535
np.uint32 | unsigned 32-bit integer value | 0 to 4294967295
np.uint64 | unsigned 64-bit integer value | 0 to 18446744073709551615
np.float16 | half precision float values | Precision: 3 decimal digits<br /> Range: ±6.55040e4 
np.float32 | single-precision float values | Precision: 6 decimal digits<br /> Range: ±3.4028235e+38, 
np.float64 | double-precision float values | Precision: 15 decimal digits<br /> Range: ±1.7976931348623157e+308 

<div class="alert alert-danger">

**Sidebar: Pay attention to your data types**

When you start working with larger array objects, you will need to be more aware of how much memory your objects are consuming to avoid running into the RAM limitations of your computer. 

For example, if you obtain daily gridded temperature or precipitation data they are often provided as float64. This is a wildly unneccesary amount of precision for these data. You can certainly covert to float32 which would result in speedier calculations and use only half the memory as float64. For temperature, you can convert to float16 for even faster calculations and 1/4 of the memory use as float64. For precipitation or other data that may require more than 3 decimal places or contain values greater than ~65500 the smallest/fastest data type may be float32.

If you try converting your data to a smaller data type and receive an "overflow" warning, that means 1 or more of your data values cannot be properly represented with fewer bits. We'll cover an example of this later.
</div>

Although not necessary recommended, NumPy arrays can also hold data of built-in Python types. The NumPy aliases to the built-in python data types for boolean, integer, and float are:

Python Built-in Type | NumPy Alias | Description
---|---|---
bool | np.bool | boolean True/False, equivalent to 1,0
int | np.int | integer values of variable size in memory
float | np.float | float values of variable size in memory


## Creating a NumPy Array from Scratch

We'll first create 1-D, 2-D, and 3-D NumPy arrays and later we'll learn how to read data from file into NumPy arrays. 

First, a 1-D array. Create it using the ```numpy.array()``` function and give the function a Python list of numbers as the input parameter.

In [2]:
arr_1d = np.array([1,2,3,4,5,6,7,8,9,10])

print(arr_1d.shape)
arr_1d

(10,)


array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

This should be fairly straightforward. Notice that, like Pandas offers the ```.shape``` attribute for its data structure (the dataframe), NumPy offers the ```.shape``` attribute as well for its data structure (the ndarray). 

Why the trailing comma in the shape tuple?

It's just part of the Python language. The ```.shape``` attribute always returns a tuple and the trailing comma indicates that it is a single-element tuple (you can't have a single-element tuple without the trailing comma). There is only 1 dimension (i.e., axis) in our array so our shape tuple will only have 1 element.

Now, we'll create a 2-D array by giving the ```numpy.array()``` function a nested Python list of numbers as the input parameter.

In [3]:
arr_2d = np.array([[1,2,3,4,5],
                   [6,7,8,9,10]])

print(arr_2d.shape)
arr_2d

(2, 5)


array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10]])

The result is an array with 2 rows and 5 columns of data values. 

The length (5) of the innermost nested list that you provide to ```numpy.array()``` (e.g., ```[1, 2, 3, 4, 5]``` and ```[6, 7, 8, 9, 10]```) is the last (rightmost) axis in the array shape. You can think of the last axis of your arrays with 2 or more axes as your columns.

The number of innermost lists (2) is the next axis to the left (second to last axis). You can think of the second to last axis of your arrays with 2 or more axes as your rows.

Let's try a 3-D array.

In [15]:
arr_3d = np.array([[[1, 2], 
                    [4,3], 
                    [7,4]],
                   [[2, 8], 
                    [9, 10], 
                    [7, 5]],
                   [[1, 6], 
                    [3, 11], 
                    [0, 2]]])

print(arr_3d.shape)
arr_3d

(3, 3, 2)


array([[[ 1,  2],
        [ 4,  3],
        [ 7,  4]],

       [[ 2,  8],
        [ 9, 10],
        [ 7,  5]],

       [[ 1,  6],
        [ 3, 11],
        [ 0,  2]]])

Again, the length (2) of the innermost nested list that you provide to ```numpy.array()``` is the last axis in the array shape (axis 2). The number of innermost lists inside the next nest level (3) is the second to last axis (axis 1). The number of 3x2 nested lists inside the next nest level (3) is the next dimension to the left (axis 0).

If we think about this array as having dimensions (time, latitude, longitude) then this array would contain a 3x2 gridded map of data (3 latitudes by 2 longitudes) at 3 different times. Where longitudes are your columns (last axis), latitudes are your rows (next to last axis), and time is the stack (leftmost axis) of your 3-D data "cube".

The ```[1, 2]``` in our array represents the data values at the first time, the first latitude, and all (2) longitudes. 

Let's look at where this ```[1, 2]``` appears in the NumPy array if we visualize the 3 dimensions

<img src="images/numpy3D_selection.png" alt="ndarray data examples" width="700"/>

## Array Attributes

Each ndarray has a number of *attributes*. We've already seen one of these above with ```.shape```. The others we will cover are ```.ndim```, ```.size```, ```.dtype```, ```.itemsize```, and ```.nbytes```. The full list of attributes can be found in the [NumPy API reference for numpy.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html).

In [19]:
print(arr_3d.shape)
print(arr_3d.ndim)
print(arr_3d.size)

(3, 3, 2)
3
18


```.shape``` returns a tuple with the number of elements along each dimension.

```.ndim``` returns the number of dimensions.

```.size``` returns the total number of elements in the array. In this case, 3 * 3 * 2 = 18.

In [20]:
print(arr_3d.dtype)
print(arr_3d.itemsize)
print(arr_3d.nbytes)

int32
4
72


```.dtype``` returns the data type.

```.itemsize``` returns the bytes consumed by one single element in the array. Remember, NumPy arrays are homogenous, so each element of an array will consume the same number of bytes.

```.nbytes``` returns the total bytes consumed by the array. This will be the number of elements in the array (```.size```) times the size in bytes of one element (```.itemsize```). (You can also infer the size of one element in bytes from your data type, e.g., int64 means one element takes up 64 bits of memory which is equivalent to 64 / 8 = 8 bytes)

**What is the data type of your ```arr_3D```?** Since we didn't specify a data type when we created ```arr_3D```, NumPy will recognize that the input is integer and the default integer data type that is assigned to the array will be based on the computer where you are running this notebook. If your computer has a 64-bit operating system, the default integer data type will be int64. If your computer has a 32-bit operating system, the default integer data type will be int32. This applies to float data as well. Today, most operating systems are 64-bit, so your ```.dtype``` output above probably says int64 and your ```.nbytes``` output probably says 144. If it doesn't, then your operating system is not 64-bit and that's ok too.  

**Using less memory:** If you are always working with small data, then having all your data stored with 64 bits per data value (likely the default on your computer) is fine, but consuming that much memory per array element is likely overkill unless your data contains super large or super small numbers. It's a good idea to get into the habit of including a data type when you create your NumPy arrays. This way you can reduce how much memory your data consumes and your code will run faster. 

Let's recreate our 1-D and 2-D arrays using data types that consume less memory.

In [17]:
# include the dtype parameter to store data using less memory
arr_1d = np.array([1,2,3,4,5,6,7,8,9,10], dtype='int8')

arr_2d = np.array([[1,2,3,4,5],
                   [6,7,8,9,10]], dtype='int16')

print(f'arr_1d dtype is {arr_1d.dtype}, arr_2d dtype is {arr_2d.dtype}')

arr_1d dtype is int8, arr_2d dtype is int16


You can also convert the data type after you've created an array with the ndarray function ```.astype()```.

In [18]:
# convert dtype of existing array
arr_3d = arr_3d.astype('int32')
arr_3d.dtype

dtype('int32')

The choice of which data type to convert each array to in this example is arbitrary. All our example arrays contain only small integers that will fit into the smallest integer data types (int8 or uint8).

# III. Indexing and Slicing Data

## Basic Indexing and Slicing

You can index and slice NumPy arrays in the same way you can slice Python lists. You will notice the syntax looks very similar and that again, the end index of a slice is exclusive.

### Single Cell

To select a single cell of an array you need to provide an index value for each array axis separated by a comma.

In [24]:
print(arr_1d[-1]) # last element
print(arr_2d[1,0])
print(arr_3d[1,2,1])

10
6
5


You may not need this very often, but you can also use a tuple for indexing.

In [94]:
# select a single value with a tuple
arr_2d[(1,0)] # second row, first column

np.int16(6)

### Slice of 1 Row

A single colon is used as the index value if you want all of the values along a particular dimension.

In [61]:
# select the second row (all columns)
print(arr_2d)
print('--------------------')
print(arr_2d[1,:])

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]
--------------------
[ 6  7  8  9 10]


In [62]:
# select the first time, first latitude, all longitudes
print(arr_3d)
print('--------------------')
print(arr_3d[0,0,:])

[[[ 1  2]
  [ 4  3]
  [ 7  4]]

 [[ 2  8]
  [ 9 10]
  [ 7  5]]

 [[ 1  6]
  [ 3 11]
  [ 0  2]]]
--------------------
[1 2]


### Slice of 1 Column

In [63]:
print(arr_2d)
print('--------------------')
arr_2d[:,0]

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]
--------------------


array([1, 6], dtype=int16)

### Slice Along a Single Dimension

Providing a start and end index on each side of a colon works to select multiple indexes, exclusive of the end index.

In [64]:
# all rows, slice of columns
print(arr_2d)
print('--------------------')
arr_2d[:,1:3]

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]
--------------------


array([[2, 3],
       [7, 8]], dtype=int16)

In [105]:
# all latitudes and longitudes at time index 0
arr_3d[0,...]

array([[1, 2],
       [4, 3],
       [7, 4]], dtype=int32)

The elipses above are short hand for all values in one or more dimensions. So ```arr_3d[0,...]``` above is the same as ```arr_3d[0,:,:]```

### Slice Along Multiple Dimensions

Selecting the second and third time, first and second latitude, first longitude of ```arr_3d```


<img src="images/numpy3D_multipledims.png" alt="schematic of what to select" width="300"/>

In [66]:
# second and third time, first and second latitude, first longitude
arr_3d[1:,0:2,0]

array([[2, 9],
       [1, 3]], dtype=int32)

<div class="alert alert-info"> 

### Exercise 1: Indexing and Slicing

In your ```arr_3d``` make the following selections using indexing and slicing. Note that slicing is never jagged so your selection for part (d) will include two data values that you cannot see in the image.

<table><tr>
<td> <img src="images/numpy3D_selection1.png" alt="schematic of what to select" width="300"/> </td>
<td> <img src="images/numpy3D_selection2.png" alt="schematic of what to select" width="300"/> </td>
<td> <img src="images/numpy3D_selection3.png" alt="schematic of what to select" width="300"/> </td>
<td> <img src="images/numpy3D_selection4.png" alt="schematic of what to select" width="300"/> </td>
</tr></table

</div>



<div class="alert alert-info"> 

#### part (a)
</div>


In [67]:
# add your code here


<div class="alert alert-info"> 

#### part (b)
</div>

In [68]:
# add your code here


<div class="alert alert-info"> 

#### part (c)
</div>

In [69]:
# add your code here


<div class="alert alert-info"> 

#### part (d)
</div>

In [70]:
# add your code here


## Advanced Indexing

NumPy arrays offer indexing capability beyond what is possible with nested Python lists. We can index an array with an array of integers or booleans.

### Advanced Indexing with Integer Arrays

First, an example with our 1-dimensional array ```arr_1d```

In [39]:
# reminder of what is in arr_1d
print(arr_1d)

# create an array of indexes
indexes = np.array([0,2,4,6])

# use the array to select data from arr_1d
arr_1d[indexes]

[ 1  2  3  4  5  6  7  8  9 10]


array([1, 3, 5, 7], dtype=int8)

You can do the same thing using a list of index values instead of an array with the following syntax, but note that it's recommended to use arrays as opposed to lists for indexing bigger data. Your code will run faster and be more memory efficient with arrays.

In [51]:
# indexing with a list
arr_1d[[0,2,4,6]]

array([1, 3, 5, 7], dtype=int8)

An example with our 2-dimensional array ```arr_2d```

In [49]:
# reminder of what is in arr_2d
print(arr_2d)

# selecting the data values 2,4,6,8
# create an array of indexes for each dimension
row_index = np.array([0,0,1,1])
col_index = np.array([1,3,0,2])

# use the arrays to select data from arr_2d
arr_2d[row_index,col_index]

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]


array([2, 4, 6, 8], dtype=int16)

And the equivalent using lists:

In [52]:
# indexing with lists
arr_2d[[0,0,1,1],[1,3,0,2]]

array([2, 4, 6, 8], dtype=int16)

You can also do advanced indexing with a tuple. Here's how to save your index arrays as a tuple and use it for multi-dimensional indexing.

In [88]:
# multi-dimensional indexing with a tuple of arrays
tup_indexes = (row_index,col_index)  # 2,4,6,8 as we did above
arr_2d[tup_indexes]

array([2, 4, 6, 8], dtype=int16)

What happens if you provide array indexing for less dimensions than are present in your array? NumPy will assume that your array index corresponds to axis zero and that you want all of the dimensions you haven't indexed.

For example, let's take our 2-dimensional array and give it only the row_index values (```row_index = np.array([0,0,1,1])```).

In [112]:
# our 2-dimensional array
print(arr_2d)

# providing array indexing for less dimensions than are present in arr_2d
arr_2d[row_index]

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]


array([[ 1,  2,  3,  4,  5],
       [ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [ 6,  7,  8,  9, 10]], dtype=int16)

You can see that NumPy returns two of the first row (all columns) and two of the second row (all columns) of data values.

The line of code above ```arr_2d[row_index]``` is the same as ```arr_2d[row_index,:]```.

You might not expect this to happen, so just be careful and double check yourself as you write your code.


### Advanced Indexing with Boolean Arrays

## Query

### Selecting Cells With Conditionals

Examples of selecting from an array the data values that fullfill certain conditions.

In [37]:
# values greater than 5
arr_3d[arr_3d > 5]

array([ 7,  8,  9, 10,  7,  6, 11], dtype=int32)

In [58]:
x = np.array([[1, 2], [3, 4], [5, 6]])
# print(x[(1,2,3),])
print(x[(1,2,3)])

IndexError: too many indices for array: array is 2-dimensional, but 3 were indexed

In [35]:
test=arr_3d.nonzero()
test

(array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]),
 array([0, 0, 1, 1, 2, 2, 0, 0, 1, 1, 2, 2, 0, 0, 1, 1, 2]),
 array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1]))

In [36]:
arr_3d[test]

array([ 1,  2,  4,  3,  7,  4,  2,  8,  9, 10,  7,  5,  1,  6,  3, 11,  2],
      dtype=int32)

# by values with conditionals

# nonzero?

# argmin, argmax

# selecting with less indexes than dims and use of ...?

# advanced indexing, indexing arrays with arrays

# where?

In [None]:
#1 Elementwise Array Math Operations
###addition, subtraction, multiplication, division,exp, sqrt
### when operating with arrays of different types the type of the resulting array corresponds to the larger-memory type (upcasting)

### broadcasting

## aggregation functions
###sum,mean,median,min,max,prod,std etc.


#2 copies and views, deep copy with .copy

#3 query
# np.unique, all, any, where, nonzero, fill

#4 np.zeros, ones, empty, arange, linspace, random.default_rng, rng.integers

#5 manipulations
##reshape, newaxis, expand_dims, vstack, hstack, hsplit, transpose, flatten,  concatenate, np.sort

#6 I/O
# formats numpy can handle
# npy format, save and load
    # use temperature data for the example, show all different files with metadata 
# loadtxt, savetxt
# using pandas to read into numpy array and write to csv


In [None]:
np.finfo(np.float64)

In [None]:
np.finfo(np.float64).precision

In [None]:
np.sctypeDict