## Introduction
An overview of data processing and the NumPy library.

When asked about Google's model for success, Peter Norvig, the director of research at Google, famously stated,

`"We don't have better algorithms than anyone else; we just have more data."`

Though probably an understatement (given the amount of talent employed at Google), the quote does provide a sense of just how vital data is to having successful outcomes.
People normally discuss the importance of data in the context of machine learning. No matter how sophisticated a machine learning model is, it will not perform well unless it has a reasonable amount of data to train on. On the other hand, given a large and diverse set of training data, a good deep learning model will significantly outperform non-deep learning algorithms.

However, data is not just limited to machine learning. Companies use data to identify customer trends, political parties use data to determine which demographics they should target, sports teams use data to analyze players, etc.

The universal usage of data makes data processing, the act of converting raw data into a meaningful form, an essential skill to have.

## Numpy

Many scenarios involve mostly numeric datasets. For example, medical data contains many numeric metrics, such as height, weight, and blood pressure. Furthermore, the majority of neural networks use input data that is either numeric or has been converted to a numeric form.

When we deal with numeric data, the best Python library to use is NumPy. The NumPy library allows us to perform many operations on numeric data, and convert the data to more usable forms.

In [2]:
import numpy as np  # import the NumPy library

# Initializing a NumPy array
arr = np.array([-1, 2, 5], dtype=np.float32)

# Print the representation of the array
print(repr(arr))

array([-1.,  2.,  5.], dtype=float32)


### Arrays

NumPy arrays are basically just Python lists with added features. In fact, you can easily convert a Python list to a Numpy array using the `np.array` function, which takes in a Python list as its required argument. The function also has quite a few keyword arguments, but the main one to know is `dtype`. The `dtype` keyword argument takes in a **NumPy type** and manually cast the array to the specific type.

In [2]:
# The code below is an example usage of np.array to create a 2D matrix. Note that the array is manually cast to np.float32.

arr = np.array([[0,1,2],[3,4,5]], dtype=np.float32)

print(repr(arr))

array([[0., 1., 2.],
       [3., 4., 5.]], dtype=float32)


When the elements of a Numpy array are mixed types, then the array's type will be upcast to the highest level type. This means that if an array input has mixed `int` and `float` elements, all the integers will be cast to their floating-point equivalents. If an array is mixed with `int`, `float` and `string` elements, everything is cast to string.

In [3]:
# The code below is an example of np.array upcasting. Both integers are cast to their floating point equivalents.

arr = np.array([0, 0.1, 2])
print(repr(arr))

array([0. , 0.1, 2. ])


### Copying
Similar to Python lists, when we make a reference to a Numpy array it doesn't create a different array. 
Therefore, if we change a value using the reference variable, it changes the original array as well. We get around this by using an array's inherent `copy` function. The function has no required arguments, and it returns the copied array.

In the code example below, `c` is a reference to `a` while `d` is a copy. Therefore, changing `c` leads to the same change in `a`, while changing `d` does not change the value of `b`

In [4]:
a = np.array([0, 1])
b = np.array([9, 8])
c = a
print('Array a: {}'.format(repr(a)))
c[0] = 5
print('Array a: {}'.format(repr(a)))

d= b.copy()
d[0] = 6
print('Array b: {}'.format(repr(b)))

Array a: array([0, 1])
Array a: array([5, 1])
Array b: array([9, 8])


### Casting
We cast Numpy arrays through their inherent `astype` function. The function's required argument is the new type for the array. It returns the array cast to the new type.

The code below shows an example of casting using the `astype` function. The `dtype` property returns the type of an array.

In [5]:
arr = np.array([0,1,2])
print(arr.dtype)
arr = arr.astype(np.float32)
print(arr.dtype)

int32
float32


### NaN
When we don't want a Numpy array to contain a value at a particular index, we can use `np.nan` to act as a placeholder. A common usage for `np.nan` is a filler value for incomplete data.

The code below shows an example usage of `np.nan`. Note that `np.nan` cannot take on a integer type.

In [6]:
arr = np.array([np.nan, 1, 2])
print(repr(arr))

arr = np.array([np.nan, 'abc'])
print(repr(arr))

#Will result in a ValueError
np.array([np.nan, 1, 2], dtype=np.int32)

array([nan,  1.,  2.])
array(['nan', 'abc'], dtype='<U32')


ValueError: cannot convert float NaN to integer

### Infinity
To represent infinity in Numpy, we use the `np.inf` special value. We can also represent negative infinity with `-np.inf`.

The code below shows an example usage of `np.inf`. Note that `np.inf` cannot take on an integer type.

In [7]:
print(np.inf > 1000000)

arr = np.array([np.inf, 5])
print(repr(arr))

arr = np.array([-np.inf, 1])
print(repr(arr))

#Will result in an OverflowError
np.array([np.inf, 3], dtype=np.int32)

True
array([inf,  5.])
array([-inf,   1.])


OverflowError: cannot convert float infinity to integer

### Time to code!

In [8]:
arr = np.array([np.nan,1,2,3,4,5])
arr2 = arr.copy()
arr2

array([nan,  1.,  2.,  3.,  4.,  5.])

In [9]:
arr2[0] = 10
arr2

array([10.,  1.,  2.,  3.,  4.,  5.])

In [10]:
float_arr = np.array([1, 5.4, 3])
float_arr2 = arr2.astype(np.float32)
print(repr(float_arr))
print(repr(float_arr2))

array([1. , 5.4, 3. ])
array([10.,  1.,  2.,  3.,  4.,  5.], dtype=float32)


In [11]:
#multi-dimensional array 2-D
matrix = np.array([[1,2,3],[4,5,6]], dtype=np.float32)
print(repr(matrix))

array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)


### Ranged data
While `np.array` can be used to create any array, it is equivalent to hardcoding an array. This won't work when the array has hundreds of values. Instead, Numpy provides an option to create ranged data arrays using **np.arange**. The function acts very similar to the `range` function in Python, and will always return a 1-D array.

The code below contains example usage of `np.arange`.

In [12]:
arr = np.arange(5)
print(repr(arr))

arr = np.arange(5.1)
print(repr(arr))

arr = np.arange(-1, 4)
print(repr(arr))

arr = np.arange(-1.5, 4, 2)
print(repr(arr))

array([0, 1, 2, 3, 4])
array([0., 1., 2., 3., 4., 5.])
array([-1,  0,  1,  2,  3])
array([-1.5,  0.5,  2.5])


The output of `np.arange` is specified as follows
* If only a single number, *n*, is passed in as an argument, `np.arange` will return an array with all the integers in the range \[0,n). **Note**: the lower end is inclusive while the upper end is exclusive.
* For two arguments, m and n, `np.arange` will return an array with all the integers in the range \[m,n)
* For three arguments, m, n and s, `np.arange` will return an array with the integers in the range \[m,n) using a step size of s.
* Like `np.array`, `np.arange` performs upcasting. It also has the `dtype` keyword argument to manually cast the array.

To specify the number of elements in the returned array, rather than the step size, we can use the `np.linspace` function.

Thi function takes in a required first two arguments, for the start and end of the range, respectively. The end of the range is inclusive for `np.linspace`, unless the keyword argument `endpoint` is set to `False`. To specify the number of elements, we set the `num` keyword argument (its default value is `50`)

The code below shows example usage of `np.linspace`. It also takes in the `dtype` keyword argument for manual casting.

In [13]:
arr =np.linspace(5, 11, num=4)
print(repr(arr))

arr = np.linspace(5, 11, num=4, endpoint=False)
print(repr(arr))

arr = np.linspace(5, 11, num=4, dtype=np.int32)
print(repr(arr))

array([ 5.,  7.,  9., 11.])
array([5. , 6.5, 8. , 9.5])
array([ 5,  7,  9, 11])


### Reshaping data
The function we use to reshape data in NumPy is `np.reshape`. It takes in an array and a new shape as required arguments. The new shape must exactly contain all the elements from the input array. For example, we could reshape an array with 12 elements to `(4,3)`, but we can't reshape it to (4,4).

We are allowed to use the special value of -1 in at most one dimension of the new shape. The dimension with -1 will take on the value neccesary to allow the new shape to contian all the elements of the array.

The code below shows example usages of `np.reshape`

In [14]:
arr = np.arange(8)
arr

array([0, 1, 2, 3, 4, 5, 6, 7])

In [15]:
reshaped_arr = np.reshape(arr, (2, 4))
print(repr(reshaped_arr))
print('New shape: {}'.format(reshaped_arr.shape))
print()

reshaped_arr = np.reshape(arr, (-1, 2, 2))
print(repr(reshaped_arr))
print('New shape: {}'.format(reshaped_arr.shape))

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])
New shape: (2, 4)

array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])
New shape: (2, 2, 2)


While the `np.reshape` function can perform any reshaping utilities we need, NumPy provides an inherent function for flattening an array. Flattening an array reshapes it into a 1D array. Since we need to flatten data quite often, it is a useful function.

The code below flattens an array using the inherent `flatten` function

In [16]:
arr = np.arange(8)
arr = np.reshape(arr, (2, 4))
flattened = arr.flatten()
print(repr(arr))
print('arr shape: {}'.format(arr.shape))
print(repr(flattened))
print('flattened shape: {}'.format(flattened.shape))

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])
arr shape: (2, 4)
array([0, 1, 2, 3, 4, 5, 6, 7])
flattened shape: (8,)


### Transposing
Similar to how it is common to reshape data, it is also common to transpose data. Perhaps we have data that's supposed to be in a particular format, but some new data we get is rearranged. We can just transpose the data, using the `np.transpose` function, to convert it to the proper format.

The code below shows an example usage of the `np.transpose` function. The matrix rows become columns after the transpose.

In [17]:
arr = np.arange(8)
arr = np.reshape(arr, (4,2))
transposed = np.transpose(arr)
print(repr(arr))
print('arr shape: {}'.format(arr.shape))
print(repr(transposed))
print('transposed shape: {}'.format(transposed.shape))

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])
arr shape: (4, 2)
array([[0, 2, 4, 6],
       [1, 3, 5, 7]])
transposed shape: (2, 4)


The function takes in a required first argument, which will be the array we want to transpose. It also has a single keyword argument calles `axes`, which represents the new *permutation* of the dimensions.

The permutation is a tuple/list of integers, with the same length as the number of dimensions in the array. It tells us where to switch up the dimensions. For example, if the permutations had 3 at index 1, it means the old third dimension of the data become the new second dimension (since index 1 represents the second dimension).

The code below shows an example usage of the `np.tranpose` function with the `axes` keyword argument.
The `shape` property gives us the shape of an array.

In [18]:
arr = np.arange(24)
arr = np.reshape(arr, (3, 4, 2))
transposed = np.transpose(arr, axes=(1, 2, 0))
print('arr shape: {}'.format(arr.shape))
print('transposed shape: {}'.format(transposed.shape))

arr shape: (3, 4, 2)
transposed shape: (4, 2, 3)


In this example, the old first dimension became the new third dimension, the old second dimension became the first dimension, and the old third dimension became the new second dimension. The default value for `axes` is a dimension reversal.

### Zeros and ones
Sometimes, we need to create arrays filled solely with 0 or 1. For example, since binary data is labeled with 0 and 1, we may need to create dummy datasets of strictly one label. For creating these arrays, NumPy provides the functions `np.zeros` and `np.ones`. They both take in the same arguments, which includes just one required argument, the array shape. The functions also allow for manual casting using the `dtype` keyword argument.

The code below shows example usages of `np.zeros` and `np.ones`.

In [19]:
arr = np.zeros(4)
print(repr(arr))

arr = np.ones((2,3))
print(repr(arr))

arr = np.ones((2,3), dtype=np.int32)
print(repr(arr))

array([0., 0., 0., 0.])
array([[1., 1., 1.],
       [1., 1., 1.]])
array([[1, 1, 1],
       [1, 1, 1]])


If we want to create an array of 0's or 1's with the same shape as another array, we can use `np.zeros_like` and `np.ones_like`

The code below shows example usage of `np.zeros_like` and `np.ones_like`

In [20]:
arr = np.array([[1,2],[3,4]])
print(repr(np.zeros_like(arr)))

arr = np.array([[0.,1.],[1.2,4.]])
print(repr(np.ones_like(arr)))
print(repr(np.ones_like(arr, dtype=np.int32)))

array([[0, 0],
       [0, 0]])
array([[1., 1.],
       [1., 1.]])
array([[1, 1],
       [1, 1]])


### Time to code.
Our initial array will just be all the integers from 0 to 11, inclusive. We'll also reshape it so it has three dimensions.

First, set arr equal to np.arange with 12 as the only argument.

Then, set reshaped equal to np.reshape with arr as the first argument and (2, 3, 2) as the second argument.

In [21]:
arr = np.arange(12)
reshaped = np.reshape(arr, (2,3,2))
print(repr(arr))
print(repr(reshaped))

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]]])


Next we want to get a flattened version of the reshaped array (the flattened version is equivalent to arr), as well as a transposed version. For the transposed version of reshaped, we use a permutation of (1, 2, 0).

Set flattened equal to reshaped.flatten().

Then set transposed equal to np.transpose with reshaped as the first argument and the specified permutation for the axes keyword argument.

In [22]:
flattened = reshaped.flatten()
print(repr(flattened))

transposed = np.transpose(reshaped, axes = (1,2,0))
print(repr(transposed))

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
array([[[ 0,  6],
        [ 1,  7]],

       [[ 2,  8],
        [ 3,  9]],

       [[ 4, 10],
        [ 5, 11]]])


We'll create an array of 5 elements, all of which are 0. We'll also create an array with the same shape as transposed, but containing only 1 as the elements.

Set zeros_arr equal to np.zeros with 5 as the lone argument.

Then set ones_arr equal to np.ones_like with transposed as the lone argument.

In [23]:
zeros_arr = np.zeros(5)
print(repr(zeros_arr))

ones_arr = np.ones_like(transposed)
print(repr(ones_arr))

array([0., 0., 0., 0., 0.])
array([[[1, 1],
        [1, 1]],

       [[1, 1],
        [1, 1]],

       [[1, 1],
        [1, 1]]])


The final array will contain 101 evenly spaced numbers between -3.5 and 1.5, inclusive. Since they are evenly spaced, the difference between adjacent numbers is 0.05.

Set points equal to np.linspace with -3.5 and 1.5 as the first two arguments, respectively, as well as 101 for the num keyword argument.

In [24]:
points = np.linspace(-3.5, 1.5, num=101)
print(repr(points))

array([-3.5 , -3.45, -3.4 , -3.35, -3.3 , -3.25, -3.2 , -3.15, -3.1 ,
       -3.05, -3.  , -2.95, -2.9 , -2.85, -2.8 , -2.75, -2.7 , -2.65,
       -2.6 , -2.55, -2.5 , -2.45, -2.4 , -2.35, -2.3 , -2.25, -2.2 ,
       -2.15, -2.1 , -2.05, -2.  , -1.95, -1.9 , -1.85, -1.8 , -1.75,
       -1.7 , -1.65, -1.6 , -1.55, -1.5 , -1.45, -1.4 , -1.35, -1.3 ,
       -1.25, -1.2 , -1.15, -1.1 , -1.05, -1.  , -0.95, -0.9 , -0.85,
       -0.8 , -0.75, -0.7 , -0.65, -0.6 , -0.55, -0.5 , -0.45, -0.4 ,
       -0.35, -0.3 , -0.25, -0.2 , -0.15, -0.1 , -0.05,  0.  ,  0.05,
        0.1 ,  0.15,  0.2 ,  0.25,  0.3 ,  0.35,  0.4 ,  0.45,  0.5 ,
        0.55,  0.6 ,  0.65,  0.7 ,  0.75,  0.8 ,  0.85,  0.9 ,  0.95,
        1.  ,  1.05,  1.1 ,  1.15,  1.2 ,  1.25,  1.3 ,  1.35,  1.4 ,
        1.45,  1.5 ])


## Math
Understand how arithmetic and linear algebra work in NumPy

### Arithmetic
One of the main purposes of NumPy is to perform multi-dimensional arithmetic. Using NumPy arrays, we can apply arithmetic to each element with a single operation.

The code below shows multi-dimensional arithmetic with NumPy

In [25]:
arr = np.array([[1,2],[3,4]])

#Add 1 to element values
print(repr(arr+1))

array([[2, 3],
       [4, 5]])


In [26]:
# Substract element values by 1.2
print(repr(arr-1.2))

array([[-0.2,  0.8],
       [ 1.8,  2.8]])


In [27]:
# Double element values
print(repr(arr*2))

array([[2, 4],
       [6, 8]])


In [28]:
# Halve element values
print(repr(arr/2))

array([[0.5, 1. ],
       [1.5, 2. ]])


In [29]:
# Integer division (half)
print(repr(arr//2))

array([[0, 1],
       [1, 2]], dtype=int32)


In [30]:
# Square element values
print(repr(arr**2))

array([[ 1,  4],
       [ 9, 16]], dtype=int32)


In [31]:
# Square root element values
print(repr(arr**0.5))

array([[1.        , 1.41421356],
       [1.73205081, 2.        ]])


Using NumPy arithmetic, we can easily modify large amount of numeric data with only a few operations. For example, we could convert a dataset of Fahrenheit temperatures to their equivalent Celsius form.

The code below converts Fahrenheit to Celsius in NumPy.

In [32]:
def f2c(temps):
    return (5/9)*(temps-32)

fahrenheits = np.array([32, -4, 14, -40])
celsius = f2c(fahrenheits)
print('Celsius {}'.format(repr(celsius)))

Celsius array([  0., -20., -10., -40.])


It is important to note that performing arithmetic on NumPy arrays **does not change the original array** and instead produces a new array that is the result of the arithmetic operation.

### Non-linear functions
Apart from basic arithmetic operations, NumPy also allows you to use non-linear functions such as exponencial and logarithms.

The function `np.exp` performs a base *e* exponential on an array, while the function `np.exp2` performs a base 2 exponential. Likewise, `np.log`, `np.log2` and `np.log10` all perform logarithms on an input array, using base e, base 2, and base 10, respectively.

The code below shows various exponentials and logarithms with NumPy. Note that `np.e` and `np.pi` represent the mathematical constants e and `pi` respectively.

In [33]:
arr = np.array([[1,2],[3,4]])
# Raised to powe of 3
print(repr(np.exp(arr)))

array([[ 2.71828183,  7.3890561 ],
       [20.08553692, 54.59815003]])


In [34]:
# Raised to powe of 2
print(repr(np.exp2(arr)))

array([[ 2.,  4.],
       [ 8., 16.]])


In [35]:
arr2 = np.array([[1,10],[np.e, np.pi]])
print(repr(arr2))

array([[ 1.        , 10.        ],
       [ 2.71828183,  3.14159265]])


In [36]:
# Natural logarith
print(repr(np.log(arr2)))

array([[0.        , 2.30258509],
       [1.        , 1.14472989]])


In [37]:
# Base 10 logarith
print(repr(np.log10(arr2)))

array([[0.        , 1.        ],
       [0.43429448, 0.49714987]])


To do a regular powe operation with any base, we use `np.power`. The first argument to the function is the base, while the second is the power. If the base or power is an array rather than a single number, the operation is applied to every element in the array.

The code below shows examples of using `np.power`.

In [38]:
arr = np.array([[1,2],[3,4]])

# Raise 3 to power of each number in arr
print(repr(np.power(3, arr)))

array([[ 3,  9],
       [27, 81]], dtype=int32)


In [39]:
arr2 = np.array([[10.2, 4],[3,5]])
# Raise arr2 to power of each number in arr
print(repr(np.power(arr2, arr)))

array([[ 10.2,  16. ],
       [ 27. , 625. ]])


### Matrix multiplication
Since NumPy arrays are basically vectors and matrices, it makes sense that there are functions for dot products and matrix multiplication. Specially, the main function to use is `np.matmul`, which takes two vector/matrix arrays as input and produces a dot product or matrix multiplication.

The code below shows various examples of matrix multiplication. When both inputs are 1-D, the output is the dot product.

Note that the dimensions of the two input matrices must be valid for a matrix multiplication. Specially, the second dimension of the first matrix must equal the first dimension of the second matrix, otherwise `np.matmul` will result is a `ValueError`

In [40]:
arr1 = np.array([1,2,3])
arr2 = np.array([-3,0,10])
print(np.matmul(arr1, arr2))

27


In [41]:
arr3 = np.array([[1,2],[3,4],[5,6]])
arr4 = np.array([[-1,0,1],[3,2,-4]])
print(repr(np.matmul(arr3,arr4)))

array([[  5,   4,  -7],
       [  9,   8, -13],
       [ 13,  12, -19]])


In [42]:
print(repr(np.matmul(arr4, arr3)))

array([[  4,   4],
       [-11, -10]])


In [43]:
# This will result in ValueError
print(repr(np.matmul(arr3,arr3)))

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 3 is different from 2)

### Time to code!
We will create a couple of matrix arrays to perform our math operations on.

In [44]:
arr1 = np.array([[-0.5,0.8,-0.1],[0.0,-1.2,1.3]])
arr2 = np.array([[1.2,3.1],[1.2,0.3],[1.5,2.2]])

print(arr1.shape)
print(arr2.shape)

(2, 3)
(3, 2)


Next we'll apply some arithmetic to arr. Specifically, we'll do multiplication, addition, and squaring.

Set multiplied equal to arr multiplied by np.pi.

Then set added equal to the result of adding arr and multiplied.

Finally, set squared equal to added with each of its elements squared.

In [45]:
multiplied = arr1*np.pi
print(repr(multiplied))

array([[-1.57079633,  2.51327412, -0.31415927],
       [ 0.        , -3.76991118,  4.08407045]])


In [46]:
added = arr1 + multiplied
print(repr(added))

array([[-2.07079633,  3.31327412, -0.41415927],
       [ 0.        , -4.96991118,  5.38407045]])


In [47]:
squared = added**2
print(repr(squared))

array([[ 4.28819743, 10.97778541,  0.1715279 ],
       [ 0.        , 24.70001718, 28.98821461]])


After the arithmetic operations, we'll apply the base e exponential and logarithm to our array matrices.

Set exponential equal to np.exp applied to squared.

Then set logged equal to np.log applied to arr2.

In [48]:
exponential = np.exp(squared)
print(repr(exponential))

array([[7.28350596e+01, 5.85587272e+04, 1.18711726e+00],
       [1.00000000e+00, 5.33434578e+10, 3.88527393e+12]])


In [49]:
logged = np.log(arr2)
print(repr(logged))

array([[ 0.18232156,  1.13140211],
       [ 0.18232156, -1.2039728 ],
       [ 0.40546511,  0.78845736]])


Note that exponential has shape (2, 3) and logged has shape (3, 2). So we can perform matrix multiplication both ways.

Set matmul1 equal to np.matmul with first argument logged and second argument exponential. Note that matmul1 will have shape (3, 3).

Then set matmul2 equal to np.matmul with first argument exponential and second argument logged. Note that matmul2 will have shape (2, 2).

In [50]:
matmul1 = np.matmul(logged,exponential)
print(repr(matmul1))

array([[ 1.44108036e+01,  6.03529115e+10,  4.39580713e+12],
       [ 1.20754286e+01, -6.42240618e+10, -4.67776415e+12],
       [ 3.03205327e+01,  4.20590657e+10,  3.06337283e+12]])


In [51]:
matmul2 = np.matmul(exponential, logged)
print(repr(matmul2))

array([[ 1.06902790e+04, -7.04197733e+04],
       [ 1.58506868e+12,  2.99914875e+12]])


## Random
Generate numbers and arrays from different random distributions.

### Random integers
Similar to the Python `random` module, NumPy has its own submodule for pseudo-random number generation called `np.random`. It provides all the necessary randomized operations and extends it to multi-dimensional arrays. To generate pseudo-random integers, we use the `np.random.randint` function.

The code below shows example usages of `np.random.randint`

In [52]:
print(np.random.randint(5))
print(np.random.randint(5))
print(np.random.randint(5, high=6))

random_arr = np.random.randint(-3, high=14, size=(2,2))

print(repr(random_arr))

2
1
5
array([[ 5,  2],
       [ 2, -2]])


The `np.random.randint` function takes in a single required argument, which actually depends on the `high` keyword argument. If `high=None` (which is the default value), then the required argument represents the upper (exclusive) end of the range, with the lower end being 0. Specially, if the required argument is *n*, then the random integer is chosen uniformly from the range \[0,n)

If `high` is not `None`, then the required argument will represent the lower (inclusive) end of the range, with `high` represents the uppder (exclusive) end.

The `size` keyword argument specifies the size of the output array, where each integer in the array is randomly drawn from the specified range. As a default, `np.random.randint` returns a single integer.

### Utility functions
Some fundamental utility functions from the `np.random` module are `np.random.seed` and `np.randome.shuffle`. We use the `np.random.seed` function to set the *random seed*, which allows us to control the output of the pseudo-random functions. The function takes in a single integer as an argument, representing the random seed.

The code below uses `np.random.seed` with the same random seed. Note how the output of the random functions in each subsequent run are idential when we set the same random seed.

In [53]:
np.random.seed(1)
print(np.random.randint(10))

5


In [54]:
random_arr = np.random.randint(3, high=100, size=(2,2))
print(repr(random_arr))

array([[15, 75],
       [12, 78]])


In [55]:
# New seed
np.random.seed(2)
print(np.random.randint(10))
random_arr = np.random.randint(3, high=100, size=(2,2))
print(repr(random_arr))

8
array([[18, 75],
       [25, 46]])


In [56]:
# Original seed
np.random.seed(1)
print(np.random.randint(10))
random_arr = np.random.randint(3, high=100, size=(2,2))
print(repr(random_arr))

5
array([[15, 75],
       [12, 78]])


The `np.random.shuffle` function allows us to randomly shuffle an array. Note that the shuffle happens in place (i.e. no return value), and shuffling multi-dimensional arrays only shuffles the first dimension.

The code below shows example usages of `np.random.shuffle`. Note that only the rows of `matrix` are shuffled (i.e. shuffling along first dimension only).

In [57]:
vec = np.array([1,2,3,4,5])
np.random.shuffle(vec)
print(repr(vec))

array([3, 4, 2, 5, 1])


In [58]:
np.random.shuffle(vec)
print(repr(vec))

array([5, 3, 4, 2, 1])


In [59]:
matrix = np.array([[1,2,3],[4,5,6],[7,8,9]])
np.random.shuffle(matrix)
print(repr(matrix))

array([[4, 5, 6],
       [7, 8, 9],
       [1, 2, 3]])


### Distributions
Using `np.random` we can also draw samples from probability distributions. For example, we can use `np.random.uniform` to draw pseudo-random real numbers from a `uniform distribution`

The code below shows usages of `np.random.uniform`

In [60]:
print(np.random.uniform())

0.3132735169322751


In [61]:
print(np.random.uniform(low=-1.5, high=2.2))

0.4408281904196243


In [62]:
print(repr(np.random.uniform(size=3)))

array([0.44345289, 0.22957721, 0.53441391])


In [63]:
print(repr(np.random.uniform(low=-3.4, high=5.9, size=(2,2))))

array([[5.09984683, 0.85200471],
       [0.60549667, 5.33388844]])


The function `np.random.uniform` actually has no required arguments. The keyword arguments, `low` and `high`, represents the inclusive lower end and exclusive upper end from which to draw random samples. Since they have default values of 0.0 and 1.0, respectively, the default outputs of `np.random.uniform` come from the range \[0.0,1.0)

The `size` keyword argument is the same as the one for `np.random.randint`, i.e. it represents the output size of the array.

Another popular distribution we can sample from is the normal Gaussian distribution. The function we use is np.random.normal.

The code below shows usages of np.random.normal.

In [64]:
print(np.random.normal())
print(np.random.normal(loc=1.5, scale=3.5))
print(repr(np.random.normal(loc=-2.4, scale=4.0, size=(2,2))))

0.7252740646272712
4.772112039383628
array([[ 2.07318791, -2.17754724],
       [-0.89337346, -0.89545991]])


Like `np.random.uniform`, `np.random.normal` has no required arguments. The `loc` and `scale` keyword arguments represent the mean and standar deviation, respectively, of the normal distribution we sample from.

### Custom sampling
While NumPy provides built-in distributions to sample from, we can also sample from a custom distribution with `np.random.choice` function

The code below shows example usages of `np.random.choice`.

In [65]:
colors = ['red', 'blue', 'green']
print(np.random.choice(colors))
print(repr(np.random.choice(colors, size=2)))
print(repr(np.random.choice(colors, size=2, p=[0.8, 0.19, 0.01])))

green
array(['blue', 'red'], dtype='<U5')
array(['red', 'red'], dtype='<U5')


The required argument for `np.random.choice` is the custom distribution we sample from. The `p` keyword argument denotes the probabilities given to each element in the input distribution. Note that the list of probabilities for `p` must sum to 1.

In the example, we set `p` such that `red` has a probability of 0.8 of being chosen, 'blue' has a probabilty of 0.19 and `green` has a probability of 0.001, When `p` is not set, the probabilities are equal for each element in the distribution (and sum to 1).

### Time to code!
Note: it is important you do all the instructions in the order listed. We test your code by setting a fixed np.random.seed, so in order for your code to match the reference output, all the functions must be run in the correct order.

We'll start off by obtaining some random integers. The first integer we get will be randomly chosen from the range \[0, 5). The remaining integers will be part of a 3x5 NumPy array, each randomly chosen from the range \[3, 10).

Set random1 equal to np.random.randint with 5 as the only argument.

Then set random_arr equal to np.random.randint with 3 as the first argument, 10 as the high keyword argument, and (3, 5) as the size keyword argument.

In [66]:
random1 = np.random.randint(5)

In [67]:
random_arr = np.random.randint(3, high=10, size=(3,5))
print(repr(random_arr))

array([[4, 6, 7, 3, 4],
       [6, 7, 5, 7, 3],
       [8, 6, 4, 5, 3]])


The next two arrays will be drawn randomly from distributions. The first will contain 5 numbers drawn uniformly from the range [-2.5, 1.5].

Set random_uniform equal to np.random.uniform with the low and high keyword arguments set to -2.5 and 1.5, respectively. The size keyword argument should be set to 5.

In [68]:
random_uniform = np.random.uniform(low=-2.5, high=1.5, size=5)
print(repr(random_uniform))

array([ 0.49266262, -1.37822403,  0.65711731, -2.08709597, -0.7084259 ])


The second array will contain 50 numbers drawn from a normal distribution with mean 2.0 and standard deviation 3.5.

Set random_norm equal to np.random.normal with the loc and scale keyword arguments set to 2.0 and 3.5, respectively. The size keyword argument should be set to (10, 5).

In [69]:
random_norm = np.random.normal(loc=2.0, scale=3.5, size=(10,5))
print(repr(random_norm))

array([[ 1.06239172,  3.85624413, -0.42081263,  0.61136266, -0.40510445],
       [-0.95821975, -0.34936146,  1.9556739 , -1.91058622,  2.82045494],
       [ 7.80930762,  4.59715456,  1.32857557, -1.10670137, -0.61505403],
       [ 7.9235911 ,  2.17782714, -0.22948476,  2.6682042 ,  9.35089298],
       [ 2.42055633,  4.16021088,  3.05059612,  0.76712554, -1.99881369],
       [ 0.77730047,  1.26887018,  4.05318117,  4.93644195,  5.25885728],
       [ 2.99955564,  5.09799407, -0.64039279,  6.38503854,  3.79525437],
       [ 0.95667508,  3.70981351,  1.735499  ,  5.96070286,  7.31935886],
       [ 9.64951392, -2.88773717, -3.05439832,  0.23436948,  2.56012974],
       [ 5.06659122,  3.10472232, -5.07770426,  0.92828596,  4.89791125]])


We'll now create our own distribution of strings and randomly select from it. The values for our distribution will be 'a', 'b', 'c', 'd'.

To choose a value, we'll use a probability distribution of [0.5, 0.1, 0.2, 0.2], i.e. 'a' will have probability 0.5 of being chosen, 'b' will have a probability of 0.1, etc.

Set choices equal to a list of the specified values, in the order given.

Set choice equal to np.random.choice with choices as the first argument and the specified probability distribution as the p keyword argument.

In [70]:
choices = ['a','b','c','d']
print(repr(choices))
choice = np.random.choice(choices, p=[0.5, 0.1, 0.2, 0.2])
print(repr(choice))

['a', 'b', 'c', 'd']
'd'


The last random operation we perform will be an in-place shuffle of a NumPy array.

Set arr equal to a NumPy array containing the integers from 1 to 5, inclusive.

Then apply np.random.shuffle to arr.

In [71]:
arr = np.array([1,2,3,4,5])
np.random.shuffle(arr)
print(repr(arr))

array([4, 2, 1, 3, 5])


## Indexing
Index into Numpy arrays to extract data and array slices.

### Array accesing
Accessing Numpy arrays is identical to accessing Python lists. For multi-dimensional arrays, it is equivalent to accessing Python lists of lists.

The code below shows example access of Numpy arrays.

In [72]:
arr = np.array([1,2,3,4,5])
print(arr[0])
print(arr[4])

arr = np.array([[6,3],[0,2]])
#Subarray
print(repr(arr[0]))

1
5
array([6, 3])


### Slicing
Numpy arrays also support slicing. Similar to Python, we use the colon operator for slicing. We can also use negative indexing to slice in the backwards direction.

The code below shows example slices of a 1-D Numpy array.

In [73]:
arr = np.array([1,2,3,4,5])
print(repr(arr[:]))
print(repr(arr[1:]))
print(repr(arr[2:4]))
print(repr(arr[:-1]))
print(repr(arr[-2:]))

array([1, 2, 3, 4, 5])
array([2, 3, 4, 5])
array([3, 4])
array([1, 2, 3, 4])
array([4, 5])


For multi-dimensional arrays, we can use a common to separate slices across each dimension.

The code below shows examples slices of a 2-D Numpy array.

In [74]:
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])

print(repr(arr[:]))
print(repr(arr[1:]))
print(repr(arr[:, -1]))
print(repr(arr[0:1, 1:]))
print(repr(arr[0, 1:]))

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
array([[4, 5, 6],
       [7, 8, 9]])
array([3, 6, 9])
array([[2, 3]])
array([2, 3])


### Argmin and argmax
In addition to accessing and slicing arrays, it is useful to figure out the actual indexes of the minimum and maximum elements. To do this, we use the `np.argmin` and `np.argmax` functions.

The code below shows example usages of `np.argmin` and `np.argmax`. Note that the index of element -6 is index 5 in the flattened version of `arr`.

In [75]:
arr = np.array([[-2, -1, -3],
                [4, 5, -6],
                [-3, 9, 1]])
print(np.argmin(arr[0]))
print(np.argmax(arr[2]))
print(np.argmin(arr))

2
1
5


The `np.argmin` and `np.argmax` functions take the same arguments. The required argument is the input array and the `axis` keyword argument specifies which dimension to apply the operation on.

The code below shows how the `axis` keyword argument is used for these functions.

In [76]:
arr = np.array([[-2, -1, -3],
                [4, 5, -6],
                [-3, 9, 1]])
print(repr(np.argmin(arr, axis=0)))
print(repr(np.argmin(arr, axis=1)))
print(repr(np.argmax(arr, axis=-1)))

array([2, 0, 1], dtype=int64)
array([2, 2, 0], dtype=int64)
array([1, 1, 1], dtype=int64)


In our example, using `axis=0` meant the function found the index of the minimum row element for each column. When we use `axis=1`, the function found the index of the minimum column element for each row.

Setting `axis` to -1 just means we apply the function across the last dimension. In this case, `axis=-1` is equilavent to `axis=1`.

### Time to code!
Each coding exercise in this chapter will be to complete a small function that takes in a 2-D NumPy matrix (data) as input. The first function to complete is direct_index.

Set elem equal to the third element of the second row in data (remember that the first row is index 0). Then return elem.

In [77]:
def direct_index(data):
    elem = data[1][2]
    return elem

The next function, slice_data, will return two slices from the input data.

The first slice will contain all the rows, but will skip the first element in each row. The second slice will contain all the elements of the first three rows except the last two elements.

Set slice1 equal to the specified first slice. Remember that NumPy uses a comma to separate slices along different dimensions.

Set slice2 equal to the specified second slice.

Return a tuple containing slice1 and slice2, in that order.

In [78]:
def slice_data(data):
    slice1 = data[:, 1:]
    slice2 = data[0:3, :-2]
    return slice1, slice2

The next function, argmin_data, will find minimum indexes in the input data.

We can use np.argmin to find minimum points in the data array. First, we'll find the index of the overall minimum element.

We can also return the indexes of each row's minimum element. This is equivalent to finding the minimum column for each row, which means our operation is done along axis 1.

Set argmin_all equal to np.argmin with data as the only argument.

Set argmin1 equal to np.argmin with data as the first argument and the specified axis keyword argument.

Return a tuple containing argmin_all and argmin1, in that order.

In [79]:
def argmin_data(data):
    argmin_all = np.argmin(data)
    argmin1 = np.argmin(data, axis=1)
    return argmin_all, argmin1


The final function, argmax_data, will find the index of each row's maximum element in data. Since there are only 2 dimensions in data, we can apply the operation along either axis 1 or -1.

Set argmax_neg1 equal to np.argmax with data as the first argument and -1 as the axis keyword argument. Then return argmax_neg1.

In [80]:
def argmax_data(data):
    argmax_neg1 = np.argmax(data, axis=-1)
    return argmax_neg1

## Filtering
Filter NumPy data for specific values.

### Filtering data
Sometimes we have data that contains values we don't want to use. For example, when tracking the best hitters in baseball, we may want to only use the batting average data above .300. In this case, we should *filter* the overall data for only the values that we want.

The key to filtering data is through basic relation operations, eg. `==`, `>`, etc. In Numpy, we can apply basic relation operations element-wise on arrays.

The code below shows relation operations on Numpy arrays. The `~` operation represents a boolean negation, i.e. it flips each truth value in the array.

In [81]:
arr = np.array([[0, 2, 3],
                [1, 3, -6],
                [-3, -2, 1]])
print(repr(arr == 3))
print(repr(arr > 0))
print(repr(arr != 1))
# Negated from the previous step
print(repr(~(arr != 1)))

array([[False, False,  True],
       [False,  True, False],
       [False, False, False]])
array([[False,  True,  True],
       [ True,  True, False],
       [False, False,  True]])
array([[ True,  True,  True],
       [False,  True,  True],
       [ True,  True, False]])
array([[False, False, False],
       [ True, False, False],
       [False, False,  True]])


Something to note is that np.nan can't be used with any relation operation. Instead, we use `np.isnan` to filter for the location of `np.nan`.

The code below uses `np.isnan` to determine which locations of the array contain `np.nan` values.

In [82]:
arr = np.array([[0, 2, np.nan],
               [1, np.nan, -6],
               [np.nan, -2, 1]])
print(repr(np.isnan(arr)))

array([[False, False,  True],
       [False,  True, False],
       [ True, False, False]])


Each boolean array in our examples represents the location of elements we want to filter for. The way we perform the filtering itself is through the `np.where` function.

### Filtering in NumPy
The `np.where` function takes in a required first argument, which is a boolean array wehre `True` represents the locations of the elemnts we want to filter for. When the function is applied with only the first argument, it returns a tuple of 1-D arrays.

The tuple will have size equal to the number of dimensions in the data, and eah array represents the `True` indices for the corresponding dimension. Note that the arrays in the tuple will all have the same lenght, equal to the number of `True` elements in the input argument.

The code below shows how to use `np.where` with a single argument.

In [83]:
print(repr(np.where([True, False, True])))

arr = np.array([0, 3, 5, 3, 1])
print(repr(np.where(arr == 3)))

arr = np.array([[0, 2, 3],
                [1, 0, 0],
                [-3, 0, 0]])
x_ind, y_ind = np.where(arr != 0)
print(repr(x_ind)) # x indices of non-zero elements
print(repr(y_ind)) # y indices of non-zero elements
print(repr(arr[x_ind, y_ind]))

(array([0, 2], dtype=int64),)
(array([1, 3], dtype=int64),)
array([0, 0, 1, 2], dtype=int64)
array([1, 2, 0, 0], dtype=int64)
array([ 2,  3,  1, -3])


The interesting thing about `np.where` is that it must be applied with exactly 1 or 3 arguments. When we use 3 arguments, the first argument is still the boolean array. However, the next two arguments represent the `True` replacement values and the `False` replacement values, respectively. The output of the function now becomes an array with the same shape as the first argument.

The code below shows how to use `np.where` with 3 arguments.

In [84]:
np_filter = np.array([[True, False], [False, True]])
positives = np.array([[1, 2], [3, 4]])
negatives = np.array([[-2, -5], [-1, -8]])
print(repr(np.where(np_filter, positives, negatives)))

np_filter = positives > 2
print(repr(np.where(np_filter, positives, negatives)))

np_filter = negatives > 0
print(repr(np.where(np_filter, positives, negatives)))

array([[ 1, -5],
       [-1,  4]])
array([[-2, -5],
       [ 3,  4]])
array([[-2, -5],
       [-1, -8]])


Note that our second and third arguments necessarily had the same shape as the first argument. However, if we wanted to use a constant replacement value, eg. `-1`, we could incorporate broadcasting. Rather than using an entire array of the same value, we can just use the value itself as an argument.

The code below showcases broadcasting with `np.where`.

In [85]:
np_filter = np.array([[True, False], [False, True]])
positives = np.array([[1, 2], [3, 4]])
print(repr(np.where(np_filter, positives, -1)))

array([[ 1, -1],
       [-1,  4]])


### Axis-wise filtering
If we wanted to filter based on rows or columns of data, we could use the `np.any` and `np.all` functions. Both functions take in the same arguments, and return a single boolean or a boolean array. The required argument for both functions is a boolean array.

The code below shows usage of `np.any` and `np.all` with a single argument.

In [86]:
arr = np.array([[-2, -1, -3],
                [4, 5, -6],
                [3, 9, 1]])
print(repr(arr > 0))
print(np.any(arr > 0))
print(np.all(arr > 0))

array([[False, False, False],
       [ True,  True, False],
       [ True,  True,  True]])
True
False


The `np.any` function is equivalent to performing a logical OR (||), while the `np.all` function is equivalent to a logical AND (&&) on the first argument, np.any returns true if even one of the elements in the array meets the condition and np.all returns true only if all the elements meet the condition. When only a single argument is passed in, the function is applied across the entire input array, so the returned value is a single boolean.

However, if we use a multi-dimensional input and specify the `axis` keyword argument, the returned value will be an array. The `axis` argument has the same meaning as it did for `np.argmin` and `np.argmax` for the previous code. Using `axis=0` means the fucntion finds the index of the minimum row element for each column. When we use `axis=1`, the function finds the index of the minimum column element for each row.

Setting `axis` to -1 just means we apply the function across the last dimension.

The code below shows examples of using `np.any` and `np.all` with the `axis` keyword argument.

In [87]:
arr = np.array([[-2, -1, -3],
                [4, 5, -6],
                [3, 9, 1]])
print(repr(arr > 0))
print(repr(np.any(arr > 0, axis=0)))
print(repr(np.any(arr > 0, axis=1)))
print(repr(np.all(arr > 0, axis=1)))

array([[False, False, False],
       [ True,  True, False],
       [ True,  True,  True]])
array([ True,  True,  True])
array([False,  True,  True])
array([False, False,  True])


We can use `np.any` and `np.all` in tandem with `np.where` to filter for entire rows or columns of data.

In the code example below, we use `np.any` to obtain a boolean array representing the rows that have at least one positive number. We then use the boolean array as the input to `np.where`, which gives us the actual indices of the rows with at least one positive number.

In [88]:
arr = np.array([[-2, -1, -3],
                [4, 5, -6],
                [3, 9, 1]])
has_positive = np.any(arr > 0, axis=1)
print(has_positive)
print(repr(arr[np.where(has_positive)]))

[False  True  True]
array([[ 4,  5, -6],
       [ 3,  9,  1]])


In [92]:
arr = np.array([0, 3, 5, 3, 1])
print(repr(np.where(arr == 3)))
arr2 = np.where(arr == 3)
arr2

(array([1, 3], dtype=int64),)


(array([1, 3], dtype=int64),)

# Statistics
## Analysis

We can obtain minimum and maximum values of a NumPy array using its inherent min and max functions. This gives us an initial sense of the data's range, and can alert us to extreme outliers in the data.

The code below shows example usages of the `min` and `max` functions

In [3]:
arr = np.array([[0, 72, 3],
                [1, 3, -60],
                [-3, -2, 4]])
print(arr.min())
print(arr.max())

print(repr(arr.min(axis=0)))
print(repr(arr.max(axis=-1)))

-60
72
array([ -3,  -2, -60])
array([72,  3,  4])


The `axis` keyword argument is identical to how it was used in `np.argmin` and `np.argmax` from the part of Indexing. In our example, we use `axis=0` to find an array of the minimum values in each column of `arr` and `axis=1` to find an array of the maximum values in each row of `arr`.

## Statistical metrics
Numpy also provides basic statistical functions sush as `np.mean`, `np.var` and `np.median`, to calculate the mean, variance and median of the data, repectively.

The code below shows how to obtain basic statistics with Numpy. Note that `np.median` applied without `axis` takes the median of the flattened array.

In [4]:
arr = np.array([[0, 72, 3],
                [1, 3, -60],
                [-3, -2, 4]])
print(np.mean(arr))
print(np.var(arr))
print(np.median(arr))
print(repr(np.median(arr, axis=-1)))

2.0
977.3333333333334
1.0
array([ 3.,  1., -2.])


Each of these functions takes in the data array as a required argument and `axis` as a keyword argument.

# Aggregation
## Summation

To sum the values within a single array, we use the `np.sum` function.
The function takes in a NumPy array as its required argument, and uses the `axis` keyword argument in the same ways as before. If the `axis` keyword arguments is not specified, `np.sum` returns the overall sum of the array.

In [5]:
arr = np.array([[0, 72, 3],
                [1, 3, -60],
                [-3, -2, 4]])
print(np.sum(arr))
print(repr(np.sum(arr, axis=0)))
print(repr(np.sum(arr, axis=1)))

18
array([ -2,  73, -53])
array([ 75, -56,  -1])


In addition to regular sums, NumPy can perform cumulative sums using `np.cumsum`. Like `np.sum`, `np.cumsum` also takes in a Numpy array as required argument and uses the `axis` argument. If the `axis` keyword argument is not specified, `np.cumsum` will return the cumulative sums for the flattened array.

The code belos shows how to use `np.cumsum`. For a 2-D Numpy array, setting `axis=0` returns an array with cumulative sums across each column, while `axis=1` returns the array with cumulative sums across each row. Not setting `axis` returns a cumulative sum across all the values of the flattened array.

In [6]:
arr = np.array([[0, 72, 3],
                [1, 3, -60],
                [-3, -2, 4]])
print(repr(np.cumsum(arr)))
print(repr(np.cumsum(arr, axis=0)))
print(repr(np.cumsum(arr, axis=1)))

array([ 0, 72, 75, 76, 79, 19, 16, 14, 18], dtype=int32)
array([[  0,  72,   3],
       [  1,  75, -57],
       [ -2,  73, -53]], dtype=int32)
array([[  0,  72,  75],
       [  1,   4, -56],
       [ -3,  -5,  -1]], dtype=int32)


## Concatenation

An important part of aggregation is combining multiple datasets. In NumPy, this equates to combining multiple arrays into one. The function we use to do this is `np.concatenate`.

Like the summations functions, `np.concatenate` uses the `axis` keyword argument. However, the default value of `axis` is `0`. Furthermore, the required argument for `np.concatenate` is a list of arrays, which the function combines into a single array.

The code below shows how to use `np.concatenate`, which aggregates arrays by joining them along a specific dimension. For 2-D arrays, not setting the `axis` argument (default to `axis=0`) concatenates the arrays vertically. When we set `axis=1`, the arrays are concatenated horizontally.

In [7]:
arr1 = np.array([[0, 72, 3],
                 [1, 3, -60],
                 [-3, -2, 4]])
arr2 = np.array([[-15, 6, 1],
                 [8, 9, -4],
                 [5, -21, 18]])
print(repr(np.concatenate([arr1, arr2])))
print(repr(np.concatenate([arr1, arr2], axis=1)))
print(repr(np.concatenate([arr2, arr1], axis=1)))

array([[  0,  72,   3],
       [  1,   3, -60],
       [ -3,  -2,   4],
       [-15,   6,   1],
       [  8,   9,  -4],
       [  5, -21,  18]])
array([[  0,  72,   3, -15,   6,   1],
       [  1,   3, -60,   8,   9,  -4],
       [ -3,  -2,   4,   5, -21,  18]])
array([[-15,   6,   1,   0,  72,   3],
       [  8,   9,  -4,   1,   3, -60],
       [  5, -21,  18,  -3,  -2,   4]])


# Saving Data
## Saving

After performing data manipulation with Numpy, it is a good idea to save the data in a file for future use. To do this, we use the `np.save` function.

The first argument for the function is the name/path of the file we want to save our data to. The file name/path should have a ".npy" extension. If it doesn't,  then `np.save` will append the ".npy" extension to it.

The second argument for `np.save` is the NumPy data we want to save. The function has no return value. Also the format of the ".npy" files when viewed with a text editor is largely gibberish when viewing with a text editor.

If `np.save` is called with the name of a file that already exists, it will overwrite the previous file.

The code below shows examples of saving Numpy data.

In [8]:
arr = np.array([1, 2, 3])
# Saves to 'arr.npy'
np.save('arr.npy', arr)
# Also saves to 'arr.npy'
np.save('arr', arr)

## Loading
After saving our data, we can load it again using `np.load`. The function's required argument is the file name/path that contains the save data. It returns the NumPy data exactly as it was saved.

Note that `np.load` will not append the ".npy" extension to the file name/path if it is not there.

The code below shows how to use `np.load` to load NumPy data.

In [9]:
arr = np.array([1, 2, 3])
np.save('arr.npy', arr)
load_arr = np.load('arr.npy')
print(repr(load_arr))

# Will result in FileNotFoundError
load_arr = np.load('arr')

array([1, 2, 3])


FileNotFoundError: [Errno 2] No such file or directory: 'arr'