<a href="https://colab.research.google.com/github/jfogarty/machine-learning-intro-workshop/blob/master/notebooks/numpy_data_structures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NumPy Vectors, Matrices, and Arrays

- From [Machine Learning with Python Cookbook](https://www.oreilly.com/library/view/machine-learning-with/9781491989371/ch01.html) by [Chris Albon](https://chrisalbon.com/) [[Github/notes](https://github.com/chrisalbon/notes)], published by [O'Reilly Safari](https://www.oreilly.com).

Updated by [John Fogarty](https://github.com/jfogarty) for Python 3.6 and [Base2 MLI](https://github.com/base2solutions/mli).


## 
1.0 Introduction

NumPy is the foundation of the Python machine learning stack. NumPy allows for efficient operations on the data structures often used in machine learning: vectors, matrices, and tensors. While NumPy is not the focus of this book, it will show up frequently throughout the following chapters. This chapter covers the most common NumPy operations we are likely to run into while working on machine learning workflows.

## 1.1 Creating a Vector
### Problem
You need to create a vector.

### Solution
Use NumPy to create a one-dimensional array:

In [0]:
# Load library
import numpy as np

# Create a vector as a row
vector_row = np.array([1, 2, 3])

# Create a vector as a column
vector_column = np.array([[1],
                          [2],
                          [3]])

### Discussion

NumPy’s main data structure is the multidimensional array. To create a vector, we simply create a one-dimensional array. Just like vectors, these arrays can be represented horizontally (i.e., rows) or vertically (i.e., columns).

### See Also
- [Vectors, Math Is Fun](http://bit.ly/2FB5q1v)

- [Euclidean vector, Wikipedia](http://bit.ly/2FtnRoL)


## 1.2 Creating a Matrix
### Problem
You need to create a matrix.

### Solution
Use NumPy to create a two-dimensional array:

In [0]:
# Create a matrix
matrix = np.array([[1, 2],
                   [1, 2],
                   [1, 2]])

### Discussion
To create a matrix we can use a NumPy two-dimensional array. In our solution, the matrix contains three rows and two columns (a column of 1s and a column of 2s).

NumPy actually has a dedicated matrix data structure:

In [4]:
matrix_object = np.mat([[1, 2],
                        [1, 2],
                        [1, 2]])
matrix_object

matrix([[1, 2],
        [1, 2],
        [1, 2]])

However, the matrix data structure is **not recommended** for two reasons. First, arrays are the de facto standard data structure of NumPy. Second, the vast majority of NumPy operations return arrays, not matrix objects.

### See Also
- [Matrix, Wikipedia](http://bit.ly/2Ftnevp)

- [Matrix, Wolfram MathWorld](http://bit.ly/2Fut7IJ)

## 1.3 Creating a Sparse Matrix
### Problem
Given data with very few nonzero values, you want to efficiently represent it.

### Solution
Create a sparse matrix:

In [0]:
# Load libraries
from scipy import sparse

# Create a matrix
matrix = np.array([[0, 0],
                   [0, 1],
                   [3, 0]])

# Create compressed sparse row (CSR) matrix
matrix_sparse = sparse.csr_matrix(matrix)

### Discussion
A frequent situation in machine learning is having a huge amount of data; however, most of the elements in the data are zeros. For example, imagine a matrix where the columns are every movie on Netflix, the rows are every Netflix user, and the values are how many times a user has watched that particular movie. This matrix would have tens of thousands of columns and millions of rows! However, since most users do not watch most movies, the vast majority of elements would be zero.

Sparse matrices only store nonzero elements and assume all other values will be zero, leading to significant computational savings. In our solution, we created a NumPy array with two nonzero values, then converted it into a sparse matrix. If we view the sparse matrix we can see that only the nonzero values are stored:

In [6]:
# View sparse matrix
print(matrix_sparse)

  (1, 1)	1
  (2, 0)	3


There are a number of types of sparse matrices. However, in *compressed sparse row* (CSR) matrices, (1, 1) and (2, 0) represent the (zero-indexed) indices of the non-zero values 1 and 3, respectively. For example, the element 1 is in the second row and second column. We can see the advantage of sparse matrices if we create a much larger matrix with many more zero elements and then compare this larger matrix with our original sparse matrix:

In [7]:
# Create larger matrix
matrix_large = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                         [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
                         [3, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

# Create compressed sparse row (CSR) matrix
matrix_large_sparse = sparse.csr_matrix(matrix_large)

# View original sparse matrix
print(matrix_sparse)

  (1, 1)	1
  (2, 0)	3


In [8]:
# View larger sparse matrix
print(matrix_large_sparse)

  (1, 1)	1
  (2, 0)	3


As we can see, despite the fact that we added many more zero elements in the larger matrix, its sparse representation is exactly the same as our original sparse matrix. That is, the addition of zero elements did not change the size of the sparse matrix.

As mentioned, there are many different types of sparse matrices, such as compressed sparse column, list of lists, and dictionary of keys. While an explanation of the different types and their implications is outside the scope of this book, it is worth noting that while there is no “best” sparse matrix type, there are meaningful differences between them and we should be conscious about why we are choosing one type over another.

### See Also
- [Sparse matrices, SciPy documentation](http://bit.ly/2HReBZR)

- [101 Ways to Store a Sparse Matrix](http://bit.ly/2HS43cI)

## 1.4 Selecting Elements
### Problem
You need to select one or more elements in a vector or matrix.

### Solution
NumPy’s arrays make that easy:

In [9]:
# Create row vector
vector = np.array([1, 2, 3, 4, 5, 6])

# Create matrix
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

# Select third element of vector
vector[2]

3

In [10]:
# Select second row, second column
matrix[1,1]

5

### Discussion
Like most things in Python, NumPy arrays are zero-indexed, meaning that the index of the first element is 0, not 1. With that caveat, NumPy offers a wide variety of methods for selecting (i.e., indexing and slicing) elements or groups of elements in arrays:

In [11]:
# Select all elements of a vector
vector[:]

array([1, 2, 3, 4, 5, 6])

In [12]:
# Select everything up to and including the third element
vector[:3]

array([1, 2, 3])

In [13]:
# Select everything after the third element
vector[3:]

array([4, 5, 6])

In [14]:
# Select the last element
vector[-1]

6

In [15]:
# Select the first two rows and all columns of a matrix
matrix[:2,:]

array([[1, 2, 3],
       [4, 5, 6]])

In [16]:
# Select all rows and the second column
matrix[:,1:2]

array([[2],
       [5],
       [8]])

## 1.5 Describing a Matrix
### Problem
You want to describe the shape, size, and dimensions of the matrix.

### Solution
Use shape, size, and ndim:

In [17]:
# Create matrix
matrix = np.array([[1, 2, 3, 4],
                   [5, 6, 7, 8],
                   [9, 10, 11, 12]])

# View number of rows and columns
matrix.shape

(3, 4)

In [18]:
# View number of elements (rows * columns)
matrix.size

12

In [19]:
# View number of dimensions
matrix.ndim

2

Discussion
This might seem basic (and it is); however, time and again it will be valuable to check the shape and size of an array both for further calculations and simply as a gut check after some operation.

## 1.6 Applying Operations to Elements
### Problem
You want to apply some function to multiple elements in an array.

### Solution
Use NumPy’s vectorize:

In [20]:
# Create matrix
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

# Create function that adds 100 to something
add_100 = lambda i: i + 100

# Create vectorized function
vectorized_add_100 = np.vectorize(add_100)

# Apply function to all elements in matrix
vectorized_add_100(matrix)

array([[101, 102, 103],
       [104, 105, 106],
       [107, 108, 109]])

### Discussion
NumPy’s vectorize class converts a function into a function that can apply to all elements in an array or slice of an array. It’s worth noting that vectorize is essentially a for loop over the elements and does not increase performance. Furthermore, NumPy arrays allow us to perform operations between arrays even if their dimensions are not the same (a process called broadcasting). For example, we can create a much simpler version of our solution using broadcasting:

In [21]:
# Add 100 to all elements
matrix + 100

array([[101, 102, 103],
       [104, 105, 106],
       [107, 108, 109]])

## 1.7 Finding the Maximum and Minimum Values
### Problem
You need to find the maximum or minimum value in an array.

### Solution
Use NumPy’s max and min:

In [22]:
# Create matrix
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

# Return maximum element
np.max(matrix)

9

In [23]:
# Return minimum element
np.min(matrix)

1

### Discussion
Often we want to know the maximum and minimum value in an array or subset of an array. This can be accomplished with the max and min methods. Using the axis parameter we can also apply the operation along a certain axis:

In [24]:
# Find maximum element in each column
np.max(matrix, axis=0)

array([7, 8, 9])

In [25]:
# Find maximum element in each row
np.max(matrix, axis=1)

array([3, 6, 9])

## 1.8 Calculating the Average, Variance, and Standard Deviation
### Problem
You want to calculate some descriptive statistics about an array.

### Solution
Use NumPy’s mean, var, and std:

In [26]:
# Create matrix
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

# Return mean
np.mean(matrix)

5.0

In [27]:
# Return variance
np.var(matrix)

6.666666666666667

In [28]:
# Return standard deviation
np.std(matrix)

2.581988897471611

### Discussion
Just like with max and min, we can easily get descriptive statistics about the whole matrix or do calculations along a single axis:

In [29]:
# Find the mean value in each column
np.mean(matrix, axis=0)

array([4., 5., 6.])

## 1.9 Reshaping Arrays
### Problem
You want to change the shape (number of rows and columns) of an array without changing the element values.

### Solution
Use NumPy’s reshape:

In [30]:
# Create 4x3 matrix
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9],
                   [10, 11, 12]])

# Reshape matrix into 2x6 matrix
matrix.reshape(2, 6)

array([[ 1,  2,  3,  4,  5,  6],
       [ 7,  8,  9, 10, 11, 12]])

### Discussion
reshape allows us to restructure an array so that we maintain the same data but it is organized as a different number of rows and columns. The only requirement is that the shape of the original and new matrix contain the same number of elements (i.e., the same size). We can see the size of a matrix using size:

In [31]:
matrix.size

12

One useful argument in reshape is -1, which effectively means “as many as needed,” so reshape(-1, 1) means one row and as many columns as needed:

In [32]:
matrix.reshape(1, -1)

array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]])

Finally, if we provide one integer, reshape will return a 1D array of that length:

In [33]:
matrix.reshape(12)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])