# Chapter 2: Linear Algebra

In [1]:
import tensorflow as tf
import numpy as np

## 2.1 Scalars, Vectors, Matrices, and Tensors

### Scalars
A single number. Typically written in italics, with a lowercase variable name, existing in a variety of spaces (e.g. $ \it{x} \in \mathbb{R}$ or $\it{x} \in \mathbb{N}$, or $\it{x} \in \mathbb{Z} $)


In [2]:
x = tf.constant(35, name='x')  # create a constant called y, which has a numerical value of 35
y = tf.Variable(x + 5, name='y')  # create a variable called x, defined as the equation "x + 5"

model = tf.global_variables_initializer()  # initialize variables

with tf.Session() as session:  # create a session
    session.run(model)  # run model (tf.global_variables_initializer())
    print("The value of y is:", session.run(y))  # run just the variable y and print its current value

The value of y is: 40


### Vectors
An array of numbers, denoted by lowercase variable names written in bold. If a vector $\bf{v}$ contains $\it{n}$ elements, each element in $\mathbb{R}$, then the vector lies in the set formed by taking the Cartesian product of $\mathbb{R}$ $\it{n}$ times, denoted as $\mathbb{R}^n$. When we need to explicitly identify the elements of a vector, we write them as a enclosed in square brackets.

For example:   $\begin{equation}
     v=\begin{bmatrix}
         v_{1} \\
         v_{2} \\
         \vdots\\
         v_{n}
        \end{bmatrix}
  \end{equation} \in \mathbb{R}^n $ is the n-dimensional vector $\bf{v}$ in Cartesian space.

In [3]:
v = tf.Variable([1, 2, 3, 4], tf.int32, name="v")  # we can create a rank 1 tensor object (i.e. a vector) by passing a list

model = tf.global_variables_initializer()  # initialize variables

with tf.Session() as session:  # create a session
    session.run(model)  # run model (tf.global_variables_initializer())
    print("A vector, v:", session.run(v))

A vector, v: [1 2 3 4]


### Matrices

A matrix is simply a 2-D array of numbers, denoted by uppercase letters with a bold typeface. Matrices typically follow a row-dominant notation; that is, a matrix $\bf{A} \in \mathbb{R}^{m \times n}$ is a matrix in Cartesian space with $\it{m}$ rows and $\it{n}$ columns. A single element of the matrix situated in the $\it{ith}$ row and $\it{jth}$ column is denoted $M_{i, j}$.

For example: $\begin{equation}
A=\begin{bmatrix}
    a_{11} & a_{12} & a_{13} & \dots  & a_{1n} \\
    a_{21} & a_{22} & a_{23} & \dots  & a_{2n} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    a_{m1} & a_{m2} & a_{m3} & \dots  & a_{mn}
\end{bmatrix}
\end{equation} \in \mathbb{R}^{m \times n}$ is the real-valued $\it{m \times n}$ matrix $\bf{A}$.

In [4]:
m = tf.Variable([[1, 2, 3], [4, 5, 6], [7, 8, 9]], tf.int32, name="m") # pass in a 2-dimensional array

model = tf.global_variables_initializer()  # initialize variables

with tf.Session() as session:  # create a session
    session.run(model)  # run model (tf.global_variables_initializer())
    print("A matrix, M:\n",session.run(m))


A matrix, M:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]


### Tensors

In some cases we will need an array with more than two axes. In the general case, an array of numbers arranged on a regular grid with a variable number of axes is known as a tensor. We identify the element of a tensor $\bf{T}$ at coordinates ($\it{i}$, $\it{j}$, $\it{k}$) by writing $T_{i, j, k}$.

In [5]:
t = tf.ones([3, 4, 5])  # 3x4x5 tensor populated with ones
t_mat = tf.reshape(t, [6, 10])  # reshape t by passing t into tf.reshape with the desired dimensions

model = tf.global_variables_initializer()  # initialize variables

with tf.Session() as session:  # create a session
    session.run(model)  # run model (tf.global_variables_initializer())
    print("A tensor, T:\n",session.run(t), "\n")  # print t
    print("T reshaped into a 6x10 matrix:\n", session.run(t_mat))  # print t_mat

A tensor, T:
 [[[ 1.  1.  1.  1.  1.]
  [ 1.  1.  1.  1.  1.]
  [ 1.  1.  1.  1.  1.]
  [ 1.  1.  1.  1.  1.]]

 [[ 1.  1.  1.  1.  1.]
  [ 1.  1.  1.  1.  1.]
  [ 1.  1.  1.  1.  1.]
  [ 1.  1.  1.  1.  1.]]

 [[ 1.  1.  1.  1.  1.]
  [ 1.  1.  1.  1.  1.]
  [ 1.  1.  1.  1.  1.]
  [ 1.  1.  1.  1.  1.]]] 

T reshaped into a 6x10 matrix:
 [[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]]


### Transposes

One important operation on matrices is the transpose. The transpose of a matrix is the mirror image of the matrix across a diagonal line (called the "main diagonal") which begins in the upper left corner, running down and across ot the bottom right. We denote the transpose of a matrix $\bf{A}$ as $\bf{A}^\top$, and it is defined such that $({\bf{A}^\top}_{i, j} = A_{j, i})$.



In [6]:
# recall m = tf.Variable([[1, 2, 3], [4, 5, 6], [7, 8, 9]], tf.int32, name="m")
m_t = tf.transpose(m)

model = tf.global_variables_initializer()  # initialize variables

with tf.Session() as session:  # create a session
    session.run(model)  # run model (tf.global_variables_initializer())
    print("A matrix, M:\n",session.run(m), "\n")
    print("The transpose of M:\n", session.run(m_t))

A matrix, M:
 [[1 2 3]
 [4 5 6]
 [7 8 9]] 

The transpose of M:
 [[1 4 7]
 [2 5 8]
 [3 6 9]]


### Other Operations

**Matrix Addition**: We can add matrices together (given that they have the same shape) which is simply addition of corresponding elements:

In [7]:
# recall m = tf.Variable([[1, 2, 3], [4, 5, 6], [7, 8, 9]], tf.int32, name="m")
n = tf.ones([3, 3], tf.int32, name="n")
p = tf.add(m, n, name="p")

model = tf.global_variables_initializer()  # initialize variables

with tf.Session() as session:  # create a session
    session.run(model)  # run model (tf.global_variables_initializer())
    print("A matrix, m:\n", session.run(m), "\n")
    print("A matrix, n:\n", session.run(n) ,"\n")
    print("A matrix p, i.e. m + n:\n", session.run(p), "\n")

A matrix, m:
 [[1 2 3]
 [4 5 6]
 [7 8 9]] 

A matrix, n:
 [[1 1 1]
 [1 1 1]
 [1 1 1]] 

A matrix p, i.e. m + n:
 [[ 2  3  4]
 [ 5  6  7]
 [ 8  9 10]] 



**Matrix and Scalar Addition/Multiplication**: We can add a scalar to a matrix, or multiply a matrix by a scalar. Here, each element of the matrix is operated on by the given scalar:

In [8]:
# recall m = tf.Variable([[1, 2, 3], [4, 5, 6], [7, 8, 9]], tf.int32, name="m")
x = tf.constant(2, name='x')
y = tf.add(m, x, name="y")
z = tf.multiply(m, x, name="z")

model = tf.global_variables_initializer()  # initialize variables

with tf.Session() as session:  # create a session
    session.run(model)  # run model (tf.global_variables_initializer())
    print("A matrix, m:\n", session.run(m), "\n")
    print("A scalar, x:", session.run(x), "\n")
    print("A matrix y, i.e. m + x:\n", session.run(y), "\n")
    print("A matrix z, i.e. m * x:\n", session.run(z))

A matrix, m:
 [[1 2 3]
 [4 5 6]
 [7 8 9]] 

A scalar, x: 2 

A matrix y, i.e. m + x:
 [[ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]] 

A matrix z, i.e. m * x:
 [[ 2  4  6]
 [ 8 10 12]
 [14 16 18]]


**Matrix and Vector Addition**: In the context of deep learning, we also allow the addition of matrix and a vector, yielding another matrix, where the addition is done row-wise (each element in the vector is added to a corresponding element in each row of the matrix).

In [9]:
# recall m = tf.Variable([[1, 2, 3], [4, 5, 6], [7, 8, 9]], tf.int32, name="m")
v = tf.Variable([1, 2, 1], tf.int32, name="v")
n = tf.add(m, v, name="n")

model = tf.global_variables_initializer()  # initialize variables

with tf.Session() as session:  # create a session
    session.run(model)  # run model (tf.global_variables_initializer())
    print("A matrix, m:\n", session.run(m), "\n")
    print("A vector, v:", session.run(v), "\n")
    print("A matrix n, i.e. m + v:\n", session.run(n), "\n")


A matrix, m:
 [[1 2 3]
 [4 5 6]
 [7 8 9]] 

A vector, v: [1 2 1] 

A matrix n, i.e. m + v:
 [[ 2  4  4]
 [ 5  7  7]
 [ 8 10 10]] 



## 2.2 Multiplying Matrices and Vectors

One of thee most important operations involving matrices is multiplication of two matrices. The **matrix product** of matrices $\boldsymbol{A}$ and $\boldsymbol{B}$ is a third matrix $\boldsymbol{C}$. In order for this product to be defined, $\boldsymbol{A}$ must have the same number of columns as $\boldsymbol{B}$ has rows. If $\boldsymbol{A}$ is of shape $m \times n$ and $\boldsymbol{B}$ is of shape $n \times p$, then $\boldsymbol{C}$ is of shape $m \times p$.

We can write the matrix product by just placing two or more matrices together, for example, 

$\quad \quad \quad \quad \quad \boldsymbol{C} = \boldsymbol{A}\boldsymbol{B}$

The product operation is defined by 

### The Dot Product

The dot product between two vectors $\bf{x}$ and $\bf{y}$ of the same dimensionality (i.e. the same number of elements) is the matrix product $\bf{x}^\top\bf{y}$, where the result is the scalar which is the sum of products of corresponding elements, i.e. $\bf{x} \dot \bf{y} = \sum\limits_{k}x_{i}y_{i}$.

In [10]:
# recall v = tf.Variable([1, 2, 1], tf.int32, name="v")
w = tf.Variable([3, 2, 1], tf.int32, name="w")
x = tf.tensordot(v, w, 1, name="x")

model = tf.global_variables_initializer()  # initialize variables

with tf.Session() as session:  # create a session
    session.run(model)  # run model (tf.global_variables_initializer())
    print("A vector, v:", session.run(v), "\n")
    print("A vector, w:", session.run(w), "\n")
    print("The dot product of v and w:", session.run(x))

A vector, v: [1 2 1] 

A vector, w: [3 2 1] 

The dot product of v and w: 8


### Matrix Multiplication

One of the most important operations involving matrices is the multiplication of two matrices. The matrix product of matrices $\bf{A}$ and $\bf{B}$ is a third matrix, $\bf{C}$. In order for this product to be defined, $\bf{A}$ must have the same number of columns as $\bf{B}$ has rows. If $\bf{A}$ is of shape $\it{m \times n}$ and $\bf{B}$ is of shape $\it{n \times p}$ then $\bf{C}$ is of shape $\it{m \times p}$.

We can write the matrix product by just placing two or more matrices together, for example, $ \bf{C} = \bf{AB} $.

The product operation is defined by $C_{i, j} = \sum\limits_{k} A_{i, k}B_{k, j} $.

**Remark**: The standard product of two matrices is NOT simply the multiplication of corresponding elements.

In [11]:
# recall m = tf.Variable([[1, 2, 3], [4, 5, 6], [7, 8, 9]], tf.int32, name="m")
n = tf.Variable([[1, 2, 1], [1, 0, 2], [2, 1, 0]])
p = tf.matmul(m, n, name="p")

model = tf.global_variables_initializer()  # initialize variables

with tf.Session() as session:  # create a session
    session.run(model)  # run model (tf.global_variables_initializer())
    print("A matrix, m:\n", session.run(m), "\n")
    print("A matrix n:\n", session.run(n), "\n")
    print("A matrix p, i.e. the matrix product mn:\n", session.run(p))

A matrix, m:
 [[1 2 3]
 [4 5 6]
 [7 8 9]] 

A matrix n:
 [[1 2 1]
 [1 0 2]
 [2 1 0]] 

A matrix p, i.e. the matrix product mn:
 [[ 9  5  5]
 [21 14 14]
 [33 23 23]]


### Properties of Matrix Multiplication
1. Distributive, i.e. $\boldsymbol{A}(\boldsymbol{B}+\boldsymbol{C}) = \boldsymbol{AB} + \boldsymbol{AC}$
2. Associative, i.e. $\boldsymbol{A}(\bf{B}\boldsymbol{C}) = (\boldsymbol{AB})\boldsymbol{C}$
3. NOT Commutative, i.e. $\boldsymbol{AB} = \boldsymbol{BA}$ does NOT always hold.

While matrix multiplication is not commutative, the dot product of vectors is, i.e. $\bf{x}^\top\bf{y} = \bf{y}^\top \bf{x}$. 

The transpose of a matrix product has the simple form $(\bf{AB})^\top = \bf{B}^\top\bf{A}^\top$.

This allows us to demonstrate the commutativity of the dot product by exploiting the fact that the value of such a product is a scalar and is therefore equal to its own transpose, i.e. $\bf{x}^\top\bf{y} = (\bf{x}^\top\bf{y})^\top = \bf{y}^\top \bf{x} $.

### Systems of Linear Equations

We now know enough linear algebra to write down a system of linear equations, $\bf{Ax} = \bf{b}$, where $\bf{A} \in \mathbb{R}^{m \times n} $ is a known matrix, $\bf{b} \in \mathbb{R}^m$ is a known vector, and $\bf{x} \in \mathbb{R}^n$ is a vector of unknown variables we'd like to solve for.

We can rewrite the above equation as:

$\quad \quad \quad \quad \quad \boldsymbol{A}_\it{1, 1}x_{1} + \boldsymbol{A}_\it{1, 2}x_{2} + \dots + \bf{A}_\it{1, n}x_{1} = \it{b}_{1} \\
\quad \quad \quad \quad \quad \bf{A}_\it{2, 1}x_{1} + \bf{A}_\it{2, 2}x_{2} + \dots + \bf{A}_\it{2, n}x_{n} = \it{b}_{2} \\
\quad \quad \quad \quad \quad \dots \\
\quad \quad \quad \quad \quad \bf{A}_\it{m, 1}x_{1} + \bf{A}_\it{m, 2}x_{1} + \dots + \bf{A}_\it{m, n}x_{n} = \it{b}_{m}$

## 2.3 Identity and Inverse Matrices

Linear algebra offers a powerful tool called **matrix inversion** that enables us to analytically solve systems of linear equations for many values of $\bf{A}$.

To describe matrix inversion, we first need to define the concept of an **identity matrix**. An identity matrix, characterized by zero entries except those occupying the main diagonal (these are all 1's), has the property that it does not change any vector when we multiply that vector by the matrix. We denote the identy matrix that preserves $\it{n}$-dimensional vectors as $\bf{I}_\it{n}$, where $\bf{I}_\it{n} \in \mathbb{R}^{n \times n}$ and $ \forall{x} \in \mathbb{R}^{n}, \bf{I}_\it{n}\bf{x} = x$

Thus, the **matrix inverse** of $\bf{A}$ is denoted as $\bf{A}^{-1}A = \bf{I}_\it{n}$.

We can now solve the equation $\bf{Ax} = \bf{b}$ with the following steps:

$\quad \quad \quad \quad \quad \bf{A}^{-1}Ax = \bf{A}^{-1}\it{b} \\
\quad \quad \quad \quad \quad \bf{I}_\it{n}\bf{x} = \bf{A}^{-1}\it{b} \\
\quad \quad \quad \quad \quad \bf{x} = \bf{A}^{-1}\it{b} $

Of course this process relies on it being possible to find $\bf{A}^{-1}$. The conditions for the existence of $\bf{A}^{-1}$ are discussed in the following section. 

When $\bf{A}^{-1}$ exits, several different algorithms can find it in closed form. In theory, the same inverse matrix can be used to solve the equation many times for different values of $\it{b}$. While useful as a theoretical tool, $\bf{A}^{-1}$ should note actually be used in practice for most software applications. Because $\bf{A}^{-1}$ can be represented with only limited precision on a digital computer, algorithms that make use of the value of b can usually obtain more accurate estimates of $\bf{x}$.

In [12]:
i = tf.eye(3, name="i")

model = tf.global_variables_initializer()  # initialize variables

with tf.Session() as session:  # create a session
    session.run(model)  # run model (tf.global_variables_initializer())
    print("A 3x3 identity matrix, i:\n", session.run(i), "\n")

A 3x3 identity matrix, i:
 [[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]] 



## Linear Dependence and Span

For $\bf{A}^{-1}$ to exist, the equation $\bf{Ax} = \bf{b}$ must have exactly one solution for every value of $\bf{b}$. It is also possible for the system of equations to have **no solutions** or **infinitely many solution** for some values of $\it{b}$. It is not possible, however, to have more than one but less than infinitely many solutions for a particular $\it{b}$; if both $\bf{x}$ and $\bf{y}$ are solutions, then 

$\quad \quad \quad \quad \quad \bf{z} = \it{\alpha}\bf{x} + (1-\it{\alpha})\bf{y}$

To analyze how many solutions the equation has, think of the columns of $\bf{A}$ as specifying the different directions we can travel in from the **origin** (specified as the $\it{n}$-dimensional vector of zeros) then determine how many ways there are of reaching $\it{b}$. In this view, each elements of $\bf{x}$ specifies how far we should travel in each of these directions, with $\bf{x}_{\it{i}}$ specifying how far to move in the direction of column $\it{i}$.

$\quad \quad \quad \quad \quad \boldsymbol{A}x = \sum_i x_i \boldsymbol{A}_{:;i}$

In general, this kind of operation is called a **linear combination**. Formally, a linear combination of some set of vectors is given by multiplying each vector $\boldsymbol{v}^(i)$ by a corresponding scalar coefficient and adding the results:

$\quad \quad \quad \quad \quad \sum_i c_i\boldsymbol{v}^{(i)}$

The **span** of a set of vectors is the set of all points obtainable by linear combination of the original vectors.

Determining whether $\boldsymbol{Ax} = \boldsymbol{B}$ has a solution this amounts to testing whether $\boldsymbol{b}$ is in the span of the columns of $\boldsymbol{A}$. This particular span is known as the **column space**, or the **range** of $\boldsymbol{A}$. 

In order for the system $\boldsymbol{Ax}=\boldsymbol{b}$ to have a solution for all values of $\boldsymbol{b} \in \mathbb{R}^m$, we therefore requre that the column space of $\boldsymbol{A}$ be all of $\mathbb{R}^m$. If any point in $\mathbb{R}^m$ is excluded from the column space, that point is a potential value of $\boldsymbol{b}$ that has no solution. The requirement that the column space of $\mathbb{A}$ be all of $\mathbb{R}^m$ implies immediately that $\boldsymbol{A}$ must have at least $m$ columns, that is $n \geq m$. Otherwise the dimensionality of the column space would be less than $m$. For example, consider a $3 \times 2$ matrix. The target $\boldsymbol{b}$ is 3-D, but $\boldsymbol{x}$ is only 2-D, so modifying the value of $\boldsymbol{x}$ at best enables us to trace out a 2-D plane within $\mathbb{R}^3$. The equation has a solution if and only if $\boldsymbol{b}$ lies on that plane.

Havin $n \geq m$ is only a necessary condition for every point to have a solution. It is not a sufficient condition, because it is possible for some of the columns to be redundant. Consider a $2 \times 2$ matrix where both of the columns are identical. This has the same column space as a $2 \times 1$ matrix containing only 1 copy of the replicated column. In other words, the column space is still just a line and fails to encompass all of $\mathbb{R}^2$, even though there are two columns.

Formally, this kind of redundancy is known as **linear dependence**. A set of vectors is **linearly independent** if no vector in the set is a linear combination of the other vectors. If we add a vector to a set that is a linear combination of the other vectors in the set, the new vector does not add any points to the set's span. This means that for the column space of the matrix to encompass all of $\mathbb{R}^m$, the matrix must contain at least one set of $m$ linearly independent columns. This condition is both nexcessary and sufficient for $\boldsymbol{Ax} = \boldsymbol{b}$ to have a solution for every value of $\boldsymbol{b}$. Note that this requirement is for a set to have exactly $m$ linearly independent columns, not at least $m$. No set of $m$-dimensional vectors can have more than $m$ mutually linearly independent columns, but a matrix with more than $m$ columns may have more than one such set.

For the matrix to have an inverse, we additionally need to ensure that $\boldsymbol{Ax} = \boldsymbol{b}$ has *at most* one solution for each alue of $\boldsymbol{b}$. To do so, we need to make certain that the matrix has at most $m$ columns. Otherwise, there is more than one way of parametrixing each solution.

Together, this means that the matrix must be **square**, that is, we require that $m = n$ and that all the columns be linearly independent. A square matrix with linearly dependent columns is known as **singular**.

If $\boldsymbol{A}$ is not square, or is square but singular, solving the equation is still possible, but we cannot use the method of matrix inversion to find the solution.

So far, we have discussed matrix inverses as being multiplied on the left. It is also possible to define an inverse that is multiplied on the right:

$ \quad \quad \quad \quad \quad \boldsymbol{AA}^{-1} = \boldsymbol{I}$

For square matrices, the left inverse and right inverse are equal.

## 2.5 Norms

Sometimes we need to measure the size of a vector. In machine learning, we measure the size of vectors using a function called a **norm**. Formally, the $L^p$ norm is given by

$\quad \quad \quad \quad \quad | |x | |_p = (\sum_i |x_i|^{\:p})^{1/p}$

for $p \in \mathbb{R}, p \geq 1$.

Norms, including the $L^p$ norm, are funtions mapping vectors to non-negative values. On an intuitive level, the norm of a vector $\boldsymbol{x}$ measure the distance from the origin to the point $\boldsymbol{x}$. More rigorously, a norm is any function $f$ that satisfies the following properties:

- $f(\boldsymbol{x}) = 0 \implies \boldsymbol{x} = \boldsymbol{0}$
- $f(\boldsymbol{x} + \boldsymbol{y}) \leq f(\boldsymbol{x}) + f(\boldsymbol{y})$ (the **triangle inequality**)
- $\forall \alpha \in \mathbb{R}, \: f(\alpha \boldsymbol{x}) = \lvert \alpha \rvert \: f(\boldsymbol{x})$

The $L^2$ norm, with $p=2$ is known as the **Euclidean norm**, which is simply the Euclidean distance from the origin to the point specified by $\boldsymbol{x}$. The $L^2$ norm is used so frequently in machine learning that it is often denoted simply as $| | \boldsymbol{x} | |$ with the subscript 2 omitted. It is also common to measure the size of the vector using the squared $L^2$ norm, which can be calculated simply as $\boldsymbol{x}^\top \boldsymbol{x}$.

The squared $L^2$ norm is more convenient to work with mathematically and computationally than the $L^2$ norm itself. For example, each derivative of the squared $L^2$ norm with respect to each element of $\boldsymbol{x}$ depends only on the corresponding element of $\boldsymbol{x}$, while all the derivatives of the $L^2$ norm depend on the entire vector. In many contexts, the squared $L^2$ norm may be undesirable because it increases very slowly near the origin. In several achine learning applications, it is important to discriminate between elemetns that are exactly zero and elements that are small but nonzero. In these cases, we turn to a function that grows at the same rate in all locations, but that retains mathematical simplicity: the $L^1$ norm. The $L^1$ norm may be simplified to:

$\quad \quad \quad \quad \quad | | \boldsymbol{x} | |_{1} = \sum_i \lvert \boldsymbol{x}_i \rvert$.

The $L^1$ norm is commonly used in machine learning when the difference between zero and nonzero elements is very important. Every time an element of $\boldsymbol{x}$ moves away from 0 by $\epsilon$, the $L^1$ norm increases by $\epsilon$.

We sometimes measure the size of the vector by counting its number of nonzero elements. Some authors refer to this function as the $L^0$ norm", but this is incorrect terminology. The number of nonzero entries in a vector is not a norm, because scaling the vector by $\alpha$ does not change the number of nonzero entries. The $L^1$ norm is often used as a substitute for the nonzero entries.

One other norm that commonly arises in machine learning is the $L^\infty$, also known as the **max norm**. This norm simplifies to the absolute value of the element with the largest magnitude in the vector, 

$\quad \quad \quad \quad \quad | | \boldsymbol{x} | |_{\infty} = max_i \lvert x_i \rvert$

Sometimes we also wish to measure the size of a matrix. In the context of deep learning, the most common way to do this is with the otherwise obscure **Frobenius norm**:

$\quad \quad \quad \quad \quad | | \boldsymbol{A} | |_F = \sqrt{\sum_{i, j} A^2_{i, j'}}$

which is analagous to the $L^2$ norm of the vector.

The dot product of two vectors can be rewritten in terms of norms. Specifically, 

$\quad \quad \quad \quad \quad \boldsymbol{x}^\top \boldsymbol{y} = | | \boldsymbol{x} | |_2 | | \boldsymbol{y} | |_2 cos\: \theta$,

Where $\theta$ is the angle between $\boldsymbol{x}$ and $\boldsymbol{y}$.

## 2.6 Special Kinds of Matrices and Vectors

Some special kinds of matrices and vectors are particularly useful.

**Diagonal** matries consist mostly of zeros and have nonzero entries only along the main diagonal. Formally, a matrix $\boldsymbol{D}$ is diagonal if and only if 