# Linear Algebra Review

In [1]:
import numpy as np

## Linear Algebra and Machine Learning
* Ranking web pages in order of importance
    * Solved as the problem of finding the eigenvector of the page score matrix
* Dimensionality reduction - Principal Component Analysis
* Movie recommendation
    * Use singular value decomposition (SVD) to break down user-movie into user-feature and movie-feature matrices, keeping only the top $k$-ranks to identify the best matches
* Topic modeling
    * Extensive use of SVD and matrix factorization can be found in Natural Language Processing, specifically in topic modeling and semantic analysis
*All of the compute-intensive operations in deep learning are matrix manipulation
    *Forward inference and backward propagation of gradients
    *Convolution and de-convolution can be understood as matrix ops
    *Recurrance function in RNN is non-linearity applied element-by-element to matrix-vector op

## Vectors

A vector can be represented by an array of real numbers

$$\boldsymbol{x} = [x_1, x_2, \ldots, x_n]$$

Geometrically, a vector specifies the coordinates of the tip of the vector if the tail were placed at the origin

The norm of a vector $\boldsymbol{x}$ is defined by

$$||\boldsymbol{x}|| = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}$$

In [2]:
x = np.array([1,2,3,4])
print np.sqrt(np.sum(x**2))
print np.linalg.norm(x)

5.47722557505
5.47722557505


If we have two vectors $\boldsymbol{x}$ and $\boldsymbol{y}$ of the same length $(n)$, then the _dot product_ is give by

$$\boldsymbol{x} \cdot \boldsymbol{y} = x_1y_1 + x_2y_2 + \cdots + x_ny_n$$

If $\mathbf{x} \cdot \mathbf{y} = 0$ then $x$ and $y$ are *orthogonal* (aligns with the intuitive notion of perpindicular)

In [3]:
w = np.array([1, 2])
v = np.array([-2, 1])
np.dot(w,v)

0

The norm squared of a vector is just the vector dot product with itself
$$
||x||^2 = x \cdot x
$$

In [4]:
print np.linalg.norm(x)**2
print np.dot(x,x)

30.0
30


If $\boldsymbol{x}$ is centered at zero, $\frac{||\boldsymbol{x}||^2}{n}$ is the _variance_ of $\boldsymbol{x}$

In [5]:
x_centered = x - np.mean(x)
print np.linalg.norm(x_centered)**2/len(x_centered)
print np.var(x_centered)

1.25
1.25


The distance between two vectors is the norm of the difference.
$$
d(x,y) = ||x-y||
$$

In [6]:
y = np.array([4,5,6,7])
print np.linalg.norm(x-y)

6.0


_Cosine Similarity_ is the cosine of the angle between the two vectors give by

$$cos(\theta) = \frac{\boldsymbol{x} \cdot \boldsymbol{y}}{||\boldsymbol{x}|| \text{ } ||\boldsymbol{y}||}$$

In [7]:
x = np.array([1,2,3,4])
y = np.array([5,6,7,8])
np.dot(x,y)/(np.linalg.norm(x)*np.linalg.norm(y))

0.96886393162696616

If both $\boldsymbol{x}$ and $\boldsymbol{y}$ are zero-centered, this calculation is the _correlation_ between $\boldsymbol{x}$ and $\boldsymbol{y}$

In [8]:
x_centered = x - np.mean(x)
print x_centered
y_centered = y - np.mean(y)
print y_centered
np.dot(x_centered,y_centered)/(np.linalg.norm(x_centered)*np.linalg.norm(y_centered))

[-1.5 -0.5  0.5  1.5]
[-1.5 -0.5  0.5  1.5]


0.99999999999999978

### Linear Combinations of Vectors

A _linear combination_ of a collection of vectors $(\boldsymbol{x}_1,
                                                    \boldsymbol{x}_2, \ldots,
                                                    \boldsymbol{x}_m)$ 
is a vector of the form

$$a_1 \cdot \boldsymbol{x}_1 + a_2 \cdot \boldsymbol{x}_2 + 
\cdots + a_m \cdot \boldsymbol{x}_m$$
                                                

# Matrices

An $n \times p$ matrix is an array of numbers with $n$ rows and $p$ columns:

$$
X =
  \begin{bmatrix}
    x_{11} & x_{12} & \cdots & x_{1p} \\
    x_{21} & x_{22} & \cdots & x_{2p} \\
    \vdots & \vdots & \ddots & \vdots \\
    x_{n1} & x_{n2} & \cdots & x_{np} 
  \end{bmatrix}
$$

$n$ = the number of subjects  
$p$ = the number of features

### Matrix multiplication

In order to multiply two matrices, they must be _conformable_ such that the number of columns of the first matrix must be the same as the number of rows of the second matrix.

Let $X$ be a matrix of dimension $n \times k$ and let $Y$ be a matrix of dimension $k \times p$, then the product $XY$ will be a matrix of dimension $n \times p$ whose $(i,j)^{th}$ element is given by the dot product of the $i^{th}$ row of $X$ and the $j^{th}$ column of $Y$

$$\sum_{s=1}^k x_{is}y_{sj} = x_{i1}y_{1j} + \cdots + x_{ik}y_{kj}$$



### Note: 

$$XY \neq YX$$

If $X$ and $Y$ are square matrices of the same dimension, then the both the product $XY$ and $YX$ exist; however, there is no guarantee the two products will be the same


### Additional Properties of Matrices
1. If $X$ and $Y$ are both $n \times p$ matrices,
then $$X+Y = Y+X$$
2. If $X$, $Y$, and $Z$ are all $n \times p$ matrices,
then $$X+(Y+Z) = (X+Y)+Z$$
3. If $X$, $Y$, and $Z$ are all conformable,
then $$X(YZ) = (XY)Z$$
4. If $X$ is of dimension $n \times k$ and $Y$ and $Z$ are of dimension $k \times p$, then $$X(Y+Z) = XY + XZ$$
5. If $X$ is of dimension $p \times n$ and $Y$ and $Z$ are of dimension $k \times p$, then $$(Y+Z)X = YX + ZX$$
6. If $a$ and $b$ are real numbers, and $X$ is an $n \times p$ matrix,
then $$(a+b)X = aX+bX$$
7. If $a$ is a real number, and $X$ and $Y$ are both $n \times p$ matrices,
then $$a(X+Y) = aX+aY$$
8. If $a$ is a real number, and $X$ and $Y$ are conformable, then
$$X(aY) = a(XY)$$

### Matrix Transpose

The transpose of an $n \times p$ matrix is a $p \times n$ matrix with rows and columns interchanged

$$
X^T =
  \begin{bmatrix}
    x_{11} & x_{12} & \cdots & x_{1n} \\
    x_{21} & x_{22} & \cdots & x_{2n} \\
    \vdots & \vdots & \ddots & \vdots \\
    x_{p1} & x_{p2} & \cdots & x_{pn} 
  \end{bmatrix}
$$



### Properties of Transpose
1. Let $X$ be an $n \times p$ matrix and $a$ a real number, then 
$$(aX)^T = aX^T$$
2. Let $X$ and $Y$ be $n \times p$ matrices, then
$$(X \pm Y)^T = X^T \pm Y^T$$
3. Let $X$ be an $n \times k$ matrix and $Y$ be a $k \times p$ matrix, then
$$(XY)^T = Y^TX^T$$

### Vector in Matrix Form
A column vector is a matrix with $n$ rows and 1 column and to differentiate from a standard matrix $X$ of higher dimensions can be denoted as a bold lower case $\boldsymbol{x}$

$$
\boldsymbol{x} =
  \begin{bmatrix}
    x_{1}\\
    x_{2}\\
    \vdots\\
    x_{n}
  \end{bmatrix}
$$

In numpy, when we enter a vector, it will not normally have the second dimension, so we can reshape it

In [11]:
x = np.array([1,2,3,4])
print x
print x.shape
x = x.reshape(4,1)
print x
print x.shape

[1 2 3 4]
(4,)
[[1]
 [2]
 [3]
 [4]]
(4, 1)


and a row vector is generally written as the transpose

$$\boldsymbol{x}^T = [x_1, x_2, \ldots, x_n]$$

In [12]:
x_T = x.transpose()
print x_T
print x_T.shape

[[1 2 3 4]]
(1, 4)


If we have two vectors $\boldsymbol{x}$ and $\boldsymbol{y}$ of the same length $(n)$, then the _dot product_ is give by matrix multiplication

$$\boldsymbol{x}^T \boldsymbol{y} =   
    \begin{bmatrix} x_1& x_2 & \ldots & x_n \end{bmatrix}
    \begin{bmatrix}
    y_{1}\\
    y_{2}\\
    \vdots\\
    y_{n}
  \end{bmatrix}  =
  x_1y_1 + x_2y_2 + \cdots + x_ny_n$$
  
## In-class exercise - Differences between numpy vectors and arrays
The distinction between a vector (aka oriented vector) and an array (aka unoriented vector) can be important.  Numpy will let you get away with using arrays and will try to determine what you intended.  Most of the time it will do what you intended, sometimes it won't.  Tensorflow will not let you use arrays, but will demand that you specify two (or more) indices for vectors.  The following sequence of exercise will help you understand why these distinctions are important.  
1.  Form two arrays or unoriented vectors.  These are numpy arrays whose shape is something like (n,).  Take their dot product.  
2.  Form two vectors or oriented vectors.  These will have shapes like (n,1) or (1,n).  Take their dot product.  If you used the same numeric values in as in 1, the answers will be the same.  
3.  Reverse the order of the multiplication in 2 (reverse the order of the arguments in the dot() function).  Are the two vectors still conformable?  What shape do you expect the answer to have?  What shape does it have?  
4.  Generate the same result as in 3, using unoriended vectors (arrays).

## Inverse of a Matrix

The inverse of a square $n \times n$ matrix $X$ is an $n \times n$ matrix $X^{-1}$ such that 

$$X^{-1}X = XX^{-1} = I$$

Where $I$ is the identity matrix, an $n \times n$ diagonal matrix with 1's along the diagonal. 

If such a matrix exists, then $X$ is said to be _invertible_ or _nonsingular_, otherwise $X$ is said to be _noninvertible_ or _singular_.

In [13]:
A = np.random.randint(0, 10, size=(3, 3))
A_inv = np.linalg.inv(A)
print A
print A_inv
print A.dot(A_inv)

[[7 1 6]
 [7 3 3]
 [1 5 0]]
[[-0.16666667  0.33333333 -0.16666667]
 [ 0.03333333 -0.06666667  0.23333333]
 [ 0.35555556 -0.37777778  0.15555556]]
[[  1.00000000e+00   0.00000000e+00  -5.55111512e-17]
 [  5.55111512e-17   1.00000000e+00  -1.94289029e-16]
 [  0.00000000e+00   5.55111512e-17   1.00000000e+00]]


### Properties of Inverse
1. If $X$ is invertible, then $X^{-1}$ is invertible and
$$(X^{-1})^{-1} = X$$
2. If $X$ and $Y$ are both $n \times n$ invertible matrices, then $XY$ is invertible and
$$(XY)^{-1} = Y^{-1}X^{-1}$$
3. If $X$ is invertible, then $X^T$ is invertible and
$$(X^T)^{-1} = (X^{-1})^T$$

### Orthogonal Matrices

Let $X$ be an $n \times n$ matrix such than $X^TX = I$, then $X$ is said to be orthogonal which implies that $X^T=X^{-1}$

In addition, two $n \times 1$ vectors $\boldsymbol{x}$ and $\boldsymbol{y}$ are said to be orthogonal if

$$\boldsymbol{x}^T \boldsymbol{y} = 0 $$

An orthogonal matrix does not change the length of a vector that it multiplies.  It only changes the orientation of the vector.  It rotates the vector.  The proof is simple:  http://www.cse.psu.edu/~b58/cse456/lecture3.pdf 

### Exercise: Mental floss 1:  
Prove that the rows (or columns) of an orthogonal matrix are orthogonal to one another. 

### Symmetric and Anti-Symmetric Matrices

A matrix is called symmetric if $X^T = X$ and it's called anti-symmetric if $X^T = -X$.  

### Exercise: Mental floss 2
Prove that every matrix can be composed as the sum of a symmetric and an anti-symmetric matrix.  Ask for hint, if you need.

### Positive (and negative) definite (and semi-definite) matrices (and quadratic forms).
These concepts and properties are important in optimization problems.  You'll learn that much of machine learning is about optimization.  With deep networks the optimization problems become wildly more intense and reinforcement learning dials the problems up several notches from there.  

A quadratic form is an expression of form $v^TXv$ where v is a column vector and X is a square matrix.  A matrix X is positive definite if $v^TXv\, >\,0 \,\forall v\ne0$.  X is negative definite if -X is positive definite.  X is positive semi-definite if $v^TXv\, \geq\,0 \,\forall v\ne0$.  Sometimes you will see notation $X\, >\, 0$ with X a matrix.  That is a common shorthand for saying that X is positive definite.  

## Matrix Equations

A system of equations of the form:
\begin{align*}
    a_{11}x_1 + \cdots + a_{1n}x_n &= b_1 \\
    \vdots \hspace{1in} \vdots \\
    a_{m1}x_1 + \cdots + a_{mn}x_n &= b_m 
\end{align*}
can be written as a matrix equation:
$$
A\mathbf{x} = \mathbf{b}
$$
and hence, has solution
$$
\mathbf{x} = A^{-1}\mathbf{b}
$$

### Eigenvalue and Eigenvectors


Let $\bf A$ be a given nonzero square matrix of dimension $n \times n$. Consider the following equation:

$${\bf A}{\bf x} = \lambda {\bf x}$$

This equation is called an _eigenvalue equation_. Here $\bf A$ is a given square matrix, $\bf x$ is an unknown vector, and $\lambda$ is an unknown scalar. The problem of finding  $\lambda$'s and  nonzero ${\bf x}$'s that satisfy the eigenvalue equation is called the _eigenvalue problem_.

### Eigenvalues of symmetric matrices
Symmetric matrices have a relatively simple structure that can sometimes be useful for doing calculations and for visualizing the properties and effects of operations and equations.  The baasic notion is that by changing your coordinate frame you can view a symmetric matrix as being diagonal.  Multiplication by a diagonal matrix is much easier to contemplate than by a more general matrix.  In mathematical terms every square symmetric matrix matrix X can be decomposed as follows $X\, =\, U^T\Lambda U$ where U and $\Lambda$ square and the same dimension as X, U is orthogonal and $\Lambda$ is diagonal with the eigenvalues of X along the diagonal.  

As discussed above, an orthogonal matrix rotates vectors.  A diagonal matrix stretches or contracts the elements of a vector independently of one another. 

### Exercises Mental floss 3
1.  Suppose you're given a diagonal matrix D.  Write out the expression for the quadratic form $v^TDv$ in terms of the componenets of v $v_1,\,v_2,\,etc$, the diagonal elements of D, $d_{1,1},\, d_{2,2},\,etc$.  What is the shape of the curve satisfying $v^TDv = c$ for some positive number c? 
2.  (More difficult) Suppose you're given a symmetric matrix X.  What requirements on the eigenvalues of X are necessary and sufficient for X > 0?

In [14]:
A = np.array([[1, 1], [1, 2]])
vals, vecs = np.linalg.eig(A)
print vals
print vecs
 

[ 0.38196601  2.61803399]
[[-0.85065081 -0.52573111]
 [ 0.52573111 -0.85065081]]


In [15]:
lam = vals[0]
vec = vecs[:,0]
print A.dot(vec)
print lam * vec

[-0.3249197   0.20081142]
[-0.3249197   0.20081142]


### Basis

If a set of linearly independent vectors $B = \{{\bf b}_1, {\bf b}_2, .... {\bf b}_n\}$ span $V$ then the set $B$ is said to be a __basis set__, or __basis vectors__, or simply __basis__, for the space. Thus, any vector $\bf v$ in $V$ can be expressed as a linear combination of the elements of $B$. 

### Null Space (Kernel)

The **Null Space** or more often **Kernel** of a $m \times n$ matrix $A$ is the set of all $n$ dimensional vectors $\vec{x}$ such that:

$A\vec{x} = 0$

## Review of a few numpy matrix and array functions.  

### Summary functions
Calculate row-wise (or column-wise) means for a matrix (or higher rank tensor).  
https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html

### Scalar functions applied to arrays (vector, matrix, tensor)
Scalar functions are functions that take a single real number as input and give a single real number as output.  There are several things you might have in mind by an expression like $X^2$ where X is a matrix.  Supposing X is square, you might mean XX where the two matrices are multiplied as matrices.  In this class an other machine learning classes $X^2$ means a matrix where each i,j element is the square of the corresponding element of X.  Something like $X^2_{i,j}\,=\,X_{i,j}*X_{i,j}$.  You'll see this notation used very frequently in deep learning literature.  A very common operation in a neural net is something like $tahn(W_xX + W_hh)$ where $W_x$ and $W_h$ are weight matrices, X is a vector of inputs and h is the output of a previous layer in the network.  In this example the hyperbolic tangent function is applied element by element to the vectors that result from the matrix multiplications.  

You can use ordinary python math library functions like math.tanh() to do these calculations, but you'd have to build a double for loop to do it, making your code uglier and wasting compute cycles.  Instead there are numpy versions of household functions that you can use instead.  Heres exponential for example.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.exp.html

Not coincidentally there are also tensor versions of household scalar functions in neural net languages like Theano and Tensorflow.  H
