# Kira Novitchkova-Burbank

# Section 3: Projections, Subspaces, Orthogonality, and QR decomposition

In lecture you have been discussing subspaces and the notion of orthogonality. Generating orthogonal subspaces or an orthonormal basis for matrices can be a powerful numerical tool.

In this section we will explore this idea of orthogonality and how to use it to describe matrices and solve least squares.

## Using QR for least squares

We can use least squares and QR to attempt to classify handwritten digits from the MNIST dataset. This is essentially a single layer perceptron with no activation function. We will use tensorflow to load the data since it has a nice loader to numpy.

In [None]:
import tensorflow as tf
import numpy as np
import scipy
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

print(x_train.shape)
print(y_train.shape)

(60000, 28, 28)
(60000,)


We see here that each training example is a 28x28 image. We want each example as a single vector so let's flatten that to shape to make a large data matrix.

In [None]:
x_train = x_train.reshape((60000, 784))
bias = np.ones((60000, 1))
#A = x_train
x_train = np.concatenate((bias, x_train), axis=1)

Let's call the data matrix $A$ and the labels $b$. There likely isn't a solution to the system $Ax = b$ since the matrix $A$ has many less columns than rows. This means we want to solve $\min_{x}||Ax - b||_{l_2}$, which is the least squares problem. Let's think about how we can do this using $QR$.

Fact:
1. Because $Q$ is orthonormal, it doesn't change the norm of any vector.

Proof:
$$||Qy||^2_2 = (Qy)^T (Qy) = y^TQ^TQy = y^ty = ||y||_2^2$$

2. So that means we can transform our minimization problem to:
\begin{align}
&=\min_x ||Ax - b||_2\\
&=\min_x ||Q^T(Ax - b)||_2\\
&=\min_x ||Q^T(QRx - b)||_2\\
&=\min_x ||Rx - Q^T b||_2
\end{align}

But since $R$ is an upper triangular matrix, we know there is a solution to:
$$Rx = Q^T b$$
or
$$x = R^{-1} Q^T b$$
which would make the result of the minimization $0$.

In [None]:
Q, R = scipy.linalg.qr(x_train, mode='economic')

np.linalg.matrix_rank(R)

713

Notice the above technique only works if $R$ is invertible, or full rank. $R$ will only be full rank if the original data, $A$, has linearly independent columns. Often times there are linear dependencies in the dataset, meaning we have to take a slightly different approach to the $QR$ for least squares.

This process is known as *rank deficient least squares* and requires a modified $QR$ which permutes the $A$ matrix so that the diagonal or $R$ is not increasing. If you're interested, [here is a formal description of this algorithm using householder transformations](https://www.math.usm.edu/lambers/mat610/sum10/lecture11.pdf).

For our case, `scipy` offers a `pivoting` argument flag that does this for us.

In [None]:
Q, R_p, p = scipy.linalg.qr(x_train, mode='economic', pivoting=True)

This version gives us a $Q$ like before but the $R_p$ is provided in the form of:
$$R_p = \begin{bmatrix}
R & S\\
0 & 0
\end{bmatrix} $$
Where the $R$ now is a true upper triangular. The provided $p$ is the pivots required to $A$ to satisfy:
$$A \Pi = Q R_p$$
With $\Pi$ being the permutation matrix created from $p$. We can create that permutation matrix with `np.eye(size)[:,p]`.

We checked previously that the rank of our R matrix is 713 but let's check again with the pivoted R by looking for the first 0 in the diagonal.

In [None]:
rank = np.argmax(np.absolute(np.diag(R_p)) < 1e-6)
rank

713

Now, one way to solve for $x$ is to slice off the part of $R$ that is all zeros and cut the bottom portion off of $Q^T b$ so that we can solve the triangular system. The resulting $x$ needs to be permuted using the pivots to get an actual solution to the least squares.

\begin{align}
||b - Ax||_2^2 &= \left\|b - Q
\begin{bmatrix}
R & S\\
0 & 0
\end{bmatrix}
\Pi^T x \right\|_2^2\\
&= \left \|Q^T b -
\begin{bmatrix}
R & S\\
0 & 0
\end{bmatrix}
\begin{bmatrix}
u\\
v
\end{bmatrix}
\right\|_2^2\\
&=\left\|
\begin{bmatrix}
c\\
d
\end{bmatrix} -
\begin{bmatrix}
Ru & Sv\\
0
\end{bmatrix} \right\|_2^2\\
&= \|c - Ru - Sv||_2^2 + \|d\|_2^2\\
\text{where} \quad
Q^Tb &=
\begin{bmatrix}
c\\
d
\end{bmatrix}, \quad \Pi^Tx =
\begin{bmatrix}
u\\
v
\end{bmatrix}
\end{align}

In the implementation below, I choose the simplest solution with $v=0$.

In [None]:
R = R_p[:rank, :rank]
c = Q.T[:rank,:] @ y_train
u = scipy.linalg.solve_triangular(R, c, lower=False)
v = np.zeros(785 - rank)
uv = np.concatenate((u,v))
x = np.eye(785)[:,p] @ uv
pred = x_train @ x
print("R:", R)
print("pred[:10]:", pred[:10])
print("y_train[:10]:", y_train[:10])

R: [[-4.34535904e+04 -2.50719164e+04 -2.38967781e+04 ... -2.34180879e-01
  -1.12533854e-02 -9.27886503e-02]
 [ 0.00000000e+00 -3.38851685e+04 -1.04617855e+04 ... -5.33754638e-02
  -3.96001001e-02  6.86550896e-02]
 [ 0.00000000e+00  0.00000000e+00  3.08895655e+04 ... -1.99244343e-01
   2.81905096e-02 -4.85307884e-02]
 ...
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  7.95197288e+00
  -2.00474285e-03 -2.36763863e-03]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
   6.98071953e+00 -4.08163581e-04]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
   0.00000000e+00 -1.87255493e+00]]
pred[:10]: [4.19480069 1.20585109 3.19619527 2.3125187  7.70189204 4.07685415
 1.62278595 3.92588536 1.84626138 4.82906635]
y_train[:10]: [5 0 4 1 9 2 1 3 1 4]


We see that this is *sort of* working...

The prediction numbers are approximately the same as the labels but with categorical labels like *digit*, getting a real number that is around the correct value isn't going to make a very good classifier.

We can really improve this model by making 10 different binary classifiers by changing the right hand side, $y_{train}$, to be a $1$ or a $0$ depending if it is the digit we are trying to classify.

## Exercise 3

Here we will complete the first step to improve the least squares classifier by implementing it as a binary classifier.

Modify `y_train` so that it *one hot encodes* the dataset. One hot encoding is where you take a vector of categorical labels and translate it to a vector for each category. Each category vector should have a $1$ if the observation is that category and a $0$ otherwise.

### One hot encoding example:
$$
\begin{bmatrix} 3\\ 2\\ 3\\ 1\\ 0 \end{bmatrix} \to
\begin{bmatrix} 0\\ 0\\ 0\\ 0\\ 1 \end{bmatrix},
\begin{bmatrix} 0\\ 0\\ 0\\ 1\\ 0 \end{bmatrix},
\begin{bmatrix} 0\\ 1\\ 0\\ 0\\ 0 \end{bmatrix},
\begin{bmatrix} 1\\ 0\\ 1\\ 0\\ 0 \end{bmatrix}
\quad \text{or in matrix form:} \quad
\begin{bmatrix}
0 & 0 & 0 & 1 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1 \\
0 & 1 & 0 & 0 \\
1 & 0 & 0 & 0
\end{bmatrix}
$$

These new one hot encoded vectors will become the new $b$ in the least squares formulation (formerly `y_train`).

In [None]:
import numpy as np


labels = y_train #vector with values ex: labels = [0,5,4,1,3,2,0,2,1,1,4,3,5] except (60000,1)
print(labels)
n = len(labels)
num_categories = max(labels) + 1

print(n) #6000
print(num_categories) #10

#creating one_hot
one_hot = np.zeros((n, num_categories))

for i, label in enumerate (labels):
  one_hot[i, label] = 1

#testing how to pull one row or column from one_hot
print("0th row:", one_hot[0,:]) #for first row do [1,:]
print("0th col:", one_hot[:,0]) #for first col do [:,1]

print("\n one_hot:\n",one_hot)


[5 0 4 ... 5 6 8]
60000
10
0th row: [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
0th col: [0. 1. 0. ... 0. 0. 0.]

 one_hot:
 [[0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]]


## Homework 3

After creating the one hot encoding of the labels, use the $Q$ and $R$ from the code above to solve the least squares problem for $x$ for each of these 10 one hot encoded column vectors.

Concat each $x$ into a matrix, $X = \begin{bmatrix} x_1, & x_2, & \dots, & x_{10} \end{bmatrix}$

And get the resulting prediction matrix, $Y = A X$.

For each row (observation) of this prediction matrix, you will have 10 values that correspond to each label. Whichever value is highest is the prediction. Extract the index of the highest value for each row in this matrix. Compare the predicted labels with the actual labels to calculate your prediction accuracy.

### Bonus / Extra Credit

Create a confusion matrix of the result and make a short comment about your observations of this matrix.

In [None]:
#creating empty matrix X where columns x1, x2, x3, ... xnum_categories will be stored

#num_categories = length of rows aka how many columns you want X to have

X = np.zeros((785, num_categories)) #785x10

#getting cols (x1, x2, x3, ... xnum_categories) to go inside X
for i in range(num_categories):

  #turning cols from one_hot to cols of X

  c = Q.T[:rank,:] @ one_hot[:,i]
  u = scipy.linalg.solve_triangular(R, c, lower=False)
  v = np.zeros(785 - rank)
  uv = np.concatenate((u,v))
  x = np.eye(785)[:,p] @ uv

  X[:,i] = x


#Y=AX equivalent to Y=x_train @ X
Y = x_train @ X

#finding index of max value in pred
pred = np.argmax(Y, axis=1)

#print first 10 labels of both pred and y_train
print("pred[:10]",pred[:10])
print("y_train[:10]",y_train[:10])

#checking to see how many labels are equal
Equal = np.sum(pred == y_train)
print("labels that are the same in pred and y_train:", Equal)
print("amount of labels in y_train:", len(y_train))

#proportion of equal to original labels
proportion = Equal / n    #n = 60000
percent_proportion = proportion * 100
print("proportion of correct predicted labels / original labels:", Equal, "/", n, "=", proportion)
print("proportion as a percentage %:", percent_proportion)

pred[:10] [5 0 4 1 9 2 1 3 1 4]
y_train[:10] [5 0 4 1 9 2 1 3 1 4]
labels that are the same in pred and y_train: 51464
amount of labels in y_train: 60000
proportion of correct predicted labels / original labels: 51464 / 60000 = 0.8577333333333333
proportion as a percentage %: 85.77333333333334


## A note about least squares in practice

Please don't implement least squares on your own like this in practice, `numpy` has least squares implemented for you.

In [None]:
result = np.linalg.lstsq(x_train, y_train, rcond=1e-6)
pred = x_train @ result[0]

print(pred[:10])
print(y_train[:10])

[4.19480069 1.20585109 3.19619527 2.3125187  7.70189204 4.07685415
 1.62278595 3.92588536 1.84626138 4.82906635]
[5 0 4 1 9 2 1 3 1 4]


QR decomposition code aka what scipy.linalg.qr() is doing:

In [None]:
import numpy as np

#gram schmidt

A = np.random.rand(4,4)
print("A:\n", A)

Q = np.zeros((4,4))
Q[:,0] = A[:,0]
n = 4

#project v onto u
def proj(u, v):
  angle = u.T @ v
  inner_u = u.T @ u
  return (angle / inner_u) * u

for j in range(n):
  v = A[:,j].copy()
  for i in range(j):
    u = Q[:,i]
    projection = proj(u, v)
    v -= projection
  v /= (v.T @ v) ** 0.5
  Q[:,j] = v[:]

print("Q:\n", Q)
eye = Q.T @ Q
eye[abs(eye) < 1e-10] = 0.0
print("hopefully indentity:\n", eye)

R = Q.T @ A
R[abs(R) < 1e-10] = 0.0
print("R:\n", R)

A:
 [[0.55260458 0.97908672 0.24273866 0.15001341]
 [0.31734006 0.77715622 0.00967177 0.77849668]
 [0.37092713 0.39320829 0.86332207 0.99480554]
 [0.09192235 0.7434321  0.93948786 0.79528849]]
Q:
 [[ 0.74370452 -0.03520303 -0.29236892 -0.60015394]
 [ 0.42708158  0.29946255 -0.44453284  0.72822665]
 [ 0.49919996 -0.41632358  0.69660626  0.30366724]
 [ 0.12371064  0.85776314  0.48130103 -0.13148154]]
hopefully indentity:
 [[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]
R:
 [[0.7430432  1.34832034 0.73185145 1.03903977]
 [0.         0.67224913 0.44078791 0.49585782]
 [0.         0.         0.97830338 0.68583434]
 [0.         0.         0.         0.674415  ]]
