## Chapter 8
# The Inner Product

**Problem 8.3.4:** Demonstrate using a numerical example that Lemma 8.3.3 would not be true if we remove the requirement that $\boldsymbol{u}$ and $\boldsymbol{v}$ are orthogonal.

Lemma 8.3.3: If $\boldsymbol{u}$ is orthogonal to $\boldsymbol{v}$ then, for any scalars $\alpha$, $\beta$,

$\|\alpha\boldsymbol{u} + \beta\boldsymbol{v}\|^2 = \alpha^2\|\boldsymbol{u}\|^2 + \beta^2\|\boldsymbol{v}\|^2$

Consider the non-orthogonal 2-vectors $\boldsymbol{v} = [1,0]$, $\boldsymbol{u} = [3,4]$ and the scalars $\alpha = 2, \beta = 2$.  Then  
$\begin{align}
\|\alpha\boldsymbol{u} + \beta\boldsymbol{v}\|^2 &= (\alpha\boldsymbol{u}  + \beta\boldsymbol{v}) \cdot(\alpha\boldsymbol{u}  + \beta\boldsymbol{v})\\
&= 2^2([3,4]\cdot[3,4]) + 2^2([1,0]\cdot[1,0]) + 2 \cdot 2([3,4]\cdot[1,0])+2\cdot 2([1,0]\cdot[3,4])\\
&= 4 \cdot 25 + 4 \cdot 1 + 4 \cdot 3 + 4 \cdot 3\\
&\neq \alpha^2\|\boldsymbol{u}\|^2 + \beta^2\|\boldsymbol{v}\|^2 = 4 \cdot 25 + 4 \cdot 1
\end{align}$

$QED$

**Problem 8.3.5:** Using induction and Lemma 8.3.3, prove the following generalization: Suppose $\boldsymbol{v}_1, ..., \boldsymbol{v}_n$ are mutually orthogonal. For any coefficients $\alpha_1, ..., \alpha_n$,

$\|\alpha_1\boldsymbol{v}_1 + \cdots + \alpha_n\boldsymbol{v}_n\|^2 = \alpha_1^2\|\boldsymbol{v}_1\|^2 + \cdots + \alpha_n^2\|\boldsymbol{v}_n\|^2$

Using induction on $n$,

consider when $n = 0$.  Then

$\|\{\}\|^2 = 0^2 = 0$

Assume the generalization holds for $n = k$ and we prove for the case $n = k + 1$.

$\begin{align}
\|\alpha_1\boldsymbol{v}_1 + \cdots + \alpha_k\boldsymbol{v}_k + \alpha_{k+1}\boldsymbol{v}_{k+1}\|^2 &=(\alpha_1\boldsymbol{v}_1 + \cdots + \alpha_k\boldsymbol{v}_k + \alpha_{k+1}\boldsymbol{v}_{k+1})(\alpha_1\boldsymbol{v}_1 + \cdots + \alpha_k\boldsymbol{v}_k + \alpha_{k+1}\boldsymbol{v}_{k+1})\\
&= (\alpha_1\boldsymbol{v}_1 + \cdots + \alpha_k\boldsymbol{v}_k)(\alpha_1\boldsymbol{v}_1 + \cdots + \alpha_k\boldsymbol{v}_k) + \alpha_{k+1}\boldsymbol{v}_{k+1} \cdot \alpha_{k+1}\boldsymbol{v}_{k+1} \space\text{(Since}\space \boldsymbol{v}_1, ..., \boldsymbol{v}_{k+1} \text{are mutually orthogonal)}\\
&= \|\alpha_1\boldsymbol{v}_1 + \cdots + \alpha_k\boldsymbol{v}_k\|^2 + \alpha_{k+1}\boldsymbol{v}_{k+1} \cdot \alpha_{k+1}\boldsymbol{v}_{k+1}\\
&= \alpha_1^2\|\boldsymbol{v}_1\|^2 + \cdots + \alpha_k^2\|\boldsymbol{v}_k\|^2 + \alpha_{k+1}\boldsymbol{v}_{k+1} \cdot \alpha_{k+1}\boldsymbol{v}_{k+1}\space\text{using the induction assumption for }k\\
&= \alpha_1^2\|\boldsymbol{v}_1\|^2 + \cdots + \alpha_k^2\|\boldsymbol{v}_k\|^2 + \alpha_{k+1}^2\|\boldsymbol{v}_{k+1}\|^2
\end{align}$

$QED$

In [1]:
import sys
sys.path.append('../')

from vec import Vec
from vecutil import list2vec

In [2]:
def project_along(b, v):
    sigma = ((b * v) / (v * v)) if v * v > 1e-20 else 0
    return sigma * v

In [3]:
project_along(list2vec([2, 4]), list2vec([6, 2]))

Vec({0, 1},{0: 3.0, 1: 1.0})

**Problem 8.3.15:** Write a Python procedure `projection_matrix(v)` that, given a vector $\boldsymbol{v}$, returns the matrix $M$ such that $\pi_\boldsymbol{v}(\boldsymbol{x}) = M\boldsymbol{x}$. Your procedure should be correct even if $\|\boldsymbol{v}\| \neq 1$.

In [4]:

from mat import Mat
from matutil import coldict2mat

def outer_product(x, v):
    return coldict2mat([x]) * coldict2mat([v]).transpose()

def projection_matrix(v):
    return outer_product(v, v) * (1 / (v * v)) # element-wise division by squared norm (dot-product)

In [5]:
projection_matrix(list2vec([6, 2])) * list2vec([2, 4]) # same result as above!

Vec({0, 1},{0: 3.0, 1: 1.0})

**Problem 8.3.16:** Suppose $\boldsymbol{v}$ is a nonzero $n$-vector. What is the rank of the matrix $M$ such that $\pi_\boldsymbol{v}(\boldsymbol{x}) = M\boldsymbol{x}$? Explain your answer using appropriate interpretations of matrix-vector or matrix-matrix multiplication.

Since $M$ is a scalar multiple of the outer product of $\boldsymbol{v}$, it will have the same rank as the outer product of $\boldsymbol{v}$.  Since the outer product of $\boldsymbol{v}$ is the matrix-matrix multiplication of column-vector $\boldsymbol{v}$ with row-vector $\boldsymbol{v}^T$, by the linear-combinations definition of matrix-matrix multiplication, each row of $M$ will be a linear combination of $\boldsymbol{v}^T$ with the corresponding element of the column-vector $\boldsymbol{v}$. Since this is just a single element, each row of $M$ will be a scalar multiple of $\boldsymbol{v}^T$.  Thus, $rank(M) = rank(\boldsymbol{v})$.

## 8.4 Lab: machine learning

**Task 8.4.1:** Use `read_training_data` to read the data in the file `train.data` into the variables `A`, `b`.

In [43]:
from cancer_data import read_training_data

A, b = read_training_data('train.data')

**Task 8.4.2:** Write a procedure `signum(u)` with the following spec:

* _input:_ a `Vec` `u`
* _output:_ the `Vec` `v` with the same domain as `u` such that `v[d] = 1 if u[d] >= 0 else -1`.

For examples, `signum(Vec({'A', 'B'}, {'A': 3, 'B': -2}))` is `Vec({'A', 'B'}, {'A': 1, 'B': -1})`

In [7]:
def signum(u):
    return Vec(u.D, {k: 1 if v >= 0 else -1 for k, v in u.f.items()})

In [8]:
signum(Vec({'A', 'B'}, {'A': 3, 'B': -2}))

Vec({'A', 'B'},{'A': 1, 'B': -1})

**Task 8.4.3:** Write the procedure `fraction_wrong(A, b, w)` with the following spec:

* _input:_ An $R \times C$ matrix $A$ whose rows are feature vectors, an $R$-vector $\boldsymbol{b}$ whose entries are $+1$ and $-1$, and a $C$-vector $\boldsymbol{w}$
* _output:_ The fraction of row labels $r$ of $A$ such that the sign of (row $r$ of $A$) $\cdot \boldsymbol{w}$ differs from that of $\boldsymbol{b}[r]$.

In [9]:
# the book hints that there's a "clever" way to write this
# without explicit loops using matrix-vector multiplication, 
# dot-product and the `signum` procedure.
# I'm taking "no loops" to also mean no comprehensions.
# This is the purest way I could find, but it's not very intuitive.
# Here's the algebra:
#     num_right + num_wrong = len
#     num_right - num_wrong = signum(pred) * actual
#     => num_wrong = (len - signum(pred) * actual) / 2
#     => frac_wrong = num_wrong/len = 1/2 * (1 - (signum(pred) * actual) / len)
def fraction_wrong(A, b, w):
    return 0.5 * (1 - (signum(A * w) * b) / len(b.D))

In [10]:
fraction_wrong(A, b, Vec(A.D[1], {k: 1 for k in A.D[1]}))

0.5133333333333333

In [11]:
len([x for x in b.f.values() if x == -1]) / len(b.f) # just checking :)

0.5133333333333333

**Task 8.4.4:** Write a procedure `loss(A, b, w)` that takes as input the training data $A$, $\boldsymbol{b}$ and a hypothesis vector $\boldsymbol{w}$, and returns the value $L(\boldsymbol{w})$ of the loss function for input $\boldsymbol{w}$.

Find the value of the loss function at a simple hypothesis vector such as the all-ones vector or a random vector of +1's and -1's.

In [12]:
def loss(A, b, w):
    error = (A * w - b)
    return error * error

In [13]:
loss(A, b, Vec(A.D[1], {k: 1 for k in A.D[1]}))

1461169191.1916513

**Task 8.4.9:** Write a procedure `find_grad(A, b, w)` that takes as input the training data $A$, $\boldsymbol{b}$ and a hypothesis vector $\boldsymbol{w}$ and returns the value of the gradient of $L$ at the point $\boldsymbol{w}$, using the following equation:

$\sum_\limits{i=1}^\limits{m}2(\boldsymbol{a}_i \cdot \boldsymbol{w} - b_i)\boldsymbol{a}_i$

In [14]:
def find_grad(A, b, w):
    return 2 * (A * w - b) * A

In [15]:
find_grad(A, b, Vec(A.D[1], {k: 1 for k in A.D[1]}))

Vec({'compactness(stderr)', 'texture(mean)', 'radius(mean)', 'smoothness(stderr)', 'concave points(mean)', 'symmetry(worst)', 'area(stderr)', 'concave points(stderr)', 'perimeter(mean)', 'symmetry(mean)', 'concavity(stderr)', 'compactness(worst)', 'fractal dimension(stderr)', 'symmetry(stderr)', 'area(mean)', 'smoothness(mean)', 'area(worst)', 'fractal dimension(worst)', 'perimeter(worst)', 'compactness(mean)', 'radius(stderr)', 'texture(worst)', 'concavity(worst)', 'concave points(worst)', 'concavity(mean)', 'fractal dimension(mean)', 'smoothness(worst)', 'radius(worst)', 'perimeter(stderr)', 'texture(stderr)'},{'compactness(stderr)': 34491.53912591444, 'texture(mean)': 23824902.07755112, 'radius(mean)': 19061204.41971144, 'smoothness(stderr)': 8048.720134830324, 'concave points(mean)': 83000.61553163151, 'symmetry(worst)': 359563.8861937369, 'area(stderr)': 74089316.99183664, 'concave points(stderr)': 15786.076442543796, 'perimeter(mean)': 125171204.03328276, 'symmetry(mean)': 220058

**Task 8.4.10:** Write a procedure `gradient_descent_step(A, b, w, sigma)` that, given the training data $A$, $\boldsymbol{b}$ and the current hypothesis vector $\boldsymbol{w}$, returns the next hypothesis vector.

The next hypothesis vector is obtained by computing the gradient, multiplying the gradient by the step size, and subtracting the result from the current hypothesis vector.

In [16]:
def gradient_descent_step(A, b, w, sigma):
    return w - sigma * find_grad(A, b, w)

**Task 8.4.11:** Write a procedure `gradient_descent(A, b, w, sigma, T)` that takes as input the training data $A$, $\boldsymbol{b}$, and initial value $\boldsymbol{w}$ for the hypothesis vector, a step size $\sigma$, and a number $T$ of iterations. The procedure should implement gradient descent as described above for $T$ iterations, and return the final value of $\boldsymbol{w}$.

Every thirty iterations or so, the procedure should print out the value of the loss function and the fraction wrong for the current hypothesis vector.

In [17]:
def gradient_descent(A, b, w, sigma, T):
    for i in range(T):
        if i % 30 == 0:
            print('Loss:\t', loss(A, b, w))
            print('Fraction wrong:\t', fraction_wrong(A, b, w))
        w = gradient_descent_step(A, b, w, sigma)
    return w

**Task 8.4.12:** Try out your gradient descent code on the training data! Notice that the fraction wrong might go up even while the value of the loss function goes down. Eventually, as the value of the loss function continues to decrease, the fraction wrong should also decrease (up to a point).

Try a step size of $\sigma = 2 \cdot 10^{-9}$, then try a step size of $\sigma = 10^{-9}$.

Try starting with the all-ones vector. Then try starting with the zero vector.

In [18]:
w = Vec(A.D[1], {k: 1 for k in A.D[1]})
sigma = 2e-9 # sigma too large
T = 300
gradient_descent(A, b, w, sigma, T)

Loss:	 1461169191.1916513
Fraction wrong:	 0.5133333333333333
Loss:	 31801400738349.836
Fraction wrong:	 0.5133333333333333
Loss:	 6.932905202112942e+17
Fraction wrong:	 0.5133333333333333
Loss:	 1.5114169719634027e+22
Fraction wrong:	 0.5133333333333333
Loss:	 3.294984132260201e+26
Fraction wrong:	 0.5133333333333333
Loss:	 7.183272805083588e+30
Fraction wrong:	 0.5133333333333333
Loss:	 1.5659986853065471e+35
Fraction wrong:	 0.5133333333333333
Loss:	 3.4139757028945487e+39
Fraction wrong:	 0.5133333333333333
Loss:	 7.442681918773589e+43
Fraction wrong:	 0.5133333333333333
Loss:	 1.6225515048942348e+48
Fraction wrong:	 0.5133333333333333


Vec({'compactness(stderr)', 'texture(mean)', 'radius(mean)', 'smoothness(stderr)', 'concave points(mean)', 'symmetry(worst)', 'area(stderr)', 'concave points(stderr)', 'perimeter(mean)', 'symmetry(mean)', 'concavity(stderr)', 'compactness(worst)', 'fractal dimension(stderr)', 'symmetry(stderr)', 'area(mean)', 'smoothness(mean)', 'area(worst)', 'fractal dimension(worst)', 'perimeter(worst)', 'compactness(mean)', 'radius(stderr)', 'texture(worst)', 'concavity(worst)', 'concave points(worst)', 'concavity(mean)', 'fractal dimension(mean)', 'smoothness(worst)', 'radius(worst)', 'perimeter(stderr)', 'texture(stderr)'},{'compactness(stderr)': 1.528369278607165e+17, 'texture(mean)': 1.0585263085778515e+20, 'radius(mean)': 8.511769465095394e+19, 'smoothness(stderr)': 3.5395650399635012e+16, 'concave points(mean)': 3.737791849292671e+17, 'symmetry(worst)': 1.5954234188339487e+18, 'area(stderr)': 3.340090771814811e+20, 'concave points(stderr)': 7.004450798487749e+16, 'perimeter(mean)': 5.59141084

Note that with this value of sigma, we keep overstepping and end up doing worse as the iterations continue.

In [19]:
sigma = 1e-9
T = 2000
w = gradient_descent(A, b, w, sigma, T)

Loss:	 1461169191.1916513
Fraction wrong:	 0.5133333333333333
Loss:	 1867476.164994827
Fraction wrong:	 0.73
Loss:	 1501858.2431936574
Fraction wrong:	 0.7133333333333334
Loss:	 1261341.0368251363
Fraction wrong:	 0.7133333333333334
Loss:	 1099845.1036712618
Fraction wrong:	 0.7166666666666667
Loss:	 988408.6940531735
Fraction wrong:	 0.7133333333333334
Loss:	 908821.9038339362
Fraction wrong:	 0.71
Loss:	 849628.557730042
Fraction wrong:	 0.7166666666666667
Loss:	 803615.1441978435
Fraction wrong:	 0.7033333333333334
Loss:	 766233.6931558683
Fraction wrong:	 0.7066666666666667
Loss:	 734611.217512566
Fraction wrong:	 0.7066666666666667
Loss:	 706927.5476219974
Fraction wrong:	 0.7033333333333334
Loss:	 682024.5367277957
Fraction wrong:	 0.6966666666666667
Loss:	 659160.5808964253
Fraction wrong:	 0.6933333333333334
Loss:	 637856.4056941518
Fraction wrong:	 0.6933333333333334
Loss:	 617798.174904544
Fraction wrong:	 0.69
Loss:	 598776.6022666126
Fraction wrong:	 0.69
Loss:	 580648.6767

In [20]:
print('Final loss:\t', loss(A, b, w))
print('Final fraction wrong:\t', fraction_wrong(A, b, w))

Final loss:	 165685.39935656055
Final fraction wrong:	 0.6166666666666667


Note that this is worse than what we started with (the all-ones guess)!

In [25]:
w = Vec(A.D[1], {k: 0 for k in A.D[1]}) # try again with 0's hypothesis
w = gradient_descent(A, b, w, sigma, T)

Loss:	 300.0
Fraction wrong:	 0.5133333333333333
Loss:	 251.01076079619205
Fraction wrong:	 0.5133333333333333
Loss:	 239.28149884833175
Fraction wrong:	 0.5133333333333333
Loss:	 231.18126788323136
Fraction wrong:	 0.43
Loss:	 225.39735401131654
Fraction wrong:	 0.26666666666666666
Loss:	 221.10335403821537
Fraction wrong:	 0.18333333333333335
Loss:	 217.77839578627712
Fraction wrong:	 0.13666666666666666
Loss:	 215.09359415175376
Fraction wrong:	 0.09999999999999998
Loss:	 212.84073473769675
Fraction wrong:	 0.08000000000000002
Loss:	 210.8874787199391
Fraction wrong:	 0.08333333333333331
Loss:	 209.14922459274135
Fraction wrong:	 0.08000000000000002
Loss:	 207.57143149149456
Fraction wrong:	 0.08000000000000002
Loss:	 206.11851313090224
Fraction wrong:	 0.08666666666666667
Loss:	 204.76685863470976
Fraction wrong:	 0.08333333333333331
Loss:	 203.5004454685678
Fraction wrong:	 0.09000000000000002
Loss:	 202.30808054345363
Fraction wrong:	 0.08666666666666667
Loss:	 201.1816640826094


In [26]:
print('Final loss:\t', loss(A, b, w))
print('Final fraction wrong:\t', fraction_wrong(A, b, w))

Final loss:	 180.35992492345744
Final fraction wrong:	 0.13333333333333336


This is _dramatically better_!

In [27]:
# holding onto these nice results:
best_w = w

In [32]:
from random import randint

w = Vec(A.D[1], {k: randint(0, 1) for k in A.D[1]})
w = gradient_descent(A, b, w, sigma, T)

Loss:	 790140453.9175893
Fraction wrong:	 0.7692307692307692
Loss:	 122759.50090194705
Fraction wrong:	 0.676923076923077
Loss:	 120317.9601365443
Fraction wrong:	 0.6730769230769231
Loss:	 118156.79449082051
Fraction wrong:	 0.6653846153846154
Loss:	 116211.02127333778
Fraction wrong:	 0.6615384615384615
Loss:	 114431.5759626456
Fraction wrong:	 0.6576923076923077
Loss:	 112781.40121804012
Fraction wrong:	 0.6576923076923077
Loss:	 111232.49695748888
Fraction wrong:	 0.6461538461538462
Loss:	 109763.6953322771
Fraction wrong:	 0.6423076923076922
Loss:	 108358.98246396436
Fraction wrong:	 0.6346153846153846
Loss:	 107006.23258360621
Fraction wrong:	 0.6307692307692307
Loss:	 105696.25323056377
Fraction wrong:	 0.6269230769230769
Loss:	 104422.0650719458
Fraction wrong:	 0.6269230769230769
Loss:	 103178.35868768365
Fraction wrong:	 0.6269230769230769
Loss:	 101961.08483424848
Fraction wrong:	 0.6269230769230769
Loss:	 100767.14538642662
Fraction wrong:	 0.6230769230769231
Loss:	 99594.1

In [44]:
print('Final loss:\t', loss(A, b, w))
print('Final fraction wrong:\t', fraction_wrong(A, b, w))

Final loss:	 198424.73800185614
Final fraction wrong:	 0.6233333333333333


**Task 8.4.13:** After you have used your gradient descent code to find a hypothesis vector $\boldsymbol{w}$, see how well this hypothesis works for the data in the file `validate.data`. What is the percentage of samples that are incorrectly classified? Is it greater or smaller than the success rate on the training data? Can you explain the difference in performance?

In [45]:
validation_A, validation_b = read_training_data('validate.data')

In [46]:
print('Validation loss:\t', loss(validation_A, validation_b, best_w))
print('Validation fraction wrong:\t', fraction_wrong(validation_A, validation_b, best_w))

Validation loss:	 152.62023375699934
Validation fraction wrong:	 0.04999999999999999


I am surprised to see that the validation error is actually _lower_ than the training error!  Normally this is not the case.  This tells us two things: we definitely did not overfit the data, _and_ we got a little lucky :)

Normally, learning a function that fits one set very will will not generalize to new data.  I think what this tells us at a higher level is that the function might be a relatively simple one to learn (the positive examples are easier to discern from the feature data).

In [47]:
# Trying with the hypothesis learned from our last trial (random starting hypothesis)
print('Validation loss:\t', loss(validation_A, validation_b, w))
print('Validation fraction wrong:\t', fraction_wrong(validation_A, validation_b, w)) 

Validation loss:	 56198.51556405085
Validation fraction wrong:	 0.6


## 8.5 Review questions

**What is an inner product for vectors over $\mathbb{R}$**?

One inner product vor vectors over $\mathbb{R}$ is the dot-product.

**How is norm defined in terms of dot-product?**

Norm is defined as the square-root of the dot-product of a vector with itself.

**What does it mean for two vectors to be orthogonal?**

For two vectors to be orthogonal, it means their inner product is 0.  Geometrically, orthogonality between 2-vectors over $\mathbb{R}$ means that they are perpendicular.

**What is the Pythagorean Theorem for vectors?**

The Pythagorean Theorem for vectors states that for orthogonal vectors, the square of the norm of the sum of the vectors is equal to the sum of the squares of the norms of the vectors.

**What is parallel-perpendicular decomposition of a vector?**

For every pair of vectors $\boldsymbol{v}$ and $\boldsymbol{a}$, $\boldsymbol{v}$ can be decomposed into a component that is _parallel with_ $\boldsymbol{a}$ (meaning the closest projection of $\boldsymbol{v}$ onto the line spanned by $\boldsymbol{a}$), and a component that is _perpendicular to_ $\boldsymbol{a}$ (meaning a vector in the direction perpendicular to the line spanned by $\boldsymbol{a}$.

**How does one find the projection of a vector $\boldsymbol{b}$ orthogonal to another vector $\boldsymbol{v}$?**

One can find such a vector with the equation 

$\boldsymbol{v} \cdot \frac{\langle\boldsymbol{b},\boldsymbol{v}\rangle}{\langle\boldsymbol{v},\boldsymbol{v}\rangle}$

**How can linear algebra help in optimizing a nonlinear function?**

Linear algebra can be used in computing the _gradient_ of a nonlinear function at a given point, where the gradient is found by taking the derivative of the nonlinear function at that point.

## 8.6 Problems

**Problem 8.6.1:** For each of the following problems, compute the norm of given vector $\boldsymbol{v}$:

**a)** $\boldsymbol{v} = [2,2,1]$

$\|\boldsymbol{v}\| = \sqrt{9} = 3$

**b)** $\boldsymbol{v} = [\sqrt{2},\sqrt{3},\sqrt{5},\sqrt{6}]$

$\|\boldsymbol{v}\| = \sqrt{16} = 4$

**c)** $\boldsymbol{v} = [1,1,1,1,1,1,1,1,1]$

$\|\boldsymbol{v}\| = \sqrt{9} = 3$

**Problem 8.6.2:** For each of the following $\boldsymbol{a}$, $\boldsymbol{b}$, find the vector in $\text{Span}\space\{\boldsymbol{a}\}$ that is closest to $\boldsymbol{b}$:

1. $\boldsymbol{a} = [1,2]$, $\boldsymbol{b} = [2,3]$

  $\boldsymbol{a} \cdot \frac{\langle\boldsymbol{b},\boldsymbol{a}\rangle}{\langle\boldsymbol{a},\boldsymbol{a}\rangle} = [1,2] \cdot \frac{8}{5} = [\frac{8}{5}, \frac{16}{5}]$

2. $\boldsymbol{a} = [0,1,0]$, $\boldsymbol{b} = [1.414,1,1.732]$

  (Since the norm of $\boldsymbol{v}$ is 1), $\boldsymbol{a} \cdot \langle\boldsymbol{b},\boldsymbol{a}\rangle = [0,1,0] \cdot 1 = [0,1,0]$

3. $\boldsymbol{a} = [-3,-2,-1,4]$, $\boldsymbol{b} = [7,2,5,0]$

  $\boldsymbol{a} \cdot \langle\boldsymbol{b},\boldsymbol{a}\rangle = [-3,-2,-1,4] \cdot \frac{-30}{18} = [-3,-2,-1,4] \cdot \frac{-5}{3} = [5,\frac{10}{3},\frac{5}{3},\frac{-20}{3}]$

**Problem 8.6.3:** For each of the following $\boldsymbol{a}$, $\boldsymbol{b}$, find $\boldsymbol{b}^{\perp\boldsymbol{a}}$ and $\boldsymbol{b}^{\|\boldsymbol{a}}$

1. $\boldsymbol{a} = [3,0], \boldsymbol{b} = [2,1]$

  $\boldsymbol{b}^{\|\boldsymbol{a}} = \boldsymbol{a} \cdot \frac{\langle\boldsymbol{b},\boldsymbol{a}\rangle}{\langle\boldsymbol{a},\boldsymbol{a}\rangle} = [3,0] \frac{6}{9} = [2,0]$  
  $\boldsymbol{b}^{\perp\boldsymbol{a}} = \boldsymbol{b} - \boldsymbol{b}^{\|\boldsymbol{a}} = [2,1] - [2,0] = [0,1]$
  
  (Of course, this could be found much more simply with geometrical reasoning since $\boldsymbol{a}$ lies along the $x$-axis.)

1. $\boldsymbol{a} = [1,2,-1], \boldsymbol{b} = [1,1,4]$

  $\boldsymbol{b}^{\|\boldsymbol{a}} = \boldsymbol{a} \cdot \frac{\langle\boldsymbol{b},\boldsymbol{a}\rangle}{\langle\boldsymbol{a},\boldsymbol{a}\rangle} = [1,2,-1] \frac{-1}{2} = [\frac{-1}{2},-1,\frac{1}{2}]$  
  $\boldsymbol{b}^{\perp\boldsymbol{a}} = \boldsymbol{b} - \boldsymbol{b}^{\|\boldsymbol{a}} = [1,1,4] - [\frac{-1}{2},-1,\frac{1}{2}] = [\frac{3}{2},2,\frac{7}{2}]$

1. $\boldsymbol{a} = [3,3,12], \boldsymbol{b} = [1,1,4]$

  $\boldsymbol{b}^{\|\boldsymbol{a}} = \boldsymbol{a} \cdot \frac{\langle\boldsymbol{b},\boldsymbol{a}\rangle}{\langle\boldsymbol{a},\boldsymbol{a}\rangle} = [3,3,12] \frac{54}{162} = [3,3,12] \frac{1}{3} = [1,1,4]$  
  $\boldsymbol{b}^{\perp\boldsymbol{a}} = \boldsymbol{b} - \boldsymbol{b}^{\|\boldsymbol{a}} = [1,1,4] - [1,1,4] = 0$
  
  (Note that this also could have been done quickly be seeing that $\boldsymbol{b}$ is a scalar multiple of $\boldsymbol{a}$, and thus lies on the line composed of its span.