In [1]:
import matplotlib.pyplot as plt
%matplotlib inline

def graph(lambda_arr, xmin, xmax):
    xplot = range(xmin, xmax+1)
    ymin = 0
    ymax = 0
    for f in lambda_arr:
        yplot = []
        for x in xplot:
            y = f(x)
            if y < ymin:
                ymin = y
            if y > ymax:
                ymax = y
            yplot.append(f(x))
        plt.plot(xplot, yplot)
    plt.axis([xmin, xmax, ymin, ymax])

# linear algebra

**What is linear algebra for?**

Solving systems of linear equations.  Seeing what values can satisfy the constraints of these equations.  Now when you 'solve' a system of equations, you get the answers for each variable.  

One way you could think about it is solving for a bunch of individual values at once.
But it's more than that.
Rather than thinking of x1, x2, x3 as being individual values that you're solving for at the same time, think of it as solving for a single answer, where that answers components are x1, x2, x3, etc.  Syntactically, a vector is just a list of numbers.  But semantically, that list of numbers comes together to represent the whole, something greater than its parts.  Think of 5d apple space.  What kinds of different apples are there?  You want to get information on all of these apples.  You aren't trying to get just the vitamin C content, just the calorie content, or just the size, etc.  You want to mathematically describe the object that is an apple, which can't be done with just a single number.

**Linear Equation**

A linear equation means you have a bunch of variables, $x_1$, $x_2$, $x_3$, etc, and they are all clumped together in 1 equation.

$ax_1 + bx_2+cx_3+...=y$

If $x_1$ gets increased by 1, the other x's will also be increased.  But by how much?

It depends on how many other variables there are.  If there's 1 other variable, it's a line, and we know.  If there's 2 or more, then the equation decribes a plane or something else, and we don't know.  We need more than 1 equation.

**Systems of linear equations**

A bunch of linear equations.  You want the set of variable values that make all the equations true simultaneously.

If you have 2 lines in 2d space, like so:

$$
x + y = 2 \\
2x + y = 1
$$

The solution is (x, y) = (-1, 3).

Visually, the equations are 2 lines.

If the lines are parallel, there are 0 solutions.

If the lines cross, there is 1 solution.

If the lines are actually the same line, there are infinite solutions.

**Free variables, etc TODO**

if x+y+z=d, and we know d, then we have a free variable.  this means we could write the equation as z = F(x, y), y=F(x, z), x=F(y, z), or whatever.

A linear equation relates all the variables to a constant.  This is the same as saying 

If you're in N-space, a single linear equation makes a shape of dimensions N-1.



**Matrix**

How do you turn x+y+z=d into a row in a matrix?

Where does the d go?

A function is a transformation on points.

An equation is a set of points.

Perhaps a function is also an equation?

Just replace the function notation with another variable.

What does it look like with multiple free variables?

x1+x2+x3=1

How do you write this in function notation?

x3 = 1 - x1 - x2

F(x1, x2) = 1 - x1 - x2

Is there a difference between rows and columns?

If it was just columns to columns, then matrix matrix multiplication would be way easier.

Matrix multiplication is function stuff, not actual multiplication.

All the functions you've seen so far are basically 1d to 1d vector functions.

If you've seen multivariable functions, that's Nd to 1d.

For each equation, you have a bunch of input variables, and 1 output variable.

The 1 output variable is the dependent variable.

Now what if we wanted a function that had multiple output variables?
What would that even mean?

It would really just be 2 functions.
F(w, x) = 3w+4x
G(w, x) = 2w+3x

Here, F and G are basically y and z.  We have 4 variables.  2 of the variables are dependent.

Before, we had 1 or 2 input variables.  We basically took those variables, transformed them into some other thing, and plotted the untransformed vs the transformed variables.  Here, the input and output aren't going to be graphed at the same time.

A data point, a vector _is_ a function, and vice versa.



`So each row is a linear equation.  What is the thing on the right of the equals sign called?  How does it affect the line?  How is it different from the numbers attached to the x's?`

Look at this equation:

$3x + y = 2$

This is just another form of the equation

$y = -3x + 2$

Without the '+2', this line would go through the origin with a '-3' slope.  With the '+2', it goes through the y-intercept at '+2', and the x-intercept at '-2'.

In the latter equation, the '+2' is called the y-intercept.  It shifts the lines away from the origin.  In this case, it shifts the line up 2 spaces on the y axis.

I'm just going to call it the constant term.  Constant for short.  The other things aren't constants, they're coefficients.

`If solving systems of linear equations is all linear algebra is for, why do we even need more complicated stuff like matrix multiplication and determinants?`

Good question.  Here's an example question that would require us to use matrix multiplication.

`Why does adding 2 shift the y-axis?  Why is the x-axis reversed?`

Because we're adding it to the x side of the equation.  These two equations are equivalent:

$$
y = -3x + 2
y - 2 = -3x
$$

Shifting the line up 2 is the same as shifting the line left 2.  

Now look at these 2 equations:

$$
y = -3x - 2
y + 2 = -3x
$$

These two equations are also equivalent.  Shifting the line 2 down is the same as shifting the line 2 right.

`If each row in a matrix is a linear equation, why does 3blue1brown talk about the columns all the time?`

`What exactly is a vector?  Is it a point?  Is it an arrow?  A list of numbers?`

`So an augmented matrix represents the coefficients of a bunch of variables, and the thing on the right of the bar is what they equal.  What does a non-augmented matrix represent?  It's just a bunch of variable coefficients without saying what they're all equal to.`

`Why is matrix multiplication defined like it is?`

`What is the difference between an Nx1 matrix a N-length column vector?`

`What does multiplying a row vector by a matrix on the left mean?  We want to solve linear equations, and those equations are the rows of the matrix.  But you're multiplying by columns, so you multipy by a bunch of x_1's, then a bunch of x_2's, etc.  It doesn't make sense.`

`Would you ever actually want to multiply a row vector by a matrix to its right?  Or is it just a consequence of how we've defined matrix multiplication?`

`Why upper triangular form specifically?  Is there anything better?`
Yes.  The more 0's the better, because that means it's an equation with fewer variables.  If you can get it to be diagonal, you have a solution for each of your variables.

`Can I get a summary?`

You have a bunch of different equations.  These equations represent things that are 'possible', rather than things that necessarily 'are'.  These equations are lines, planes, hyperplanes, etc.

You want to know under what conditions are all of these equations (these things that are possible), possible at the same time.  What vector values (values of our variables $x_1$, $x_2$, etc) will validate all of our requirements (our equations), all at the same time?

Perhaps you should think about the 3 lines on the graph in a different way.  You subtract 1 from the other.  But what if you just added that third thing to the matrix?  Rather than replacing rows, you just add new rows.  It would be the same thing.  The system would be 'overdetermined', but these new equations would be dependent on the old ones, so they wouldn't add any new information.

Equations are descriptions of things.  You have a whole bunch of equations describing things.  You want to find out 'what things in the world fit the description of all of these equations?'  Think of 2 cars going at different speeds. 

$$
car\ A:\ y=2x+1
car\ B:\ y=3x
$$

Both of these equations describe cars.  They describe the relationships of the time the car has been driving (x), and their distance (y).  The 2 and 3 are the respective speeds.  You could ask 'when will these cars be at the same spot?'  In essence, what you are asking is 'when will the y values be the same?'  Since these are linear equations, there is only 1 x value for every y value, so when you find out when the cars are at the same place, you also find out the time that they are at the same place, since there's only 1 possible time they could be at the same place.

`vectors` are lists of numbers.  Here's an example:  

[1, 3, 5, 3.423].  

In regular algebra, you used `scalar` numbers to describe things.  4 apples.  1 person.  3 slices of pie.  In linear algebra, you use vectors to describe things.  Each number in the list describes some characteristic of the thing you're describing.  We could say an apple is described by a vector where the first number describes the apples weight, the second describes its volume, the third describes its caloric content, and the fourth describes its acid content.

In physics, vectors are often thought of as arrows in space, usually 2d or 3d space.  Each number in these vectors describes one of the d's (dimensions) in those spaces.  Really, it's the same either way.  You could interpret a physics vector as just a list of numbers, and you could interpret an apple vector as an arrow in 4-dimensional apple space.

A scalar is really just a 1d vector.  4 apples in 1d apple space.  1 person in 1d person space.  3 slices of pie in 1d pie space.

`linear combinations` are combinations of vectors.  Vectors can be stretched, squished, and added together.  

You can stretch / squish a vector by multiplying it by a scalar.  

You can add vectors together by adding their individual components together.

The `span` of a set of vectors is all the new vectors you can make through linear combinations of the vectors in your set.

`i hat, j hat, k hat` are simple vectors that are commonly used to make linear combinations.  i hat is for the x dimension, j hat for y, and k hat for z.  i hat and j hat are commonly used as the basis vectors of $R^2$, meaning that they span $R^2$.  i, j, and k are commonly used as the basis vectors of $R^3$.

$$
\hat{i} = [1, 0, 0] \\
\hat{j} = [0, 1, 0] \\
\hat{k} = [0, 0, 1] \\
$$

Think of a robot that can only move by deciding to move either $\hat{i},\hat{j},$ or $\hat{k}$, or some fraction of anyof them.  It could go anywhere in 3-dimensionsal space.  If you think of a set of vectors as the decisions of a robot, then all the places the robot can go is the span of that set of vectors.

`basis vectors` It turns out that $\hat{i},\hat{j},$ and $\hat{k}$ are special vectors, because they are `linearly independent`.  They have nothing in common in terms of where they can point.  They are all `orthogonal` to eachother.  Consider the vector $\hat{g}$ = [0, 1, 1].  $\hat{g}$ is independent from $\hat{i}$, but not independent from $\hat{j}$ or $\hat{k}$.  The three vectors $\hat{i}$,$\hat{j}$,$\hat{g}$ are a spanning set of $R^3$, but they are not a set of basis vectors for $R^3$, because they are not all linearly independent from eachother.  Basis vectors are easier to make linear combinations with.

Any vector in $R^2$ can be made with $\hat{i}$ and $\hat{j}$.  For instance, v=[-1, 2] is the same thing as $v=-1*\hat{i}+2*\hat{j}$.

Imagine if your robot wanted to move 1 space in the z direction.  The only way to do that would be to select $\hat{g}$.  But then it would also have to move -1*$\hat{j}$.  This robot is basically z-disabled.  Look at what you've done.  You've crippled your robot.

<details><summary>But this robot is more efficient when it's moving in $\hat{g}$-like directions.  How is this bad?</summary>
We want to be equally efficient in moving in any direction, because we're equally likely to move in any direction.  Linearly independent basis vectors are equally efficient no matter which direction you're going.
</details>

`linear transformations` A transformation on a vector just means to change that vector into another vector.  A linear transformation is a transformation that only rotates/stretches/squishes the vector.  No curving.

In linear algebra, all vectors are some combination of basis vectors.  In $R^2$, it's $\hat{i},\hat{j}$.  In $R^3$, it's $\hat{i},\hat{j},\hat{k}$.

To linearly transform a vector, just transform it's $\hat{i}$ and $\hat{j}$ components, then combine them.

Say we have a vector $\vec{p}=x*\hat{i}+y*\hat{j}$.  A linear transformation changes $\hat{i}$ and $\hat{j}$.  Now the new $\vec{p}=x*(transformed\ \ \hat{i})+y*(transformed\ \ \hat{j})$.  

`matrix` A matrix is a function that does linear transformations on vectors.  They look like this:

$$
\begin{bmatrix}
i_1 & j_1 \\
i_2 & j_2 
\end{bmatrix}
$$

The first column represents the transformed $\hat{i}$, and the second represents the transformed $\hat{j}$.  

This matrix says "turn $\hat{i}=[1, 0]$ into $\hat{i}=[1, 2]$ and turns $\hat{j}=[0, 1]$ into $\hat{j}=[3, 4]$":

$$
\begin{bmatrix}
1 & 3 \\
2 & 4 
\end{bmatrix}
$$

What if you wanted to make a matrix for $R^2$ that rotates all vectors 90 degrees left?  Just rotate $\hat{i}$ and $\hat{j}$ by 90 degrees each.  Now you have a matrix that describes how to rotate any vector in $R^2$ 90 degrees left.

$$
\begin{bmatrix}
0 & -1 \\
1 & 0 
\end{bmatrix}
$$

Since all vectors in $R^2$ are just some combo of $\hat{i}$ and $\hat{j}$, by rotating these 2 vectors, you've essentially rotated all the other vectors too.  Note that you could do this equally well for some other set of basis vectors.

Could you do the same thing with some other set of basis vectors?  What about a spanning set that isn't a basis set?  What does it mean to have a matrix of vectors that doesn't consist of basis vectors?  What about basis vectors that aren't $\hat{i}$ and $\hat{j}$?  Wouldn't that be a transformation that actually does stuff?

<details><summary>Why is matrix multiplication rows to columns?</summary>
If we look at how A*x should be formatted, the first thing we think is that x should be vertical, not horizontal.
    
$$
\begin{bmatrix}
1 & 3 \\
2 & 4 
\end{bmatrix}
*
\begin{bmatrix}
5 \\
6 
\end{bmatrix}
$$

This seems fine.  But what if x was horizontal?

$$
\begin{bmatrix}
1 & 3 \\
2 & 4 
\end{bmatrix}
*
\begin{bmatrix}
5 & 6
\end{bmatrix}
$$

This version of matrix multiplication wastes whitespace.  Specifically the extra space above and below x.  So the quintessential vector is vertical, not horizontal.  So each item in the vector x will either be multiplied by an entire row in the matrix, or an entire column in the matrix.

While it seems to me that multiplying each item in the vector by a row is more intuitive, there is something to be said about multiplying by columns.  If our x vector is a column, it is more thematically consistent to have $\hat{i}$ and $\hat{j}$ be columns in A, so that they are similar to x.

The takeaway from this is that we could have defined it some other way.  There is no deep meaning to how we have defined matrix multiplication other than 'it looks pretty'.

</details>

`determinants` are how much a matrix 'expands' or 'contracts' the space that it transforms.  It doesn't really make sense to think about on a single vector.  Let's look at the below matrix:

$$
\begin{bmatrix}
3 & 0 \\
0 & 2 
\end{bmatrix}
$$

Here, $\hat{i}$ gets stretched by a factor of 3, and $\hat{j}$ gets stretched by a factor of 2.  If you think of $\hat{i}$ and $\hat{j}$ as making a little 1x1 box before, now they make a 2x3 (j comes before i) box, so the determinant is 6.  Again, the determinant is how much the matrix changes space.  It is the ratio of the new area to the old area of the box made by i and j.  Since the ratio of the old area is always 1, the stretching is always (new area)/(old area) = (new area) / 1 = (new area).  So the determinant is the area of the new box.

But it's more complicated than that.  With the matrix: 

$$
\begin{bmatrix}
3 & 0 \\
0 & 0 
\end{bmatrix}
$$

Our 2d space is now a 1d space.  Any time you get rid of dimensions, the determinant is 0.

The determinant can also be negative.  

$$
\begin{bmatrix}
-1 & 0 \\
0 & 1 
\end{bmatrix}
$$

An area can't be negative, and that's kind of where our 'area' analogy breaks down.  Think of a negative determinant as our square flipping over onto its backside, since j is now 'on the right' of i instead of 'on the left'.

Imagine you had a 2d picture of mario's face, completely symmetric except for a single mole on his right cheek.  If the determinant is negative, mario's face flips, and the mole is now on his left cheek.

Only square matrices have determinants.  What would it mean to 'expand' a 2 dimensional square into 3 dimensions via a 3x2 matrix?  What is its area now?  That doesn't even make sense, because now the square is no longer a square, it's some 3d shape.  It has volume now, not area.

Speaking of volume, the determinant for volume can also be negative using the same 'mario mole' example.  But it is slightly more complicated.  If any 2 of i, j, and k switch places, the determinant will be negative.  If we get 3 shifts in some way, the mole will still be on mario's right cheek and the determinant will be positive.  There's some generalization of this, but I don't care enough to think about it.

Determinant formula:

$$
det(\begin{bmatrix}
a & b \\
c & d 
\end{bmatrix}) = ad-bc
$$

Consider the case where b and c are 0.  In this case, i and j are just scaled along their respective axes.  If you add a positive b and c, both i and j are now pointing up and to the right.  The closer i and j get, the more the determinant rhombus thing gets squeezed, and the less area it has.  So b and c combine to form a kind of 'negative rhombus' that subtracts from the total rhombus of the determinant.  Of course, if b\*c turns out to be negative, the determinant will increase in value.

The determinant tells us how much the vectors distances relative to eachother change.  If the determinant of a matrix is 6, all vectors that were originally 1 away from each other will now be 6 apart.  All vectors that were 2.5 away from eachother will not be 15 apart.

For now, skip trying to reason out why this formula works exactly.  Also skip for 3d and onward.  Maybe come back to it.  But you can also decompose into triangular matrices for high dimension matrices.  Maybe that formula will be easier to explain.

`rank` the number of dimensions in the matrix output.  A 3x2 matrix takes in 2-vectors and outputs 3-vectors, so the rank of that matrix is 3.  Unless it outputs a plane of vectors in 3-space, in which case it's rank is 2.  Or if it turns every vector into [0 0 0], in which case it's rank is 0.

`column space` is the span of all the vectors that your matrix can output.  So it's the span of the columns, since your matrix outputs linear combinations of its columns.

`null space` all the vectors your matrix turns into 0.  If it's any more than a single vector your matrix is squishing space.

Also need inverse matrices, row space, whatever.

`dot product` is a special kind of linear matrix transformation.  It's used to discover the relationship between 2 vectors v and w.  If the dot product is 0, they're perpendicular, if they're facing the same direction, it's positive.  If they're facing different directions, it's negative.  In general, the dot product is used to measure 'similarity' between 2 vectors.

Remember all the vectors from 3blue1brown being folded onto a plane?  It's like that.  Notice that the arrow pointing perpendicular to the plane gets 'folded' by being squished down into nothing.  So here you're squishing a vector into another vector.  If that squished vector is 0, you know the 2 vectors are perpendicular.

You take 2 vectors, say v and w, of the same dimension and matrix multiply them into a single number.  In this instance, either v or w could be considered the matrix, and the other the vector.  Really what this shows is that vectors are just matrices.  A vector is really a nx1 matrix, where the n is each individual actual dimension of whatever, and the 1 output is a single measure whatever it is.  It's kind of weird to think about for a lot of things.  I guess if you took the dot product between 2 apple vectors and they were perpendicular, they would be like, opposite apples or something?  But with an apple you can't have any negative attributes.  So that doesn't really make sense.

It is kind of weird that the dot product is so 'asymmetric'.  That's covered in the video.  Maybe explain that later.

Perhaps also clear up that the dot product isn't some operation between 2 lines, but an operation between 2 points.  I think you might be getting a little confused in your intuition of vectors, thinking they're lines instead of points.

`What is the significance of 2 vectors being perpendicular?`

If we have a set of vectors that span a space, those vectors are a basis if they are all independent of each other.  They are all independent of each other if all of their dot products with each other are 0.  Wait, that's pairwise independence.  Anyway, it's part of it.  If we have a basis of vectors, that's better than a regular span of vectors, because it's simpler.  I think it's less computationally expensive.  That's at least 1 reason.  Remember the example with the robot.

`cross product` take 2 vectors v and w, and you can use the cross product to find a third vector p that is perpendicular to both v and w at the same time.  Also p's length will be equal to the rhombus created by v and w (as though v and w were i and j).  It's pretty simple to find a vector orthogonal to v, and simple to find a vector orthogonal to w, but in 3d there are infinite vectors orthogonal to v and w.  So how do you find a vector orthogonal to both v and w out of infinitely many vectors?  With the cross product.



`eigenvectors` look at the 'change of basis' video maybe.  All nxn (square) matrices have eigenvectors.  A vector is an eigenvector for a matrix if that matrix stretches/squishes the vector but doesn't rotate it.

`Eigenvalues` each eigenvector is stretched/squised by a certain amount; its eigenvalue.

`diagnol matrix` are easy to work with.  Can easily compute things many times.  All 0's except the diagnol.  The basis vectors are all eigenvectors of a diagnol matrix.

`eigenbasis` how do I compute the 100th power of this matrix?  Easiest way is to do change of basis to some eigenbasis, do the 100th power, then change of basis back.  Not all matrices can be diagnolized (translated).  Things like a rotation, which don't have any eigenvectors.  But there are other ways to do those.  Maybe break it up into it's rotational part and its scaling part.  Then you can change of basis the scaling part, then do that modulo thing for the rotation.

`why is matrix multiplication defined like it is?`

I think this might be your answer:
https://math.stackexchange.com/questions/31725/intuition-behind-matrix-multiplication

I don't know where to put this, but linear combinations of your equations doesn't create a new 'constraint', as you put it earlier.  I don't know where this will go, but I feel like it will go somewhere.


Remember that one paper on the nearest-neighbor algorithm, and how they figured out how to make it faster by not using the euclidean definition of ‘nearest’?  That makes me wonder what other problems we could solve by discarding this notion that euclidean distance is the ‘best’ (or ‘optimal’, haha) measure of optimality.  Like, sometimes manhattan distance is best.  When is variance the best indicator of spread, and when is it not?  Hm…. I feel like there’s something here I’m not quite getting.  Can you optimize your notion of optimality?  Obviously not for simple problems like 'maximize the money you're making', but for more complicated stuff.

On page 209 of the CS170 textbook, they talk about duality.  Notice that the dual for 
$$Ax \leq b$$ 
is 
$$y^TA\geq c^T$$

This might be your answer to 'what's the point of being able to multiply by a matrix on the left?'

`EE127`

In EE127, a `function` is a function f: $R^n \rightarrow R$

A `map` is a function g:$R^n \rightarrow R^m$

So a function has only scalar outputs, and a map has vector outputs.

A `graph` for a function f is exactly what you would expect it to be.

An `epigraph` for a function f is the graph, plus everything above the possible values of f.  Maybe it's to show feasible solutions?

A `contour`, or contour line, is something for functions with multiple input variables.  A contour line is a squiggly oval thing (a closed curve) with a value.  If a contour has a value of 3, the contour line is saying 'all of the x, y values that I have will all output a z value of 3'.  So the contour line would be a squiggle looping around the z-axis at a height of 3.

You can see an example of contour lines in the CS170 textbook on linear programming.

A `level set` is a contour.

A `sublevel set` is a contour plus everything above the possible values of the contour.  It's a bunch of contours with different values all on the same graph.

`What does it mean if we have a bunch of constraints and no objective function?`

Then we're just trying to find any feasible solution.  No need for maxing/mining.

`If there are feasible solutions, is there always an optimal solution?`

Yes, but sometimes you can't get to them.  Look at this problem:

min $e^{-x}$

It just gets smaller and smaller the larger you make x.  So you can never really get the optimal solution, because the optimal solution is $x=\infty$

`Why would we ever go for a suboptimal solution?`

Calculating the exact optimal solution sometimes takes a really long time.  We sometimes instead go for the $\epsilon-suboptimal$ set of solutions.  $\epsilon$ is just some value that says 'if you're $\epsilon$ away from the optimal solution, you're good enough.'

`If we're going for suboptimal solutions with this epsilon thing, we have to know the optimal solution in the first place.  How is this helpful at all?`

Remember you want to calculate some vector x that will give you an optimal solution.  Maybe there's some easy way to figure out the bounds on an optimal solution without actually calculating x in the first place.

`What exactly is a convex function?  What makes them easy to solve?`

A function is convex if you can pick any 2 points on the function's graph and draw a line between them without that line going through any other points.  So a line or a parabola are convex.  The S thing generated by $f(x)=x^3$ is not convex, because if you draw a line between, say (-1, -1) and (1, 1), that line will go through the point (0, 0), which is also on the curve of f.

Convex functions are easy to solve because they have 1 global maximum.  With no local maximums to throw you off, it's much easier to find the actual maximum, and therefore the optimal answer.

Concave functions have 1 global minimum and are equivalent.

`Why is the book definition of convex optimization so formal and obtuse?`

Because this definition works for more than just linear programming, which is what you learned before.

It works for least squares, linear programming, quadratic programming, nonlinear optimization, and convex optimization and combinatorial optimization.

Maybe you could take your linear programming example and convert it to this more formal notation?

Maybe try to do the discussion before you do the homework?

`What is an affine set?  Why are we even making a distinction?`

First, look at these two vectors in $R^3$: 
$$
v_1=[1, 0, 0]\\
v_2=[0, 1, 0]
$$

They span the subspace of the x-y plane in $R^3$.  So you can describe this plane with $v_1$ and $v_2$.  It describes every vector in $R^3$ that has a z component of 0.

Now try to think of two vectors in $R^3$ that together span every vector in $R^3$ with a z component of 1.  Essentially, the x-y plane lifted up 1.  It's not possible with 2 vectors.

However, it is possible with 3 vectors.  To describe this boosted x-y plane, we use notation something like this:

$$A=S+v_0$$

S is the subspace of the x-y plane, and we'll say that $v_0$ is the vector (0, 0, 1).  So $v_0$ isn't part of the span of the other 2 vectors, its a vector that is added to every single vector in the span, boosting them all by 1 in the z direction.

This boosted plane, A, is called an 'affine set'.  It is distinguised from subspaces because it can't be described by a set of vectors that 'span' a space.

`What is a norm?`

Here's the general formula:

$$
\sqrt[n]{x_1^n+x_2^n+...}
$$

If n is 1, it's manhattan distance.  If n is 2, it's euclidean distance.  If we take n to it's limit of infinity, it's the biggest of x's components.

As n gets bigger, the magnitude gets smaller.

`Why is the infinity norm the maximum of the vectors components?`

As n gets bigger and bigger, the biggest of all the $x_i$'s will start to dwarf the others.  So all the other x's become insignificant, basically nothing.  So when n is really really big, the equation might as well be:

$$
\sqrt[\infty]{x_{biggest}^\infty}
$$

The infinite power and infinite root cancel out, and we are left with $x_{biggest}$.

`What is the cauchy-schwartz inequality?`

The dot product of 2 vectors is less than or equal to their euclidean magnitudes multiplied together.

They're equal when the dot product is maximized.  The dot product is maximized when the two vectors point in the exact same or exact opposite direction.

The formula is:

$$
x^T*y \leq \sqrt[2]{x_1^2+x_2^2+...} * \sqrt[2]{y_1^2+y_2^2+...}
$$

It's a little more obvious if you consider the case where x == y, since the euclidean norm is just a dot product which you take the square root of:

$$
y^T*y \leq \sqrt[2]{y_1^2+y_2^2+...} * \sqrt[2]{y_1^2+y_2^2+...} \\
y^T*y \leq y_1^2+y_2^2+... \\
y^T*y == y_1^2+y_2^2+...
$$

If you're still not convinced, consider taking one of the y's and stretching /shrinking it.  The equality will still hold.  Rotate the vector, and the dot product will be less than the norm-multiplying.

Perhaps there's some tip to tail analogy here?  If the vectors are pointing in the exact same direction, the total distance is maximized.  But when you 'bend' or push the 2 vectors so that they're not pointing in the same direction, that can only make the total combined distance less.

Cauchy-Schwartz also implies the following:

$$
\frac{x^T*y}{\sqrt[2]{x_1^2+x_2^2+...} * \sqrt[2]{y_1^2+y_2^2+...}} \leq 1 \\
$$

Segway to the 'Angles between vectors' thing here.

Imagine a crazy bug that only wants to move in 1 specific vector direction.  If someone picked up the bug and moved it along a vector perpendicular to the direction it wanted to go, it would not care.  Imagine the perpendicular vector as drawing an infinite starting line.  If you move the bug ahead of the starting line (positive dot product) it's happy, and if you move it behind the starting line (negative dot product) it's angry.

You have x, the vector you're projecting onto, and y, the vector that's getting projected.  Slide y along the line perpendicular to x until x and y are pointing in the same direction.

`Ok, yeah, orthogonality is when the dot product is 0.  But then the way we explain the dot product is that when it's 0 it means the 2 vectors are orthogonal.  When do you actually use the dot product?  What is it useful for?`

The projection of a vector onto a line, or a plane, or a whatever, calculates the answer to the question 'I have a vector and this other thing.  What point on this other thing is closest to this vector?'

By 'closest' we mean distance, and by distance, we mean euclidean distance, usually.

You woulnd't calculate a vector's projection onto another vector.  That doesn't make any sense.  'Which point on x is closest to this point y?  x is only 1 point, so of course the answer is x.  Instead, we figure out a line that goes in the same direction as x, then project y onto that.'

Ok, trig functions.  An angle is measured in radians.
Radians are some constant times pi.
Pi is the ratio of the circle's circumference to its diameter.
Tau is the ratio of the circle's circumference to its radius.
A circle's area is pi*r*r
which is the same as (C/2r)*r*r
which is the same as Cr/2
So using pi is good because then you don't have to measure circumference.
Since circumference is just a function of the radius / diameter.
But perhaps diameter is easier to measure?  Just start at any point on the circle,
and sweep your ruler or whatever across the circle.  The point at which this sweep
is maximized is the diameter.  But then I guess radius is just as easily gotten.
Just 1 extra step.
We say that an angle of 2pi is the same as 0, because why?  Why does it stop at 2pi?
Why not pi?
An angle is measured as some slice of a circle of radius 1, and area pi.
Why not just measure angles directly, with some fraction of 1?
Well, I guess it would only ever go up to 1/2.
Once you got past 1/2 it would start going down again.
So I guess you only need a half circle to measure an angle.
So I guess it makes sense.  Kind of.
But why pi/2 for a 90 degree angle?  Why not 1/2?  Just have the radius be root(pi).
Maybe it has something to do with the triangle stuff.

Why don't cos and sin always add up to 1?  Why do we square them before adding them?
Maybe it has to do with euclidean norm.

So cos measures angles equalness/oppositeness, and sin measures their perpendicularness.
A helpful way to think about perpendicularness is that since it's a right angle,
all you have to do is rotate it and you get vectors in the x/y direction.
So they, as you thought of earlier, 'have nothing in common'.
Their components are completely distinct.

Any 2 vectors, when put tail to tail, form a right triangle.  
Just put one of them on the x axis.
Now if you actually want a right triangle the 2 vectors will have to be the same length.
Let's shrink the vectors so they're both size 1.
This way they'll definitely make a right triangle.
Hm...but how exactly do we shrink the vector?
Obviously we want it to be size 1, but what exactly is the size of a vector?
The euclidean distance is the most natural choice.
Wait.  But is it?
It's the best for physical phenomena.  It's just like how we measure things with a ruler.
But what about other stuff?  Like your 'apple vector' thing earlier?
What if your idea of 'best' was just the total of the vitamins?
If apple X had 30, 30, 30, and apple Y had 50, 25, 0, then X has a larger L1 norm, but Y has a larger L2 norm.

What is the dot products connection with cos?

Wait, let's go back to the whole 'how do we measure an angle' thing.
You need a function that tells you 'these 2 vectors are going in the same direction'
it also needs to say 'these 2 vectors are perpendicular'
it also needs to say 'these 2 vectors are going opposite of each other'
and everything in between.
what should the range of this function be?
well, you need the range to be symmetric.  
xMax for same direction, 0 for perpendicular, xMin for opposite direction.
What's the simplest xMax and xMin you can think of?  1 and -1.


---------------

a spanning set is a set of vectors that can be combined to create anything in a vector space.  It is a finite set of vectors.
A vector space gets spanned by a spanning set.  It is an infinite set of vectors.
A basis is a special kind of spanning set.  All of its vectors are independent.

A={v0=(1, 0), v1=(0, 1), v2=(0, 2)} is a spanning set for R2
A is not a basis of R2, since v0 and v2 are dependent on each other.

B={v0=(1, 0), v1=(0, 1)} is also a spanning set for R2.
B is a basis of R2, since v0 and v1 are independent.

R2 is spanned by A.
R2 is the span of A.
These statements are equivalent.

# Template

`Motivation`

`Formula`

`Intuition`

`Proof`

`Questions`

the dot product of 2 orthogonal / independent vectors is 0.  X and Y.
The definition of orthogonality / independence comes from the dot product. of X and Y.
Orthogonality / independence means X and Y are at right angles to each other.
Right angles means X and Y form a right triangle.
Right triangle means a whole bunch of interesting properties.
    Like you can easily compute the length of the sum of X and Y.
    Wait, can you do this easily even if they're not orthogonal.
    Means the projection of X onto Y is 0, and Y onto X is 0.
    
Given X and Y, you can break X into any number of components.
But what is interesting is when you break X into 2 components:
    O, The component of X orthogonal to Y
    P, The component of X parallel to Y
    
    
X = O + P
<X, Y> = ???
<O, Y> = 0
<P, Y> = |P| * <1 in the direction of Y, Y> = |P| * ????
Maybe Y should be of size 1?
After all, we're trying to project onto a line, not a vector, really.
A projection has nothing to do with Y's length.
Only its direction.
In this case, the dot product would be P.

<x=(2, 2), y=(4, 4)> = 2\*4+2\*4 = 16
|x| = root(8)
|y| = root(16)

It's not the X, Y, and sine that make the triangle.  It's X, O, and P that make the triangle.

proj x onto y = cY
c = <x, y> / <y, y>

So to project X onto Y:
Find Y_1 = Y / |Y|, which is a 1 vector in the direction of Y
proj X onto Y = <X, Y_1> * Y_1 (since it's a vector)

We can get this same thing by doing:
<X, Y> / <Y, Y> * Y, since it's also
<X, Y_1> * |Y| / (<Y_1, Y_1> * |Y| * |Y|) = <X, Y_1> / (<Y_1, Y_1> * |Y|) = <X, Y_1> / (1 * 1 * |Y|)
asdf = <X, Y_1> / |Y|.  The 1 * 1 thing is because the dot product with yourself is length squared, and Y_1's length is 1.

<X, Y> = <P + O, Y> = <P, Y> + <O, Y> = <P, Y> + 0 = <P, Y>.
You would never actually compute P, you would just do X.
This is just to show that the dot product using X or P is the same.
And P is the thing you want to get, so yeah.


for the last part of question 2, they give you the hint y->min of x-prime (x').  Mong says the prime doesn't matter.  You can think of it as just x.
Wait could you just use duality for this question and show that they're equivalent?  That would show something stronger than what they're asking for.


An angle is the fraction of a circle that 2 vectors create when you join them tail to tail.  It's a slice of pie.


# Vector Projection

`Motivation`

Say we have 2 vectors, x and y.  The projection of x onto y is kind of like the portion of x that is going in the y direction.  

If you have a heavy box, and x is the direction you're pushing the box in, and y is the inclined plane you're pushing the box up, then the projection of x onto y is the part of x that 'counts', because directing the force any other way is wasted.

`Formula`

To understand the formula, you'll need to know that the dot product of 2 independent vectors is 0.

$$
proj(a\ on\ b) = \frac{<a, b>}{<b, b>}*b
$$

`Intuition`

`Proof`

`Questions`

# Dot product

`Motivation`



In addition to being used for vector projection, you might also take the dot product between any 2 vectors to see if they're orthogonal.

The dot product of $\vec{x}$ and $\vec{y}$ is the magnitude of x projected onto y.  Equivalently the magnitude of y projected onto x.

Wait, this is wrong.  Y would need to be of size 1 in order for this to be true.

The only time you would bother taking a dot product where neither vector is size 1 is when you're testing for orthogonality.

`Formula`

Dot notation:

$\vec{x} \cdot \vec{y} =  x_1 * y_1 + x_2 * y_2 + ...$

Angle bracket notation:

$\langle \vec{x}, \vec{y} \rangle = x_1 * y_1 + x_2 * y_2 + ...$

`Intuition`

`Proof`

`Questions`

# Norms

`Motivation`

Norms tell you the size / length of a vector, in different senses of the word 'size'.

The L1-norm is Manhattan distance.

The L2-norm is Euclidean distance.

The L-Infinity-norm is the max component of a vector.

`Formula`

`Intuition`

`Proof`

`Questions`

# Cauchy Schwartz Inequality

`Motivation`

Proves that the dot product of x and y is less than or equal to their lengths multiplied together.

$$
<a, b> \leq root(<a, a>) * root(<b, b>)
$$

Basically says 'the dot product of a and b is less than or equal to their lengths multiplied.'

It's equal if they're colinear, meaning going in the same direction.

Cauchy Schwartz is equivalent to:

$$
<a, b> / (root(<a, a>) * root(<b, b>) \leq 1
$$

It turns out that that the division is the angle between the two vectors like so:

$$
<a, b> / (root(<a, a>) * root(<b, b>) = cos(Theta)
$$



Not sure why this is true exactly.  I think they try and explain it.  Maybe come back to it later.

Let's go back to the 'dot product of a and b is less than or equal to their lengths multiplied.'  I'm internally thinking of <b, b> as a square made by 2 vectors, one being b and the other being perpendicular to b, and the same length.  And <a, b> is some kind of rhombus.  But if you normalize, wouldn't those be the same area?  Maybe you're thinking of it wrong.

`Formula`

`Intuition`

`Proof`

`Questions`

# Orthonormal basis

`Motivation`

An orthonormal basis is the best basis.  It's a spanning set where there are no dependent vectors, all the vectors are at right angles to each other, and all the vectors have length 1.  So for $R^3$, the typical orthonormal basis is (1, 0, 0) (0, 1, 0) (0, 0, 1).

I think this is important for change of basis.

Gram-Schmidt procedure: used to turn a basis into an orthonormal basis.

Take first vector, make it have length 1.  Take second vector, find its projection onto first vector.  Subtract projection from second vector.  Second vector - projection of second vector onto first vector = new vector that is orthogonal to vector 1.  Now vector 1 and 2 have a 90 degree angle between them.  Now take the third vector.  Do this for _both_ of the previous vectors.  So for the third one you do 2 subtractions.  Also remember that you have to normalize all of them.

Remember, subtract the projection, _then_ normalize.  If you normalize, then get the projection, then subtract the projection, it won't be length 1 anymore, and you'll have to normalize it again.

If vector a and b are dependent, then when you do this procedure you'll end up with either a or b being the 0 vector, depending on which one you do first.

`Formula`

`Intuition`

`Proof`

`Questions`

`Duality` Thanatcha says that every problem has a dual, but that dual isn't necessarily the exact same problem.

But, if it is the exact same problem, then you can turn your optimization problem into an optimization problem without constraints, which will mean you can solve it a lot quicker and easier.

Is matrix multiplying like x * A like saying 4 + 2 * x?  Like, you can do it either way but we have a standard way to do it?

Linear algebra.  If a matrix is just a function on vectors, how come we always do multiplication with vectors?  Isn't the matrix supposed to be like f?  So the matrix is like a function?  What is the difference between a dot product and a matrix multiplication?  Is a vector a matrix?  Is a matrix a bunch of vectors?  Is a vector a 'transformation', or is it representative of some 'object'?  Are those 2 things necessarily different?  Is matrix vector multiplication just a bunch of dot products?

Ghaoi said that decomposition is the crowning achievement of linear algebra.  The spectral norm was invented in the context of physics.  All the important stuff was invented by application people.

Identity matrix:  1's on the diagonal, 0's elsewhere.

Diagonal matrix:  x y z whatever on the diagonal, 0's elsewhere.

Symmetric matrix: matrix[i][j] == matrix[j][i] for all i, j.  So The upper triangle is the same as the lower triangle.

Upper triangular matrx:  if i > j, matrix[i][j] == 0.  So the lower triangle is all 0's.

Orthogonal matrix:  A is an orthogonal matrix if A * A^T = I.  There are some pretty simple examples, but I can't really think of a pattern for them.

Dyad:  A matrix that can be written as the product of 2 vectors.  A = u * v^T.
Since u = a * u_length1, and v = b * v_length1, then A = (a * b) u_length1 * v_length1^T.

QR decomposition:  We have matrix A, we want matrices QR == A.
The first thing we do is Gram-Schmidt on the columns of A.  This will give us Q.

For each column vector a_i in A, (except for the first one, i=1), there is a q_i that corresponds to it.  This q_i is the 'orthonormal' version of a_i.  The formula is like this:

a_i - (a_i^T * q_1) * q_1 - ... - (a_i^T * q_i) * q_i = L2-norm(Q_i) * q_i.

Each - (a_i^T * q_?) * q_? is a step in the Gram Schmidt normalization process.  So once we do all the subtractions, we'll be left with Q_i, the part of a_i that is orthonormal to q_1, ...q_(i-1).  Then we normalize Q_i to get q_i.  

Q_i is q_i, but not normalized.  So L2-norm(Q_i) * q_i = Q_i.  I don't know why they have this silly notation in the livebook, but whatever.

Now, what if we put all the subtraction stuff on the other side?

a_i = (a_i^T * q_1) * q_1 + ... + (a_i^T * q_i) * q_i L2-norm(Q_i) * q_i.

So each (a_i^T * q_?) * q_? = r_?i * q_i.  So since the first thing, a_1, has nothing before it, it only has 1 non-zero value in its column.  a_2 has 2 non-zero values in its column, and so on.  We could either define our R matrix by filling its columns up with r's to correspond with the q's, but then we would need to multiply by R^T, I think.  So instead we fill up by rows, and R goes on the right of Q.

So Q is a matrix of orthonormal vectors.  If Q is square, then it's an orthogonal matrix, so Q * Q^T = I, and Q^T * Q = I as well.  I don't kno why Q * Q^T = I.  There's a proof somewhere.  Rows and columns and whatever.

And R is an upper triangular matrix (because r1 has 1 non-zero, r2 has 2 non-zeros, etc).


Frobenius norm:  A norm for a matrix.  Just square all the entries and add them together.

Ordinary Least Squares (OLS):  choose the x vector that minimizes the length of the vector Ax-y.

This is the same as minimizing L2-norm(Ax-y), since the L2-norm is just the length squared.

We don't bother square rooting it because it's more work and there's no point.

Think of A as a plane (or the range of A), and y as a vector not on A's range.  We want to find the vector in A's range that is closest to y.  So we want to find the projection of y onto the plane that is A's range.

The formula for this is:

x\* = (A^T * A)^-1 * A^T * y.

This is actually equivalent to QR Decomposition.  Why?  I don't know.  Should probably look at it later.

Also, this assumes that A is 'tall', meaning m $\geq$ n.  So it's underdetermined or something.

There's also a bunch of special cases of least squares that I'm not going to go into.

Least squares is for drawing a line that approximates a bunch of points.  L2-norm(Xw-y).  If the points are all in a row and can be perfectly fit to a line, then Xw=y.  The rows of X are the data points, w is the weights, and y is the labels.

Spectral Theorem:  If A is a symmetric matrix, (symmetric matrices are always square), and it is n by n in dimension, then A has n eigenvalues.  For each of the n eigenvalues, their eigenvectors are orthogonal.

Cool thing:  For _any_ matrix A, B = A * A^T and C = A^T * A are both symmetric.
Also, the non-zero eigenvalues of B and C will be the same.
Say A is 500x3 in dimension.
Then B has 500 eigenvalues, and C has 3.
Since B and C's non-zero eigenvalues are the same, we just have to find C's 3 eigenvalues.  Then they will be the same as B's non-zeor eigenvalues.  So then we know the other 497 eigenvalues of B are 0.

**PCA**  

If you have 1000 dimension data, doing any calculations on that data will take a long time.  

PCA orders your dimensions in terms of how important they are.

Then you can keep the K most important dimensions and get rid of the rest.

This makes your calculations take much less time.



In your data matrix, each column is a dimension, and each row is a data point.

The first think you do is get the 'sample mean vector', or just sample mean.  We'll call it $\hat{u}$

Make it by averaging each column.  Then it's the 'average row'.

Now do ($\hat{x}$-$\hat{u}$)($\hat{x}$-$\hat{u}$)^T where each $\hat{x}$ is a row.  Average all the matrices this creates.

This is the covariance matrix.

Find all the eigenvalues of the covariance matrix, and each eigenvector associated with them.

Now choose the K largest eigenvalues, and each eigenvector associated with them.

Remember that an eigenvector $\hat{w}$ means X\*$\hat{w}$=a$\hat{w}$, where a is a scalar.  So if a=8, then the eigenvector associated with $\hat{w}$ is 8, and $\hat{w}$ will get 8 times bigger whenever you multiply it with X.

So you take each of the K best eigenvectors, and you normalize them.  That is, you make them all have length 1.

Each eigenvector has the same number of elements as each row of X.

So then for each row of X, you take the dot product with each eigenvector, starting with the one that has the largest eigenvalue, then the second largest, etc, until you have K dot products.

Put all of these dot product scalars into a vector.

This is your new transformed data point, which is a row vector.  It has K dimensions, instead of 1000.

This is a change of basis.  For each row, you dot it with K eigenvectors to produce a new row that has only K elements.

Since you're just doing a bunch of dot products, you can stack the eigenvectors into a matrix and do it all at once for each of your data points (row vectors in X).

How do you know what to choose for K?

With this fraction:

$\frac{e_1+....+e_K}{e_1+....e_N}$

The denominator is the sum of all the eigenvalues.  Just choose eigenvalues until you have some fraction that you think is good enough.  Most people stop around 90-95% and say that's good enough.  It's basically a percentage saying 'I have x% of the original information of my data points and that's good enough.'

**Kernel Trick**

You don't have enough dimensions in your data, and you want more.  You can make fake dimensions by doing stuff like taking 1 of your dimensions and squaring it, then appending it to each data point.  Now each data point has 2 dimensions.  

A kernel function takes in 2 raw (unlifted) data points and returns the dot product of the lifted versions of those data points, without actually lifting them.  So you could do something with the gaussian distribution or whatever (which has infinite dimensions), and do the kernel trick on data points with the gaussian distribution as the kernel function.  You can't actually give data points infinite dimension, but the kernel trick can output a number that is the same as the dot product of 2 infinite dimensional vectors.  So that's what the kernel trick is.

compatible with PCA.  Get rid of useless dimensions.  Oh no, you only had like 2 useful dimensions.  Now we use the kernel trick to add more dimensions that actually mean something.