# Linear algebra and gradient review

In this section. We will review concepts of linear algebra and gradient using Ridge regression as an example.

Material to cover:
1. Rank of the matrix
2. Inverse of the the matrix
3. properties of $X^TX$
4. Inner and outer product


### Rank
[Rank] is the maximum number of linearly ***independent vectors*** in a matrix. 
[Rank]: https://en.wikipedia.org/wiki/Rank_(linear_algebra)

In [None]:
import numpy as np
a = [[1,1],
     [2,3]]
b = [[1,1],
     [2,2]]
c = [[1,0],
     [0,0]]
np.linalg.matrix_rank(a),np.linalg.matrix_rank(b),np.linalg.matrix_rank(c)

### Inverse of the matrix
$XX^{-1} = I$

X must be square and full-rank

In [None]:
np.linalg.inv(a)  
#np.linalg.inv(b)  Will get error as singular matrix

In [None]:
np.array(a).dot(np.linalg.inv(a))

In [None]:
diag = [[1,0,0],
        [0,3,0],
        [0,0,5]]
np.linalg.inv(diag)

In [None]:
diag = [[1,0,0],
        [0,3,0],
        [0,0,0]]
#np.linalg.inv(diag) ###You can't do that

Inverse of a diagonal matrix is just the inverse of each value. Therefore there is no inverse for a matrix that is not full-rank

Also, `np.linalg.solve` is doing similar things as inverse.

$Ax=\beta$

$x=A^{-1}\beta$


In [None]:
beta=[2,3]
np.linalg.solve(a,beta)

In [None]:
np.linalg.inv(a).dot(beta)

### $X^TX$
We use $X^TX$ a lot in linear regression. Since it will be always a square matrix. And it has the same rank as X.

Why?
Since $Xv=0$ if and only if $X^TXv=0$ , and then follow some maths about [null space] and [Rank–nullity theorem](https://en.wikipedia.org/wiki/Rank%E2%80%93nullity_theorem)

[null space]: https://en.wikipedia.org/wiki/Kernel_(linear_algebra)

In [None]:
d=[[1,2],
  [2,4],
  [5,6]]
e=[[1,1],
  [2,2],
  [3,3]]
np.linalg.matrix_rank(d),np.linalg.matrix_rank(e)

In [None]:
#np.linalg.inv(d)
dd=np.array(d).transpose().dot(d)
ee=np.array(e).transpose().dot(e)
np.linalg.matrix_rank(dd),np.linalg.matrix_rank(ee)

Therefore $X^TX$ is not always invertible 

### Outer and inner product
Suppose X is row vector.
$X^TX$ is outer product, $XX^T$ is inner product. But in numpy, there is difference between 1-D array and 2-D array.

In [None]:
v=np.array([1,2,3])

In [None]:
np.outer(v,v)

In [None]:
np.dot(v,v)

In [None]:
v.transpose().dot(v) #Transpose doesn't work for 1-D array

In [None]:
vv=np.array([[1,2,3]])
vv

In [None]:
vv.transpose()

In [None]:
vv.transpose().dot(vv)

In [None]:
vv.dot(vv.transpose())


### Ridge Regression

There are many forms for $\textbf{R}(\theta)$ but a common form is the squared **$L^2$** norm of $\theta$.

$$\large
\large \textbf{R}_{L^2}(\theta) = 
\large||\theta||_2^2 = \theta^T \theta  = \sum_{k=1}^p \theta_k^2
$$

In the context of least squares regression this is often referred to as **Ridge Regression** with the objective:

$$ \large
\hat{\theta} = \arg \min_\theta \frac{1}{n} \sum_{i=1}^n \left(y_i - f_\theta(x_i)\right)^2 + \lambda ||\theta||_2^2
$$

This is also sometimes called [Tikhonov Regularization](https://en.wikipedia.org/wiki/Tikhonov_regularization).  

## Deriving the optimal $\hat{\theta}$ with $L^2$ Regularization

We return to our linear model formulation:

$$ \large
f_\theta(x) = x^T \theta
$$

Using the standard matrix notation:

<img src="images/matrix_dot.png" width="400px">

We can rewrite the objection


\begin{align}\large
\hat{\theta}_{\text{L2}} = \arg\min_\theta \frac{1}{n}\left(Y -  X \theta \right)^T \left(Y -  X \theta \right)  + \lambda \theta^T \theta
\end{align}

Expanding the objective term:

\begin{align}\large
L_\lambda(\theta) = \left(Y -  X \theta \right)^T \left(Y -  X \theta \right)  + \lambda \theta^T \theta = 
\frac{1}{n} \left( 
 Y^T Y -  2 Y^T X \theta + \theta^T  X^T  X \theta 
\right) + \lambda \theta^T \theta
\end{align}

Taking the **gradient** with respect to $\theta$:


\begin{align} \large
\nabla_\theta L_\lambda(\theta)
& \large =
\frac{1}{n} \left( 
 \nabla_\theta Y^T Y -  \nabla_\theta 2 Y^T X \theta + \nabla_\theta \theta^T  X^T  X \theta 
\right) + \nabla_\theta  \lambda \theta^T \theta \\
& \large =
\frac{1}{n} \left( 
 0 -  2 X^T Y  +  2 X^T  X \theta 
\right) + 2\lambda \theta
\end{align} 

The above gradient derivation uses the following identities:
1. $\large \nabla_\theta \left( A \theta  \right) = A^T$
1. $\large \nabla_\theta \left( \theta^T A \theta \right) = A\theta + A^T \theta$ and $\large A = X^T X$ is symmetric

Setting the gradient equal to zero we get a **regularized** version of the **normal equations**:

$$\large
(X^T  X  + n \lambda I) \theta =  X^T Y
$$

$$\large
 \theta = \left(X^T  X  + n \lambda I \right)^{-1} X^T Y
$$




## Optimal $\theta$ under $L^2$ regularization


Because $\lambda$ is a tuning parameter we often will absorb the $n$ into $\lambda$ and rewrite the above equations as:



$$\large
(X^T  X  + \lambda I) \theta =  X^T Y
$$

$$\large
 \theta = \left(X^T  X  + \lambda I \right)^{-1} X^T Y
$$

**Notice:** The addition of $\lambda I$ ensures that $X^T  X  + \lambda I$ is **full rank**.  This addresses the earlier issue in least-squares regression when we had co-linear features.




## How does $L^2$ Regularization Help

The $L^2$ penalty helps in several ways:

**Manages Model Complexity**
1. It ensures that uninformative features weights are relatively small (near zero) mitigating the affect of those features.  
1. It evenly distributes weight over similar features to reduce variance.

**Practical Concerns**
1. It removes degeneracy created by co-linear features
1. It improves the numerical stability of

---


## Normalization and the Intercept

Before we proceed it is important that we appropriately normalize the data.  Because the standard $L^2$ regularization methods treat each dimensional equivalently it is important that all dimensions are in the same range of values.  

However, we notice that the distribution of values 
can be quite different for each dimension.

For example in the following:

In [None]:
import pandas as pd
df=pd.read_csv("diamonds.csv")
df.head()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(10,10))
plt.subplot(2,2,1)
_,_,a=plt.hist(df["carat"],bins=30)
plt.subplot(2,2,2)
_,_,a=plt.hist(df["depth"],bins=30)
plt.subplot(2,2,3)
_,_,a=plt.hist(df["table"],bins=30)
plt.subplot(2,2,4)
_,_,a=plt.hist(df["price"],bins=30)

In [None]:
X = np.array(df[["carat","depth","table"]])
Y = np.array(df["price"])
from sklearn import linear_model
lm = linear_model.LinearRegression()
lm.fit(X,Y)
lm.coef_,lm.intercept_

In [None]:
r=linear_model.Ridge(alpha=1)
r.fit(X,Y)
r.coef_

### Questions in the homework
Q: Can we say carat is the dominating feature?

A: We can't make the judgement based on the scale of the coefficent when the data is not normalized.


## Standardizing the Data

A common transformation is to center and scale the features to zero mean and unit variance:

$$\large
z = \frac{x - \mu}{\sigma}
$$

This an be accomplished by applying the `StandardScalar` scikit learn preprocessor.

In [None]:
from sklearn.preprocessing import StandardScaler
normalizer = StandardScaler()
normalizer.fit(X)
X_norm=normalizer.transform(X)

In [None]:
plt.figure(figsize=(10,10))
plt.subplot(2,2,1)
_,_,a=plt.hist(X_norm[:,0],bins=30)
plt.subplot(2,2,2)
_,_,a=plt.hist(X_norm[:,1],bins=30)
plt.subplot(2,2,3)
_,_,a=plt.hist(X_norm[:,2],bins=30)
plt.subplot(2,2,4)
_,_,a=plt.hist(df["price"],bins=30)

In [None]:
r=linear_model.Ridge(alpha=1)
r.fit(normalizer.transform(X),Y)
r.coef_,r.intercept_

In [None]:
plt.plot(df["carat"],df["price"],".")

Now what can we say about carat?