# Multivariate linear regression

$X \in R^{N \times d}, y \in R^{N}, w \in R^{d}$

Model: $ a(x) = X w$

MSE: $ L(w) = \frac{1}{N}\|Xw - y\|^2 $

Goal: $w^*: min_w L(w)$

Solve:

$\nabla_w L(w) = 0$

$\nabla_w\| Xw-y \|^2 = \frac{1}{N}X^T 2(Xw-y)=0$

$X^T (Xw^*-y)=0$

$X^TXw^*-X^Ty=0$

$X^TXw^* = X^Ty$

$(X^TX)^{-1}X^TX w^* = X^Ty(X^TX)^{-1}$

$w^* = (X^TX)^{-1} X^Ty$


---
Euclidean norm:

1. $a \in R^d\;\;\;\; \|a\|^2 = \sum^d_{i=1} a^2_i$

Gradient of such function $\nabla_a \| a\|^2 = \begin{pmatrix}
\frac{\partial \| a\|^2}{\partial a_1} \\
\frac{\partial \| a\|^2}{\partial a_2} \\
\vdots \\
\end{pmatrix}=\begin{pmatrix}
2 a_1 \\
2 a_2 \\
\vdots \\
2 a_d \\
\end{pmatrix}=2a
\;\;\;\;\;$($a_i$ exists in only one term of $\sum^d_{i=1} a^2_i$)

---

2 $\nabla_w(Xw) = X^T$



# Gradient Descent for Linear Regression

$a(x) = <w, x>$

$L(w) = \frac{1}{N} \| Xw-y \|^2, X \in R^{n\times k}, y = R^{n}$

$w^* = (X^TX)^{-1}X^Ty$

It's not always easy or even not always possible to use this formula to find optimal parameters, becasue inversion of metrics is not a trivial task.

$X^TX \in R^{k\times k}$

1. complexity $O(k^3)$

2. Stability. Sometimes matrices do not have inverse at all. It happens for example, when we have linear dependent lines or columns in our matrix. In a more like practical case that happens if your features are linearly dependent. Or even if you don't have linear dependent features, where you can obtain one feature from another or just by taking linear combination of other features. But your features are highly correlated, then computing the inverse may be very unstable operation. Because we're using approximate algorithms to do that.

- $w^0 \sim N(O, I_{k\times k})$

For t in 1 ... Max_iter:

$w^t = w^{t-1} - \eta \nabla_w L(w^{t-1})$

if $\| w^t - w^{t-1} \|_2 < \epsilon$:
    break
    
$\nabla_w L(w) = \frac{2}{N} X^T(Xw-y) = \frac{2}{N}(X^TXw-X^Ty)$

$w^t = w^{t-1} - \eta \frac{2}{N}(X^TX w^{t-1} - X^Ty)$

Calculate $X^TX$ and $X^Ty$ at the begining, and use them through out the entire process.

Complex part is only $X^TX w^{t-1}$. much better than calculate inverse matrix. it's good for model has correlated features or a lot of features.

# Losses in Linear Regression

## Mean squared error

one of the most popular loss functions, performed bad when there is outlier

$L(a, X) = \frac{1}{N} \sum^{N}_{n=1}(a(x_n) - y_n)^2$

## Mean absolute error

using absolute values instead of square, performed better when there is outlier

$L(a, X) = \frac{1}{N} \sum^{N}_{n=1} |a(x_n)-y_n|$

## Huber loss

use MSE around center, MAE on the tail

$l_H(y,a)= \begin{cases}
\frac{1}{2}(y-a)^2, |y-a| < \delta \\
\delta (|y-a|-\frac{1}{2}\delta), |y-a| \ge \delta \\
\end{cases}$

$L(a,X) = \frac{1}{N}\sum^{N}_{n=1}l_H(y_n, a(x_n))$

Above error functions are relative.

---

## Mean absolute percentage error

Loss similar to MAE, measure in percentage.

$L(a,X)=\frac{100\%}{N}\sum^{N}_{n=1}|\frac{a(x_n)-y_n}{y_n}|$

- Non-symmetric function

- Give preference to negative error

## Symmetric MAPE

Add some symmetry

$L(a,X) = \frac{100\%}{N}\sum^N_{n=1}\frac{|y_n-a(x_n)|}{(|y_n|+|a(x_n)|)/2}$

# Interpretation and Feature Importance

- It is worth comparing the weights only if features are scaled.

- Estimate importance of a feature by removing it from the model