* Table of contents:
    * [Linear regression](#linear)
    * [Logistic regression](#logistic)

# linear regression  <a name="linear"></a>

The model of linear regression assumed a relation between the outcome and independent variables as 
$$
\hat{y}=\boldsymbol{\theta}\cdot \boldsymbol{x}+\epsilon,
$$
where $\epsilon$ is the noise term and $\boldsymbol{\theta}$ is the linear coefficient.

The essential difference between frequentist view and Bayesian view is whether the coefficient $\boldsymbol{\theta}$ is treated as a random variable.

## frequentist perspective

The objective function can be given by the mean square error as:
$$
MSE=\frac{1}{N}\sum_{i}(\theta_{j} x_{i}^{j}-y_i)^2.
$$
We can see that in terms of the model parameter $\boldsymbol{\theta}$, the MSE is always a second order polynomial function with the form:
$$
MSE=\frac{1}{N}[a^{mn}\theta_{m}\theta_{n}-2 b^{k}\theta_{k}+c],
$$
with the parameters $a^{mn},b^{k},c$ given by
\begin{align}
a^{mn}=& \sum_i x_{i}^{m} x_{i}^{n},\\
b^{k}=&\sum_i x_{i}^{k} y_{i},\\
c=&\sum_i y_{i} y_{i}.
\end{align}

**Normal Equation**

The analytic result for the least MSE reads:
\begin{align} 
\hat{\theta}=\boldsymbol{a}^{-1}\boldsymbol{b}.
\end{align}

And the minimal of the MSE is given by
\begin{align}
\frac{1}{N}(\hat{\boldsymbol{\theta}}^{T} \cdot \boldsymbol{a} \cdot \hat{\boldsymbol{\theta}}-2\boldsymbol{b}^T \cdot \hat{\boldsymbol{\theta}} +c) =& \frac{1}{N}[(\boldsymbol{a}^{-1}\boldsymbol{b})^{T}\boldsymbol{a}(\boldsymbol{a}^{-1}\boldsymbol{b})-2\boldsymbol{b}^T(\boldsymbol{a}^{-1}\boldsymbol{b})+c] \\
=&\frac{1}{N}(-\boldsymbol{b}^T\boldsymbol{a}^{-1}\boldsymbol{b}+c)
\end{align}

#### Computational Complexity
If we use normal equation to compute the optimal solution, calculating the matrix $xx^{T}$ needs $O(nd^{2})$ with $n$ the sample size and $d$ the feature dimension.  The computational complexity of inverting the matrix is typically about $O(d^{2.4})$ to $O(d^{3})$ (depending on the implementation). So it takes $O(nd^{2}+d^{3})$ to get the inverse matrix. 

Then the calculation of $\hat{\theta}$ takes $O(d^2)$ and MSE takes $O(d^2+d)$.

## Bayesian Perspective

The model of linear regression assumed a relation between the outcome and independent variables as 
$$
y=\boldsymbol{\theta}\cdot \boldsymbol{x}+\epsilon,
$$
where $\epsilon$ is the noise term and $\boldsymbol{\theta}$ is the linear coefficient.

The essential difference between frequentist view and Bayesian view is whether the coefficient $\boldsymbol{\theta}$ is treated as a random variable.

In general, Bayesian methods starts from a proper factorization of the total joint distribution of both hidden variables and observables. In the example of linear regression:
$$
p(y,x,\theta,\sigma^2)=p(y|x,\theta,\sigma^2)p(x|\theta,\sigma^2)p(\theta|\sigma^2)p(\sigma^2).
$$
Due to the independence assumed between variables, we have
$$
p(y,x,\theta,\sigma^2)=p(y|x,\theta,\sigma^2)p(x)p(\theta|\sigma^2)p(\sigma^2).
$$
Then to get the conditional probability  $p(\theta,\sigma^2|y,x)$, we can either substitute the observed value of $x,y$ and conduct MCMC with respect to the variable $\theta,\sigma^2$ to the above expression or if possible analytically integrate out $\theta,\sigma^2$ and divide the above expression by the marginal distribution $p(x,y)$.

The analytic method although not always feasible, however, can expose some properties of $p(\theta,\sigma^2|y,x)$. For this example, we have
\begin{align}
p(y,x)=&\int d\theta d\sigma^2p(y|x,\theta,\sigma^2)p(x)p(\theta|\sigma^2)p(\sigma^2)\\
      =& p(x)\int d\theta d\sigma^2p(y|x,\theta,\sigma^2)p(\theta|\sigma^2)p(\sigma^2).
\end{align}
Thus we can see that when $p(y,x,\theta,\sigma^2)$ divided by $p(y,x)$, the factor $p(x)$ is cancelled thus we conclude that
$$
p(\theta,\sigma^2|y,x)\propto p(y|x,\theta,\sigma^2)p(\theta|\sigma^2)p(\sigma^2),
$$
where the distribution $p(x)$ can be neglected.

In the linear regression model, it is assumed that
$$
p(y|x,\theta,\sigma^2)=\frac{1}{\sigma \sqrt{2 \pi}}\exp[-\frac{(y-\theta x)^2}{2\sigma^2}]
$$

# Logistic regression <a name="logistic"></a>

## Binary classification case

In the binary case, the probability of belonging to the first class given predictors $X$ reads:
\begin{align}
\hat{p}=h_{\theta}(X)=\sigma(\theta^T \cdot X)
\end{align}
where the sigmoid function reads: $\sigma(t)=\frac{1}{1+\exp(-t)} $.
Cost function:
\begin{align}
J(\theta)=&-\frac{1}{m}\sum_{i}[y_{i}\ln(\hat{p}_i)+(1-y_{i})\ln(1-\hat{p}_i)]\\
\hat{p}_{i}=&\frac{1}{1+\exp(-\theta_{l}x_{i}^{l})}.
\end{align}

The derivative of cost function with respect to $\theta$ reads:
\begin{align}
\partial_{\theta_{j}}J(\theta)=&-\frac{1}{m}\sum_{i}[y_{i}\frac{1}{\hat{p}_{i}}\partial_{\theta_{j}}\hat{p}_{i}+(1-y_{i})\frac{1}{1-\hat{p}_{i}}(-\partial_{\theta_{j}}\hat{p}_{i})].
\end{align}

the derivative of sigmoid function is
\begin{align}
\partial_{t}\sigma(t)=&\frac{\exp(-t)}{(1+\exp(-t))^{2}}\\
=&\sigma(t)(1-\sigma(t))
\end{align}

so the above equation reads:
\begin{align}
\partial_{\theta_{j}}J(\theta)=&-\frac{1}{m}\sum_{i}[y_{i}\frac{1}{\hat{p}_{i}}-(1-y_{i})\frac{1}{1-\hat{p}_{i}}]\partial_{\theta_{j}}\hat{p}_{i}\\
=&-\frac{1}{m}\sum_{i}[y_{i}\frac{1}{\hat{p}_{i}}-(1-y_{i})\frac{1}{1-\hat{p}_{i}}]\hat{p}_{i}(1-\hat{p}_{i})x_{i}^{j}\\
=&-\frac{1}{m}\sum_{i}[y_{i}(1-\hat{p}_{i})-(1-y_{i})\hat{p}_{i}]x_{i}^{j}\\
=&\frac{1}{m}\sum_{i}[\hat{p}_{i}-y_{i}]x_{i}^{j}
\end{align}

## Logistic regression: general cases

The logistic regression model is a linear discriminative model directly simulating the conditioinal probability $p(y|x)$. With $M$ possible outcomes, the probability is given by the formula:

\begin{align}
z_i(x)&=\beta_{ij}\phi_{j}(x)+\beta_{0},\\
\hat{p}(y_i|x)&=\frac{e^{z_i(x)}}{\sum_i e^{z_i(x)}},
\end{align}

where $i$ runs from $1$ to $M$.

### Softmax Regression
\begin{align}
S^{k}(X)=\theta^{k} \cdot X
\end{align}
All these vectors are typically stored as rows in a parameter matrix $\Theta$.  
After calculating all the classes
\begin{align}
\hat{p}^{k}=\sigma(\theta^{k} \cdot X)=\frac{\exp(S^{k})}{\sum_{j=1}^{K} S^{j}}
\end{align}

Cross entropy:
\begin{align}
J(\Theta)=-\frac{1}{m}\sum_{i}\sum_{j=1}^{K}y^{k}_{i}\ln(\hat{p}^{k}_{i})
\end{align}

As can be guessed, the gradient of cross entropy reads:
\begin{align}
\partial_{\theta^{k}_{j}}J(\Theta)=&\frac{1}{m}\sum_{i}[\hat{p}^{k}_{i}-y^{k}_{i}]x_{i}^{j}
\end{align}

The definition of loss function in this case can be interpreted in terms of the cross entropy:
\begin{align}
l=&-\int dx p(x)\int dy p(y|x)\log\hat{p}(y|x).
\end{align}
In a sample of instance denotedy by $(x^i,y^i)$, the loss function is simply:
\begin{align}
l({x,y})=&-\frac{1}{N}\log\hat{p}(y^i|x^i),
\end{align}
where only those $\hat{p}(y^i|x^i)$ that have the same value as instance $(x^i,y^i)$ contributes to the loss function. 