# Regression and Classification

## Linear Regression

notation

* **data/design matrix** - $\Phi \in \mathbb{R}^{n \times p}$
    * typically $n > p$
* **response vector** - $t \in \mathbb{R}^n$
* $\phi(x_i)^\top$ is the $i^{th}$ row/observation
* $\phi_j(X)$ is the $j^{th}$ column/feature

assumptions

* there is a **true function** $f(x_i) = w^\top \phi(x_i) = \sum_k w_k \phi_k(x_i)$ where $w$ is a vector of (unknown) weights/coefficients
* we observe noise/uncertainty in the form of $t(x_i) \stackrel{iid}{\sim} \mathcal{N}(f(x_i), \beta^{-1})$

likelihood

* $L(w, \beta) = \prod_i \mathcal{N}(t_i \mid w^\top x_i, \beta^{-1})$
$= (\frac{\beta}{2 \pi})^{n/2} e^{-\frac{\beta}{2} \sum_i (t_i - w^\top \phi(x_i))^2}$
* $\ell(w, \beta) = -\frac{n}{2} \log 2 \pi + \frac{n}{2} \log \beta - \frac{\beta}{2} \sum_i (t_i - w^\top \phi(x_i))^2$

to solve for $\hat{w}$, maximize $\ell$ w.r.t. $w$ $\iff$ minimize sum of squares $\sum_i (t_i - w^\top \phi(x_i))^2$  
this is equivalent to minimizing $|t - \Phi w|^2 = (t - \Phi w)^\top (t - \Phi w) = t^\top t - t^\top \Phi w - w^\top \Phi^\top t + w^\top \Phi^\top \Phi w$  
taking the derivative w.r.t. $w$ and setting to 0 yields  
$0 = -2 \Phi^\top t + 2 \Phi^\top \Phi w$
$\implies \hat{w} = (\Phi^\top \Phi)^{-1} \Phi^\top t$

note that $\Phi^\top \Phi$ is invertible iff $n > p$ and the columns of $\Phi$ are independent

geometric intuition

* we want to minimize $|t - \Phi \hat{w}|^2$ w.r.t. $\hat{w}$
* defining $\hat{t} = \Phi \hat{w}$ implies $\hat{t}$ is in the column span of $\Phi$
* then $t - \hat{t}$ is orthogonal to every column of $\Phi$
* then we have $\Phi^\top (t - \hat{t}) = 0$  
$\implies \Phi^\top (t - \Phi \hat{w}) = 0$  
$\implies \hat{w} = (\Phi^\top \Phi)^{-1} \Phi^\top t$
* $\Phi (\Phi^\top \Phi)^{-1} \Phi^\top$ is the projection matrix

**regularized linear regression** - minimize $|t - \Phi w|^2 + \lambda |w|^2$

* $t^\top t - 2 w^\top \Phi^\top t + w^\top \Phi^\top \Phi w + \lambda w^\top w$
* to solve for $w$, take derivative and set to 0
* $0 = -2 \Phi^\top t + 2 \Phi^2 \Phi w + 2 \lambda w$  
$\implies \hat{w} = (\Phi^\top \Phi + \lambda I)^{-1} \Phi^\top t$

##  Classification

now suppose that $t \in \{0, 1\}^n$  
i.e., $t_i \in \{0, 1\}$

a plausible model for this might be:  
$t_i \stackrel{indep}{\sim} Bernoulli \big(g(\phi(x_i)^\top )\big)$

If we want to keep the linear structure, then a plausible way is to say 
$g(\cdot) = \sigma(f_i)$ where $f_i = w^\top \phi(x_i)$ as before

$\sigma(y) = \frac{1}{1 - e^{-y}}$

to find $\hat{w}_{MLE}$, we again set $\ell(w) = 0$ and solve 

no closed-form solution--solve via numerical optimization

### generative model for continuous features

for each example $i$

* pick label for example $i$: $t_i$ usin $P(y_i = 1) = p$
* if $y_i = 0$ then pick $\phi(x_i) \sim P(x_i \mid y_i = 0)$
    * e.g., $\phi(x_i) \sim \mathcal{N}(\mu_0, \sigma_0^2)$
* if $y_i = 1$ then pick $\phi(x_i) \sim P(x_i \mid y_i = 1)$
    * e.g., $\phi(x_i) \sim \mathcal{N}(\mu_1, \sigma_1^2)$
    
model has parameters $\{p, \mu_0, \sigma_0^2, \mu_1, \sigma_1^2\}$



### generative model for discrete features

suppose two features each with three labels (e.g., S, M. L)

then there are $3^2 = 9$ possible permutations of the two features

$3^2 - 1 = 8$ degrees of freedom

to control for the size of the design matrix, assume that the features are conditionally independent given the response (Naive Bayes)

$P(X_i \mid y_i) = \prod_{l=1}^L P(X_{il} \mid y_i)$

for each example $i$

* pick $y_i \in \{1, 2, ..., C\}$ from $Discrete(\{p_1, ..., p_{C}\}$
* for $l = 1$ to $L$
    * let $j = y_l$
    * pick $x_{il}$ from $Discrete(\{q_{jl}, ..., q_{jlD}\})$

**e.g.** 

* let 
    * $C = 2$, $L = 2$, $D = 3$
    * each $X$ can take on values $\{a, b, c\}$
    * $p_1 = .9$, $p_2 = .1$
    * $j = 1$, $l = 1$ $\implies$ $q = \{.1, .8, .1\}$
    * $j = 1$, $l = 2$ $\implies$ $q = \{.2, .4, .4\}$
    * $j = 2$, $l = 1$ $\implies$ $q = \{.3, .4, .3\}$
    * $j = 2$, $l = 2$ $\implies$ $q = \{.2, .7, .1\}$
    * $i$ is the observation index
    * $j$ is the class index
    * $l$ is the feature index
    * $k$ is the feature value index$

* suppose we observe
    * $C=1$ and $X = [a, b]$
    * $C=1$ and $X = [c, a]$
    * $C=2$ and $X = [a, a]$

* then
    * $L = \prod_i \prod_j (p_j \prod_l \prod_k q_{jkl}^{x_ilk})^{j_ij}$