#### Review of Linear Regression

- Can extend a linear model to fit nonlinear functions

#### Estimator bias and variance
- $\hat{\theta} = (X^TX)^{-1}X^TY$
- Train several models and get several different models
- True underlying parameter $\theta$, and our average estimation is $E[\hat{\theta}]$
- The bias is $E[\hat{theta}] - \theta$
- The variance is $E[(\theta - E[\hat{\theta}])^2]$ 

#### Example of bias and variance
- Expectation is a linear operator
- $E[aX] = aE[x] and E[X + Y] = E[X] + E[Y]$
- $m$, the number of coin flips, doesn't affect the bias of the estimator, but it does affect the variance
- i.e. if you flip a coin twice, you'll get a lot of variability, but if you flip it 3000 times, you're likely to get a more consistent distribution
- when there's less data samples, the distribution is more likely to have high variability
- Show that the sample mean estimator has bias 0

#### Variance of sample mean estimator
- Properties of variance: $var(aX) = a^2 var(X)$ and $var(x + y) = var(x) + var(y) + 2 cov(x, y)$, if x and y are ind

#### Standard Error
- More samples -> less variability in the mean
- More samples -> smaller and smaller variance of your estimator
- SE of the sample mean = $\sqrt(var(\hat{\theta}))$

#### MSE
- Mean squared error = var(estimator) + bias(estimator)^2

- Higher model complexity -> fit the data better
- Have a lot more parameters/degrees of freedom
- i.e. if you have $n$ points and are allowed to use $n-1$ parameters, you can just make a $n-1$ degree polynomial and fit the whole dataset


#### Training, Validation, and testing data
- Training data - used to laearn the params of your model
- Validation data: used to optimize the hyperparameters 
- Testing data - score your model, use it only once, its the final accuracy
- One method of selecting hyperparameters is to select some hyperparameters, train on the training data, and test on validation data, and select the hyperamraters that perform best on the validation data
- This risks selecting a setting of hyperparameters that only work well on the validation dataset though. This is why we can also do k-fold CV

#### Cross-Validation
- Take your training data and divide it into $k$ folds 
- Train on $k-1$ and validate on the $k$th
- Alternate this $k$ and take the average error. 

#### Maximum Likelihood
- Find the params that yield the highest probability of having observed your dataset
- Say we have 3 clusters and want to  model each cluster as being generated by a multivariate Gaussian. There are 3 Gaussian clusters, each one describes the distribution for the poits in that class. 
- We can use ML estimation to find the params for each cluster that maximizes our probability of observing the data
- Basically we have 3 classes: $y_1, y_2, y_3$ and a bunch of data points $(x_i, y_i)$.
- Each cluster is multivariate Gaussian, so it can be modelled by some mean and covariance matrix. 
- Write down the likelihood using independence assumption: $L = \prod_{i=1}^{N} p(x_i,y_i)$ because we observe the pairs as coming from a joint model, and then use the chain rule of probability: $L = \prod_{i=1}^{N} p(y_i) p(x_i | y_i)$. 
- Crucial assumption: $p(x_i | y_i) \sim N(\mu_i \sigma_i) $
- $p(y_i)$ are the probabilities that we assign to our class labels, which in MLE we take to be constant (i.e. we don't impose a prior on the class labels). If we did impose a prior on the class labels, this would be a MAP estiamtion.
- And then we can substitute this in to expand the likelihood, take the log, and maximize overall all 6 params (the means and covar matrices for each of the 3 classes). 

#### Inference after training with maximum likelihood
- Scenario: We have estimated $\theta$, the parameters corresponding to the mean and covariance matrices of the 3 Gaussian clusters. Now we are given a new data point $x_i$. We want to predict a label. 
- Solution: $y = argmax_{y_i} p(y_i | x_i)$, i.e. pick the class label that maximizes the posterior distribution of our. We can use Bayes' rule to compute this: $p(y_i | x_i) = \frac{p(x_i | y_i)p(y_i)}{p(x_i}$ where $p(x_i | y_i)$ is the likelihood (of our data given the label), $p(y_i)$ is our prior belief of the label, and $p(x_i)$ is known as the evidence. 

#### Summary of Bayes' Rule
- In Bayesian inference, we want to calculate $p(\theta | D)$, i.e. a posterior probability distribution of our parameters given the data that we observe. We can then establish parameters for our model either by sampling from this distribution or by taking it's mean or covariance. 
- In the maximum likelihood (or frequentist) setting, we want to calculate $p(D | \theta)$, or the probability of observing our dataset given some parameters (this probability is known as the likelihood). 
    - Crucially, here, we ** do not ** treat $\theta$ as a random variable, we instead treat it as a point estimate, and we want to find the $\theta$ that makes the data most likely to have been observed. 
    - We just want to estimate the probability of the data given some setting of the parameters, and maximize those parameters. We're not concerned about our prior knowledge on those parameters or the dataset (i.e. the prior on the parameters and the evidence of our dataset are taken to be constants). 

- Instead of MLE, we could use MAP estimation, where we place a prior on the parameters which we wish to predict. In particular, we optimize for $p(\theta | D) = p(D | \theta)p(\theta)$ (and again take the evidence to be constant). For example, we could put a Guassian prior on our variable $\theta$ and assume that our data are drawn from a Bernoulli parametrized by theta. Then the maximization problem would be $argmax_{\theta} \prod_{i=1}^{N}\theta^{x_i}(1-\theta)^{1-x_i} N(\theta, 0, \sigma)$.


#### Chain Rule for probability
- $p(a, b) = p(a | b) p(b) = p(a)p(b | a)$
- $p(a, b, c) = p(a | b, c) p(b, c) = p(a | b, c) p(b | c) p(c)$
- $p(b, c, d, e) = p(b, c | d, e) p(d, e) = p(b, c | d, e) p(e | d)p(d)$, so $p(b, c | d, e) = \frac{p(b,c,d,e)}{p(d)p(e|d)}$

