#### Setting: Supervised Machine Learning, Regression
- Learn a function that predicts real-valued, continuous outputs $y$ from $x$, i.e. $y = f(x)$
- The model $f$ is parametrized by some vectors $\theta$
- Find parameters $\theta$ to optimize the objective function

#### Regression Example
- Consider a dataset where we only have one feature, i.e. the square footage of a house and want to predict the price. 
- We want to fit a model $\hat{y} = ax + b$ or $\hat{y} = \theta^T \hat{x}$ where $\theta = (a, b)$ and $\hat{x} = (x, 1)$
- Use training examples to find the "best" $\theta$ that minimizes some cost function, such as the squared error: $\frac{1}{2}\sum_{i=1}^{N}(y - \hat{y})^2$
- Solution: set $\frac{dL}{d\theta} = 0$ and solve for $\theta$, and verify that this is a local minimum. 

#### Solving Least Squares
 - We have $L = \frac{1}{2}\sum_{i=1}^{N}(y - \hat{y})^2 = \frac{1}{2}(Y - X\theta)^T(Y-\theta^TX)$
 - Show that setting $\frac{dL}{d\theta} = 0$ leads to finding the least squares estimator: $\theta = (X^TX)^{-1} X^TY$
 
#### Higher Degree Polynomials
- Can generalize to higher degree polynomials
- i.e. $y = b + a_1x + a_2x^2 + ... a_nx^n$
- we can fit $n$ data points with an $n-1$ degree polynomial
- Polynomial models have more capacity than linear models, so they can fit the data better, and they can be made arbitrarily complex
- However, this can easily lead to overfitting since a high degree polynomial may not capture the general trend in the overall data generating distribution, and instead just fit the noise in the data. 

#### Overfitting
- Models w/low training error but high test error are overfit
- This means that parameters that don't generalize well were learned
- Simpler models, regularization, early stopping, using a holdout/validation dataset, and larger datasets are ways to prevent overfitting.

#### Bias-Variance Tradeoff
- Overfitting/underfitting are closely related to the bias-variance tradeoff
- Generally, overfit models have low bias and high variance, while underfit models have low variance and high bias. 
- Imagine that there exists a true function $$f(x)$$ that perfectly maps inputs to outputs (i.e. this is the function for the data generating distribution, which we want to approximate as closely as possible). We can sample the data and construct several different models, $$[\hat{f_1(x)} ... \hat{f_n(x)}]$$ which have an expectation $$E[\hat{f(x)}]$$
- The bias is given by how much our predictions differ from the actual values: $$E[\hat{f(x)}] - f(x)$$
- The variance of our  model is given by the variability in the models that we've trained (i.e., how much do the models differ with respect from each other? If this is large, it intuitively means that the model we learn is highly dependent on the specific data sample that we train on, which indicates poor generalization capability). $$E[(\hat{f(x)} - E[\hat{f(x)}])^2]$$
- An ** estimator ** $\hat{\theta}$ is a single best estimate of a true underlying parameter $\theta$
- For example, if we have $N$ iid samples from a bernoulli distribution with parameter $\theta$, we can calculate a point estimate $\hat{\theta}$
- The estimator can be thought of as a random variable which we can take an expectation/variance over, since it is generated by random variables that we sample from an (unknown) probability distribution. 
- Machine learning is mostly trying to do parameter estimation of a probability distribution that we don't know about, given a bunch of data samples from the probability distribution.
- An estimator is called unbiased if $E[\hat{\theta}] - \theta = 0$.
- Even if an estimator is unbiased, any single $\hat{\theta}$ may deviate from the true parameter $\theta$. This variability is due to the data samples that we used to approximate $\theta$. 

#### Bias and Variance examples
- Let ${x_1 ... x_m}$ be iid samples from a Bernoulli distribution with mean $\theta$. Show that the sample mean estimator given by $\hat{\theta} = \frac{1}{m} \sum_{i=1}^{m} x_i$ is unbiased.
- Remember that in a Bernoulli distribution we have $x \in [0, 1]$ and $p(x | \theta) = \theta^x(1 - \theta)^{(1-x)}$.
- Show that the variance of the estimator is $\frac{\theta(1-\theta)}{m}$ (use property that $Var[x] = E[x^2] - E[x]^2$).

#### Mean-Squared Error
- The MSE of an estimator $E[(\hat{\theta} - \theta)^2]$ can be decomposed into bias, variance, and irreducible error: $var(\theta) + bias(\theta)^2 + \sigma^2$ (see slide 24 [here](https://seas.ucla.edu/~kao/nndl/lectures/ml-basics.pdf) for a full derivation)
- Therefore, bias and variance are usually at a tradeoff, i.e. if we want to minimize one we must tradeoff the other. For example, we could have 0 variance if we just predicted the same output all of the time, but then our bias would be very large. On the other hand, a deep neural network could fit all of the data perfectly, but be prone to overfitting (especially without the use of regularization techniques such as L2, dropout, or batch norm). 

#### Choosing a Model
- There are a few different information criterion that can be used to penalize model complexity when we don't leave data out to test on: Akaike information criterion, Bayes information criterion, and deviance information criterion. 
- More typically it is common to have a validation and test dataset. 


