# Linear Regression

In regression, we aim to find a function $f$ that maps inputs $x \in \mathbb{R}^D$ to corresponding function values $f(x) \in \mathbb{R}$. We assume we are given a set of training inputs $x_n$ and corresponding noisy observations $yn = f(x_n)+\epsilon$, where $\epsilon$ is an i.i.d. random variable that describes measurement/observation noise and potentially unmodeled processes. Our task is to find a function that not only models the training data, but generalizes well to predicting function values at input locations that are not part of the training data. An illustration of such a regression problem is in the figure below:

<img src="attachment:5b55346c-ade1-4f76-a154-3abf84bca8d0.png" style="width: 600px">

_Left: Regression problem: observed noisy function values from which we wish to infer the underlying function that generated the data_

_Right: Regression solution: possible function that could have generated the data (blue) with indication of the measurement noise of the function value at the corresponding inputs (orange distributions)_

For some input values $x_n$, we observe (noisy) function values $y_n = f(x_n) + \epsilon$. The task is to infer the function $f$ that generated the data and generalizes well to function values at new input locations.

Finding a regression function requires solving a variety of problems, including the following:

- **Choice of the model (type) and the parametrization** of the regression function. Given a dataset, what function classes (e.g., polynomials) are good candidates for modeling the data, and what particular parametrisation (e.g., degree of the polynomial) should we choose? Model selection, as discussed previously, allows us to compare various models to find the simplest model that explains the training data reasonably well.
- **Finding good parameters.** Having chosen a model of the regression function, how do we find good model parameters? Here, we will need to look at different loss/objective functions (they determine what a "good" fit is) and optimisation algorithms that allow us to minimise this loss.
- **Overfitting and model selection.** Overfitting is a problem when the regression function fits the training data "too well" but does not generalise to unseen test data. Overfitting typically occurs if the underlying model (or its parametrization) is overly flexible and expressive. We will look at the underlying reasons and discuss ways to mitigate the effect of overfitting in the context of linear regression.
- **Relationship between loss functions and parameter priors.** Loss functions (optimisation objectives) are often motivated and induced by probabilistic models. We will look at the connection between loss functions and the underlying prior assumptions that induce these losses.
- **Uncertainty modeling.** In any practical setting, we have access to only a finite, potentially large, amount of (training) data for selecting the model class and the corresponding parameters. Given that this finite amount of training data does not cover all possible scenarios, we may want to describe the remaining parameter uncertainty to obtain a measure of confidence of the model’s prediction at test time; the smaller the training set, the more important uncertainty modeling. Consistent modeling of uncertainty equips model predictions with confidence bounds.

## Problem Foundation

Because of the presence of observation noise, we will adopt a probabilistic approach and explicitly model the noise using a likelihood function. More specifically, throughout this chapter, we consider a regression problem with the likelihood function

\begin{equation}
p(y|x) = \mathcal{N}(y|f(x), \sigma^2)
\end{equation}

Here, $x \in \mathbb{R}^D$ are inputs and $y \in \mathbb{R}$ are noisy function values (targets). With above equation, the functional relationship between $x$ and $y$ is given as

\begin{equation}
y = f(x) + \epsilon
\end{equation}

where $\epsilon ~ \mathcal{N}(0, \sigma^2)$ is independent, identically distributed (i.i.d.) Gaussian measurement noise with mean 0 and variance $\sigma^2$. Our objective is to find a function that is close (similar) to the unknown function $f$ that generated the data and that generalizes well.

In this chapter, we focus on parametric models, i.e., we choose a parametrised function and find parameters $\theta$ that "work well" for modeling the data. For the time being, we assume that the noise variance $\sigma^2$ is known and focus on learning the model parameters $\theta$. In linear regression, we consider the special case that the parameters $\theta$ appear linearly in our model

For $x, \theta \in \mathbb{R}$ the linear regression model describes straight lines (linear functions), and the parameter $\theta$ is the slope of the line.

<img src="attachment:de72e7fd-bb45-4b56-a196-fbbefcfee2a0.png" style="width: 800px">

_Left: Example functions (straight lines) that can be described using the linear model_

_Middle: Training Set_

_Right: Maximum likelihood estimate_

## Parameter Estimation

In [None]:
\mathcal{X}

In [None]:
\mathbb{R}