# Linear Regression

In [2]:
%matplotlib inline
import math
import time
import numpy as np
import torch
from d2l import torch as d2l

## Basics

Regression is whenever we want to predict a value based on some number of features. In the following example, assume we wish to predict the value (in dollars) of houses based on their size (in sq. feet) and age (in years). First we have to get our hands on some training data. Each row of this training set is an example. The house prices in this training set are the labels/targets. The age and price are the features/covariates. 

Linear regression is one of the simplest amnd most popular standard tools. Assume the relationsip between the features $\mathbf{x}$ and $y$ is approximately linear, but allow for some random noise, which we assume to be Gaussian. 

Use $n$ to represent the number of _examples_ in our training dataset. Use superscripts to enumerate samples and targets, and subscripts to index coordinates. $\mathbf{x}^{(i)}$ indicates the vector of features for the $i$th sample/example. $x^{(i)}_j$ indicates its $j$th coordinate/feature. 

### Model

The assumption of linearity means that the expectation value of the target may be expressed in terms of a weighted sum of the features, that is:

$$ \hat{y}^{(i)} = w^{(i)}_{age}x^{(i)}_{age} + w^{(i)}_{sqft}x^{(i)}_{sqft} + b$$

Where w are the weights and b is the bias, or offset, representing the intercept of the linear model with the y axis. Although there is no "real" house with a square footage of 0, this value is still relevant for ensuring that the model is fit correctly. Presumably our assumption of linearity would break down outside of the limits of a normal/acceptable house size.

Our task, given a particular dataset, is to find a set of weight $\mathbf{w}$ such that the difference between our predicted value $\hat{y}$ and and the true/observed values $y$ is minimised.

In ML, typically use compact notation: When our inputs consist of $d$ features, typically assign each an index between 1 and d, rather than names and express our prediction as $\hat{y}$. So the equation above would become

$$ \hat{y} = w_1x_1 + w_2x_2 + ... + w_dx_d +b $$

By collecting all the features into a vector $\mathbf{x} \in \mathbb{R}^d$ and all the weights into a vector $\mathbf{w} \in \mathbb{R}^d$, we can express this more compactly

$$ \hat{y} = \mathbf{w}^{\intercal}\mathbf{x} + b$$  

Where $\mathbf{x}$ refers to the features of a single example. In general it is more convenient/sensible to express this model in terms of the $\mathbf{X} \in \mathbb{R}^{n \times d}$ matrix of features, where each row n is an example, with d features. 

In this case, the vector of predictions $\mathbf{\hat{y}} \in \mathbb{R}^{n}$ for all our examples can be expressed as a matrix-vector product

$$ \mathbf{\hat{y}} = \mathbf{X}\mathbf{w} + b$$

Where broadcasting is applied during the summation. 

The goal of machine learning, given some observed features $\mathbf{X}$ with associated labels $\mathbf{y}$, is to find the set of weights $\mathbf{w}$ and bias $b$ such that the predicted labels $\mathbf{\hat{y}}$ given some new dataset are as close as possible to the real labels as possible, or, in other words, that the error of the predicted values is minimised.

Even in an idealised case where the underlying relation truly is linear, we still would not expect our prediction for every example to be 0, due to experimental error and other sources of error inherent in real world data. Therefore, it is important to incorporate a noise term to account for this.

Before we can go about trying to find the best _paramaters_ $\mathbf{w}$ and $b$, we need two things
1. A way to measure the performance of the model.
2. Some way for updating the parameters to improve the model accuracy.

### Loss function